Package htsjdk.variant.bcf2
Class BCF2Utils
java.lang.Object
htsjdk.variant.bcf2.BCF2Utils
Common utilities for working with BCF2 files
Includes convenience methods for encoding, decoding BCF2 type descriptors (size + type)
- Since:
- 5/12
-
Field Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic String
collapseStringList
(List<String> strings) Collapse multiple strings into a comma separated list ["s1", "s2", "s3"] => ",s1,s2,s3"static int
decodeSize
(byte typeDescriptor) static BCF2Type
decodeType
(byte typeDescriptor) static int
decodeTypeID
(byte typeDescriptor) static BCF2Type
determineIntegerType
(int value) static BCF2Type
determineIntegerType
(int[] values) static BCF2Type
determineIntegerType
(List<Integer> values) static byte
encodeTypeDescriptor
(int nElements, BCF2Type type) explodeStringList
(String collapsed) Inverse operation of collapseStringList.static boolean
headerLinesAreOrderedConsistently
(VCFHeader outputHeader, VCFHeader genotypesBlockHeader) Are the elements and their order in the output and input headers consistent so that we can write out the raw genotypes block without decoding and recoding it? If the order of INFO, FILTER, or contrig elements in the output header is different than in the input header we must decode the blocks using the input header and then recode them based on the new output order.static boolean
makeDictionary
(VCFHeader header) Create a strings dictionary from the VCF header The dictionary is an ordered list of common VCF identifers (FILTER, INFO, and FORMAT) fields.static BCF2Type
maxIntegerType
(BCF2Type t1, BCF2Type t2) Returns the maximum BCF2 integer size of t1 and t2 For example, if t1 == INT8 and t2 == INT16 returns INT16static byte
readByte
(InputStream stream) static final File
Returns a good name for a shadow BCF file for vcfFile.static boolean
sizeIsOverflow
(byte typeDescriptor) static <T> List<T>
Helper function that takes an object and returns a list representation of it: o == null => [] o is a list => o else => [o]
-
Field Details
-
MAX_ALLELES_IN_GENOTYPES
public static final int MAX_ALLELES_IN_GENOTYPES- See Also:
-
OVERFLOW_ELEMENT_MARKER
public static final int OVERFLOW_ELEMENT_MARKER- See Also:
-
MAX_INLINE_ELEMENTS
public static final int MAX_INLINE_ELEMENTS- See Also:
-
INTEGER_TYPES_BY_SIZE
-
ID_TO_ENUM
-
-
Method Details
-
makeDictionary
Create a strings dictionary from the VCF header The dictionary is an ordered list of common VCF identifers (FILTER, INFO, and FORMAT) fields. Note that its critical that the list be dedupped and sorted in a consistent manner each time, as the BCF2 offsets are encoded relative to this dictionary, and if it isn't determined exactly the same way as in the header each time it's very bad- Parameters:
header
- the VCFHeader from which to build the dictionary- Returns:
- a non-null dictionary of elements, may be empty
-
encodeTypeDescriptor
-
decodeSize
public static int decodeSize(byte typeDescriptor) -
decodeTypeID
public static int decodeTypeID(byte typeDescriptor) -
decodeType
-
sizeIsOverflow
public static boolean sizeIsOverflow(byte typeDescriptor) -
readByte
- Throws:
IOException
-
collapseStringList
Collapse multiple strings into a comma separated list ["s1", "s2", "s3"] => ",s1,s2,s3"- Parameters:
strings
- size > 1 list of strings- Returns:
-
explodeStringList
Inverse operation of collapseStringList. ",s1,s2,s3" => ["s1", "s2", "s3"]- Parameters:
collapsed
-- Returns:
-
isCollapsedString
-
shadowBCF
Returns a good name for a shadow BCF file for vcfFile. foo.vcf => foo.bcf foo.xxx => foo.xxx.bcf If the resulting BCF file cannot be written, return null. Happens when vcfFile = /dev/null for example- Parameters:
vcfFile
-- Returns:
- the BCF
-
determineIntegerType
-
determineIntegerType
-
maxIntegerType
Returns the maximum BCF2 integer size of t1 and t2 For example, if t1 == INT8 and t2 == INT16 returns INT16- Parameters:
t1
-t2
-- Returns:
-
determineIntegerType
-
toList
Helper function that takes an object and returns a list representation of it: o == null => [] o is a list => o else => [o]- Parameters:
c
- the class of the objecto
- the object to convert to a Java List- Returns:
-
headerLinesAreOrderedConsistently
public static boolean headerLinesAreOrderedConsistently(VCFHeader outputHeader, VCFHeader genotypesBlockHeader) Are the elements and their order in the output and input headers consistent so that we can write out the raw genotypes block without decoding and recoding it? If the order of INFO, FILTER, or contrig elements in the output header is different than in the input header we must decode the blocks using the input header and then recode them based on the new output order. If they are consistent, we can simply pass through the raw genotypes block bytes, which is a *huge* performance win for large blocks. Many common operations on BCF2 files (merging them for -nt, selecting a subset of records, etc) don't modify the ordering of the header fields and so can safely pass through the genotypes undecoded. Some operations -- those at add filters or info fields -- can change the ordering of the header fields and so produce invalid BCF2 files if the genotypes aren't decoded
-