Class BCF2Utils

java.lang.Object
htsjdk.variant.bcf2.BCF2Utils

public final class BCF2Utils extends Object
Common utilities for working with BCF2 files Includes convenience methods for encoding, decoding BCF2 type descriptors (size + type)
Since:
5/12
  • Field Details

    • MAX_ALLELES_IN_GENOTYPES

      public static final int MAX_ALLELES_IN_GENOTYPES
      See Also:
    • OVERFLOW_ELEMENT_MARKER

      public static final int OVERFLOW_ELEMENT_MARKER
      See Also:
    • MAX_INLINE_ELEMENTS

      public static final int MAX_INLINE_ELEMENTS
      See Also:
    • INTEGER_TYPES_BY_SIZE

      public static final BCF2Type[] INTEGER_TYPES_BY_SIZE
    • ID_TO_ENUM

      public static final BCF2Type[] ID_TO_ENUM
  • Method Details

    • makeDictionary

      public static ArrayList<String> makeDictionary(VCFHeader header)
      Create a strings dictionary from the VCF header The dictionary is an ordered list of common VCF identifers (FILTER, INFO, and FORMAT) fields. Note that its critical that the list be dedupped and sorted in a consistent manner each time, as the BCF2 offsets are encoded relative to this dictionary, and if it isn't determined exactly the same way as in the header each time it's very bad
      Parameters:
      header - the VCFHeader from which to build the dictionary
      Returns:
      a non-null dictionary of elements, may be empty
    • encodeTypeDescriptor

      public static byte encodeTypeDescriptor(int nElements, BCF2Type type)
    • decodeSize

      public static int decodeSize(byte typeDescriptor)
    • decodeTypeID

      public static int decodeTypeID(byte typeDescriptor)
    • decodeType

      public static BCF2Type decodeType(byte typeDescriptor)
    • sizeIsOverflow

      public static boolean sizeIsOverflow(byte typeDescriptor)
    • readByte

      public static byte readByte(InputStream stream) throws IOException
      Throws:
      IOException
    • collapseStringList

      public static String collapseStringList(List<String> strings)
      Collapse multiple strings into a comma separated list ["s1", "s2", "s3"] => ",s1,s2,s3"
      Parameters:
      strings - size > 1 list of strings
      Returns:
    • explodeStringList

      public static List<String> explodeStringList(String collapsed)
      Inverse operation of collapseStringList. ",s1,s2,s3" => ["s1", "s2", "s3"]
      Parameters:
      collapsed -
      Returns:
    • isCollapsedString

      public static boolean isCollapsedString(String s)
    • shadowBCF

      public static final File shadowBCF(File vcfFile)
      Returns a good name for a shadow BCF file for vcfFile. foo.vcf => foo.bcf foo.xxx => foo.xxx.bcf If the resulting BCF file cannot be written, return null. Happens when vcfFile = /dev/null for example
      Parameters:
      vcfFile -
      Returns:
      the BCF
    • determineIntegerType

      public static BCF2Type determineIntegerType(int value)
    • determineIntegerType

      public static BCF2Type determineIntegerType(int[] values)
    • maxIntegerType

      public static BCF2Type maxIntegerType(BCF2Type t1, BCF2Type t2)
      Returns the maximum BCF2 integer size of t1 and t2 For example, if t1 == INT8 and t2 == INT16 returns INT16
      Parameters:
      t1 -
      t2 -
      Returns:
    • determineIntegerType

      public static BCF2Type determineIntegerType(List<Integer> values)
    • toList

      public static <T> List<T> toList(Class<T> c, Object o)
      Helper function that takes an object and returns a list representation of it: o == null => [] o is a list => o else => [o]
      Parameters:
      c - the class of the object
      o - the object to convert to a Java List
      Returns:
    • headerLinesAreOrderedConsistently

      public static boolean headerLinesAreOrderedConsistently(VCFHeader outputHeader, VCFHeader genotypesBlockHeader)
      Are the elements and their order in the output and input headers consistent so that we can write out the raw genotypes block without decoding and recoding it? If the order of INFO, FILTER, or contrig elements in the output header is different than in the input header we must decode the blocks using the input header and then recode them based on the new output order. If they are consistent, we can simply pass through the raw genotypes block bytes, which is a *huge* performance win for large blocks. Many common operations on BCF2 files (merging them for -nt, selecting a subset of records, etc) don't modify the ordering of the header fields and so can safely pass through the genotypes undecoded. Some operations -- those at add filters or info fields -- can change the ordering of the header fields and so produce invalid BCF2 files if the genotypes aren't decoded