Class SequenceUtil

java.lang.Object
htsjdk.samtools.util.SequenceUtil

public class SequenceUtil extends Object
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static class 
     
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final byte
    Byte typed variables for all normal bases.
    static final byte
    Byte typed variables for all normal bases.
    static final byte
    Byte typed variables for all normal bases.
    static final byte
    Byte typed variables for all normal bases.
    static final byte
    Byte typed variables for all normal bases.
    static final byte
    Byte typed variables for all normal bases.
    static final byte
    Byte typed variables for all normal bases.
    static final byte
    Byte typed variables for all normal bases.
    static final byte
    Byte typed variables for all normal bases.
    static final byte
    Byte typed variables for all normal bases.
    static final byte[]
     
    static final byte[]
     
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static boolean
    Returns true if both parameters are null or equal, otherwise returns false
    static void
    Throws an exception if both parameters are non-null and unequal.
    static void
    Throws an exception if both (first) parameters are non-null and unequal (if checkPrefixOnly, checks prefix of lists only).
    static void
    Throws an exception if both parameters are non-null and unequal, including the filenames.
    static void
    default signature that forces the lists to be the same size
    static void
    Throws an exception only if both (first) parameters are not null optionally check that one list is a (nonempty) prefix of the other.
    static boolean
    basesEqual(byte lhs, byte rhs)
    Efficiently compare two IUPAC base codes, simply returning true if they are equal (ignoring case), without considering the set relationships between ambiguous codes.
    static boolean
    bisulfiteBasesEqual(boolean negativeStrand, byte read, byte reference)
    Returns true if the bases are equal OR if the mismatch can be accounted for by bisulfite treatment.
    static boolean
    bisulfiteBasesEqual(byte read, byte reference)
     
    static boolean
    bisulfiteBasesMatchWithAmbiguity(boolean negativeStrand, byte read, byte reference)
    Same as above, but use readBaseMatchesRefBaseWithAmbiguity instead of basesEqual.
    static double
    calculateGc(byte[] bases)
    Calculates the fraction of bases that are G/C in the sequence
    static byte[]
    calculateMD5(byte[] data, int offset, int len)
     
    static String
    calculateMD5String(byte[] data)
     
    static String
    calculateMD5String(byte[] data, int offset, int len)
     
    static void
    calculateMdAndNmTags(SAMRecord record, byte[] ref, boolean calcMD, boolean calcNM)
    Calculate MD and NM similarly to Samtools, except that N->N is a match.
    static int
    calculateSamNmTag(SAMRecord read, byte[] referenceBases)
    Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (see readBaseMatchesRefBaseWithAmbiguity method).
    static int
    calculateSamNmTag(SAMRecord read, byte[] referenceBases, int referenceOffset)
    Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (see readBaseMatchesRefBaseWithAmbiguity method).
    static int
    calculateSamNmTag(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence)
    Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (see readBaseMatchesRefBaseWithAmbiguity method).
    static int
    Attempts to calculate the predefined NM tag from the SAM spec using the cigar string alone.
    static byte
    complement(byte b)
    Returns the complement of a single byte.
    static int
     
    static int
     
    static int
     
    static int
     
    static int
    countMismatches(SAMRecord read, byte[] referenceBases)
    Calculates the number of mismatches between the read and the reference sequence provided.
    static int
    countMismatches(SAMRecord read, byte[] referenceBases, boolean bisulfiteSequence)
    Calculates the number of mismatches between the read and the reference sequence provided.
    static int
    countMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset)
    Calculates the number of mismatches between the read and the reference sequence provided.
    static int
    countMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence)
    Calculates the number of mismatches between the read and the reference sequence provided.
    static int
    countMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence, boolean matchAmbiguousRef)
     
    static List<byte[]>
    generateAllKmers(int length)
    Generates all possible unambiguous kmers (upper-case) of length and returns them as byte[]s.
    static String
    Returns all IUPAC codes as a string
    static byte[]
    getRandomBases(Random random, int length)
    Returns an array of bytes containing random DNA bases
    static void
    getRandomBases(Random random, int length, byte[] bases)
    Fills an array of bytes with random DNA bases.
    static String
    Returns a read name from a FASTQ header string suitable for use in a SAM/BAM file.
    static boolean
    isBamReadBase(byte base)
    Check if the given base belongs to BAM read base set '=ABCDGHKMNRSTVWY'
    static boolean
    isBisulfiteConverted(byte read, byte reference)
     
    static boolean
    isBisulfiteConverted(byte read, byte reference, boolean negativeStrand)
    Checks for bisulfite conversion, C->T on the positive strand and G->A on the negative strand.
    static boolean
    isIUPAC(byte base)
    Checks if the given base is a IUPAC code
    static boolean
    isNoCall(byte base)
    returns true if the value of base represents a no call
    static boolean
    isUpperACGTN(byte base)
    Check if the given base is one of upper case ACGTN
    static boolean
    isValidBase(byte b)
    Returns true if the byte is in [acgtACGT].
    static String
    makeCigarStringWithIndelPossibleClipping(int alignmentStart, int readLength, int referenceSequenceLength, int indelPosition, int indelLength)
    Create a cigar string for a gapped alignment, which may have soft clipping at either end
    static String
    makeCigarStringWithPossibleClipping(int alignmentStart, int readLength, int referenceSequenceLength)
    Create a simple ungapped cigar string, which might have soft clipping at either end
    static byte[]
    makeReferenceFromAlignment(SAMRecord rec, boolean includeReferenceBasesForDeletions)
    Produce reference bases from an aligned SAMRecord with MD string and Cigar.
    static String
    makeSoftClipCigar(int clipLength)
     
    static String
    md5DigestToString(byte[] digest)
    Convets the result of an md5Digest to a string
    static boolean
    readBaseMatchesRefBaseWithAmbiguity(byte readBase, byte refBase)
    Efficiently compare two IUPAC base codes, one coming from a read sequence and the other coming from a reference sequence, using the reference code as a 'pattern' that the read base must match.
    static void
    reverse(byte[] array, int offset, int len)
     
    static void
    reverseComplement(byte[] bases)
    Reverses and complements the bases in place.
    static void
    reverseComplement(byte[] bases, int offset, int len)
     
    static String
    reverseComplement(String sequenceData)
    Calculate the reverse complement of the specified sequence (Stolen from Reseq)
    static void
    reverseQualities(byte[] quals)
    Reverses the quals in place.
    static int
    sumQualitiesOfMismatches(SAMRecord read, byte[] referenceBases)
    Calculates the sum of qualities for mismatched bases in the read.
    static int
    sumQualitiesOfMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset)
    Calculates the sum of qualities for mismatched bases in the read.
    static int
    sumQualitiesOfMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence)
    Calculates the sum of qualities for mismatched bases in the read.
    static byte[]
    toBamReadBasesInPlace(byte[] bases)
    Update and return the given array of bases by upper casing and then replacing all non-BAM read bases with N
    static byte
    upperCase(byte base)
     
    static byte[]
    upperCase(byte[] bases)
     

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • a

      public static final byte a
      Byte typed variables for all normal bases.
      See Also:
    • c

      public static final byte c
      Byte typed variables for all normal bases.
      See Also:
    • g

      public static final byte g
      Byte typed variables for all normal bases.
      See Also:
    • t

      public static final byte t
      Byte typed variables for all normal bases.
      See Also:
    • n

      public static final byte n
      Byte typed variables for all normal bases.
      See Also:
    • A

      public static final byte A
      Byte typed variables for all normal bases.
      See Also:
    • C

      public static final byte C
      Byte typed variables for all normal bases.
      See Also:
    • G

      public static final byte G
      Byte typed variables for all normal bases.
      See Also:
    • T

      public static final byte T
      Byte typed variables for all normal bases.
      See Also:
    • N

      public static final byte N
      Byte typed variables for all normal bases.
      See Also:
    • VALID_BASES_UPPER

      public static final byte[] VALID_BASES_UPPER
    • VALID_BASES_LOWER

      public static final byte[] VALID_BASES_LOWER
  • Constructor Details

    • SequenceUtil

      public SequenceUtil()
  • Method Details

    • reverseComplement

      public static String reverseComplement(String sequenceData)
      Calculate the reverse complement of the specified sequence (Stolen from Reseq)
      Parameters:
      sequenceData -
      Returns:
      reverse complement
    • basesEqual

      public static boolean basesEqual(byte lhs, byte rhs)
      Efficiently compare two IUPAC base codes, simply returning true if they are equal (ignoring case), without considering the set relationships between ambiguous codes.
    • readBaseMatchesRefBaseWithAmbiguity

      public static boolean readBaseMatchesRefBaseWithAmbiguity(byte readBase, byte refBase)
      Efficiently compare two IUPAC base codes, one coming from a read sequence and the other coming from a reference sequence, using the reference code as a 'pattern' that the read base must match. We take ambiguous codes into account, returning true if the set of possible bases represented by the read value is a (non-strict) subset of the possible bases represented by the reference value. Since the comparison is directional, make sure to pass read / ref codes in correct order.
    • isNoCall

      public static boolean isNoCall(byte base)
      returns true if the value of base represents a no call
    • isValidBase

      public static boolean isValidBase(byte b)
      Returns true if the byte is in [acgtACGT].
    • isUpperACGTN

      public static boolean isUpperACGTN(byte base)
      Check if the given base is one of upper case ACGTN
    • getIUPACCodesString

      public static String getIUPACCodesString()
      Returns all IUPAC codes as a string
    • isIUPAC

      public static boolean isIUPAC(byte base)
      Checks if the given base is a IUPAC code
    • calculateGc

      public static double calculateGc(byte[] bases)
      Calculates the fraction of bases that are G/C in the sequence
    • isBamReadBase

      public static boolean isBamReadBase(byte base)
      Check if the given base belongs to BAM read base set '=ABCDGHKMNRSTVWY'
    • toBamReadBasesInPlace

      public static byte[] toBamReadBasesInPlace(byte[] bases)
      Update and return the given array of bases by upper casing and then replacing all non-BAM read bases with N
    • assertSequenceListsEqual

      public static void assertSequenceListsEqual(List<SAMSequenceRecord> s1, List<SAMSequenceRecord> s2)
      default signature that forces the lists to be the same size
      Parameters:
      s1 - a list of sequence headers
      s2 - a second list of sequence headers
    • assertSequenceListsEqual

      public static void assertSequenceListsEqual(List<SAMSequenceRecord> s1, List<SAMSequenceRecord> s2, boolean checkPrefixOnly)
      Throws an exception only if both (first) parameters are not null optionally check that one list is a (nonempty) prefix of the other.
      Parameters:
      s1 - a list of sequence headers
      s2 - a second list of sequence headers
      checkPrefixOnly - a flag specifying whether to only look at the first records in the lists. This will then check that the records of the smaller dictionary are equal to the records of the beginning of the larger dictionary, which can be useful since sometimes different pipelines choose to use only the first contigs of a standard reference.
    • areSequenceDictionariesEqual

      public static boolean areSequenceDictionariesEqual(SAMSequenceDictionary s1, SAMSequenceDictionary s2)
      Returns true if both parameters are null or equal, otherwise returns false
      Parameters:
      s1 - a list of sequence headers
      s2 - a second list of sequence headers
    • assertSequenceDictionariesEqual

      public static void assertSequenceDictionariesEqual(SAMSequenceDictionary s1, SAMSequenceDictionary s2)
      Throws an exception if both parameters are non-null and unequal.
      Parameters:
      s1 - a list of sequence headers
      s2 - a second list of sequence headers
    • assertSequenceDictionariesEqual

      public static void assertSequenceDictionariesEqual(SAMSequenceDictionary s1, SAMSequenceDictionary s2, boolean checkPrefixOnly)
      Throws an exception if both (first) parameters are non-null and unequal (if checkPrefixOnly, checks prefix of lists only).
      Parameters:
      s1 - a list of sequence headers
      s2 - a second list of sequence headers
      checkPrefixOnly - a flag specifying whether to only look at the first records in the lists. This will then check that the records of the smaller dictionary are equal to the records of the beginning of the larger dictionary, which can be useful since sometimes different pipelines choose to use only the first contigs of a standard reference.
    • assertSequenceDictionariesEqual

      public static void assertSequenceDictionariesEqual(SAMSequenceDictionary s1, SAMSequenceDictionary s2, File f1, File f2)
      Throws an exception if both parameters are non-null and unequal, including the filenames.
    • makeCigarStringWithPossibleClipping

      public static String makeCigarStringWithPossibleClipping(int alignmentStart, int readLength, int referenceSequenceLength)
      Create a simple ungapped cigar string, which might have soft clipping at either end
      Parameters:
      alignmentStart - raw aligment start, which may result in read hanging off beginning or end of read
      Returns:
      cigar string that may have S operator at beginning or end, and has M operator for the rest of the read
    • makeCigarStringWithIndelPossibleClipping

      public static String makeCigarStringWithIndelPossibleClipping(int alignmentStart, int readLength, int referenceSequenceLength, int indelPosition, int indelLength)
      Create a cigar string for a gapped alignment, which may have soft clipping at either end
      Parameters:
      alignmentStart - raw alignment start, which may result in read hanging off beginning or end of read
      readLength -
      referenceSequenceLength -
      indelPosition - number of matching bases before indel. Must be > 0
      indelLength - length of indel. Positive for insertion, negative for deletion.
      Returns:
      cigar string that may have S operator at beginning or end, has one or two M operators, and an I or a D.
    • makeSoftClipCigar

      public static String makeSoftClipCigar(int clipLength)
    • countMismatches

      public static int countMismatches(SAMRecord read, byte[] referenceBases)
      Calculates the number of mismatches between the read and the reference sequence provided.
    • countMismatches

      public static int countMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset)
      Calculates the number of mismatches between the read and the reference sequence provided.
    • countMismatches

      public static int countMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence)
      Calculates the number of mismatches between the read and the reference sequence provided.
      Parameters:
      referenceBases - Array of ASCII bytes that covers at least the the portion of the reference sequence to which read is aligned from getReferenceStart to getReferenceEnd.
      referenceOffset - 0-based offset of the first element of referenceBases relative to the start of that reference sequence.
      bisulfiteSequence - If this is true, it is assumed that the reads were bisulfite treated and C->T on the positive strand and G->A on the negative strand will not be counted as mismatches.
    • countMismatches

      public static int countMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence, boolean matchAmbiguousRef)
    • countMismatches

      public static int countMismatches(SAMRecord read, byte[] referenceBases, boolean bisulfiteSequence)
      Calculates the number of mismatches between the read and the reference sequence provided.
      Parameters:
      referenceBases - Array of ASCII bytes that covers at least the the portion of the reference sequence to which read is aligned from getReferenceStart to getReferenceEnd.
      bisulfiteSequence - If this is true, it is assumed that the reads were bisulfite treated and C->T on the positive strand and G->A on the negative strand will not be counted as mismatches.
    • sumQualitiesOfMismatches

      public static int sumQualitiesOfMismatches(SAMRecord read, byte[] referenceBases)
      Calculates the sum of qualities for mismatched bases in the read.
      Parameters:
      referenceBases - Array of ASCII bytes in which the 0th position in the array corresponds to the first element of the reference sequence to which read is aligned.
    • sumQualitiesOfMismatches

      public static int sumQualitiesOfMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset)
      Calculates the sum of qualities for mismatched bases in the read.
      Parameters:
      referenceBases - Array of ASCII bytes that covers at least the the portion of the reference sequence to which read is aligned from getReferenceStart to getReferenceEnd.
      referenceOffset - 0-based offset of the first element of referenceBases relative to the start of that reference sequence.
    • sumQualitiesOfMismatches

      public static int sumQualitiesOfMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence)
      Calculates the sum of qualities for mismatched bases in the read.
      Parameters:
      referenceBases - Array of ASCII bytes that covers at least the the portion of the reference sequence to which read is aligned from getReferenceStart to getReferenceEnd.
      referenceOffset - 0-based offset of the first element of referenceBases relative to the start of that reference sequence.
      bisulfiteSequence - If this is true, it is assumed that the reads were bisulfite treated and C->T on the positive strand and G->A on the negative strand will not be counted as mismatches.
    • countInsertedBases

      public static int countInsertedBases(Cigar cigar)
    • countDeletedBases

      public static int countDeletedBases(Cigar cigar)
    • countInsertedBases

      public static int countInsertedBases(SAMRecord read)
    • countDeletedBases

      public static int countDeletedBases(SAMRecord read)
    • calculateSamNmTag

      public static int calculateSamNmTag(SAMRecord read, byte[] referenceBases)
      Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (see readBaseMatchesRefBaseWithAmbiguity method).
    • calculateSamNmTag

      public static int calculateSamNmTag(SAMRecord read, byte[] referenceBases, int referenceOffset)
      Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (see readBaseMatchesRefBaseWithAmbiguity method).
      Parameters:
      referenceOffset - 0-based offset of the first element of referenceBases relative to the start of that reference sequence.
    • calculateSamNmTag

      public static int calculateSamNmTag(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence)
      Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (see readBaseMatchesRefBaseWithAmbiguity method).
      Parameters:
      referenceOffset - 0-based offset of the first element of referenceBases relative to the start of that reference sequence.
      bisulfiteSequence - If this is true, it is assumed that the reads were bisulfite treated and C->T on the positive strand and G->A on the negative strand will not be counted as mismatches.
    • calculateSamNmTagFromCigar

      public static int calculateSamNmTagFromCigar(SAMRecord record)
      Attempts to calculate the predefined NM tag from the SAM spec using the cigar string alone. It may calculate incorrectly if ambiguous operators (Like M) are used. Needed for testing infrastructure: SAMRecordSetBuilder
    • complement

      public static byte complement(byte b)
      Returns the complement of a single byte.
    • bisulfiteBasesEqual

      public static boolean bisulfiteBasesEqual(boolean negativeStrand, byte read, byte reference)
      Returns true if the bases are equal OR if the mismatch can be accounted for by bisulfite treatment. C->T on the positive strand and G->A on the negative strand do not count as mismatches.
    • bisulfiteBasesEqual

      public static boolean bisulfiteBasesEqual(byte read, byte reference)
    • bisulfiteBasesMatchWithAmbiguity

      public static boolean bisulfiteBasesMatchWithAmbiguity(boolean negativeStrand, byte read, byte reference)
      Same as above, but use readBaseMatchesRefBaseWithAmbiguity instead of basesEqual. Note that isBisulfiteConverted is not affected because it only applies when the reference base is non-ambiguous.
    • isBisulfiteConverted

      public static boolean isBisulfiteConverted(byte read, byte reference, boolean negativeStrand)
      Checks for bisulfite conversion, C->T on the positive strand and G->A on the negative strand.
    • isBisulfiteConverted

      public static boolean isBisulfiteConverted(byte read, byte reference)
    • makeReferenceFromAlignment

      public static byte[] makeReferenceFromAlignment(SAMRecord rec, boolean includeReferenceBasesForDeletions)
      Produce reference bases from an aligned SAMRecord with MD string and Cigar.
      Parameters:
      rec - Must contain non-empty CIGAR and MD attribute.
      includeReferenceBasesForDeletions - If true, include reference bases that are deleted in the read. This will make the returned array not line up with the read if there are deletions.
      Returns:
      References bases corresponding to the read. If there is an insertion in the read, reference contains '-'. If the read is soft-clipped, reference contains '0'. If there is a skipped region and includeReferenceBasesForDeletions==true, reference will have Ns for the skipped region.
    • reverseComplement

      public static void reverseComplement(byte[] bases)
      Reverses and complements the bases in place.
    • reverseQualities

      public static void reverseQualities(byte[] quals)
      Reverses the quals in place.
    • reverse

      public static void reverse(byte[] array, int offset, int len)
    • reverseComplement

      public static void reverseComplement(byte[] bases, int offset, int len)
    • calculateMD5String

      public static String calculateMD5String(byte[] data)
    • calculateMD5String

      public static String calculateMD5String(byte[] data, int offset, int len)
    • md5DigestToString

      public static String md5DigestToString(byte[] digest)
      Convets the result of an md5Digest to a string
      Parameters:
      digest - digest that needs to be converted to a string
      Returns:
      string representing the md5
    • calculateMD5

      public static byte[] calculateMD5(byte[] data, int offset, int len)
    • calculateMdAndNmTags

      public static void calculateMdAndNmTags(SAMRecord record, byte[] ref, boolean calcMD, boolean calcNM)
      Calculate MD and NM similarly to Samtools, except that N->N is a match.
      Parameters:
      record - Input record for which to calculate NM and MD. The appropriate tags will be added/updated in the record
      ref - The reference bases for the sequence to which the record is mapped
      calcMD - A flag indicating whether to update the MD tag in the record
      calcNM - A flag indicating whether to update the NM tag in the record
    • upperCase

      public static byte upperCase(byte base)
    • upperCase

      public static byte[] upperCase(byte[] bases)
    • generateAllKmers

      public static List<byte[]> generateAllKmers(int length)
      Generates all possible unambiguous kmers (upper-case) of length and returns them as byte[]s.
    • getSamReadNameFromFastqHeader

      public static String getSamReadNameFromFastqHeader(String fastqHeader)
      Returns a read name from a FASTQ header string suitable for use in a SAM/BAM file. Any letters after the first space are ignored. Ths method also strips trailing "/1" or "/2" so that paired end reads have the same name.
      Parameters:
      fastqHeader - the header from a FastqRecord.
      Returns:
      a read name appropriate for output in a SAM/BAM file.
    • getRandomBases

      public static byte[] getRandomBases(Random random, int length)
      Returns an array of bytes containing random DNA bases
      Parameters:
      random - A Random object to use for drawing randomw bases
      length - How many bases to return.
      Returns:
      an array of random DNA bases of the requested length.
    • getRandomBases

      public static void getRandomBases(Random random, int length, byte[] bases)
      Fills an array of bytes with random DNA bases. Will overwrite first length byte with new bases. if length is less than the size of the input array, the remaining bases will not be modified.
      Parameters:
      random - A Random object to use for drawing random bases
      length - How many bases to return.
      bases - Array to use for bases (from index 0)