Package htsjdk.samtools.util
Class SequenceUtil
java.lang.Object
htsjdk.samtools.util.SequenceUtil
-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionstatic final byte
Byte typed variables for all normal bases.static final byte
Byte typed variables for all normal bases.static final byte
Byte typed variables for all normal bases.static final byte
Byte typed variables for all normal bases.static final byte
Byte typed variables for all normal bases.static final byte
Byte typed variables for all normal bases.static final byte
Byte typed variables for all normal bases.static final byte
Byte typed variables for all normal bases.static final byte
Byte typed variables for all normal bases.static final byte
Byte typed variables for all normal bases.static final byte[]
static final byte[]
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic boolean
Returns true if both parameters are null or equal, otherwise returns falsestatic void
Throws an exception if both parameters are non-null and unequal.static void
assertSequenceDictionariesEqual
(SAMSequenceDictionary s1, SAMSequenceDictionary s2, boolean checkPrefixOnly) Throws an exception if both (first) parameters are non-null and unequal (if checkPrefixOnly, checks prefix of lists only).static void
assertSequenceDictionariesEqual
(SAMSequenceDictionary s1, SAMSequenceDictionary s2, File f1, File f2) Throws an exception if both parameters are non-null and unequal, including the filenames.static void
default signature that forces the lists to be the same sizestatic void
assertSequenceListsEqual
(List<SAMSequenceRecord> s1, List<SAMSequenceRecord> s2, boolean checkPrefixOnly) Throws an exception only if both (first) parameters are not null optionally check that one list is a (nonempty) prefix of the other.static boolean
basesEqual
(byte lhs, byte rhs) Efficiently compare two IUPAC base codes, simply returning true if they are equal (ignoring case), without considering the set relationships between ambiguous codes.static boolean
bisulfiteBasesEqual
(boolean negativeStrand, byte read, byte reference) Returns true if the bases are equal OR if the mismatch can be accounted for by bisulfite treatment.static boolean
bisulfiteBasesEqual
(byte read, byte reference) static boolean
bisulfiteBasesMatchWithAmbiguity
(boolean negativeStrand, byte read, byte reference) Same as above, but usereadBaseMatchesRefBaseWithAmbiguity
instead ofbasesEqual
.static double
calculateGc
(byte[] bases) Calculates the fraction of bases that are G/C in the sequencestatic byte[]
calculateMD5
(byte[] data, int offset, int len) static String
calculateMD5String
(byte[] data) static String
calculateMD5String
(byte[] data, int offset, int len) static void
calculateMdAndNmTags
(SAMRecord record, byte[] ref, boolean calcMD, boolean calcNM) Calculate MD and NM similarly to Samtools, except that N->N is a match.static int
calculateSamNmTag
(SAMRecord read, byte[] referenceBases) Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (seereadBaseMatchesRefBaseWithAmbiguity
method).static int
calculateSamNmTag
(SAMRecord read, byte[] referenceBases, int referenceOffset) Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (seereadBaseMatchesRefBaseWithAmbiguity
method).static int
calculateSamNmTag
(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence) Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (seereadBaseMatchesRefBaseWithAmbiguity
method).static int
calculateSamNmTagFromCigar
(SAMRecord record) Attempts to calculate the predefined NM tag from the SAM spec using the cigar string alone.static byte
complement
(byte b) Returns the complement of a single byte.static int
countDeletedBases
(Cigar cigar) static int
countDeletedBases
(SAMRecord read) static int
countInsertedBases
(Cigar cigar) static int
countInsertedBases
(SAMRecord read) static int
countMismatches
(SAMRecord read, byte[] referenceBases) Calculates the number of mismatches between the read and the reference sequence provided.static int
countMismatches
(SAMRecord read, byte[] referenceBases, boolean bisulfiteSequence) Calculates the number of mismatches between the read and the reference sequence provided.static int
countMismatches
(SAMRecord read, byte[] referenceBases, int referenceOffset) Calculates the number of mismatches between the read and the reference sequence provided.static int
countMismatches
(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence) Calculates the number of mismatches between the read and the reference sequence provided.static int
countMismatches
(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence, boolean matchAmbiguousRef) static List<byte[]>
generateAllKmers
(int length) Generates all possible unambiguous kmers (upper-case) of length and returns them as byte[]s.static String
Returns all IUPAC codes as a stringstatic byte[]
getRandomBases
(Random random, int length) Returns an array of bytes containing random DNA basesstatic void
getRandomBases
(Random random, int length, byte[] bases) Fills an array of bytes with random DNA bases.static String
getSamReadNameFromFastqHeader
(String fastqHeader) Returns a read name from a FASTQ header string suitable for use in a SAM/BAM file.static boolean
isBamReadBase
(byte base) Check if the given base belongs to BAM read base set '=ABCDGHKMNRSTVWY'static boolean
isBisulfiteConverted
(byte read, byte reference) static boolean
isBisulfiteConverted
(byte read, byte reference, boolean negativeStrand) Checks for bisulfite conversion, C->T on the positive strand and G->A on the negative strand.static boolean
isIUPAC
(byte base) Checks if the given base is a IUPAC codestatic boolean
isNoCall
(byte base) returns true if the value of base represents a no callstatic boolean
isUpperACGTN
(byte base) Check if the given base is one of upper case ACGTNstatic boolean
isValidBase
(byte b) Returns true if the byte is in [acgtACGT].static String
makeCigarStringWithIndelPossibleClipping
(int alignmentStart, int readLength, int referenceSequenceLength, int indelPosition, int indelLength) Create a cigar string for a gapped alignment, which may have soft clipping at either endstatic String
makeCigarStringWithPossibleClipping
(int alignmentStart, int readLength, int referenceSequenceLength) Create a simple ungapped cigar string, which might have soft clipping at either endstatic byte[]
makeReferenceFromAlignment
(SAMRecord rec, boolean includeReferenceBasesForDeletions) Produce reference bases from an aligned SAMRecord with MD string and Cigar.static String
makeSoftClipCigar
(int clipLength) static String
md5DigestToString
(byte[] digest) Convets the result of an md5Digest to a stringstatic boolean
readBaseMatchesRefBaseWithAmbiguity
(byte readBase, byte refBase) Efficiently compare two IUPAC base codes, one coming from a read sequence and the other coming from a reference sequence, using the reference code as a 'pattern' that the read base must match.static void
reverse
(byte[] array, int offset, int len) static void
reverseComplement
(byte[] bases) Reverses and complements the bases in place.static void
reverseComplement
(byte[] bases, int offset, int len) static String
reverseComplement
(String sequenceData) Calculate the reverse complement of the specified sequence (Stolen from Reseq)static void
reverseQualities
(byte[] quals) Reverses the quals in place.static int
sumQualitiesOfMismatches
(SAMRecord read, byte[] referenceBases) Calculates the sum of qualities for mismatched bases in the read.static int
sumQualitiesOfMismatches
(SAMRecord read, byte[] referenceBases, int referenceOffset) Calculates the sum of qualities for mismatched bases in the read.static int
sumQualitiesOfMismatches
(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence) Calculates the sum of qualities for mismatched bases in the read.static byte[]
toBamReadBasesInPlace
(byte[] bases) Update and return the given array of bases by upper casing and then replacing all non-BAM read bases with Nstatic byte
upperCase
(byte base) static byte[]
upperCase
(byte[] bases)
-
Field Details
-
a
public static final byte aByte typed variables for all normal bases.- See Also:
-
c
public static final byte cByte typed variables for all normal bases.- See Also:
-
g
public static final byte gByte typed variables for all normal bases.- See Also:
-
t
public static final byte tByte typed variables for all normal bases.- See Also:
-
n
public static final byte nByte typed variables for all normal bases.- See Also:
-
A
public static final byte AByte typed variables for all normal bases.- See Also:
-
C
public static final byte CByte typed variables for all normal bases.- See Also:
-
G
public static final byte GByte typed variables for all normal bases.- See Also:
-
T
public static final byte TByte typed variables for all normal bases.- See Also:
-
N
public static final byte NByte typed variables for all normal bases.- See Also:
-
VALID_BASES_UPPER
public static final byte[] VALID_BASES_UPPER -
VALID_BASES_LOWER
public static final byte[] VALID_BASES_LOWER
-
-
Constructor Details
-
SequenceUtil
public SequenceUtil()
-
-
Method Details
-
reverseComplement
Calculate the reverse complement of the specified sequence (Stolen from Reseq)- Parameters:
sequenceData
-- Returns:
- reverse complement
-
basesEqual
public static boolean basesEqual(byte lhs, byte rhs) Efficiently compare two IUPAC base codes, simply returning true if they are equal (ignoring case), without considering the set relationships between ambiguous codes. -
readBaseMatchesRefBaseWithAmbiguity
public static boolean readBaseMatchesRefBaseWithAmbiguity(byte readBase, byte refBase) Efficiently compare two IUPAC base codes, one coming from a read sequence and the other coming from a reference sequence, using the reference code as a 'pattern' that the read base must match. We take ambiguous codes into account, returning true if the set of possible bases represented by the read value is a (non-strict) subset of the possible bases represented by the reference value. Since the comparison is directional, make sure to pass read / ref codes in correct order. -
isNoCall
public static boolean isNoCall(byte base) returns true if the value of base represents a no call -
isValidBase
public static boolean isValidBase(byte b) Returns true if the byte is in [acgtACGT]. -
isUpperACGTN
public static boolean isUpperACGTN(byte base) Check if the given base is one of upper case ACGTN -
getIUPACCodesString
Returns all IUPAC codes as a string -
isIUPAC
public static boolean isIUPAC(byte base) Checks if the given base is a IUPAC code -
calculateGc
public static double calculateGc(byte[] bases) Calculates the fraction of bases that are G/C in the sequence -
isBamReadBase
public static boolean isBamReadBase(byte base) Check if the given base belongs to BAM read base set '=ABCDGHKMNRSTVWY' -
toBamReadBasesInPlace
public static byte[] toBamReadBasesInPlace(byte[] bases) Update and return the given array of bases by upper casing and then replacing all non-BAM read bases with N -
assertSequenceListsEqual
default signature that forces the lists to be the same size- Parameters:
s1
- a list of sequence headerss2
- a second list of sequence headers
-
assertSequenceListsEqual
public static void assertSequenceListsEqual(List<SAMSequenceRecord> s1, List<SAMSequenceRecord> s2, boolean checkPrefixOnly) Throws an exception only if both (first) parameters are not null optionally check that one list is a (nonempty) prefix of the other.- Parameters:
s1
- a list of sequence headerss2
- a second list of sequence headerscheckPrefixOnly
- a flag specifying whether to only look at the first records in the lists. This will then check that the records of the smaller dictionary are equal to the records of the beginning of the larger dictionary, which can be useful since sometimes different pipelines choose to use only the first contigs of a standard reference.
-
areSequenceDictionariesEqual
public static boolean areSequenceDictionariesEqual(SAMSequenceDictionary s1, SAMSequenceDictionary s2) Returns true if both parameters are null or equal, otherwise returns false- Parameters:
s1
- a list of sequence headerss2
- a second list of sequence headers
-
assertSequenceDictionariesEqual
public static void assertSequenceDictionariesEqual(SAMSequenceDictionary s1, SAMSequenceDictionary s2) Throws an exception if both parameters are non-null and unequal.- Parameters:
s1
- a list of sequence headerss2
- a second list of sequence headers
-
assertSequenceDictionariesEqual
public static void assertSequenceDictionariesEqual(SAMSequenceDictionary s1, SAMSequenceDictionary s2, boolean checkPrefixOnly) Throws an exception if both (first) parameters are non-null and unequal (if checkPrefixOnly, checks prefix of lists only).- Parameters:
s1
- a list of sequence headerss2
- a second list of sequence headerscheckPrefixOnly
- a flag specifying whether to only look at the first records in the lists. This will then check that the records of the smaller dictionary are equal to the records of the beginning of the larger dictionary, which can be useful since sometimes different pipelines choose to use only the first contigs of a standard reference.
-
assertSequenceDictionariesEqual
public static void assertSequenceDictionariesEqual(SAMSequenceDictionary s1, SAMSequenceDictionary s2, File f1, File f2) Throws an exception if both parameters are non-null and unequal, including the filenames. -
makeCigarStringWithPossibleClipping
public static String makeCigarStringWithPossibleClipping(int alignmentStart, int readLength, int referenceSequenceLength) Create a simple ungapped cigar string, which might have soft clipping at either end- Parameters:
alignmentStart
- raw aligment start, which may result in read hanging off beginning or end of read- Returns:
- cigar string that may have S operator at beginning or end, and has M operator for the rest of the read
-
makeCigarStringWithIndelPossibleClipping
public static String makeCigarStringWithIndelPossibleClipping(int alignmentStart, int readLength, int referenceSequenceLength, int indelPosition, int indelLength) Create a cigar string for a gapped alignment, which may have soft clipping at either end- Parameters:
alignmentStart
- raw alignment start, which may result in read hanging off beginning or end of readreadLength
-referenceSequenceLength
-indelPosition
- number of matching bases before indel. Must be > 0indelLength
- length of indel. Positive for insertion, negative for deletion.- Returns:
- cigar string that may have S operator at beginning or end, has one or two M operators, and an I or a D.
-
makeSoftClipCigar
-
countMismatches
Calculates the number of mismatches between the read and the reference sequence provided. -
countMismatches
Calculates the number of mismatches between the read and the reference sequence provided. -
countMismatches
public static int countMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence) Calculates the number of mismatches between the read and the reference sequence provided.- Parameters:
referenceBases
- Array of ASCII bytes that covers at least the the portion of the reference sequence to which read is aligned from getReferenceStart to getReferenceEnd.referenceOffset
- 0-based offset of the first element of referenceBases relative to the start of that reference sequence.bisulfiteSequence
- If this is true, it is assumed that the reads were bisulfite treated and C->T on the positive strand and G->A on the negative strand will not be counted as mismatches.
-
countMismatches
public static int countMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence, boolean matchAmbiguousRef) -
countMismatches
Calculates the number of mismatches between the read and the reference sequence provided.- Parameters:
referenceBases
- Array of ASCII bytes that covers at least the the portion of the reference sequence to which read is aligned from getReferenceStart to getReferenceEnd.bisulfiteSequence
- If this is true, it is assumed that the reads were bisulfite treated and C->T on the positive strand and G->A on the negative strand will not be counted as mismatches.
-
sumQualitiesOfMismatches
Calculates the sum of qualities for mismatched bases in the read.- Parameters:
referenceBases
- Array of ASCII bytes in which the 0th position in the array corresponds to the first element of the reference sequence to which read is aligned.
-
sumQualitiesOfMismatches
public static int sumQualitiesOfMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset) Calculates the sum of qualities for mismatched bases in the read.- Parameters:
referenceBases
- Array of ASCII bytes that covers at least the the portion of the reference sequence to which read is aligned from getReferenceStart to getReferenceEnd.referenceOffset
- 0-based offset of the first element of referenceBases relative to the start of that reference sequence.
-
sumQualitiesOfMismatches
public static int sumQualitiesOfMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence) Calculates the sum of qualities for mismatched bases in the read.- Parameters:
referenceBases
- Array of ASCII bytes that covers at least the the portion of the reference sequence to which read is aligned from getReferenceStart to getReferenceEnd.referenceOffset
- 0-based offset of the first element of referenceBases relative to the start of that reference sequence.bisulfiteSequence
- If this is true, it is assumed that the reads were bisulfite treated and C->T on the positive strand and G->A on the negative strand will not be counted as mismatches.
-
countInsertedBases
-
countDeletedBases
-
countInsertedBases
-
countDeletedBases
-
calculateSamNmTag
Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (seereadBaseMatchesRefBaseWithAmbiguity
method). -
calculateSamNmTag
Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (seereadBaseMatchesRefBaseWithAmbiguity
method).- Parameters:
referenceOffset
- 0-based offset of the first element of referenceBases relative to the start of that reference sequence.
-
calculateSamNmTag
public static int calculateSamNmTag(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence) Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels) For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes (seereadBaseMatchesRefBaseWithAmbiguity
method).- Parameters:
referenceOffset
- 0-based offset of the first element of referenceBases relative to the start of that reference sequence.bisulfiteSequence
- If this is true, it is assumed that the reads were bisulfite treated and C->T on the positive strand and G->A on the negative strand will not be counted as mismatches.
-
calculateSamNmTagFromCigar
Attempts to calculate the predefined NM tag from the SAM spec using the cigar string alone. It may calculate incorrectly if ambiguous operators (Like M) are used. Needed for testing infrastructure: SAMRecordSetBuilder -
complement
public static byte complement(byte b) Returns the complement of a single byte. -
bisulfiteBasesEqual
public static boolean bisulfiteBasesEqual(boolean negativeStrand, byte read, byte reference) Returns true if the bases are equal OR if the mismatch can be accounted for by bisulfite treatment. C->T on the positive strand and G->A on the negative strand do not count as mismatches. -
bisulfiteBasesEqual
public static boolean bisulfiteBasesEqual(byte read, byte reference) -
bisulfiteBasesMatchWithAmbiguity
public static boolean bisulfiteBasesMatchWithAmbiguity(boolean negativeStrand, byte read, byte reference) Same as above, but usereadBaseMatchesRefBaseWithAmbiguity
instead ofbasesEqual
. Note thatisBisulfiteConverted
is not affected because it only applies when the reference base is non-ambiguous. -
isBisulfiteConverted
public static boolean isBisulfiteConverted(byte read, byte reference, boolean negativeStrand) Checks for bisulfite conversion, C->T on the positive strand and G->A on the negative strand. -
isBisulfiteConverted
public static boolean isBisulfiteConverted(byte read, byte reference) -
makeReferenceFromAlignment
public static byte[] makeReferenceFromAlignment(SAMRecord rec, boolean includeReferenceBasesForDeletions) Produce reference bases from an aligned SAMRecord with MD string and Cigar.- Parameters:
rec
- Must contain non-empty CIGAR and MD attribute.includeReferenceBasesForDeletions
- If true, include reference bases that are deleted in the read. This will make the returned array not line up with the read if there are deletions.- Returns:
- References bases corresponding to the read. If there is an insertion in the read, reference contains '-'. If the read is soft-clipped, reference contains '0'. If there is a skipped region and includeReferenceBasesForDeletions==true, reference will have Ns for the skipped region.
-
reverseComplement
public static void reverseComplement(byte[] bases) Reverses and complements the bases in place. -
reverseQualities
public static void reverseQualities(byte[] quals) Reverses the quals in place. -
reverse
public static void reverse(byte[] array, int offset, int len) -
reverseComplement
public static void reverseComplement(byte[] bases, int offset, int len) -
calculateMD5String
-
calculateMD5String
-
md5DigestToString
Convets the result of an md5Digest to a string- Parameters:
digest
- digest that needs to be converted to a string- Returns:
- string representing the md5
-
calculateMD5
public static byte[] calculateMD5(byte[] data, int offset, int len) -
calculateMdAndNmTags
public static void calculateMdAndNmTags(SAMRecord record, byte[] ref, boolean calcMD, boolean calcNM) Calculate MD and NM similarly to Samtools, except that N->N is a match.- Parameters:
record
- Input record for which to calculate NM and MD. The appropriate tags will be added/updated in the recordref
- The reference bases for the sequence to which the record is mappedcalcMD
- A flag indicating whether to update the MD tag in the recordcalcNM
- A flag indicating whether to update the NM tag in the record
-
upperCase
public static byte upperCase(byte base) -
upperCase
public static byte[] upperCase(byte[] bases) -
generateAllKmers
Generates all possible unambiguous kmers (upper-case) of length and returns them as byte[]s. -
getSamReadNameFromFastqHeader
Returns a read name from a FASTQ header string suitable for use in a SAM/BAM file. Any letters after the first space are ignored. Ths method also strips trailing "/1" or "/2" so that paired end reads have the same name.- Parameters:
fastqHeader
- the header from aFastqRecord
.- Returns:
- a read name appropriate for output in a SAM/BAM file.
-
getRandomBases
Returns an array of bytes containing random DNA bases- Parameters:
random
- ARandom
object to use for drawing randomw baseslength
- How many bases to return.- Returns:
- an array of random DNA bases of the requested length.
-
getRandomBases
Fills an array of bytes with random DNA bases. Will overwrite firstlength
byte with new bases. iflength
is less than the size of the input array, the remaining bases will not be modified.- Parameters:
random
- ARandom
object to use for drawing random baseslength
- How many bases to return.bases
- Array to use for bases (from index 0)
-