public class SequenceUtil
extends java.lang.Object
Modifier and Type | Class and Description |
---|---|
static class |
SequenceUtil.SequenceListsDifferException |
Modifier and Type | Field and Description |
---|---|
static byte |
a
Byte typed variables for all normal bases.
|
static byte |
A
Byte typed variables for all normal bases.
|
static byte |
c
Byte typed variables for all normal bases.
|
static byte |
C
Byte typed variables for all normal bases.
|
static byte |
g
Byte typed variables for all normal bases.
|
static byte |
G
Byte typed variables for all normal bases.
|
static byte |
n
Byte typed variables for all normal bases.
|
static byte |
N
Byte typed variables for all normal bases.
|
static byte |
t
Byte typed variables for all normal bases.
|
static byte |
T
Byte typed variables for all normal bases.
|
static byte[] |
VALID_BASES_LOWER |
static byte[] |
VALID_BASES_UPPER |
Constructor and Description |
---|
SequenceUtil() |
Modifier and Type | Method and Description |
---|---|
static boolean |
areSequenceDictionariesEqual(SAMSequenceDictionary s1,
SAMSequenceDictionary s2)
Returns true if both parameters are null or equal, otherwise returns false
|
static void |
assertSequenceDictionariesEqual(SAMSequenceDictionary s1,
SAMSequenceDictionary s2)
Throws an exception if both parameters are non-null and unequal.
|
static void |
assertSequenceDictionariesEqual(SAMSequenceDictionary s1,
SAMSequenceDictionary s2,
boolean checkPrefixOnly)
Throws an exception if both (first) parameters are non-null and unequal (if checkPrefixOnly, checks prefix of lists only).
|
static void |
assertSequenceDictionariesEqual(SAMSequenceDictionary s1,
SAMSequenceDictionary s2,
java.io.File f1,
java.io.File f2)
Throws an exception if both parameters are non-null and unequal, including the filenames.
|
static void |
assertSequenceListsEqual(java.util.List<SAMSequenceRecord> s1,
java.util.List<SAMSequenceRecord> s2)
default signature that forces the lists to be the same size
|
static void |
assertSequenceListsEqual(java.util.List<SAMSequenceRecord> s1,
java.util.List<SAMSequenceRecord> s2,
boolean checkPrefixOnly)
Throws an exception only if both (first) parameters are not null
optionally check that one list is a (nonempty) prefix of the other.
|
static boolean |
basesEqual(byte lhs,
byte rhs)
Efficiently compare two IUPAC base codes, simply returning true if they are equal (ignoring case),
without considering the set relationships between ambiguous codes.
|
static boolean |
bisulfiteBasesEqual(boolean negativeStrand,
byte read,
byte reference)
Returns true if the bases are equal OR if the mismatch can be accounted for by
bisulfite treatment.
|
static boolean |
bisulfiteBasesEqual(byte read,
byte reference) |
static boolean |
bisulfiteBasesMatchWithAmbiguity(boolean negativeStrand,
byte read,
byte reference)
Same as above, but use
readBaseMatchesRefBaseWithAmbiguity instead of basesEqual . |
static double |
calculateGc(byte[] bases)
Calculates the fraction of bases that are G/C in the sequence
|
static byte[] |
calculateMD5(byte[] data,
int offset,
int len) |
static java.lang.String |
calculateMD5String(byte[] data) |
static java.lang.String |
calculateMD5String(byte[] data,
int offset,
int len) |
static void |
calculateMdAndNmTags(SAMRecord record,
byte[] ref,
boolean calcMD,
boolean calcNM)
Calculate MD and NM similarly to Samtools, except that N->N is a match.
|
static int |
calculateSamNmTag(SAMRecord read,
byte[] referenceBases)
Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels)
For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes
(see
readBaseMatchesRefBaseWithAmbiguity method). |
static int |
calculateSamNmTag(SAMRecord read,
byte[] referenceBases,
int referenceOffset)
Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels)
For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes
(see
readBaseMatchesRefBaseWithAmbiguity method). |
static int |
calculateSamNmTag(SAMRecord read,
byte[] referenceBases,
int referenceOffset,
boolean bisulfiteSequence)
Calculates the predefined NM tag from the SAM spec: (# of mismatches + # of indels)
For the purposes for calculating mismatches, we do not yet support IUPAC ambiguous codes
(see
readBaseMatchesRefBaseWithAmbiguity method). |
static int |
calculateSamNmTagFromCigar(SAMRecord record)
Attempts to calculate the predefined NM tag from the SAM spec using the cigar string alone.
|
static byte |
complement(byte b)
Returns the complement of a single byte.
|
static int |
countDeletedBases(Cigar cigar) |
static int |
countDeletedBases(SAMRecord read) |
static int |
countInsertedBases(Cigar cigar) |
static int |
countInsertedBases(SAMRecord read) |
static int |
countMismatches(SAMRecord read,
byte[] referenceBases)
Calculates the number of mismatches between the read and the reference sequence provided.
|
static int |
countMismatches(SAMRecord read,
byte[] referenceBases,
boolean bisulfiteSequence)
Calculates the number of mismatches between the read and the reference sequence provided.
|
static int |
countMismatches(SAMRecord read,
byte[] referenceBases,
int referenceOffset)
Calculates the number of mismatches between the read and the reference sequence provided.
|
static int |
countMismatches(SAMRecord read,
byte[] referenceBases,
int referenceOffset,
boolean bisulfiteSequence)
Calculates the number of mismatches between the read and the reference sequence provided.
|
static int |
countMismatches(SAMRecord read,
byte[] referenceBases,
int referenceOffset,
boolean bisulfiteSequence,
boolean matchAmbiguousRef) |
static java.util.List<byte[]> |
generateAllKmers(int length)
Generates all possible unambiguous kmers (upper-case) of length and returns them as byte[]s.
|
static java.lang.String |
getIUPACCodesString()
Returns all IUPAC codes as a string
|
static java.lang.String |
getSamReadNameFromFastqHeader(java.lang.String fastqHeader)
Returns a read name from a FASTQ header string suitable for use in a SAM/BAM file.
|
static boolean |
isBamReadBase(byte base)
Check if the given base belongs to BAM read base set '=ABCDGHKMNRSTVWY'
|
static boolean |
isBisulfiteConverted(byte read,
byte reference) |
static boolean |
isBisulfiteConverted(byte read,
byte reference,
boolean negativeStrand)
Checks for bisulfite conversion, C->T on the positive strand and G->A on the negative strand.
|
static boolean |
isIUPAC(byte base)
Checks if the given base is a IUPAC code
|
static boolean |
isNoCall(byte base)
returns true if the value of base represents a no call
|
static boolean |
isUpperACGTN(byte base)
Check if the given base is one of upper case ACGTN
|
static boolean |
isValidBase(byte b)
Returns true if the byte is in [acgtACGT].
|
static java.lang.String |
makeCigarStringWithIndelPossibleClipping(int alignmentStart,
int readLength,
int referenceSequenceLength,
int indelPosition,
int indelLength)
Create a cigar string for a gapped alignment, which may have soft clipping at either end
|
static java.lang.String |
makeCigarStringWithPossibleClipping(int alignmentStart,
int readLength,
int referenceSequenceLength)
Create a simple ungapped cigar string, which might have soft clipping at either end
|
static byte[] |
makeReferenceFromAlignment(SAMRecord rec,
boolean includeReferenceBasesForDeletions)
Produce reference bases from an aligned SAMRecord with MD string and Cigar.
|
static java.lang.String |
makeSoftClipCigar(int clipLength) |
static boolean |
readBaseMatchesRefBaseWithAmbiguity(byte readBase,
byte refBase)
Efficiently compare two IUPAC base codes, one coming from a read sequence and the other coming from
a reference sequence, using the reference code as a 'pattern' that the read base must match.
|
static void |
reverse(byte[] array,
int offset,
int len) |
static void |
reverseComplement(byte[] bases)
Reverses and complements the bases in place.
|
static void |
reverseComplement(byte[] bases,
int offset,
int len) |
static java.lang.String |
reverseComplement(java.lang.String sequenceData)
Calculate the reverse complement of the specified sequence
(Stolen from Reseq)
|
static void |
reverseQualities(byte[] quals)
Reverses the quals in place.
|
static int |
sumQualitiesOfMismatches(SAMRecord read,
byte[] referenceBases)
Calculates the sum of qualities for mismatched bases in the read.
|
static int |
sumQualitiesOfMismatches(SAMRecord read,
byte[] referenceBases,
int referenceOffset)
Calculates the sum of qualities for mismatched bases in the read.
|
static int |
sumQualitiesOfMismatches(SAMRecord read,
byte[] referenceBases,
int referenceOffset,
boolean bisulfiteSequence)
Calculates the sum of qualities for mismatched bases in the read.
|
static byte[] |
toBamReadBasesInPlace(byte[] bases)
Update and return the given array of bases by upper casing and then replacing all non-BAM read bases with N
|
static byte |
upperCase(byte base) |
static byte[] |
upperCase(byte[] bases) |
public static final byte a
public static final byte c
public static final byte g
public static final byte t
public static final byte n
public static final byte A
public static final byte C
public static final byte G
public static final byte T
public static final byte N
public static final byte[] VALID_BASES_UPPER
public static final byte[] VALID_BASES_LOWER
public static java.lang.String reverseComplement(java.lang.String sequenceData)
sequenceData
- public static boolean basesEqual(byte lhs, byte rhs)
public static boolean readBaseMatchesRefBaseWithAmbiguity(byte readBase, byte refBase)
public static boolean isNoCall(byte base)
public static boolean isValidBase(byte b)
public static boolean isUpperACGTN(byte base)
public static java.lang.String getIUPACCodesString()
public static boolean isIUPAC(byte base)
public static double calculateGc(byte[] bases)
public static boolean isBamReadBase(byte base)
public static byte[] toBamReadBasesInPlace(byte[] bases)
public static void assertSequenceListsEqual(java.util.List<SAMSequenceRecord> s1, java.util.List<SAMSequenceRecord> s2)
s1
- a list of sequence headerss2
- a second list of sequence headerspublic static void assertSequenceListsEqual(java.util.List<SAMSequenceRecord> s1, java.util.List<SAMSequenceRecord> s2, boolean checkPrefixOnly)
s1
- a list of sequence headerss2
- a second list of sequence headerscheckPrefixOnly
- a flag specifying whether to only look at the first records in the lists. This will then check that the
records of the smaller dictionary are equal to the records of the beginning of the larger dictionary, which can be useful since
sometimes different pipelines choose to use only the first contigs of a standard reference.public static boolean areSequenceDictionariesEqual(SAMSequenceDictionary s1, SAMSequenceDictionary s2)
s1
- a list of sequence headerss2
- a second list of sequence headerspublic static void assertSequenceDictionariesEqual(SAMSequenceDictionary s1, SAMSequenceDictionary s2)
s1
- a list of sequence headerss2
- a second list of sequence headerspublic static void assertSequenceDictionariesEqual(SAMSequenceDictionary s1, SAMSequenceDictionary s2, boolean checkPrefixOnly)
s1
- a list of sequence headerss2
- a second list of sequence headerscheckPrefixOnly
- a flag specifying whether to only look at the first records in the lists. This will then check that the
records of the smaller dictionary are equal to the records of the beginning of the larger dictionary, which can be useful since
sometimes different pipelines choose to use only the first contigs of a standard reference.public static void assertSequenceDictionariesEqual(SAMSequenceDictionary s1, SAMSequenceDictionary s2, java.io.File f1, java.io.File f2)
public static java.lang.String makeCigarStringWithPossibleClipping(int alignmentStart, int readLength, int referenceSequenceLength)
alignmentStart
- raw aligment start, which may result in read hanging off beginning or end of readpublic static java.lang.String makeCigarStringWithIndelPossibleClipping(int alignmentStart, int readLength, int referenceSequenceLength, int indelPosition, int indelLength)
alignmentStart
- raw alignment start, which may result in read hanging off beginning or end of readreadLength
- referenceSequenceLength
- indelPosition
- number of matching bases before indel. Must be > 0indelLength
- length of indel. Positive for insertion, negative for deletion.public static java.lang.String makeSoftClipCigar(int clipLength)
public static int countMismatches(SAMRecord read, byte[] referenceBases)
public static int countMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset)
public static int countMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence)
referenceBases
- Array of ASCII bytes that covers at least the the portion of the reference sequence
to which read is aligned from getReferenceStart to getReferenceEnd.referenceOffset
- 0-based offset of the first element of referenceBases relative to the start
of that reference sequence.bisulfiteSequence
- If this is true, it is assumed that the reads were bisulfite treated
and C->T on the positive strand and G->A on the negative strand will not be counted
as mismatches.public static int countMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence, boolean matchAmbiguousRef)
public static int countMismatches(SAMRecord read, byte[] referenceBases, boolean bisulfiteSequence)
referenceBases
- Array of ASCII bytes that covers at least the the portion of the reference sequence
to which read is aligned from getReferenceStart to getReferenceEnd.bisulfiteSequence
- If this is true, it is assumed that the reads were bisulfite treated
and C->T on the positive strand and G->A on the negative strand will not be counted
as mismatches.public static int sumQualitiesOfMismatches(SAMRecord read, byte[] referenceBases)
referenceBases
- Array of ASCII bytes in which the 0th position in the array corresponds
to the first element of the reference sequence to which read is aligned.public static int sumQualitiesOfMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset)
referenceBases
- Array of ASCII bytes that covers at least the the portion of the reference sequence
to which read is aligned from getReferenceStart to getReferenceEnd.referenceOffset
- 0-based offset of the first element of referenceBases relative to the start
of that reference sequence.public static int sumQualitiesOfMismatches(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence)
referenceBases
- Array of ASCII bytes that covers at least the the portion of the reference sequence
to which read is aligned from getReferenceStart to getReferenceEnd.referenceOffset
- 0-based offset of the first element of referenceBases relative to the start
of that reference sequence.bisulfiteSequence
- If this is true, it is assumed that the reads were bisulfite treated
and C->T on the positive strand and G->A on the negative strand will not be counted
as mismatches.public static int countInsertedBases(Cigar cigar)
public static int countDeletedBases(Cigar cigar)
public static int countInsertedBases(SAMRecord read)
public static int countDeletedBases(SAMRecord read)
public static int calculateSamNmTag(SAMRecord read, byte[] referenceBases)
readBaseMatchesRefBaseWithAmbiguity
method).public static int calculateSamNmTag(SAMRecord read, byte[] referenceBases, int referenceOffset)
readBaseMatchesRefBaseWithAmbiguity
method).referenceOffset
- 0-based offset of the first element of referenceBases relative to the start
of that reference sequence.public static int calculateSamNmTag(SAMRecord read, byte[] referenceBases, int referenceOffset, boolean bisulfiteSequence)
readBaseMatchesRefBaseWithAmbiguity
method).referenceOffset
- 0-based offset of the first element of referenceBases relative to the start
of that reference sequence.bisulfiteSequence
- If this is true, it is assumed that the reads were bisulfite treated
and C->T on the positive strand and G->A on the negative strand will not be counted
as mismatches.public static int calculateSamNmTagFromCigar(SAMRecord record)
public static byte complement(byte b)
public static boolean bisulfiteBasesEqual(boolean negativeStrand, byte read, byte reference)
public static boolean bisulfiteBasesEqual(byte read, byte reference)
public static boolean bisulfiteBasesMatchWithAmbiguity(boolean negativeStrand, byte read, byte reference)
readBaseMatchesRefBaseWithAmbiguity
instead of basesEqual
.
Note that isBisulfiteConverted
is not affected because it only applies when the
reference base is non-ambiguous.public static boolean isBisulfiteConverted(byte read, byte reference, boolean negativeStrand)
public static boolean isBisulfiteConverted(byte read, byte reference)
public static byte[] makeReferenceFromAlignment(SAMRecord rec, boolean includeReferenceBasesForDeletions)
rec
- Must contain non-empty CIGAR and MD attribute.includeReferenceBasesForDeletions
- If true, include reference bases that are deleted in the read.
This will make the returned array not line up with the read if there are deletions.public static void reverseComplement(byte[] bases)
public static void reverseQualities(byte[] quals)
public static void reverse(byte[] array, int offset, int len)
public static void reverseComplement(byte[] bases, int offset, int len)
public static java.lang.String calculateMD5String(byte[] data) throws java.security.NoSuchAlgorithmException
java.security.NoSuchAlgorithmException
public static java.lang.String calculateMD5String(byte[] data, int offset, int len)
public static byte[] calculateMD5(byte[] data, int offset, int len)
public static void calculateMdAndNmTags(SAMRecord record, byte[] ref, boolean calcMD, boolean calcNM)
record
- Input record for which to calculate NM and MD.
The appropriate tags will be added/updated in the recordref
- The reference bases for the sequence to which the record is mappedcalcMD
- A flag indicating whether to update the MD tag in the recordcalcNM
- A flag indicating whether to update the NM tag in the recordpublic static byte upperCase(byte base)
public static byte[] upperCase(byte[] bases)
public static java.util.List<byte[]> generateAllKmers(int length)
public static java.lang.String getSamReadNameFromFastqHeader(java.lang.String fastqHeader)
fastqHeader
- the header from a FastqRecord
.