Package htsjdk.samtools
Class SAMUtils
java.lang.Object
htsjdk.samtools.SAMUtils
Utilty methods.
-
Field Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic String
calculateOATagValue
(SAMRecord record) Function to create the OA tag value from a record.static String
calculateReadGroupRecordChecksum
(File input, File referenceFasta) Calculate a hash code from identifying information in the RG (read group) records in a SAM file's header.static void
chainSAMProgramRecord
(SAMFileHeader header, SAMProgramRecord program) Chainsprogram
in front of the first "head" item in the list of SAMProgramRecords inheader
.static boolean
cigarMapsNoBasesToRef
(Cigar cigar) Determines if a cigar has any element that both consumes read bases and consumes reference bases (e.g.static SAMRecord
clipOverlappingAlignedBases
(SAMRecord record, boolean noSideEffects) Returns a (possibly new) record that has been clipped if input is a mapped paired and has overlapping bases with its mate.static SAMRecord
clipOverlappingAlignedBases
(SAMRecord record, int numOverlappingBasesToClip, boolean noSideEffects) Returns a (possibly new) SAMRecord with the given number of bases soft-clipped at the end of the read if is a mapped paired and has overlapping bases with its mate.static int
combineMapqs
(int m1, int m2) Hokey algorithm for combining two MAPQs into values that are comparable, being cognizant of the fact that in MAPQ world, 1 > 255 > 0.static int
compareMapqs
(int mapq1, int mapq2) static byte[]
compressedBasesToBytes
(int length, byte[] compressedBases, int compressedOffset) Convert from a byte array with bases stored in nybbles, with for example,=, A, C, G, T, N represented as 0, 1, 2, 4, 8, 15, to a a byte array containing =AaCcGgTtNn represented as ASCII.static void
fastqToPhred
(byte[] fastq) Converts printable qualities in Sanger fastq format to binary phred scores.static int
fastqToPhred
(char ch) Convert a single printable ASCII FASTQ format phred score to binary phred score.static byte[]
fastqToPhred
(String fastq) Convert a string with phred scores in printable ASCII FASTQ format to an array of binary phred scores.static long
findVirtualOffsetOfFirstRecordInBam
(SeekableStream seekableStream) Returns the virtual file offset of the first record in a BAM file - i.e.static long
findVirtualOffsetOfFirstRecordInBam
(File bamFile) Returns the virtual file offset of the first record in a BAM file - i.e.static long
findVirtualOffsetOfFirstRecordInBam
(Path bamFile) static List<AlignmentBlock>
getAlignmentBlocks
(Cigar cigar, int alignmentStart, String cigarTypeName) Given a Cigar, Returns blocks of the sequence that have been aligned directly to the reference sequence.static String
getCanonicalRecordName
(SAMRecord record) Returns a string that is the the read group ID and read name separated by a colon.static List<AlignmentBlock>
static int
This method uses the MateCigar value as determined from the attribute MC.static Cigar
getMateCigar
(SAMRecord rec) Returns the Mate Cigar or null if there is none.static Cigar
getMateCigar
(SAMRecord rec, boolean withValidation) Returns the Mate Cigar or null if there is none.static int
static String
Returns the Mate Cigar String as stored in the attribute 'MC'.static int
static int
static int
Returns the number of bases that need to be clipped due to overlapping pairs.getOtherCanonicalAlignments
(SAMRecord record) Extract a List of 'other canonical alignments' from a SAM record.static int
getUnclippedEnd
(int alignmentEnd, Cigar cigar) static int
getUnclippedStart
(int alignmentStart, Cigar cigar) static boolean
hasMateCigar
(SAMRecord rec) Checks to see if it is valid for this record to have a mate CIGAR (MC) and then if there is a mate CIGAR available.static boolean
See if any tags pertaining to original mapping information have been set.static boolean
Deprecated.because the method does the exact opposite of what it says.static boolean
Checks if reference sequence is compatible with BAI indexing format.static boolean
isValidUnsignedIntegerAttribute
(long value) Checks if a long attribute value is within the allowed range of a 32-bit unsigned integer.static void
Strip mapping information from a SAMRecord.static void
Strip mapping information from a SAMRecord, but preserve it in the 'O' tags if it isn't already set.static String
phredToFastq
(byte[] data) Convert an array of bytes, in which each byte is a binary phred quality score, to printable ASCII representation of the quality scores, ala FASTQ format.static String
phredToFastq
(byte[] buffer, int offset, int length) Convert an array of bytes, in which each byte is a binary phred quality score, to printable ASCII representation of the quality scores, ala FASTQ format.static char
phredToFastq
(int phredScore) Convert a single binary phred score to printable ASCII representation, ala FASTQ format.static void
processValidationError
(SAMValidationError validationError, ValidationStringency validationStringency) static void
processValidationErrors
(List<SAMValidationError> validationErrors, long samRecordIndex, ValidationStringency validationStringency) Handle a list of validation errors according to the validation stringency.static boolean
Tests if the provided record is mapped entirely beyond the end of the reference (i.e., the alignment start is greater than the length of the sequence to which the record is mapped).static List<SAMValidationError>
validateCigar
(SAMRecord rec, Cigar cigar, Integer referenceIndex, List<AlignmentBlock> alignmentBlocks, long recordNumber, String cigarTypeName) Run all validations of the mate's CIGAR.static List<SAMValidationError>
validateMateCigar
(SAMRecord rec, long recordNumber) Run all validations of the mate's CIGAR.
-
Field Details
-
MAX_PHRED_SCORE
public static final int MAX_PHRED_SCORE- See Also:
-
-
Constructor Details
-
SAMUtils
public SAMUtils()
-
-
Method Details
-
compressedBasesToBytes
public static byte[] compressedBasesToBytes(int length, byte[] compressedBases, int compressedOffset) Convert from a byte array with bases stored in nybbles, with for example,=, A, C, G, T, N represented as 0, 1, 2, 4, 8, 15, to a a byte array containing =AaCcGgTtNn represented as ASCII.- Parameters:
length
- Number of bases (not bytes) to convert.compressedBases
- Bases represented as nybbles, in BAM binary format.compressedOffset
- Byte offset in compressedBases to start.- Returns:
- New byte array with bases as ASCII bytes.
-
phredToFastq
Convert an array of bytes, in which each byte is a binary phred quality score, to printable ASCII representation of the quality scores, ala FASTQ format. Equivalent to phredToFastq(data, 0, data.length)- Parameters:
data
- Array of bytes in which each byte is a binar phred score.- Returns:
- String with ASCII representation of those quality scores.
-
phredToFastq
Convert an array of bytes, in which each byte is a binary phred quality score, to printable ASCII representation of the quality scores, ala FASTQ format.- Parameters:
buffer
- Array of bytes in which each byte is a binar phred score.offset
- Where in buffer to start conversion.length
- How many bytes of buffer to convert.- Returns:
- String with ASCII representation of those quality scores.
-
phredToFastq
public static char phredToFastq(int phredScore) Convert a single binary phred score to printable ASCII representation, ala FASTQ format.- Parameters:
phredScore
- binary phred score.- Returns:
- Printable ASCII representation of phred score.
-
fastqToPhred
Convert a string with phred scores in printable ASCII FASTQ format to an array of binary phred scores.- Parameters:
fastq
- Phred scores in FASTQ printable ASCII format.- Returns:
- byte array of binary phred scores in which each byte corresponds to a character in the input string.
-
fastqToPhred
public static void fastqToPhred(byte[] fastq) Converts printable qualities in Sanger fastq format to binary phred scores. -
fastqToPhred
public static int fastqToPhred(char ch) Convert a single printable ASCII FASTQ format phred score to binary phred score.- Parameters:
ch
- Printable ASCII FASTQ format phred score.- Returns:
- Binary phred score.
-
processValidationErrors
public static void processValidationErrors(List<SAMValidationError> validationErrors, long samRecordIndex, ValidationStringency validationStringency) Handle a list of validation errors according to the validation stringency.- Parameters:
validationErrors
- List of errors to report, or null if there are no errors.samRecordIndex
- Record number of the SAMRecord corresponding to the validation errors, or -1 if the record number is not known.validationStringency
- If STRICT, throw a SAMFormatException. If LENIENT, print the validation errors to stderr. If SILENT, do nothing.
-
processValidationError
public static void processValidationError(SAMValidationError validationError, ValidationStringency validationStringency) -
calculateReadGroupRecordChecksum
Calculate a hash code from identifying information in the RG (read group) records in a SAM file's header. This hash code changes any time read groups are added or removed. Comparing one file's hash code to another's tells you if the read groups in the BAM files are different. -
chainSAMProgramRecord
Chainsprogram
in front of the first "head" item in the list of SAMProgramRecords inheader
. This method should not be used when there are multiple chains of program groups in a header, only when it can safely be assumed that there is only one chain. It correctly handles the case whereprogram
has already been added to the header, so it can be used whether creating a SAMProgramRecord with a constructor or when calling SAMFileHeader.createProgramRecord(). -
makeReadUnmapped
Strip mapping information from a SAMRecord.WARNING: by clearing the secondary and supplementary flags, this may have the affect of producing multiple distinct records with the same read name and flags, which may lead to invalid SAM/BAM output. Callers of this method should make sure to deal with this issue.
-
makeReadUnmappedWithOriginalTags
Strip mapping information from a SAMRecord, but preserve it in the 'O' tags if it isn't already set. -
hasOriginalMappingInformation
See if any tags pertaining to original mapping information have been set. -
cigarMapsNoBasesToRef
Determines if a cigar has any element that both consumes read bases and consumes reference bases (e.g. is not all soft-clipped) -
recordMapsEntirelyBeyondEndOfReference
Tests if the provided record is mapped entirely beyond the end of the reference (i.e., the alignment start is greater than the length of the sequence to which the record is mapped).- Parameters:
record
- must not have a null SamFileHeader
-
compareMapqs
public static int compareMapqs(int mapq1, int mapq2) - Returns:
- negative if mapq1 < mapq2, etc. Note that MAPQ(0) < MAPQ(255) < MAPQ(1)
-
combineMapqs
public static int combineMapqs(int m1, int m2) Hokey algorithm for combining two MAPQs into values that are comparable, being cognizant of the fact that in MAPQ world, 1 > 255 > 0. In this algorithm, 255 is treated as if it were 0.01, so that CombinedMapq(1,0) > CombinedMapq(255, 255) > CombinedMapq(0, 0). The return value should not be used for anything other than comparing to the return value of other invocations of this method. -
findVirtualOffsetOfFirstRecordInBam
-
findVirtualOffsetOfFirstRecordInBam
Returns the virtual file offset of the first record in a BAM file - i.e. the virtual file offset after skipping over the text header and the sequence records. -
findVirtualOffsetOfFirstRecordInBam
Returns the virtual file offset of the first record in a BAM file - i.e. the virtual file offset after skipping over the text header and the sequence records. -
getAlignmentBlocks
public static List<AlignmentBlock> getAlignmentBlocks(Cigar cigar, int alignmentStart, String cigarTypeName) Given a Cigar, Returns blocks of the sequence that have been aligned directly to the reference sequence. Note that clipped portions, and inserted and deleted bases (vs. the reference) are not represented in the alignment blocks.- Parameters:
cigar
- The cigar containing the alignment informationalignmentStart
- The start (1-based) of the alignmentcigarTypeName
- The type of cigar passed - for error logging.- Returns:
- List of alignment blocks
-
getUnclippedStart
- Parameters:
alignmentStart
- The start (1-based) of the alignmentcigar
- The cigar containing the alignment information- Returns:
- the alignment start (1-based, inclusive) adjusted for clipped bases. For example if the read has an alignment start of 100 but the first 4 bases were clipped (hard or soft clipped) then this method will return 96. Invalid to call on an unmapped read. Invalid to call with cigar = null
-
getUnclippedEnd
- Parameters:
alignmentEnd
- The end (1-based) of the alignmentcigar
- The cigar containing the alignment information- Returns:
- the alignment end (1-based, inclusive) adjusted for clipped bases. For example if the read has an alignment end of 100 but the last 7 bases were clipped (hard or soft clipped) then this method will return 107. Invalid to call on an unmapped read. Invalid to call with cigar = null
-
getMateCigarString
Returns the Mate Cigar String as stored in the attribute 'MC'.- Parameters:
rec
- the SAM record- Returns:
- Mate Cigar String, or null if there is none.
-
getMateCigar
Returns the Mate Cigar or null if there is none.- Parameters:
rec
- the SAM recordwithValidation
- true if we are to validate the mate cigar before returning, false otherwise.- Returns:
- Cigar object for the read's mate, or null if there is none.
-
getMateCigar
Returns the Mate Cigar or null if there is none. No validation is done on the returned cigar.- Parameters:
rec
- the SAM record- Returns:
- Cigar object for the read's mate, or null if there is none.
-
getMateCigarLength
- Parameters:
rec
- the SAM record- Returns:
- number of cigar elements (number + operator) in the mate cigar string.
-
getMateAlignmentEnd
This method uses the MateCigar value as determined from the attribute MC. It must be non-null.- Parameters:
rec
- the SAM record- Returns:
- 1-based inclusive rightmost position of the clipped mate sequence, or 0 read if unmapped.
-
getMateUnclippedStart
- Parameters:
rec
- the SAM record- Returns:
- the mate alignment start (1-based, inclusive) adjusted for clipped bases. For example if the mate has an alignment start of 100 but the first 4 bases were clipped (hard or soft clipped) then this method will return 96. Invalid to call on an unmapped read.
-
getMateUnclippedEnd
- Parameters:
rec
- the SAM record- Returns:
- the mate alignment end (1-based, inclusive) adjusted for clipped bases. For example if the mate has an alignment end of 100 but the last 7 bases were clipped (hard or soft clipped) then this method will return 107. Invalid to call on an unmapped read.
-
getMateAlignmentBlocks
- Parameters:
rec
- the SAM record Returns blocks of the mate sequence that have been aligned directly to the reference sequence. Note that clipped portions of the mate and inserted and deleted bases (vs. the reference) are not represented in the alignment blocks.
-
validateCigar
public static List<SAMValidationError> validateCigar(SAMRecord rec, Cigar cigar, Integer referenceIndex, List<AlignmentBlock> alignmentBlocks, long recordNumber, String cigarTypeName) Run all validations of the mate's CIGAR. These include validation that the CIGAR makes sense independent of placement, plus validation that CIGAR + placement yields all bases with M operator within the range of the reference.- Parameters:
rec
- the SAM recordcigar
- The cigar containing the alignment informationreferenceIndex
- The reference indexalignmentBlocks
- The alignment blocks (parsed from the cigar)recordNumber
- For error reporting. -1 if not known.cigarTypeName
- For error reporting. "Read CIGAR" or "Mate Cigar"- Returns:
- List of errors, or null if no errors.
-
validateMateCigar
Run all validations of the mate's CIGAR. These include validation that the CIGAR makes sense independent of placement, plus validation that CIGAR + placement yields all bases with M operator within the range of the reference.- Parameters:
rec
- the SAM recordrecordNumber
- For error reporting. -1 if not known.- Returns:
- List of errors, or null if no errors.
-
hasMateCigar
Checks to see if it is valid for this record to have a mate CIGAR (MC) and then if there is a mate CIGAR available. This is done by checking that this record is paired, its mate is mapped, and that it returns a non-null mate CIGAR.- Parameters:
rec
-- Returns:
-
getCanonicalRecordName
Returns a string that is the the read group ID and read name separated by a colon. This is meant to canonically identify a given record within a set of records.- Parameters:
record
- SAMRecord for which "canonical" read name is requested- Returns:
- The record's readgroup-id (if non-null) and the read name, separated by a colon, ':'
-
getNumOverlappingAlignedBasesToClip
Returns the number of bases that need to be clipped due to overlapping pairs. If the record is not paired, or the given record's start position is greater than its mate's start position, zero is automatically returned. NB: This method assumes that the record's mate is not contained within the given record's alignment.- Parameters:
rec
- SAMRecord that needs clipping due to overlapping pairs.- Returns:
- the number of bases at the end of the read that need to be clipped such that there would be no overlapping bases with its mate. Read bases include only those from insertion, match, or mismatch Cigar operators.
-
clipOverlappingAlignedBases
Returns a (possibly new) record that has been clipped if input is a mapped paired and has overlapping bases with its mate. SeegetNumOverlappingAlignedBasesToClip(SAMRecord)
for how the number of overlapping bases is computed. NB: this does not properly consider a cigar like: 100M20S10H. NB: This method assumes that the record's mate is not contained within the given record's alignment.- Parameters:
record
- the record from which to clip bases.noSideEffects
- if true a modified clone of the original record is returned, otherwise we modify the record directly.- Returns:
- a (possibly new) record that has been clipped
-
clipOverlappingAlignedBases
public static SAMRecord clipOverlappingAlignedBases(SAMRecord record, int numOverlappingBasesToClip, boolean noSideEffects) Returns a (possibly new) SAMRecord with the given number of bases soft-clipped at the end of the read if is a mapped paired and has overlapping bases with its mate. NB: this does not properly consider a cigar like: 100M20S10H. NB: This method assumes that the record's mate is not contained within the given record's alignment.- Parameters:
record
- the record from which to clip bases.numOverlappingBasesToClip
- the number of bases to clip at the end of the read.noSideEffects
- if true a modified clone of the original record is returned, otherwise we modify the record directly.- Returns:
- Returns a (possibly new) SAMRecord with the given number of bases soft-clipped
-
isValidUnsignedIntegerAttribute
public static boolean isValidUnsignedIntegerAttribute(long value) Checks if a long attribute value is within the allowed range of a 32-bit unsigned integer.- Parameters:
value
- a long value to check- Returns:
- true if value is >= 0 and <=
BinaryCodec.MAX_UINT
, and false otherwise
-
getOtherCanonicalAlignments
Extract a List of 'other canonical alignments' from a SAM record. Those alignments are stored as a string in the 'SA' tag as defined in the SAM specification. The name, sequence and qualities, mate data are copied from the original record.- Parameters:
record
- must be non null and must have a non-null associated header.- Returns:
- a list of 'other canonical alignments' SAMRecords. The list is empty if the 'SA' attribute is missing.
-
isReferenceSequenceCompatibleWithBAI
Deprecated.because the method does the exact opposite of what it says. Use the correctly named isReferenceSequenceIncompatibleWithBAI() instead. -
isReferenceSequenceIncompatibleWithBAI
Checks if reference sequence is compatible with BAI indexing format.- Parameters:
sequence
- reference sequence.
-
calculateOATagValue
Function to create the OA tag value from a record. The OA tag contains the mapping information of a record encoded as a comma-separated string (REF,POS,STRAND,CIGAR,MAPPING_QUALITY,NM_TAG_VALUE)- Parameters:
record
- to use for generating the OA tag- Returns:
- the OA tag string value
-