Class SAMUtils

java.lang.Object
htsjdk.samtools.SAMUtils

public final class SAMUtils extends Object
Utilty methods.
  • Field Details

  • Constructor Details

    • SAMUtils

      public SAMUtils()
  • Method Details

    • compressedBasesToBytes

      public static byte[] compressedBasesToBytes(int length, byte[] compressedBases, int compressedOffset)
      Convert from a byte array with bases stored in nybbles, with for example,=, A, C, G, T, N represented as 0, 1, 2, 4, 8, 15, to a a byte array containing =AaCcGgTtNn represented as ASCII.
      Parameters:
      length - Number of bases (not bytes) to convert.
      compressedBases - Bases represented as nybbles, in BAM binary format.
      compressedOffset - Byte offset in compressedBases to start.
      Returns:
      New byte array with bases as ASCII bytes.
    • phredToFastq

      public static String phredToFastq(byte[] data)
      Convert an array of bytes, in which each byte is a binary phred quality score, to printable ASCII representation of the quality scores, ala FASTQ format.

      Equivalent to phredToFastq(data, 0, data.length)

      Parameters:
      data - Array of bytes in which each byte is a binar phred score.
      Returns:
      String with ASCII representation of those quality scores.
    • phredToFastq

      public static String phredToFastq(byte[] buffer, int offset, int length)
      Convert an array of bytes, in which each byte is a binary phred quality score, to printable ASCII representation of the quality scores, ala FASTQ format.
      Parameters:
      buffer - Array of bytes in which each byte is a binar phred score.
      offset - Where in buffer to start conversion.
      length - How many bytes of buffer to convert.
      Returns:
      String with ASCII representation of those quality scores.
    • phredToFastq

      public static char phredToFastq(int phredScore)
      Convert a single binary phred score to printable ASCII representation, ala FASTQ format.
      Parameters:
      phredScore - binary phred score.
      Returns:
      Printable ASCII representation of phred score.
    • fastqToPhred

      public static byte[] fastqToPhred(String fastq)
      Convert a string with phred scores in printable ASCII FASTQ format to an array of binary phred scores.
      Parameters:
      fastq - Phred scores in FASTQ printable ASCII format.
      Returns:
      byte array of binary phred scores in which each byte corresponds to a character in the input string.
    • fastqToPhred

      public static void fastqToPhred(byte[] fastq)
      Converts printable qualities in Sanger fastq format to binary phred scores.
    • fastqToPhred

      public static int fastqToPhred(char ch)
      Convert a single printable ASCII FASTQ format phred score to binary phred score.
      Parameters:
      ch - Printable ASCII FASTQ format phred score.
      Returns:
      Binary phred score.
    • processValidationErrors

      public static void processValidationErrors(List<SAMValidationError> validationErrors, long samRecordIndex, ValidationStringency validationStringency)
      Handle a list of validation errors according to the validation stringency.
      Parameters:
      validationErrors - List of errors to report, or null if there are no errors.
      samRecordIndex - Record number of the SAMRecord corresponding to the validation errors, or -1 if the record number is not known.
      validationStringency - If STRICT, throw a SAMFormatException. If LENIENT, print the validation errors to stderr. If SILENT, do nothing.
    • processValidationError

      public static void processValidationError(SAMValidationError validationError, ValidationStringency validationStringency)
    • calculateReadGroupRecordChecksum

      public static String calculateReadGroupRecordChecksum(File input, File referenceFasta)
      Calculate a hash code from identifying information in the RG (read group) records in a SAM file's header. This hash code changes any time read groups are added or removed. Comparing one file's hash code to another's tells you if the read groups in the BAM files are different.
    • chainSAMProgramRecord

      public static void chainSAMProgramRecord(SAMFileHeader header, SAMProgramRecord program)
      Chains program in front of the first "head" item in the list of SAMProgramRecords in header. This method should not be used when there are multiple chains of program groups in a header, only when it can safely be assumed that there is only one chain. It correctly handles the case where program has already been added to the header, so it can be used whether creating a SAMProgramRecord with a constructor or when calling SAMFileHeader.createProgramRecord().
    • makeReadUnmapped

      public static void makeReadUnmapped(SAMRecord rec)
      Strip mapping information from a SAMRecord.

      WARNING: by clearing the secondary and supplementary flags, this may have the affect of producing multiple distinct records with the same read name and flags, which may lead to invalid SAM/BAM output. Callers of this method should make sure to deal with this issue.

    • makeReadUnmappedWithOriginalTags

      public static void makeReadUnmappedWithOriginalTags(SAMRecord rec)
      Strip mapping information from a SAMRecord, but preserve it in the 'O' tags if it isn't already set.
    • hasOriginalMappingInformation

      public static boolean hasOriginalMappingInformation(SAMRecord rec)
      See if any tags pertaining to original mapping information have been set.
    • cigarMapsNoBasesToRef

      public static boolean cigarMapsNoBasesToRef(Cigar cigar)
      Determines if a cigar has any element that both consumes read bases and consumes reference bases (e.g. is not all soft-clipped)
    • recordMapsEntirelyBeyondEndOfReference

      public static boolean recordMapsEntirelyBeyondEndOfReference(SAMRecord record)
      Tests if the provided record is mapped entirely beyond the end of the reference (i.e., the alignment start is greater than the length of the sequence to which the record is mapped).
      Parameters:
      record - must not have a null SamFileHeader
    • compareMapqs

      public static int compareMapqs(int mapq1, int mapq2)
      Returns:
      negative if mapq1 < mapq2, etc. Note that MAPQ(0) < MAPQ(255) < MAPQ(1)
    • combineMapqs

      public static int combineMapqs(int m1, int m2)
      Hokey algorithm for combining two MAPQs into values that are comparable, being cognizant of the fact that in MAPQ world, 1 > 255 > 0. In this algorithm, 255 is treated as if it were 0.01, so that CombinedMapq(1,0) > CombinedMapq(255, 255) > CombinedMapq(0, 0). The return value should not be used for anything other than comparing to the return value of other invocations of this method.
    • findVirtualOffsetOfFirstRecordInBam

      public static long findVirtualOffsetOfFirstRecordInBam(Path bamFile)
    • findVirtualOffsetOfFirstRecordInBam

      public static long findVirtualOffsetOfFirstRecordInBam(File bamFile)
      Returns the virtual file offset of the first record in a BAM file - i.e. the virtual file offset after skipping over the text header and the sequence records.
    • findVirtualOffsetOfFirstRecordInBam

      public static long findVirtualOffsetOfFirstRecordInBam(SeekableStream seekableStream)
      Returns the virtual file offset of the first record in a BAM file - i.e. the virtual file offset after skipping over the text header and the sequence records.
    • getAlignmentBlocks

      public static List<AlignmentBlock> getAlignmentBlocks(Cigar cigar, int alignmentStart, String cigarTypeName)
      Given a Cigar, Returns blocks of the sequence that have been aligned directly to the reference sequence. Note that clipped portions, and inserted and deleted bases (vs. the reference) are not represented in the alignment blocks.
      Parameters:
      cigar - The cigar containing the alignment information
      alignmentStart - The start (1-based) of the alignment
      cigarTypeName - The type of cigar passed - for error logging.
      Returns:
      List of alignment blocks
    • getUnclippedStart

      public static int getUnclippedStart(int alignmentStart, Cigar cigar)
      Parameters:
      alignmentStart - The start (1-based) of the alignment
      cigar - The cigar containing the alignment information
      Returns:
      the alignment start (1-based, inclusive) adjusted for clipped bases. For example if the read has an alignment start of 100 but the first 4 bases were clipped (hard or soft clipped) then this method will return 96.

      Invalid to call on an unmapped read. Invalid to call with cigar = null

    • getUnclippedEnd

      public static int getUnclippedEnd(int alignmentEnd, Cigar cigar)
      Parameters:
      alignmentEnd - The end (1-based) of the alignment
      cigar - The cigar containing the alignment information
      Returns:
      the alignment end (1-based, inclusive) adjusted for clipped bases. For example if the read has an alignment end of 100 but the last 7 bases were clipped (hard or soft clipped) then this method will return 107.

      Invalid to call on an unmapped read. Invalid to call with cigar = null

    • getMateCigarString

      public static String getMateCigarString(SAMRecord rec)
      Returns the Mate Cigar String as stored in the attribute 'MC'.
      Parameters:
      rec - the SAM record
      Returns:
      Mate Cigar String, or null if there is none.
    • getMateCigar

      public static Cigar getMateCigar(SAMRecord rec, boolean withValidation)
      Returns the Mate Cigar or null if there is none.
      Parameters:
      rec - the SAM record
      withValidation - true if we are to validate the mate cigar before returning, false otherwise.
      Returns:
      Cigar object for the read's mate, or null if there is none.
    • getMateCigar

      public static Cigar getMateCigar(SAMRecord rec)
      Returns the Mate Cigar or null if there is none. No validation is done on the returned cigar.
      Parameters:
      rec - the SAM record
      Returns:
      Cigar object for the read's mate, or null if there is none.
    • getMateCigarLength

      public static int getMateCigarLength(SAMRecord rec)
      Parameters:
      rec - the SAM record
      Returns:
      number of cigar elements (number + operator) in the mate cigar string.
    • getMateAlignmentEnd

      public static int getMateAlignmentEnd(SAMRecord rec)
      This method uses the MateCigar value as determined from the attribute MC. It must be non-null.
      Parameters:
      rec - the SAM record
      Returns:
      1-based inclusive rightmost position of the clipped mate sequence, or 0 read if unmapped.
    • getMateUnclippedStart

      public static int getMateUnclippedStart(SAMRecord rec)
      Parameters:
      rec - the SAM record
      Returns:
      the mate alignment start (1-based, inclusive) adjusted for clipped bases. For example if the mate has an alignment start of 100 but the first 4 bases were clipped (hard or soft clipped) then this method will return 96.

      Invalid to call on an unmapped read.

    • getMateUnclippedEnd

      public static int getMateUnclippedEnd(SAMRecord rec)
      Parameters:
      rec - the SAM record
      Returns:
      the mate alignment end (1-based, inclusive) adjusted for clipped bases. For example if the mate has an alignment end of 100 but the last 7 bases were clipped (hard or soft clipped) then this method will return 107.

      Invalid to call on an unmapped read.

    • getMateAlignmentBlocks

      public static List<AlignmentBlock> getMateAlignmentBlocks(SAMRecord rec)
      Parameters:
      rec - the SAM record Returns blocks of the mate sequence that have been aligned directly to the reference sequence. Note that clipped portions of the mate and inserted and deleted bases (vs. the reference) are not represented in the alignment blocks.
    • validateCigar

      public static List<SAMValidationError> validateCigar(SAMRecord rec, Cigar cigar, Integer referenceIndex, List<AlignmentBlock> alignmentBlocks, long recordNumber, String cigarTypeName)
      Run all validations of the mate's CIGAR. These include validation that the CIGAR makes sense independent of placement, plus validation that CIGAR + placement yields all bases with M operator within the range of the reference.
      Parameters:
      rec - the SAM record
      cigar - The cigar containing the alignment information
      referenceIndex - The reference index
      alignmentBlocks - The alignment blocks (parsed from the cigar)
      recordNumber - For error reporting. -1 if not known.
      cigarTypeName - For error reporting. "Read CIGAR" or "Mate Cigar"
      Returns:
      List of errors, or null if no errors.
    • validateMateCigar

      public static List<SAMValidationError> validateMateCigar(SAMRecord rec, long recordNumber)
      Run all validations of the mate's CIGAR. These include validation that the CIGAR makes sense independent of placement, plus validation that CIGAR + placement yields all bases with M operator within the range of the reference.
      Parameters:
      rec - the SAM record
      recordNumber - For error reporting. -1 if not known.
      Returns:
      List of errors, or null if no errors.
    • hasMateCigar

      public static boolean hasMateCigar(SAMRecord rec)
      Checks to see if it is valid for this record to have a mate CIGAR (MC) and then if there is a mate CIGAR available. This is done by checking that this record is paired, its mate is mapped, and that it returns a non-null mate CIGAR.
      Parameters:
      rec -
      Returns:
    • getCanonicalRecordName

      public static String getCanonicalRecordName(SAMRecord record)
      Returns a string that is the the read group ID and read name separated by a colon. This is meant to canonically identify a given record within a set of records.
      Parameters:
      record - SAMRecord for which "canonical" read name is requested
      Returns:
      The record's readgroup-id (if non-null) and the read name, separated by a colon, ':'
    • getNumOverlappingAlignedBasesToClip

      public static int getNumOverlappingAlignedBasesToClip(SAMRecord rec)
      Returns the number of bases that need to be clipped due to overlapping pairs. If the record is not paired, or the given record's start position is greater than its mate's start position, zero is automatically returned. NB: This method assumes that the record's mate is not contained within the given record's alignment.
      Parameters:
      rec - SAMRecord that needs clipping due to overlapping pairs.
      Returns:
      the number of bases at the end of the read that need to be clipped such that there would be no overlapping bases with its mate. Read bases include only those from insertion, match, or mismatch Cigar operators.
    • clipOverlappingAlignedBases

      public static SAMRecord clipOverlappingAlignedBases(SAMRecord record, boolean noSideEffects)
      Returns a (possibly new) record that has been clipped if input is a mapped paired and has overlapping bases with its mate. See getNumOverlappingAlignedBasesToClip(SAMRecord) for how the number of overlapping bases is computed. NB: this does not properly consider a cigar like: 100M20S10H. NB: This method assumes that the record's mate is not contained within the given record's alignment.
      Parameters:
      record - the record from which to clip bases.
      noSideEffects - if true a modified clone of the original record is returned, otherwise we modify the record directly.
      Returns:
      a (possibly new) record that has been clipped
    • clipOverlappingAlignedBases

      public static SAMRecord clipOverlappingAlignedBases(SAMRecord record, int numOverlappingBasesToClip, boolean noSideEffects)
      Returns a (possibly new) SAMRecord with the given number of bases soft-clipped at the end of the read if is a mapped paired and has overlapping bases with its mate. NB: this does not properly consider a cigar like: 100M20S10H. NB: This method assumes that the record's mate is not contained within the given record's alignment.
      Parameters:
      record - the record from which to clip bases.
      numOverlappingBasesToClip - the number of bases to clip at the end of the read.
      noSideEffects - if true a modified clone of the original record is returned, otherwise we modify the record directly.
      Returns:
      Returns a (possibly new) SAMRecord with the given number of bases soft-clipped
    • isValidUnsignedIntegerAttribute

      public static boolean isValidUnsignedIntegerAttribute(long value)
      Checks if a long attribute value is within the allowed range of a 32-bit unsigned integer.
      Parameters:
      value - a long value to check
      Returns:
      true if value is >= 0 and <= BinaryCodec.MAX_UINT, and false otherwise
    • getOtherCanonicalAlignments

      public static List<SAMRecord> getOtherCanonicalAlignments(SAMRecord record)
      Extract a List of 'other canonical alignments' from a SAM record. Those alignments are stored as a string in the 'SA' tag as defined in the SAM specification. The name, sequence and qualities, mate data are copied from the original record.
      Parameters:
      record - must be non null and must have a non-null associated header.
      Returns:
      a list of 'other canonical alignments' SAMRecords. The list is empty if the 'SA' attribute is missing.
    • isReferenceSequenceCompatibleWithBAI

      @Deprecated public static boolean isReferenceSequenceCompatibleWithBAI(SAMSequenceRecord sequence)
      Deprecated.
      because the method does the exact opposite of what it says. Use the correctly named isReferenceSequenceIncompatibleWithBAI() instead.
    • isReferenceSequenceIncompatibleWithBAI

      public static boolean isReferenceSequenceIncompatibleWithBAI(SAMSequenceRecord sequence)
      Checks if reference sequence is compatible with BAI indexing format.
      Parameters:
      sequence - reference sequence.
    • calculateOATagValue

      public static String calculateOATagValue(SAMRecord record)
      Function to create the OA tag value from a record. The OA tag contains the mapping information of a record encoded as a comma-separated string (REF,POS,STRAND,CIGAR,MAPPING_QUALITY,NM_TAG_VALUE)
      Parameters:
      record - to use for generating the OA tag
      Returns:
      the OA tag string value