picard.sam.markduplicates.MarkDuplicates

All Implemented Interfaces:: MarkDuplicatesHelper

Direct Known Subclasses:: SimpleMarkDuplicatesWithMateCigar

@DocumentedFeature public class MarkDuplicates extends AbstractMarkDuplicatesCommandLineProgram implements MarkDuplicatesHelper

A better duplication marking algorithm that handles all cases including clipped and gapped alignments.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

MarkDuplicates.DuplicateTaggingPolicy

Enum used to control how duplicates are flagged in the DT optional tag on each read.

static enum

MarkDuplicates.DuplicateType

Enum for the possible values that a duplicate read can be tagged with in the DT attribute.

Nested classes/interfaces inherited from class picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram
AbstractMarkDuplicatesCommandLineProgram.SamHeaderAndIterator
Field Summary

Fields

Modifier and Type

Field

Description

String

BARCODE_TAG

boolean

CLEAR_DT

boolean

DUPLEX_UMI

static final String

DUPLICATE_SET_INDEX_TAG

The attribute in the SAM/BAM file used to store which read was selected as representative out of a duplicate set

static final String

DUPLICATE_SET_SIZE_TAG

The attribute in the SAM/BAM file used to store the size of a duplicate set

static final String

DUPLICATE_TYPE_LIBRARY

The duplicate type tag value for duplicate type: library.

static final String

DUPLICATE_TYPE_SEQUENCING

The duplicate type tag value for duplicate type: sequencing (optical & pad-hopping, or "co-localized").

static final String

DUPLICATE_TYPE_TAG

The optional attribute in SAM/BAM/CRAM files used to store the duplicate type.

protected htsjdk.samtools.util.SortingLongCollection

duplicateIndexes

MarkDuplicatesForFlowArgumentCollection

flowBasedArguments

protected htsjdk.samtools.util.SortingCollection<ReadEndsForMarkDuplicates>

fragSort

protected LibraryIdGenerator

libraryIdGenerator

int

MAX_FILE_HANDLES_FOR_READ_ENDS_MAP

int

MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP

If more than this many sequences in SAM file, don't spill to disk because there will not be enough file handles.

String

MOLECULAR_IDENTIFIER_TAG

protected htsjdk.samtools.util.SortingLongCollection

opticalDuplicateIndexes

protected htsjdk.samtools.util.SortingCollection<ReadEndsForMarkDuplicates>

pairSort

String

READ_ONE_BARCODE_TAG

String

READ_TWO_BARCODE_TAG

boolean

REMOVE_SEQUENCING_DUPLICATES

protected htsjdk.samtools.util.SortingCollection<RepresentativeReadIndexer>

representativeReadIndicesForDuplicates

double

SORTING_COLLECTION_SIZE_RATIO

boolean

TAG_DUPLICATE_SET_MEMBERS

MarkDuplicates.DuplicateTaggingPolicy

TAGGING_POLICY

Fields inherited from class picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram
ASSUME_SORT_ORDER, ASSUME_SORTED, COMMENT, DUPLICATE_SCORING_STRATEGY, INPUT, METRICS_FILE, OUTPUT, pgIdsSeen, pgTagArgumentCollection, PROGRAM_GROUP_COMMAND_LINE, PROGRAM_GROUP_NAME, PROGRAM_GROUP_VERSION, PROGRAM_RECORD_ID, REMOVE_DUPLICATES

Fields inherited from class picard.sam.markduplicates.util.AbstractOpticalDuplicateFinderCommandLineProgram
LOG, MAX_OPTICAL_DUPLICATE_SET_SIZE, OPTICAL_DUPLICATE_PIXEL_DISTANCE, opticalDuplicateFinder, READ_NAME_REGEX

Fields inherited from class picard.cmdline.CommandLineProgram
COMPRESSION_LEVEL, CREATE_INDEX, CREATE_MD5_FILE, MAX_ALLOWABLE_ONE_LINE_SUMMARY_LENGTH, MAX_RECORDS_IN_RAM, QUIET, REFERENCE_SEQUENCE, referenceSequence, specialArgumentsCollection, SYNTAX_TRANSITION_URL, TMP_DIR, USE_JDK_DEFLATER, USE_JDK_INFLATER, VALIDATION_STRINGENCY, VERBOSITY
Constructor Summary

Constructors

Constructor

Description

MarkDuplicates()
Method Summary

Modifier and Type

Method

Description

protected boolean

areComparableForDuplicates(ReadEndsForMarkDuplicates lhs, ReadEndsForMarkDuplicates rhs, boolean compareRead2, boolean useBarcodes)

ReadEndsForMarkDuplicates

buildReadEnds(htsjdk.samtools.SAMFileHeader header, long index, htsjdk.samtools.SAMRecord rec, boolean useBarcodes)

Builds a read ends object that represents a single read.

protected int

doWork()

Main work method.

void

generateDuplicateIndexes(boolean useBarcodes, boolean indexOpticalDuplicates)

Goes through the accumulated ReadEndsForMarkDuplicates objects and determines which of them are to be marked as duplicates.

short

getReadDuplicateScore(htsjdk.samtools.SAMRecord rec, ReadEndsForMarkDuplicates pairedEnds)

Calculates score for the duplicate read

protected void

handleChunk(List<ReadEndsForMarkDuplicates> nextChunk)

protected void

markDuplicateFragments(List<ReadEndsForMarkDuplicates> list, boolean containsPairs)

Takes a list of ReadEndsForMarkDuplicates objects and removes from it all objects that should not be marked as duplicates.

Methods inherited from class picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram
addReadToLibraryMetrics, addSingletonToCount, finalizeAndWriteMetrics, getChainedPgIds, openInputs, trackOpticalDuplicates

Methods inherited from class picard.sam.markduplicates.util.AbstractOpticalDuplicateFinderCommandLineProgram
customCommandLineValidation, setupOpticalDuplicateFinder

Methods inherited from class picard.cmdline.CommandLineProgram
checkRInstallation, getCommandLine, getCommandLineParser, getCommandLineParserForArgs, getDefaultHeaders, getFaqLink, getMetricsFile, getPGRecord, getStandardUsagePreamble, getStandardUsagePreamble, getVersion, hasWebDocumentation, instanceMain, instanceMainWithExit, makeReferenceArgumentCollection, parseArgs, requiresReference, setDefaultHeaders, useLegacyParser

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- DUPLICATE_TYPE_TAG
  
  public static final String DUPLICATE_TYPE_TAG
  
  The optional attribute in SAM/BAM/CRAM files used to store the duplicate type.
  See Also:
  
  Constant Field Values
- DUPLICATE_TYPE_LIBRARY
  
  public static final String DUPLICATE_TYPE_LIBRARY
  
  The duplicate type tag value for duplicate type: library.
  See Also:
  
  Constant Field Values
- DUPLICATE_TYPE_SEQUENCING
  
  public static final String DUPLICATE_TYPE_SEQUENCING
  
  The duplicate type tag value for duplicate type: sequencing (optical & pad-hopping, or "co-localized").
  See Also:
  
  Constant Field Values
- DUPLICATE_SET_INDEX_TAG
  
  public static final String DUPLICATE_SET_INDEX_TAG
  
  The attribute in the SAM/BAM file used to store which read was selected as representative out of a duplicate set
  See Also:
  
  Constant Field Values
- DUPLICATE_SET_SIZE_TAG
  
  public static final String DUPLICATE_SET_SIZE_TAG
  
  The attribute in the SAM/BAM file used to store the size of a duplicate set
  See Also:
  
  Constant Field Values
- MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP
  
  @Argument(shortName="MAX_SEQS", doc="This option is obsolete. ReadEnds will always be spilled to disk.") public int MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP
  
  If more than this many sequences in SAM file, don't spill to disk because there will not be enough file handles.
- MAX_FILE_HANDLES_FOR_READ_ENDS_MAP
  
  @Argument(shortName="MAX_FILE_HANDLES", doc="Maximum number of file handles to keep open when spilling read ends to disk. Set this number a little lower than the per-process maximum number of file that may be open. This number can be found by executing the \'ulimit -n\' command on a Unix system.") public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP
- SORTING_COLLECTION_SIZE_RATIO
  
  @Argument(doc="This number, plus the maximum RAM available to the JVM, determine the memory footprint used by some of the sorting collections. If you are running out of memory, try reducing this number.") public double SORTING_COLLECTION_SIZE_RATIO
- BARCODE_TAG
  
  @Argument(doc="Barcode SAM tag (ex. BC for 10X Genomics)", optional=true) public String BARCODE_TAG
- READ_ONE_BARCODE_TAG
  
  @Argument(doc="Read one barcode SAM tag (ex. BX for 10X Genomics)", optional=true) public String READ_ONE_BARCODE_TAG
- READ_TWO_BARCODE_TAG
  
  @Argument(doc="Read two barcode SAM tag (ex. BX for 10X Genomics)", optional=true) public String READ_TWO_BARCODE_TAG
- TAG_DUPLICATE_SET_MEMBERS
  
  @Argument(doc="If a read appears in a duplicate set, add two tags. The first tag, DUPLICATE_SET_SIZE_TAG (DS), indicates the size of the duplicate set. The smallest possible DS value is 2 which occurs when two reads map to the same portion of the reference only one of which is marked as duplicate. The second tag, DUPLICATE_SET_INDEX_TAG (DI), represents a unique identifier for the duplicate set to which the record belongs. This identifier is the index-in-file of the representative read that was selected out of the duplicate set.", optional=true) public boolean TAG_DUPLICATE_SET_MEMBERS
- REMOVE_SEQUENCING_DUPLICATES
  
  @Argument(doc="If true remove \'optical\' duplicates and other duplicates that appear to have arisen from the sequencing process instead of the library preparation process, even if REMOVE_DUPLICATES is false. If REMOVE_DUPLICATES is true, all duplicates are removed and this option is ignored.") public boolean REMOVE_SEQUENCING_DUPLICATES
- TAGGING_POLICY
  
  @Argument(doc="Determines how duplicate types are recorded in the DT optional attribute.") public MarkDuplicates.DuplicateTaggingPolicy TAGGING_POLICY
- CLEAR_DT
  
  @Argument(doc="Clear DT tag from input SAM records. Should be set to false if input SAM doesn\'t have this tag. Default true") public boolean CLEAR_DT
- DUPLEX_UMI
  
  @Argument(doc="Treat UMIs as being duplex stranded. This option requires that the UMI consist of two equal length strings that are separated by a hyphen (e.g. \'ATC-GTC\'). Reads are considered duplicates if, in addition to standard definition, have identical normalized UMIs. A UMI from the \'bottom\' strand is normalized by swapping its content around the hyphen (eg. ATC-GTC becomes GTC-ATC). A UMI from the \'top\' strand is already normalized as it is. Both reads from a read pair considered top strand if the read 1 unclipped 5\' coordinate is less than the read 2 unclipped 5\' coordinate. All chimeric reads and read fragments are treated as having come from the top strand. With this option is it required that the BARCODE_TAG hold non-normalized UMIs. Default false.") public boolean DUPLEX_UMI
- MOLECULAR_IDENTIFIER_TAG
  
  @Argument(doc="SAM tag to uniquely identify the molecule from which a read was derived. Use of this option requires that the BARCODE_TAG option be set to a non null value. Default null.", optional=true) public String MOLECULAR_IDENTIFIER_TAG
- flowBasedArguments
  
  @ArgumentCollection public MarkDuplicatesForFlowArgumentCollection flowBasedArguments
- pairSort
  
  protected htsjdk.samtools.util.SortingCollection<ReadEndsForMarkDuplicates> pairSort
- fragSort
  
  protected htsjdk.samtools.util.SortingCollection<ReadEndsForMarkDuplicates> fragSort
- duplicateIndexes
  
  protected htsjdk.samtools.util.SortingLongCollection duplicateIndexes
- opticalDuplicateIndexes
  
  protected htsjdk.samtools.util.SortingLongCollection opticalDuplicateIndexes
- representativeReadIndicesForDuplicates
  
  protected htsjdk.samtools.util.SortingCollection<RepresentativeReadIndexer> representativeReadIndicesForDuplicates
- libraryIdGenerator
  
  protected LibraryIdGenerator libraryIdGenerator
Constructor Details
- MarkDuplicates
  
  public MarkDuplicates()
Method Details
- doWork
  
  protected int doWork()
  
  Main work method. Reads the SAM file once and collects sorted information about the 5' ends of both ends of each read (or just one end in the case of pairs). Then makes a pass through those determining duplicates before re-reading the input file and writing it out with duplication flags set correctly.
  
  Specified by:
  
  doWork in class CommandLineProgram
  
  Returns:
  
  program exit status.
- getReadDuplicateScore
  
  public short getReadDuplicateScore(htsjdk.samtools.SAMRecord rec, ReadEndsForMarkDuplicates pairedEnds)
  
  Calculates score for the duplicate read
  
  Specified by:
  
  getReadDuplicateScore in interface MarkDuplicatesHelper
  
  Parameters:
  
  rec - - read
  
  pairedEnds - - location of the read ends
  
  Returns:
  
  - read score calculated according to the DUPLICATE_SCORING_STRATEGY: SUM_OF_BASE_QUALITIES, (default) TOTAL_MAPPED_REFERENCE_LENGTH, RANDOM
- buildReadEnds
  
  public ReadEndsForMarkDuplicates buildReadEnds(htsjdk.samtools.SAMFileHeader header, long index, htsjdk.samtools.SAMRecord rec, boolean useBarcodes)
  
  Builds a read ends object that represents a single read.
  
  Specified by:
  
  buildReadEnds in interface MarkDuplicatesHelper
- generateDuplicateIndexes
  
  public void generateDuplicateIndexes(boolean useBarcodes, boolean indexOpticalDuplicates)
  
  Goes through the accumulated ReadEndsForMarkDuplicates objects and determines which of them are to be marked as duplicates.
  
  Specified by:
  
  generateDuplicateIndexes in interface MarkDuplicatesHelper
- handleChunk
  
  protected void handleChunk(List<ReadEndsForMarkDuplicates> nextChunk)
- areComparableForDuplicates
  
  protected boolean areComparableForDuplicates(ReadEndsForMarkDuplicates lhs, ReadEndsForMarkDuplicates rhs, boolean compareRead2, boolean useBarcodes)
- markDuplicateFragments
  
  protected void markDuplicateFragments(List<ReadEndsForMarkDuplicates> list, boolean containsPairs)
  
  Takes a list of ReadEndsForMarkDuplicates objects and removes from it all objects that should not be marked as duplicates. This will set the duplicate index for only list items are fragments.
  
  Parameters:
  
  containsPairs - true if the list also contains objects containing pairs, false otherwise.

Class MarkDuplicates

Nested Class Summary

Nested classes/interfaces inherited from class picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram

Field Summary

Fields inherited from class picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram

Fields inherited from class picard.sam.markduplicates.util.AbstractOpticalDuplicateFinderCommandLineProgram

Fields inherited from class picard.cmdline.CommandLineProgram

Constructor Summary

Method Summary

Methods inherited from class picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram

Methods inherited from class picard.sam.markduplicates.util.AbstractOpticalDuplicateFinderCommandLineProgram

Methods inherited from class picard.cmdline.CommandLineProgram

Methods inherited from class java.lang.Object

Field Details

DUPLICATE_TYPE_TAG

DUPLICATE_TYPE_LIBRARY

DUPLICATE_TYPE_SEQUENCING

DUPLICATE_SET_INDEX_TAG

DUPLICATE_SET_SIZE_TAG

MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP

MAX_FILE_HANDLES_FOR_READ_ENDS_MAP

SORTING_COLLECTION_SIZE_RATIO

BARCODE_TAG

READ_ONE_BARCODE_TAG

READ_TWO_BARCODE_TAG

TAG_DUPLICATE_SET_MEMBERS

REMOVE_SEQUENCING_DUPLICATES

TAGGING_POLICY

CLEAR_DT

DUPLEX_UMI

MOLECULAR_IDENTIFIER_TAG

flowBasedArguments

pairSort

fragSort

duplicateIndexes

opticalDuplicateIndexes

representativeReadIndicesForDuplicates

libraryIdGenerator

Constructor Details

MarkDuplicates

Method Details

doWork

getReadDuplicateScore

buildReadEnds

generateDuplicateIndexes

handleChunk

areComparableForDuplicates

markDuplicateFragments