picard.sam.markduplicates.EstimateLibraryComplexity

@DocumentedFeature public class EstimateLibraryComplexity extends AbstractOpticalDuplicateFinderCommandLineProgram

Attempts to estimate library complexity from sequence alone. Does so by sorting all reads by the first N bases (5 by default) of each read and then comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default).

Reads of poor quality are filtered out so as to provide a more accurate estimate. The filtering removes reads with any no-calls in the first N bases or with a mean base quality lower than MIN_MEAN_QUALITY across either the first or second read.

The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the calculation of library size. Also, since there is no alignment to screen out technical reads one further filter is applied on the data. After examining all reads a Histogram is built of [#reads in duplicate set -> #of duplicate sets]; all bins that contain exactly one duplicate set are then removed from the Histogram as outliers before library size is estimated.

Field Summary

Fields

Modifier and Type

Field

Description

String

BARCODE_TAG

List<File>

INPUT

double

MAX_DIFF_RATE

int

MAX_GROUP_RATIO

int

MAX_READ_LENGTH

int

MIN_GROUP_COUNT

int

MIN_IDENTICAL_BASES

int

MIN_MEAN_QUALITY

File

OUTPUT

String

READ_ONE_BARCODE_TAG

String

READ_TWO_BARCODE_TAG

Fields inherited from class picard.sam.markduplicates.util.AbstractOpticalDuplicateFinderCommandLineProgram
LOG, MAX_OPTICAL_DUPLICATE_SET_SIZE, OPTICAL_DUPLICATE_PIXEL_DISTANCE, opticalDuplicateFinder, READ_NAME_REGEX

Fields inherited from class picard.cmdline.CommandLineProgram
COMPRESSION_LEVEL, CREATE_INDEX, CREATE_MD5_FILE, MAX_ALLOWABLE_ONE_LINE_SUMMARY_LENGTH, MAX_RECORDS_IN_RAM, QUIET, REFERENCE_SEQUENCE, referenceSequence, specialArgumentsCollection, SYNTAX_TRANSITION_URL, TMP_DIR, USE_JDK_DEFLATER, USE_JDK_INFLATER, VALIDATION_STRINGENCY, VERBOSITY
Constructor Summary

Constructors

Constructor

Description

EstimateLibraryComplexity()
Method Summary

Modifier and Type

Method

Description

protected String[]

customCommandLineValidation()

Put any custom command-line validation in an override of this method.

protected int

doWork()

Method that does most of the work.

int

getBarcodeValue(htsjdk.samtools.SAMRecord record)

static int

getReadBarcodeValue(htsjdk.samtools.SAMRecord record, String tag)

Methods inherited from class picard.sam.markduplicates.util.AbstractOpticalDuplicateFinderCommandLineProgram
setupOpticalDuplicateFinder

Methods inherited from class picard.cmdline.CommandLineProgram
checkRInstallation, getCommandLine, getCommandLineParser, getCommandLineParserForArgs, getDefaultHeaders, getFaqLink, getMetricsFile, getPGRecord, getStandardUsagePreamble, getStandardUsagePreamble, getVersion, hasWebDocumentation, instanceMain, instanceMainWithExit, makeReferenceArgumentCollection, parseArgs, requiresReference, setDefaultHeaders, useLegacyParser

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- INPUT
  
  @Argument(shortName="I", doc="One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped.") public List<File> INPUT
- OUTPUT
  
  @Argument(shortName="O", doc="Output file to writes per-library metrics to.") public File OUTPUT
- MIN_IDENTICAL_BASES
  
  @Argument(doc="The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU.") public int MIN_IDENTICAL_BASES
- MAX_DIFF_RATE
  
  @Argument(doc="The maximum rate of differences between two reads to call them identical.") public double MAX_DIFF_RATE
- MIN_MEAN_QUALITY
  
  @Argument(doc="The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations.") public int MIN_MEAN_QUALITY
- MAX_GROUP_RATIO
  
  @Argument(doc="Do not process self-similar groups that are this many times over the mean expected group size. I.e. if the input contains 10m read pairs and MIN_IDENTICAL_BASES is set to 5, then the mean expected group size would be approximately 10 reads.") public int MAX_GROUP_RATIO
- BARCODE_TAG
  
  @Argument(doc="Barcode SAM tag (ex. BC for 10X Genomics)", optional=true) public String BARCODE_TAG
- READ_ONE_BARCODE_TAG
  
  @Argument(doc="Read one barcode SAM tag (ex. BX for 10X Genomics)", optional=true) public String READ_ONE_BARCODE_TAG
- READ_TWO_BARCODE_TAG
  
  @Argument(doc="Read two barcode SAM tag (ex. BX for 10X Genomics)", optional=true) public String READ_TWO_BARCODE_TAG
- MAX_READ_LENGTH
  
  @Argument(doc="The maximum number of bases to consider when comparing reads (0 means no maximum).", optional=true) public int MAX_READ_LENGTH
- MIN_GROUP_COUNT
  
  @Argument(doc="Minimum number group count. On a per-library basis, we count the number of groups of duplicates that have a particular size. Omit from consideration any count that is less than this value. For example, if we see only one group of duplicates with size 500, we omit it from the metric calculations if MIN_GROUP_COUNT is set to two. Setting this to two may help remove technical artifacts from the library size calculation, for example, adapter dimers.", optional=true) public int MIN_GROUP_COUNT
Constructor Details
- EstimateLibraryComplexity
  
  public EstimateLibraryComplexity()
Method Details
- customCommandLineValidation
  
  protected String[] customCommandLineValidation()
  
  Description copied from class: CommandLineProgram
  
  Put any custom command-line validation in an override of this method. clp is initialized at this point and can be used to print usage and access argv. Any options set by command-line parser can be validated.
  
  Overrides:
  
  customCommandLineValidation in class AbstractOpticalDuplicateFinderCommandLineProgram
  
  Returns:
  
  null if command line is valid. If command line is invalid, returns an array of error message to be written to the appropriate place.
- getBarcodeValue
  
  public int getBarcodeValue(htsjdk.samtools.SAMRecord record)
- getReadBarcodeValue
  
  public static int getReadBarcodeValue(htsjdk.samtools.SAMRecord record, String tag)
- doWork
  
  protected int doWork()
  
  Method that does most of the work. Reads through the input BAM file and extracts the read sequences of each read pair and sorts them via a SortingCollection. Then traverses the sorted reads and looks at small groups at a time to find duplicates.
  
  Specified by:
  
  doWork in class CommandLineProgram
  
  Returns:
  
  program exit status.

Class EstimateLibraryComplexity

Field Summary

Fields inherited from class picard.sam.markduplicates.util.AbstractOpticalDuplicateFinderCommandLineProgram

Fields inherited from class picard.cmdline.CommandLineProgram

Constructor Summary

Method Summary

Methods inherited from class picard.sam.markduplicates.util.AbstractOpticalDuplicateFinderCommandLineProgram

Methods inherited from class picard.cmdline.CommandLineProgram

Methods inherited from class java.lang.Object

Field Details

INPUT

OUTPUT

MIN_IDENTICAL_BASES

MAX_DIFF_RATE

MIN_MEAN_QUALITY

MAX_GROUP_RATIO

BARCODE_TAG

READ_ONE_BARCODE_TAG

READ_TWO_BARCODE_TAG

MAX_READ_LENGTH

MIN_GROUP_COUNT

Constructor Details

EstimateLibraryComplexity

Method Details

customCommandLineValidation

getBarcodeValue

getReadBarcodeValue

doWork