Class MarkDuplicatesWithMateCigar


@DocumentedFeature public class MarkDuplicatesWithMateCigar extends AbstractMarkDuplicatesCommandLineProgram
An even better duplication marking algorithm that handles all cases including clipped and gapped alignments.

This tool differs with MarkDuplicates as it may break ties differently. Furthermore, as it is a one-pass algorithm, it cannot know the program records contained in the file that should be chained in advance. Therefore it will only be able to examine the header to attempt to infer those program group records that have no associated previous program group record. If a read is encountered without a program record, or not one as previously defined, it will not be updated.

This tool will also not work with alignments that have large gaps or skips, such as those from RNA-seq data. This is due to the need to buffer small genomic windows to ensure integrity of the duplicate marking, while large skips (ex. skipping introns) in the alignment records would force making that window very large, thus exhausting memory.

  • Field Details

    • MINIMUM_DISTANCE

      @Argument(doc="The minimum distance to buffer records to account for clipping on the 5\' end of the records. For a given alignment, this parameter controls the width of the window to search for duplicates of that alignment. Due to 5\' read clipping, duplicates do not necessarily have the same 5\' alignment coordinates, so the algorithm needs to search around the neighborhood. For single end sequencing data, the neighborhood is only determined by the amount of clipping (assuming no split reads), thus setting MINIMUM_DISTANCE to twice the sequencing read length should be sufficient. For paired end sequencing, the neighborhood is also determined by the fragment insert size, so you may want to set MINIMUM_DISTANCE to something like twice the 99.5% percentile of the fragment insert size distribution (see CollectInsertSizeMetrics). Or you can set this number to -1 to use either a) twice the first read\'s read length, or b) 100, whichever is smaller. Note that the larger the window, the greater the RAM requirements, so you could run into performance limitations if you use a value that is unnecessarily large.", optional=true) public int MINIMUM_DISTANCE
    • BLOCK_SIZE

      @Argument(doc="The block size for use in the coordinate-sorted record buffer.", optional=true) public int BLOCK_SIZE
  • Constructor Details

    • MarkDuplicatesWithMateCigar

      public MarkDuplicatesWithMateCigar()
  • Method Details

    • doWork

      protected int doWork()
      Main work method.
      Specified by:
      doWork in class CommandLineProgram
      Returns:
      program exit status.