Class PathSeqBuildReferenceTaxonomy

java.lang.Object
org.broadinstitute.hellbender.cmdline.CommandLineProgram
org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqBuildReferenceTaxonomy
All Implemented Interfaces:
org.broadinstitute.barclay.argparser.CommandLinePluginProvider

@DocumentedFeature public class PathSeqBuildReferenceTaxonomy extends CommandLineProgram
Build an annotated taxonomy datafile for a given microbe reference. The output file from this tool is required to run the PathSeq pipeline.

The tool reads the list of sequence accessions from the given reference. For each accession, it looks up the NCBI taxonomic ID of the corresponding organism and builds a taxonomic tree containing only organisms that are represented in the reference. The reference should only contain sequences from NCBI RefSeq and/or Genbank databases.

Input

  • An indexed microbe reference in FASTA format (NCBI RefSeq/Genbank sequences)
  • Downloaded NCBI RefSeq (and/or GenBank) catalog archive file(s)
  • Downloaded NCBI taxonomy archive file

See argument documentation for information about where to download the archive files.

Output

  • A binary file containing reference taxonomy information

Usage examples

 gatk PathSeqBuildReferenceTaxonomy \
   --reference microbe_reference.fasta \
   --output taxonomy.db \
   --refseq-catalog RefSeq-releaseXX.catalog.gz \
   --tax-dump taxdump.tar.gz \
   --min-non-virus-contig-length 2000
 

Notes

Often there are inconsistencies between the reference sequences, NCBI catalog, and taxonomy archive. To minimize this issue, ensure that the input files are retrieved on the same date.

  • Field Details

    • REFSEQ_CATALOG_LONG_NAME

      public static final String REFSEQ_CATALOG_LONG_NAME
      See Also:
    • REFSEQ_CATALOG_SHORT_NAME

      public static final String REFSEQ_CATALOG_SHORT_NAME
      See Also:
    • GENBANK_CATALOG_LONG_NAME

      public static final String GENBANK_CATALOG_LONG_NAME
      See Also:
    • GENBANK_CATALOG_SHORT_NAME

      public static final String GENBANK_CATALOG_SHORT_NAME
      See Also:
    • TAX_DUMP_LONG_NAME

      public static final String TAX_DUMP_LONG_NAME
      See Also:
    • TAX_DUMP_SHORT_NAME

      public static final String TAX_DUMP_SHORT_NAME
      See Also:
    • MIN_NON_VIRUS_CONTIG_LENGTH_LONG_NAME

      public static final String MIN_NON_VIRUS_CONTIG_LENGTH_LONG_NAME
      See Also:
    • MIN_NON_VIRUS_CONTIG_LENGTH_SHORT_NAME

      public static final String MIN_NON_VIRUS_CONTIG_LENGTH_SHORT_NAME
      See Also:
    • referenceArguments

      @ArgumentCollection protected final ReferenceInputArgumentCollection referenceArguments
    • outputPath

      @Argument(doc="Local path for the output file. By convention, the extension should be \".db\"", shortName="O", fullName="output") public String outputPath
    • refseqCatalogPath

      @Argument(doc="Local path to catalog file (RefSeq-releaseXX.catalog.gz available at ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/)", fullName="refseq-catalog", shortName="RC", optional=true) public String refseqCatalogPath
    • genbankCatalogPath

      @Argument(doc="Local path to Genbank catalog file (gbXXX.catalog.XXX.txt.gz at ftp://ftp.ncbi.nlm.nih.gov/genbank/catalog/)", fullName="genbank-catalog", shortName="GC", optional=true) public String genbankCatalogPath
      This may be supplied alone or in addition to the RefSeq catalog in the case that sequences from GenBank are present in the reference.
    • taxdumpPath

      @Argument(doc="Local path to taxonomy dump tarball (taxdump.tar.gz available at ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/)", fullName="tax-dump", shortName="TD") public String taxdumpPath
    • minNonVirusContigLength

      @Argument(doc="Minimum reference contig length for non-viruses", fullName="min-non-virus-contig-length", shortName="min-non-virus-contig-length", minValue=0.0, minRecommendedValue=500.0, maxRecommendedValue=10000.0) public int minNonVirusContigLength
      Sequences from non-virus organisms less than this length will be filtered out such that any reads aligning to them will be ignored. This is a quality control measure to remove shorter sequences from draft genomes that are likely to contain sequencing artifacts such as cross-species contamination or sequencing adapters. Note this may remove some bacteria plasmid sequences.
  • Constructor Details

    • PathSeqBuildReferenceTaxonomy

      public PathSeqBuildReferenceTaxonomy()
  • Method Details

    • doWork

      public Object doWork()
      Description copied from class: CommandLineProgram
      Do the work after command line has been parsed. RuntimeException may be thrown by this method, and are reported appropriately.
      Specified by:
      doWork in class CommandLineProgram
      Returns:
      the return value or null is there is none.