Class PathSeqBuildReferenceTaxonomy
java.lang.Object
org.broadinstitute.hellbender.cmdline.CommandLineProgram
org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqBuildReferenceTaxonomy
- All Implemented Interfaces:
org.broadinstitute.barclay.argparser.CommandLinePluginProvider
Build an annotated taxonomy datafile for a given microbe reference. The output file from this tool is required to run the PathSeq pipeline.
The tool reads the list of sequence accessions from the given reference. For each accession, it looks up the NCBI taxonomic ID of the corresponding organism and builds a taxonomic tree containing only organisms that are represented in the reference. The reference should only contain sequences from NCBI RefSeq and/or Genbank databases.
Input
- An indexed microbe reference in FASTA format (NCBI RefSeq/Genbank sequences)
- Downloaded NCBI RefSeq (and/or GenBank) catalog archive file(s)
- Downloaded NCBI taxonomy archive file
See argument documentation for information about where to download the archive files.
Output
- A binary file containing reference taxonomy information
Usage examples
gatk PathSeqBuildReferenceTaxonomy \ --reference microbe_reference.fasta \ --output taxonomy.db \ --refseq-catalog RefSeq-releaseXX.catalog.gz \ --tax-dump taxdump.tar.gz \ --min-non-virus-contig-length 2000
Notes
Often there are inconsistencies between the reference sequences, NCBI catalog, and taxonomy archive. To minimize this issue, ensure that the input files are retrieved on the same date.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
CommandLineProgram.AutoCloseableNoCheckedExceptions
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final String
static final String
This may be supplied alone or in addition to the RefSeq catalog in the case that sequences from GenBank are present in the reference.static final String
static final String
int
Sequences from non-virus organisms less than this length will be filtered out such that any reads aligning to them will be ignored.protected final ReferenceInputArgumentCollection
static final String
static final String
static final String
static final String
Fields inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, NIO_PROJECT_FOR_REQUESTER_PAYS, QUIET, specialArgumentsCollection, tmpDir, useJdkDeflater, useJdkInflater, VERBOSITY
-
Constructor Summary
Constructors -
Method Summary
Methods inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getPluginDescriptors, getSupportInformation, getToolkitName, getToolkitShortName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, onShutdown, onStartup, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus
-
Field Details
-
REFSEQ_CATALOG_LONG_NAME
- See Also:
-
REFSEQ_CATALOG_SHORT_NAME
- See Also:
-
GENBANK_CATALOG_LONG_NAME
- See Also:
-
GENBANK_CATALOG_SHORT_NAME
- See Also:
-
TAX_DUMP_LONG_NAME
- See Also:
-
TAX_DUMP_SHORT_NAME
- See Also:
-
MIN_NON_VIRUS_CONTIG_LENGTH_LONG_NAME
- See Also:
-
MIN_NON_VIRUS_CONTIG_LENGTH_SHORT_NAME
- See Also:
-
referenceArguments
-
outputPath
@Argument(doc="Local path for the output file. By convention, the extension should be \".db\"", shortName="O", fullName="output") public String outputPath -
refseqCatalogPath
@Argument(doc="Local path to catalog file (RefSeq-releaseXX.catalog.gz available at ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/)", fullName="refseq-catalog", shortName="RC", optional=true) public String refseqCatalogPath -
genbankCatalogPath
@Argument(doc="Local path to Genbank catalog file (gbXXX.catalog.XXX.txt.gz at ftp://ftp.ncbi.nlm.nih.gov/genbank/catalog/)", fullName="genbank-catalog", shortName="GC", optional=true) public String genbankCatalogPathThis may be supplied alone or in addition to the RefSeq catalog in the case that sequences from GenBank are present in the reference. -
taxdumpPath
@Argument(doc="Local path to taxonomy dump tarball (taxdump.tar.gz available at ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/)", fullName="tax-dump", shortName="TD") public String taxdumpPath -
minNonVirusContigLength
@Argument(doc="Minimum reference contig length for non-viruses", fullName="min-non-virus-contig-length", shortName="min-non-virus-contig-length", minValue=0.0, minRecommendedValue=500.0, maxRecommendedValue=10000.0) public int minNonVirusContigLengthSequences from non-virus organisms less than this length will be filtered out such that any reads aligning to them will be ignored. This is a quality control measure to remove shorter sequences from draft genomes that are likely to contain sequencing artifacts such as cross-species contamination or sequencing adapters. Note this may remove some bacteria plasmid sequences.
-
-
Constructor Details
-
PathSeqBuildReferenceTaxonomy
public PathSeqBuildReferenceTaxonomy()
-
-
Method Details
-
doWork
Description copied from class:CommandLineProgram
Do the work after command line has been parsed. RuntimeException may be thrown by this method, and are reported appropriately.- Specified by:
doWork
in classCommandLineProgram
- Returns:
- the return value or null is there is none.
-