@DocumentedFeature public class PathSeqPipelineSpark extends GATKSparkTool
PathSeq is a suite of tools for detecting microbial organisms in deep sequenced biological samples. It is capable of (1) quantifying microbial abundances in metagenomic samples containing a mixture of organisms, (2) detecting extremely low-abundance (<0.001%) organisms, and (3) identifying unknown sequences that may belong to novel organisms. The pipeline is based on a previously published tool of the same name (Kostic et al. 2011), which has been used in a wide range of studies to investigate novel associations between pathogens and human disease.
The pipeline consists of three phases: (1) removing reads that are low quality, low complexity, or match a given host (e.g. human) reference, (2) aligning the remaining reads to a microorganism reference, and (3) determining the taxonomic classification of each read and estimating microbe abundances. These steps can be performed individually using PathSeqFilterSpark, PathSeqBwaSpark, and PathSeqScoreSpark. To simplify using the pipeline, this tool combines the three steps into one. Further details can be found in the individual tools' documentations.
The filtering phase ensures that only high fidelity, non-host reads are classified, thus reducing computational costs and false positives. Note that while generally applicable to any type of biological sample (e.g. saliva, stool), PathSeq is particularly efficient for samples containing a high percentage of host reads (e.g. blood, tissue, CSF). PathSeq is able to detect evidence of low-abundance organisms and scales to use comprehensive genomic database references (e.g. > 100 Gbp). Lastly, because PathSeq works by identifying both host and known microbial sequences, it can also be used to discover novel pathogens by deducing the sample to sequences of unknown origin, which may be followed by de novo assembly.
Because sequence alignment is computationally burdensome, PathSeq is integrated with Apache Spark, enabling parallelization of all steps in the pipeline on multi-core workstations and cluster environments. This overcomes the high computational cost and permits rapid turnaround times (minutes to hours) in deep sequenced samples.
Before running the PathSeq pipeline, the host and microbe references must be built. Prebuilt references for a standard microbial set are available in the GATK Resource Bundle.
To build custom references, users must provide FASTA files of the host and pathogen sequences. Tools are included to generate the necessary files: the host k-mer database (PathSeqBuildKmers), BWA-MEM index image files of the host and pathogen references (BwaMemIndexImageCreator), and a taxonomic tree of the pathogen reference (PathSeqBuildReferenceTaxonomy).
This tool can be run without explicitly specifying Spark options. That is to say, the given example command without Spark options will run locally. See Tutorial#10060 for an example of how to set up and run a Spark tool on a cloud Spark cluster.
gatk PathSeqFilterSpark \ --input input_reads.bam \ --kmer-file host_kmers.bfi \ --filter-bwa-image host_reference.img \ --microbe-bwa-image microbe_reference.img \ --microbe-fasta reference.fa \ --taxonomy-file taxonomy.db \ --min-clipped-read-length 60 \ --min-score-identity 0.90 \ --identity-margin 0.02 \ --scores-output scores.txt \ --output output_reads.bam \ --filter-metrics filter_metrics.txt \ --score-metrics score_metrics.txt
gatk PathSeqFilterSpark \ --input gs://my-gcs-bucket/input_reads.bam \ --kmer-file hdfs://my-cluster-m:8020//host_kmers.bfi \ --filter-bwa-image /references/host_reference.img \ --microbe-bwa-image /references/microbe_reference.img \ --microbe-fasta hdfs://my-cluster-m:8020//reference.fa \ --taxonomy-file hdfs://my-cluster-m:8020//taxonomy.db \ --min-clipped-read-length 60 \ --min-score-identity 0.90 \ --identity-margin 0.02 \ --scores-output gs://my-gcs-bucket/scores.txt \ --output gs://my-gcs-bucket/output_reads.bam \ --filter-metrics gs://my-gcs-bucket/filter_metrics.txt \ --score-metrics gs://my-gcs-bucket/score_metrics.txt \ -- \ --sparkRunner GCS \ --cluster my_cluster \ --driver-memory 8G \ --executor-memory 32G \ --num-executors 4 \ --executor-cores 30 \ --conf spark.yarn.executor.memoryOverhead=132000
Note that the host and microbe BWA images must be copied to the same paths on every worker node. The microbe FASTA, host k-mer file, and taxonomy file may also be copied to a single path on every worker node or to HDFS.
Modifier and Type | Field and Description |
---|---|
PSBwaArgumentCollection |
bwaArgs |
PSFilterArgumentCollection |
filterArgs |
java.lang.String |
outputPath |
static java.lang.String |
READS_PER_PARTITION_LONG_NAME |
static java.lang.String |
READS_PER_PARTITION_SHORT_NAME |
int |
readsPerPartition |
int |
readsPerPartitionOutput
Because numReducers is based on the input size, it causes too many partitions to be produced when the output size is much smaller.
|
PSScoreArgumentCollection |
scoreArgs |
BAM_PARTITION_SIZE_LONG_NAME, bamPartitionSplitSize, features, intervalArgumentCollection, NUM_REDUCERS_LONG_NAME, numReducers, OUTPUT_SHARD_DIR_LONG_NAME, readArguments, referenceArguments, sequenceDictionaryValidationArguments, SHARDED_OUTPUT_LONG_NAME, shardedOutput, shardedPartsDir
programName, SPARK_PROGRAM_NAME_LONG_NAME, sparkArgs
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, QUIET, specialArgumentsCollection, TMP_DIR, useJdkDeflater, useJdkInflater, VERBOSITY
Constructor and Description |
---|
PathSeqPipelineSpark() |
Modifier and Type | Method and Description |
---|---|
boolean |
requiresReads()
Does this tool require reads? Tools that do should override to return true.
|
protected void |
runTool(org.apache.spark.api.java.JavaSparkContext ctx)
Runs the tool itself after initializing and validating inputs.
|
editIntervals, getBestAvailableSequenceDictionary, getDefaultReadFilters, getDefaultVariantAnnotationGroups, getDefaultVariantAnnotations, getHeaderForReads, getIntervals, getPluginDescriptors, getReads, getReadSourceName, getRecommendedNumReducers, getReference, getReferenceSequenceDictionary, getReferenceWindowFunction, getSequenceDictionaryValidationArgumentCollection, getTargetPartitionSize, getUnfilteredReads, hasIntervals, hasReads, hasReference, makeReadFilter, makeReadFilter, makeVariantAnnotations, requiresIntervals, requiresReference, runPipeline, useVariantAnnotations, validateSequenceDictionaries, writeReads, writeReads
afterPipeline, doWork, getProgramName
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getSupportInformation, getToolkitName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, onShutdown, onStartup, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus
public static final java.lang.String READS_PER_PARTITION_LONG_NAME
public static final java.lang.String READS_PER_PARTITION_SHORT_NAME
@ArgumentCollection public PSFilterArgumentCollection filterArgs
@ArgumentCollection public PSBwaArgumentCollection bwaArgs
@ArgumentCollection public PSScoreArgumentCollection scoreArgs
@Argument(doc="Output BAM", fullName="output", shortName="O", optional=true) public java.lang.String outputPath
@Argument(doc="Number of reads per partition to use for alignment and scoring.", fullName="pipeline-reads-per-partition", shortName="pipeline-reads-per-partition", optional=true, minValue=100.0) public int readsPerPartition
@Argument(doc="Number of reads per partition for output. Use this to control the number of sharded BAMs (not --num-reducers).", fullName="readsPerPartitionOutput", optional=true, minValue=100.0, minRecommendedValue=100000.0) public int readsPerPartitionOutput
public boolean requiresReads()
GATKSparkTool
requiresReads
in class GATKSparkTool
protected void runTool(org.apache.spark.api.java.JavaSparkContext ctx)
GATKSparkTool
runTool
in class GATKSparkTool
ctx
- our Spark context