PathSeqPipelineSpark (gatk 4.0.5.0 API)

java.lang.Object
- org.broadinstitute.hellbender.cmdline.CommandLineProgram
- - org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram
  - - org.broadinstitute.hellbender.engine.spark.GATKSparkTool
    - - org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark

All Implemented Interfaces:

java.io.Serializable, org.broadinstitute.barclay.argparser.CommandLinePluginProvider
```
@DocumentedFeature
public class PathSeqPipelineSpark
extends GATKSparkTool
```
Combined tool that performs all PathSeq steps: read filtering, microbe reference alignment and abundance scoring
PathSeq is a suite of tools for detecting microbial organisms in deep sequenced biological samples. It is capable of (1) quantifying microbial abundances in metagenomic samples containing a mixture of organisms, (2) detecting extremely low-abundance (<0.001%) organisms, and (3) identifying unknown sequences that may belong to novel organisms. The pipeline is based on a previously published tool of the same name (Kostic et al. 2011), which has been used in a wide range of studies to investigate novel associations between pathogens and human disease.

The pipeline consists of three phases: (1) removing reads that are low quality, low complexity, or match a given host (e.g. human) reference, (2) aligning the remaining reads to a microorganism reference, and (3) determining the taxonomic classification of each read and estimating microbe abundances. These steps can be performed individually using PathSeqFilterSpark, PathSeqBwaSpark, and PathSeqScoreSpark. To simplify using the pipeline, this tool combines the three steps into one. Further details can be found in the individual tools' documentations.

The filtering phase ensures that only high fidelity, non-host reads are classified, thus reducing computational costs and false positives. Note that while generally applicable to any type of biological sample (e.g. saliva, stool), PathSeq is particularly efficient for samples containing a high percentage of host reads (e.g. blood, tissue, CSF). PathSeq is able to detect evidence of low-abundance organisms and scales to use comprehensive genomic database references (e.g. > 100 Gbp). Lastly, because PathSeq works by identifying both host and known microbial sequences, it can also be used to discover novel pathogens by deducing the sample to sequences of unknown origin, which may be followed by de novo assembly.

Because sequence alignment is computationally burdensome, PathSeq is integrated with Apache Spark, enabling parallelization of all steps in the pipeline on multi-core workstations and cluster environments. This overcomes the high computational cost and permits rapid turnaround times (minutes to hours) in deep sequenced samples.

Reference files

Before running the PathSeq pipeline, the host and microbe references must be built. Prebuilt references for a standard microbial set are available in the GATK Resource Bundle.

To build custom references, users must provide FASTA files of the host and pathogen sequences. Tools are included to generate the necessary files: the host k-mer database (PathSeqBuildKmers), BWA-MEM index image files of the host and pathogen references (BwaMemIndexImageCreator), and a taxonomic tree of the pathogen reference (PathSeqBuildReferenceTaxonomy).

Input
- BAM containing input reads (either unaligned or aligned to a host reference)
- Host k-mer file generated using PathSeqBuildKmers
- Host BWA-MEM index image generated using BwaMemIndexImageCreator
- Microbe BWA-MEM index image generated using BwaMemIndexImageCreator
- Indexed microbe reference FASTA file
- Taxonomy file generated using PathSeqBuildReferenceTaxonomy
Output
- Taxonomic scores table
- Annotated BAM aligned to the microbe reference
- Filter metrics file (optional)
- Score metrics file (optional)
Usage example

This tool can be run without explicitly specifying Spark options. That is to say, the given example command without Spark options will run locally. See Tutorial#10060 for an example of how to set up and run a Spark tool on a cloud Spark cluster.

Local mode:
```
 gatk PathSeqFilterSpark  \
   --input input_reads.bam \
   --kmer-file host_kmers.bfi \
   --filter-bwa-image host_reference.img \
   --microbe-bwa-image microbe_reference.img \
   --microbe-fasta reference.fa \
   --taxonomy-file taxonomy.db \
   --min-clipped-read-length 60 \
   --min-score-identity 0.90 \
   --identity-margin 0.02 \
   --scores-output scores.txt \
   --output output_reads.bam \
   --filter-metrics filter_metrics.txt \
   --score-metrics score_metrics.txt
 
```
Spark cluster on Google Cloud DataProc with 6 16-core / 208GB memory worker nodes:
```
 gatk PathSeqFilterSpark  \
   --input gs://my-gcs-bucket/input_reads.bam \
   --kmer-file hdfs://my-cluster-m:8020//host_kmers.bfi \
   --filter-bwa-image /references/host_reference.img \
   --microbe-bwa-image /references/microbe_reference.img \
   --microbe-fasta hdfs://my-cluster-m:8020//reference.fa \
   --taxonomy-file hdfs://my-cluster-m:8020//taxonomy.db \
   --min-clipped-read-length 60 \
   --min-score-identity 0.90 \
   --identity-margin 0.02 \
   --scores-output gs://my-gcs-bucket/scores.txt \
   --output gs://my-gcs-bucket/output_reads.bam \
   --filter-metrics gs://my-gcs-bucket/filter_metrics.txt \
   --score-metrics gs://my-gcs-bucket/score_metrics.txt \
   -- \
   --sparkRunner GCS \
   --cluster my_cluster \
   --driver-memory 8G \
   --executor-memory 32G \
   --num-executors 4 \
   --executor-cores 30 \
   --conf spark.yarn.executor.memoryOverhead=132000
 
```
Note that the host and microbe BWA images must be copied to the same paths on every worker node. The microbe FASTA, host k-mer file, and taxonomy file may also be copied to a single path on every worker node or to HDFS.

References
1. Kostic, A. D. et al. (2011). PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat. Biotechnol. 29, 393-396.
See Also:

Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`PSBwaArgumentCollection`	`bwaArgs`
`PSFilterArgumentCollection`	`filterArgs`
`java.lang.String`	`outputPath`
`static java.lang.String`	`READS_PER_PARTITION_LONG_NAME`
`static java.lang.String`	`READS_PER_PARTITION_SHORT_NAME`
`int`	`readsPerPartition`
`int`	`readsPerPartitionOutput` Because numReducers is based on the input size, it causes too many partitions to be produced when the output size is much smaller.
`PSScoreArgumentCollection`	`scoreArgs`

Fields inherited from class org.broadinstitute.hellbender.engine.spark.GATKSparkTool
BAM_PARTITION_SIZE_LONG_NAME, bamPartitionSplitSize, features, intervalArgumentCollection, NUM_REDUCERS_LONG_NAME, numReducers, OUTPUT_SHARD_DIR_LONG_NAME, readArguments, referenceArguments, sequenceDictionaryValidationArguments, SHARDED_OUTPUT_LONG_NAME, shardedOutput, shardedPartsDir

Fields inherited from class org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram
programName, SPARK_PROGRAM_NAME_LONG_NAME, sparkArgs

Fields inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, QUIET, specialArgumentsCollection, TMP_DIR, useJdkDeflater, useJdkInflater, VERBOSITY

Constructor Summary

Constructors
Constructor and Description

PathSeqPipelineSpark()

Constructors
Constructor and Description
`PathSeqPipelineSpark()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`boolean`	`requiresReads()` Does this tool require reads? Tools that do should override to return true.
`protected void`	`runTool(org.apache.spark.api.java.JavaSparkContext ctx)` Runs the tool itself after initializing and validating inputs.

Methods inherited from class org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram
afterPipeline, doWork, getProgramName

Methods inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getSupportInformation, getToolkitName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, onShutdown, onStartup, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

READS_PER_PARTITION_LONG_NAME

public static final java.lang.String READS_PER_PARTITION_LONG_NAME

See Also:: Constant Field Values

READS_PER_PARTITION_SHORT_NAME

public static final java.lang.String READS_PER_PARTITION_SHORT_NAME

See Also:: Constant Field Values

filterArgs

@ArgumentCollection
public PSFilterArgumentCollection filterArgs

bwaArgs

@ArgumentCollection
public PSBwaArgumentCollection bwaArgs

scoreArgs

@ArgumentCollection
public PSScoreArgumentCollection scoreArgs

outputPath

@Argument(doc="Output BAM",
          fullName="output",
          shortName="O",
          optional=true)
public java.lang.String outputPath

readsPerPartition

@Argument(doc="Number of reads per partition to use for alignment and scoring.",
          fullName="pipeline-reads-per-partition",
          shortName="pipeline-reads-per-partition",
          optional=true,
          minValue=100.0)
public int readsPerPartition

readsPerPartitionOutput

@Argument(doc="Number of reads per partition for output. Use this to control the number of sharded BAMs (not --num-reducers).",
          fullName="readsPerPartitionOutput",
          optional=true,
          minValue=100.0,
          minRecommendedValue=100000.0)
public int readsPerPartitionOutput

Because numReducers is based on the input size, it causes too many partitions to be produced when the output size is much smaller.

Constructor Detail
- PathSeqPipelineSpark
```
public PathSeqPipelineSpark()
```

Method Detail
- requiresReads
```
public boolean requiresReads()
```
  Description copied from class: GATKSparkTool
  
  Does this tool require reads? Tools that do should override to return true.
  
  Overrides:
  
  requiresReads in class GATKSparkTool
  
  Returns:
  
  true if this tool requires reads, otherwise false
- runTool
```
protected void runTool(org.apache.spark.api.java.JavaSparkContext ctx)
```
  Description copied from class: GATKSparkTool
  
  Runs the tool itself after initializing and validating inputs. Must be implemented by subclasses.
  
  Specified by:
  
  runTool in class GATKSparkTool
  
  Parameters:
  
  ctx - our Spark context

Class PathSeqPipelineSpark

Reference files

Input

Output

Usage example

Local mode:

Spark cluster on Google Cloud DataProc with 6 16-core / 208GB memory worker nodes:

References

Field Summary

Fields inherited from class org.broadinstitute.hellbender.engine.spark.GATKSparkTool

Fields inherited from class org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram

Fields inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram

Constructor Summary

Method Summary

Methods inherited from class org.broadinstitute.hellbender.engine.spark.GATKSparkTool

Methods inherited from class org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram

Methods inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram

Methods inherited from class java.lang.Object

Field Detail

READS_PER_PARTITION_LONG_NAME

READS_PER_PARTITION_SHORT_NAME

filterArgs

bwaArgs

scoreArgs

outputPath

readsPerPartition

readsPerPartitionOutput

Constructor Detail

PathSeqPipelineSpark

Method Detail

requiresReads

runTool