public abstract class GATKSparkTool extends SparkCommandLineProgram
runTool(org.apache.spark.api.java.JavaSparkContext)
.
-Tools should override requiresReference()
, requiresReads()
, and/or requiresIntervals()
as appropriate to indicate required inputs.
-Tools can query whether certain inputs are present via hasReference()
, hasReads()
, and
hasUserSuppliedIntervals()
.
-Tools can load the reads via getReads()
, access the reference via getReference()
, and
access the intervals via getIntervals()
. Any intervals specified are automatically applied
to the reads. Input metadata is available via getHeaderForReads()
, getReferenceSequenceDictionary()
,
and getBestAvailableSequenceDictionary()
.
-Tools that require a custom reference window function (extra bases of reference context around each read)
may override getReferenceWindowFunction()
to supply one. This function will be propagated to the
reference source returned by getReference()
.Modifier and Type | Class and Description |
---|---|
static class |
GATKSparkTool.ReadInputMergingPolicy |
Modifier and Type | Field and Description |
---|---|
boolean |
addOutputVCFCommandLine |
static java.lang.String |
BAM_PARTITION_SIZE_LONG_NAME |
protected long |
bamPartitionSplitSize |
static java.lang.String |
CREATE_OUTPUT_BAM_SPLITTING_INDEX_LONG_NAME |
boolean |
createOutputBamIndex |
boolean |
createOutputBamSplittingIndex |
boolean |
createOutputVariantIndex |
protected FeatureManager |
features |
protected IntervalArgumentCollection |
intervalArgumentCollection |
static java.lang.String |
NUM_REDUCERS_LONG_NAME |
protected int |
numReducers |
static java.lang.String |
OUTPUT_SHARD_DIR_LONG_NAME |
ReadInputArgumentCollection |
readArguments |
ReferenceInputArgumentCollection |
referenceArguments |
protected SequenceDictionaryValidationArgumentCollection |
sequenceDictionaryValidationArguments |
static java.lang.String |
SHARDED_OUTPUT_LONG_NAME |
protected boolean |
shardedOutput |
protected java.lang.String |
shardedPartsDir |
static java.lang.String |
USE_NIO |
protected boolean |
useNio |
programName, SPARK_PROGRAM_NAME_LONG_NAME, sparkArgs
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, NIO_PROJECT_FOR_REQUESTER_PAYS, QUIET, specialArgumentsCollection, tmpDir, useJdkDeflater, useJdkInflater, VERBOSITY
Constructor and Description |
---|
GATKSparkTool() |
Modifier and Type | Method and Description |
---|---|
protected static java.lang.String |
addReferenceFilesForSpark(org.apache.spark.api.java.JavaSparkContext ctx,
java.lang.String referenceFile)
Register the reference file (and associated dictionary and index) to be downloaded to every node using Spark's
copying mechanism (
SparkContext#addFile() ). |
protected static java.util.List<java.lang.String> |
addVCFsForSpark(org.apache.spark.api.java.JavaSparkContext ctx,
java.util.List<java.lang.String> vcfFileNames)
Register the VCF file (and associated index) to be downloaded to every node using Spark's copying mechanism
(
SparkContext#addFile() ). |
protected java.util.List<SimpleInterval> |
editIntervals(java.util.List<SimpleInterval> rawIntervals)
Transform the intervals during loading.
|
htsjdk.samtools.SAMSequenceDictionary |
getBestAvailableSequenceDictionary()
Returns the "best available" sequence dictionary.
|
java.util.List<ReadFilter> |
getDefaultReadFilters()
Returns the default list of ReadFilters that are used for this tool.
|
protected java.util.Set<htsjdk.variant.vcf.VCFHeaderLine> |
getDefaultToolVCFHeaderLines() |
java.util.List<java.lang.Class<? extends Annotation>> |
getDefaultVariantAnnotationGroups() |
java.util.List<Annotation> |
getDefaultVariantAnnotations() |
protected org.apache.spark.api.java.JavaRDD<GATKRead> |
getGatkReadJavaRDD(TraversalParameters traversalParameters,
ReadsSparkSource source,
java.lang.String input) |
htsjdk.samtools.SAMFileHeader |
getHeaderForReads() |
java.util.List<SimpleInterval> |
getIntervals() |
java.util.List<? extends org.broadinstitute.barclay.argparser.CommandLinePluginDescriptor<?>> |
getPluginDescriptors()
Return the list of GATKCommandLinePluginDescriptor objects to be used for this CLP.
|
GATKSparkTool.ReadInputMergingPolicy |
getReadInputMergingPolicy()
Does this tool support multiple inputs? Tools that do should override this method with the desired
GATKSparkTool.ReadInputMergingPolicy . |
org.apache.spark.api.java.JavaRDD<GATKRead> |
getReads()
Loads the reads into a
JavaRDD using the intervals specified, and filters them using
the filter returned by makeReadFilter() . |
protected java.util.LinkedHashMap<java.lang.String,htsjdk.samtools.SAMFileHeader> |
getReadSourceHeaderMap()
Returns a map of read input to header.
|
protected java.util.List<java.lang.String> |
getReadSourceName()
Returns the name of the source of reads data.
|
int |
getRecommendedNumReducers()
Return the recommended number of reducers for a pipeline processing the reads.
|
ReferenceMultiSparkSource |
getReference() |
htsjdk.samtools.SAMSequenceDictionary |
getReferenceSequenceDictionary() |
SerializableFunction<GATKRead,SimpleInterval> |
getReferenceWindowFunction()
Window function that controls how much reference context to return for each read when
using the reference source returned by
getReference() . |
protected SequenceDictionaryValidationArgumentCollection |
getSequenceDictionaryValidationArgumentCollection()
subclasses can override this to provide different default behavior for sequence dictionary validation
|
int |
getTargetPartitionSize()
Returns the size of each input partition (in bytes) that is used to determine the recommended number of reducers
for running a processing pipeline.
|
org.apache.spark.api.java.JavaRDD<GATKRead> |
getUnfilteredReads()
Loads the reads into a
JavaRDD using the intervals specified, and returns them
without applying any filtering. |
boolean |
hasReads()
Are sources of reads available?
|
boolean |
hasReference()
Is a source of reference data available?
|
boolean |
hasUserSuppliedIntervals()
Are sources of intervals available?
|
ReadFilter |
makeReadFilter()
Returns a read filter (simple or composite) that can be applied to the reads returned from
getReads() . |
protected ReadFilter |
makeReadFilter(htsjdk.samtools.SAMFileHeader samFileHeader)
Like
makeReadFilter() but with the ability to pass a different SAMFileHeader. |
java.util.Collection<Annotation> |
makeVariantAnnotations() |
boolean |
requiresIntervals()
Does this tool require intervals? Tools that do should override to return true.
|
boolean |
requiresReads()
Does this tool require reads? Tools that do should override to return true.
|
boolean |
requiresReference()
Does this tool require reference data? Tools that do should override to return true.
|
protected void |
runPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext)
Runs the pipeline.
|
protected abstract void |
runTool(org.apache.spark.api.java.JavaSparkContext ctx)
Runs the tool itself after initializing and validating inputs.
|
boolean |
useVariantAnnotations() |
protected void |
validateSequenceDictionaries()
Validates standard tool inputs against each other.
|
void |
writeReads(org.apache.spark.api.java.JavaSparkContext ctx,
java.lang.String outputFile,
org.apache.spark.api.java.JavaRDD<GATKRead> reads)
Writes the reads from a
JavaRDD to an output file. |
void |
writeReads(org.apache.spark.api.java.JavaSparkContext ctx,
java.lang.String outputFile,
org.apache.spark.api.java.JavaRDD<GATKRead> reads,
htsjdk.samtools.SAMFileHeader header,
boolean sortReadsToHeader)
Writes the reads from a
JavaRDD to an output file. |
afterPipeline, doWork, getProgramName
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getSupportInformation, getToolkitName, getToolkitShortName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, onShutdown, onStartup, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus
public static final java.lang.String BAM_PARTITION_SIZE_LONG_NAME
public static final java.lang.String NUM_REDUCERS_LONG_NAME
public static final java.lang.String SHARDED_OUTPUT_LONG_NAME
public static final java.lang.String OUTPUT_SHARD_DIR_LONG_NAME
public static final java.lang.String CREATE_OUTPUT_BAM_SPLITTING_INDEX_LONG_NAME
public static final java.lang.String USE_NIO
@ArgumentCollection public final ReferenceInputArgumentCollection referenceArguments
@ArgumentCollection public final ReadInputArgumentCollection readArguments
@ArgumentCollection protected IntervalArgumentCollection intervalArgumentCollection
@Argument(doc="maximum number of bytes to read from a file into each partition of reads. Setting this higher will result in fewer partitions. Note that this will not be equal to the size of the partition in memory. Defaults to 0, which uses the default split size (determined by the Hadoop input format, typically the size of one HDFS block).", fullName="bam-partition-size", optional=true) protected long bamPartitionSplitSize
@Argument(doc="Whether to use NIO or the Hadoop filesystem (default) for reading files. (Note that the Hadoop filesystem is always used for writing files.)", fullName="use-nio", optional=true) protected boolean useNio
@ArgumentCollection protected SequenceDictionaryValidationArgumentCollection sequenceDictionaryValidationArguments
@Argument(fullName="add-output-vcf-command-line", shortName="add-output-vcf-command-line", doc="If true, adds a command line header line to created VCF files.", optional=true, common=true) public boolean addOutputVCFCommandLine
@Argument(doc="For tools that write an output, write the output in multiple pieces (shards)", fullName="sharded-output", optional=true, mutex="output-shard-tmp-dir") protected boolean shardedOutput
@Argument(doc="when writing a bam, in single sharded mode this directory to write the temporary intermediate output shards, if not specified .parts/ will be used", fullName="output-shard-tmp-dir", optional=true, mutex="sharded-output") protected java.lang.String shardedPartsDir
@Argument(doc="For tools that shuffle data or write an output, sets the number of reducers. Defaults to 0, which gives one partition per 10MB of input.", fullName="num-reducers", optional=true) protected int numReducers
@Argument(fullName="create-output-bam-index", shortName="OBI", doc="If true, create a BAM index when writing a coordinate-sorted BAM file.", optional=true, common=true) public boolean createOutputBamIndex
@Argument(fullName="create-output-bam-splitting-index", doc="If true, create a BAM splitting index (SBI) when writing a coordinate-sorted BAM file.", optional=true, common=true) public boolean createOutputBamSplittingIndex
@Argument(fullName="create-output-variant-index", shortName="OVI", doc="If true, create a VCF index when writing a coordinate-sorted VCF file.", optional=true, common=true) public boolean createOutputVariantIndex
protected FeatureManager features
public java.util.List<? extends org.broadinstitute.barclay.argparser.CommandLinePluginDescriptor<?>> getPluginDescriptors()
getPluginDescriptors
in interface org.broadinstitute.barclay.argparser.CommandLinePluginProvider
getPluginDescriptors
in class CommandLineProgram
public boolean requiresReference()
public boolean requiresReads()
public GATKSparkTool.ReadInputMergingPolicy getReadInputMergingPolicy()
GATKSparkTool.ReadInputMergingPolicy
.public boolean requiresIntervals()
public final boolean hasReference()
public final boolean hasReads()
public final boolean hasUserSuppliedIntervals()
public SerializableFunction<GATKRead,SimpleInterval> getReferenceWindowFunction()
getReference()
. Tools should override
as appropriate. The default function is the identity function (ie., return exactly
the reference bases that span each read).protected SequenceDictionaryValidationArgumentCollection getSequenceDictionaryValidationArgumentCollection()
public htsjdk.samtools.SAMSequenceDictionary getBestAvailableSequenceDictionary()
public htsjdk.samtools.SAMSequenceDictionary getReferenceSequenceDictionary()
public htsjdk.samtools.SAMFileHeader getHeaderForReads()
public org.apache.spark.api.java.JavaRDD<GATKRead> getReads()
JavaRDD
using the intervals specified, and filters them using
the filter returned by makeReadFilter()
.
If no intervals were specified, returns all the reads (both mapped and unmapped).JavaRDD
, bounded by intervals if specified,
and filtered using the filter from makeReadFilter()
.public org.apache.spark.api.java.JavaRDD<GATKRead> getUnfilteredReads()
JavaRDD
using the intervals specified, and returns them
without applying any filtering.
If no intervals were specified, returns all the reads (both mapped and unmapped).JavaRDD
, bounded by intervals if specified, and unfiltered.protected org.apache.spark.api.java.JavaRDD<GATKRead> getGatkReadJavaRDD(TraversalParameters traversalParameters, ReadsSparkSource source, java.lang.String input)
public void writeReads(org.apache.spark.api.java.JavaSparkContext ctx, java.lang.String outputFile, org.apache.spark.api.java.JavaRDD<GATKRead> reads)
JavaRDD
to an output file.ctx
- the JavaSparkContext to write.outputFile
- path to the output bam/cram.reads
- reads to write.public void writeReads(org.apache.spark.api.java.JavaSparkContext ctx, java.lang.String outputFile, org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, boolean sortReadsToHeader)
JavaRDD
to an output file.ctx
- the JavaSparkContext to write.outputFile
- path to the output bam/cram.reads
- reads to write.header
- the header to write.public int getRecommendedNumReducers()
getTargetPartitionSize()
.
Subclasses that want to control the recommended number of reducers should typically override
getTargetPartitionSize()
rather than this method.public int getTargetPartitionSize()
public ReadFilter makeReadFilter()
getReads()
.
This implementation combines the default read filters for this tool (returned by getDefaultReadFilters()
along with any read filter command line directives specified by the user (such as enabling other filters or
disabling default filters); and returns a single composite filter resulting from the list by and'ing them together.
NOTE: Most tools will not need to override the method, and should only do so in order to provide custom
behavior or processing of the final merged read filter. To change the default read filters used by the tool,
override getDefaultReadFilters()
instead.
Multiple filters can be composed by using ReadFilter
composition methods.protected ReadFilter makeReadFilter(htsjdk.samtools.SAMFileHeader samFileHeader)
makeReadFilter()
but with the ability to pass a different SAMFileHeader.public java.util.List<ReadFilter> getDefaultReadFilters()
WellformedReadFilter
filter with all default options. Subclasses
can override to provide alternative filters.
Note: this method is called before command line parsing begins, and thus before a SAMFileHeader is
available through getHeaderForReads()
. The actual SAMFileHeader is propagated to the read filters
by makeReadFilter()
after the filters have been merged with command line arguments.public boolean useVariantAnnotations()
GATKTool.useVariantAnnotations()
public java.util.List<Annotation> getDefaultVariantAnnotations()
GATKTool.getDefaultVariantAnnotations()
public java.util.List<java.lang.Class<? extends Annotation>> getDefaultVariantAnnotationGroups()
protected java.util.Set<htsjdk.variant.vcf.VCFHeaderLine> getDefaultToolVCFHeaderLines()
public java.util.Collection<Annotation> makeVariantAnnotations()
GATKTool.makeVariantAnnotations()
protected java.util.List<java.lang.String> getReadSourceName()
protected java.util.LinkedHashMap<java.lang.String,htsjdk.samtools.SAMFileHeader> getReadSourceHeaderMap()
public ReferenceMultiSparkSource getReference()
public java.util.List<SimpleInterval> getIntervals()
protected void runPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext)
SparkCommandLineProgram
runPipeline
in class SparkCommandLineProgram
protected java.util.List<SimpleInterval> editIntervals(java.util.List<SimpleInterval> rawIntervals)
rawIntervals
- Intervals specified on command line by user (-L). Can be null
protected void validateSequenceDictionaries()
protected static java.lang.String addReferenceFilesForSpark(org.apache.spark.api.java.JavaSparkContext ctx, java.lang.String referenceFile)
SparkContext#addFile()
).ctx
- the Spark contextreferenceFile
- the reference file, can be a local file or a remote pathSparkFiles#get()
protected static java.util.List<java.lang.String> addVCFsForSpark(org.apache.spark.api.java.JavaSparkContext ctx, java.util.List<java.lang.String> vcfFileNames)
SparkContext#addFile()
).ctx
- the Spark contextvcfFileNames
- the VCF files, can be local files or remote pathsSparkFiles#get()
protected abstract void runTool(org.apache.spark.api.java.JavaSparkContext ctx)
ctx
- our Spark context