@DocumentedFeature @BetaFeature public class BaseRecalibratorSpark extends GATKSparkTool
This walker generates tables based on specified covariates. It does a by-locus traversal operating only at sites that are not in the known-sites resource. ExAc, gnomAD, or dbSNP resources can be used as known sites of variation. We assume that all reference mismatches we see are therefore errors and indicative of poor base quality. Since there is a large amount of data one can then calculate an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a table (of the several covariate values, num observations, num mismatches, empirical quality score).
A GATK Report file with many tables:
gatk BaseRecalibratorSpark \ -I gs://my-gcs-bucket/my_reads.bam \ -R gs://my-gcs-bucket/reference.fasta \ --known-sites gs://my-gcs-bucket/sites_of_variation.vcf \ --known-sites gs://my-gcs-bucket/another/optional/setOfSitesToMask.vcf \ -O gs://my-gcs-bucket/recal_data.table \ -- \ --sparkRunner GCS \ --cluster my-dataproc-cluster
Modifier and Type | Field and Description |
---|---|
int |
readShardPadding |
int |
readShardSize |
BAM_PARTITION_SIZE_LONG_NAME, bamPartitionSplitSize, features, intervalArgumentCollection, NUM_REDUCERS_LONG_NAME, numReducers, OUTPUT_SHARD_DIR_LONG_NAME, readArguments, referenceArguments, sequenceDictionaryValidationArguments, SHARDED_OUTPUT_LONG_NAME, shardedOutput, shardedPartsDir
programName, SPARK_PROGRAM_NAME_LONG_NAME, sparkArgs
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, QUIET, specialArgumentsCollection, TMP_DIR, useJdkDeflater, useJdkInflater, VERBOSITY
Constructor and Description |
---|
BaseRecalibratorSpark() |
Modifier and Type | Method and Description |
---|---|
java.util.List<ReadFilter> |
getDefaultReadFilters()
Returns the default list of ReadFilters that are used for this tool.
|
SerializableFunction<GATKRead,SimpleInterval> |
getReferenceWindowFunction()
Window function that controls how much reference context to return for each read when
using the reference source returned by
GATKSparkTool.getReference() . |
boolean |
requiresReads()
Does this tool require reads? Tools that do should override to return true.
|
boolean |
requiresReference()
Does this tool require reference data? Tools that do should override to return true.
|
protected void |
runTool(org.apache.spark.api.java.JavaSparkContext ctx)
Runs the tool itself after initializing and validating inputs.
|
editIntervals, getBestAvailableSequenceDictionary, getDefaultVariantAnnotationGroups, getDefaultVariantAnnotations, getHeaderForReads, getIntervals, getPluginDescriptors, getReads, getReadSourceName, getRecommendedNumReducers, getReference, getReferenceSequenceDictionary, getSequenceDictionaryValidationArgumentCollection, getTargetPartitionSize, getUnfilteredReads, hasIntervals, hasReads, hasReference, makeReadFilter, makeReadFilter, makeVariantAnnotations, requiresIntervals, runPipeline, useVariantAnnotations, validateSequenceDictionaries, writeReads, writeReads
afterPipeline, doWork, getProgramName
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getSupportInformation, getToolkitName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, onShutdown, onStartup, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus
@Argument(fullName="read-shard-size", doc="Maximum size of each read shard, in bases. Only applies when using the OVERLAPS_PARTITIONER join strategy.", optional=true) public int readShardSize
@Argument(fullName="read-shard-padding", doc="Each read shard has this many bases of extra context on each side. Only applies when using the OVERLAPS_PARTITIONER join strategy.", optional=true) public int readShardPadding
public boolean requiresReads()
GATKSparkTool
requiresReads
in class GATKSparkTool
public boolean requiresReference()
GATKSparkTool
requiresReference
in class GATKSparkTool
public SerializableFunction<GATKRead,SimpleInterval> getReferenceWindowFunction()
GATKSparkTool
GATKSparkTool.getReference()
. Tools should override
as appropriate. The default function is the identity function (ie., return exactly
the reference bases that span each read).getReferenceWindowFunction
in class GATKSparkTool
public java.util.List<ReadFilter> getDefaultReadFilters()
GATKSparkTool
WellformedReadFilter
filter with all default options. Subclasses
can override to provide alternative filters.
Note: this method is called before command line parsing begins, and thus before a SAMFileHeader is
available through GATKSparkTool.getHeaderForReads()
. The actual SAMFileHeader is propagated to the read filters
by GATKSparkTool.makeReadFilter()
after the filters have been merged with command line arguments.getDefaultReadFilters
in class GATKSparkTool
protected void runTool(org.apache.spark.api.java.JavaSparkContext ctx)
GATKSparkTool
runTool
in class GATKSparkTool
ctx
- our Spark context