@DocumentedFeature public class VariantRecalibrator extends MultiVariantWalker
This tool performs the first pass in a two-stage process called Variant Quality Score Recalibration (VQSR). Specifically, it builds the model that will be used in the second step to actually filter variants. This model attempts to describe the relationship between variant annotations (such as QD, MQ and ReadPosRankSum, for example) and the probability that a variant is a true genetic variant versus a sequencing or data processing artifact. It is developed adaptively based on "true sites" provided as input, typically HapMap sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array (in humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The result is a score called the VQSLOD that gets added to the INFO field of each variant. This score is the log odds of being a true variant versus being false under the trained Gaussian mixture model.
The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. These probabilities can then be used to filter the variants with a greater level of accuracy and flexibility than can typically be achieved by traditional hard-filter (filtering on individual annotation value thresholds). The first pass consists of building a model that describes how variant annotation values co-vary with the truthfulness of variant calls in a training set, and then scoring all input variants according to the model. The second pass simply consists of specifying a target sensitivity value (which corresponds to an empirical VQSLOD cutoff) and applying filters to each variant call according to their ranking. The result is a VCF file in which variants have been assigned a score and filter status.
VQSR is probably the hardest part of the Best Practices to get right, so be sure to read the method documentation, parameter recommendations and tutorial to really understand what these tools do and how to use them for best results on your own data.
gatk VariantRecalibrator \ -R Homo_sapiens_assembly38.fasta \ -V input.vcf.gz \ --resource hapmap,known=false,training=true,truth=true,prior=15.0:hapmap_3.3.hg38.sites.vcf.gz \ --resource omni,known=false,training=true,truth=false,prior=12.0:1000G_omni2.5.hg38.sites.vcf.gz \ --resource 1000G,known=false,training=true,truth=false,prior=10.0:1000G_phase1.snps.high_confidence.hg38.vcf.gz \ --resource dbsnp,known=true,training=false,truth=false,prior=2.0:Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode SNP \ -O output.recal \ --tranches-file output.tranches \ --rscript-file output.plots.R
gatk VariantRecalibrator \ -R Homo_sapiens_assembly38.fasta \ -V input.vcf.gz \ -AS \ --resource hapmap,known=false,training=true,truth=true,prior=15.0:hapmap_3.3.hg38.sites.vcf.gz \ --resource omni,known=false,training=true,truth=false,prior=12.0:1000G_omni2.5.hg38.sites.vcf.gz \ --resource 1000G,known=false,training=true,truth=false,prior=10.0:1000G_phase1.snps.high_confidence.hg38.vcf.gz \ --resource dbsnp,known=true,training=false,truth=false,prior=2.0:Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode SNP \ -O output.AS.recal \ --tranches-file output.AS.tranches \ --rscript-file output.plots.AS.R
Note that to use the allele-specific (AS) mode, the input VCF must have been produced using allele-specific annotations in HaplotypeCaller. Note also that each allele will have a separate line in the output recalibration file with its own VQSLOD and `culprit`, which will be transferred to the final VCF by the ApplyRecalibration tool.
Modifier and Type | Field and Description |
---|---|
protected int |
max_attempts
The statistical model being built by this tool may fail due to simple statistical sampling
issues.
|
multiVariantInputArgumentCollection
FEATURE_CACHE_LOOKAHEAD
addOutputSAMProgramRecord, addOutputVCFCommandLine, cloudIndexPrefetchBuffer, cloudPrefetchBuffer, createOutputBamIndex, createOutputBamMD5, createOutputVariantIndex, createOutputVariantMD5, disableBamIndexCaching, intervalArgumentCollection, lenientVCFProcessing, outputSitesOnlyVCFs, progressMeter, readArguments, referenceArguments, SECONDS_BETWEEN_PROGRESS_UPDATES_NAME, seqValidationArguments
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, QUIET, specialArgumentsCollection, TMP_DIR, useJdkDeflater, useJdkInflater, VERBOSITY
Constructor and Description |
---|
VariantRecalibrator() |
Modifier and Type | Method and Description |
---|---|
void |
apply(htsjdk.variant.variantcontext.VariantContext vc,
ReadsContext readsContext,
ReferenceContext ref,
FeatureContext featureContext)
Process an individual variant.
|
void |
closeTool()
This method is called by the GATK framework at the end of the
GATKTool.doWork() template method. |
protected org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel |
GMMFromTables(GATKReportTable muTable,
GATKReportTable sigmaTable,
GATKReportTable pmixTable,
int numAnnotations,
int numVariants)
Rebuild a Gaussian Mixture Model from gaussian means and co-variates stored in a GATKReportTables
|
protected GATKReportTable |
makeVectorTable(java.lang.String tableName,
java.lang.String tableDescription,
java.util.List<java.lang.String> annotationList,
double[] perAnnotationValues,
java.lang.String columnName,
java.lang.String formatString) |
void |
onTraversalStart()
Operations performed just prior to the start of traversal.
|
java.lang.Object |
onTraversalSuccess()
Operations performed immediately after a successful traversal (ie when no uncaught exceptions were thrown during the traversal).
|
protected GATKReport |
writeModelReport(org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel goodModel,
org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel badModel,
java.util.List<java.lang.String> annotationList) |
getDrivingVariantsFeatureInputs, getHeaderForVariants, getMultiVariantInputArgumentCollection, getSamplesForVariants, getSequenceDictionaryForDrivingVariants, getSpliteratorForDrivingVariants, initializeDrivingVariants, onShutdown, onStartup
getBestAvailableSequenceDictionary, getProgressMeterRecordLabel, getTransformedVariantStream, makePostVariantFilterTransformer, makePreVariantFilterTransformer, makeVariantFilter, requiresFeatures, traverse
addFeatureInputsAfterInitialization, addFeatureInputsAfterInitialization, createSAMWriter, createSAMWriter, createVCFWriter, doWork, getDefaultCloudIndexPrefetchBufferSize, getDefaultCloudPrefetchBufferSize, getDefaultReadFilters, getDefaultToolVCFHeaderLines, getDefaultVariantAnnotationGroups, getDefaultVariantAnnotations, getHeaderForFeatures, getHeaderForReads, getHeaderForSAMWriter, getMasterSequenceDictionary, getPluginDescriptors, getReferenceDictionary, getSequenceDictionaryValidationArgumentCollection, getToolName, getTransformedReadStream, getTraversalIntervals, hasFeatures, hasReads, hasReference, hasUserSuppliedIntervals, makePostReadFilterTransformer, makePreReadFilterTransformer, makeReadFilter, makeVariantAnnotations, requiresIntervals, requiresReads, requiresReference, useVariantAnnotations
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getSupportInformation, getToolkitName, getToolkitShortName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus
@Advanced @Argument(fullName="max-attempts", doc="Number of attempts to build a model before failing", optional=true) protected int max_attempts
public void onTraversalStart()
GATKTool
onTraversalStart
in class GATKTool
public void apply(htsjdk.variant.variantcontext.VariantContext vc, ReadsContext readsContext, ReferenceContext ref, FeatureContext featureContext)
VariantWalkerBase
apply
in class VariantWalkerBase
vc
- Current variant being processed.readsContext
- Reads overlapping the current variant. Will be an empty, but non-null, context object
if there is no backing source of reads data (in which case all queries on it will return
an empty array/iterator)ref
- Reference bases spanning the current variant. Will be an empty, but non-null, context object
if there is no backing source of reference data (in which case all queries on it will return
an empty array/iterator). Can request extra bases of context around the current variant's interval
by invoking ReferenceContext.setWindow(int, int)
on this object before calling ReferenceContext.getBases()
featureContext
- Features spanning the current variant. Will be an empty, but non-null, context object
if there is no backing source of Feature data (in which case all queries on it will return an
empty List).public java.lang.Object onTraversalSuccess()
GATKTool
onTraversalSuccess
in class GATKTool
public void closeTool()
GATKTool
GATKTool.doWork()
template method.
It is called regardless of whether the GATKTool.traverse()
has succeeded or not.
It is called after the GATKTool.onTraversalSuccess()
has completed (successfully or not)
but before the GATKTool.doWork()
method returns.
In other words, on successful runs both GATKTool.onTraversalSuccess()
and GATKTool.closeTool()
will be called (in this order) while
on failed runs (when GATKTool.traverse()
causes an exception), only GATKTool.closeTool()
will be called.
The default implementation does nothing.
Subclasses should override this method to close any resources that must be closed regardless of the success of traversal.protected org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel GMMFromTables(GATKReportTable muTable, GATKReportTable sigmaTable, GATKReportTable pmixTable, int numAnnotations, int numVariants)
muTable
- Table of Gaussian meanssigmaTable
- Table of Gaussian co-variatespmixTable
- Table of PMixLog10 valuesnumAnnotations
- Number of annotations, i.e. Dimension of the annotation space in which the Gaussians liveprotected GATKReport writeModelReport(org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel goodModel, org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel badModel, java.util.List<java.lang.String> annotationList)
protected GATKReportTable makeVectorTable(java.lang.String tableName, java.lang.String tableDescription, java.util.List<java.lang.String> annotationList, double[] perAnnotationValues, java.lang.String columnName, java.lang.String formatString)