VariantRecalibrator (gatk 4.0.8.0 API)

java.lang.Object
- org.broadinstitute.hellbender.cmdline.CommandLineProgram
- - org.broadinstitute.hellbender.engine.GATKTool
  - - org.broadinstitute.hellbender.engine.VariantWalkerBase
    - - org.broadinstitute.hellbender.engine.MultiVariantWalker
      - org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator

All Implemented Interfaces:

org.broadinstitute.barclay.argparser.CommandLinePluginProvider
```
@DocumentedFeature
public class VariantRecalibrator
extends MultiVariantWalker
```
Build a recalibration model to score variant quality for filtering purposes
This tool performs the first pass in a two-stage process called Variant Quality Score Recalibration (VQSR). Specifically, it builds the model that will be used in the second step to actually filter variants. This model attempts to describe the relationship between variant annotations (such as QD, MQ and ReadPosRankSum, for example) and the probability that a variant is a true genetic variant versus a sequencing or data processing artifact. It is developed adaptively based on "true sites" provided as input, typically HapMap sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array (in humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The result is a score called the VQSLOD that gets added to the INFO field of each variant. This score is the log odds of being a true variant versus being false under the trained Gaussian mixture model.

Summary of the VQSR procedure

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. These probabilities can then be used to filter the variants with a greater level of accuracy and flexibility than can typically be achieved by traditional hard-filter (filtering on individual annotation value thresholds). The first pass consists of building a model that describes how variant annotation values co-vary with the truthfulness of variant calls in a training set, and then scoring all input variants according to the model. The second pass simply consists of specifying a target sensitivity value (which corresponds to an empirical VQSLOD cutoff) and applying filters to each variant call according to their ranking. The result is a VCF file in which variants have been assigned a score and filter status.

VQSR is probably the hardest part of the Best Practices to get right, so be sure to read the method documentation, parameter recommendations and tutorial to really understand what these tools do and how to use them for best results on your own data.

Inputs
- The input variants to be recalibrated. These variant calls must be annotated with the annotations that will be used for modeling. If the calls come from multiple samples, they must have been obtained by joint calling the samples, either directly (running HaplotypeCaller on all samples together) or via the GVCF workflow (HaplotypeCaller with -ERC GVCF per-sample then GenotypeGVCFs on the resulting gVCFs) which is more scalable.
- Known, truth, and training sets to be used by the algorithm. See the method documentation linked above for more details.
Outputs
- A recalibration table file that will be used by the ApplyVQSR tool.
- A tranches file that shows various metrics of the recalibration callset for slices of the data.
Usage example

Recalibrating SNPs in exome data
```
 gatk VariantRecalibrator \
   -R Homo_sapiens_assembly38.fasta \
   -V input.vcf.gz \
   --resource hapmap,known=false,training=true,truth=true,prior=15.0:hapmap_3.3.hg38.sites.vcf.gz \
   --resource omni,known=false,training=true,truth=false,prior=12.0:1000G_omni2.5.hg38.sites.vcf.gz \
   --resource 1000G,known=false,training=true,truth=false,prior=10.0:1000G_phase1.snps.high_confidence.hg38.vcf.gz \
   --resource dbsnp,known=true,training=false,truth=false,prior=2.0:Homo_sapiens_assembly38.dbsnp138.vcf.gz \
   -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
   -mode SNP \
   -O output.recal \
   --tranches-file output.tranches \
   --rscript-file output.plots.R
 
```
Allele-specific version of the SNP recalibration (beta)
```
 gatk VariantRecalibrator \
   -R Homo_sapiens_assembly38.fasta \
   -V input.vcf.gz \
   -AS \
   --resource hapmap,known=false,training=true,truth=true,prior=15.0:hapmap_3.3.hg38.sites.vcf.gz \
   --resource omni,known=false,training=true,truth=false,prior=12.0:1000G_omni2.5.hg38.sites.vcf.gz \
   --resource 1000G,known=false,training=true,truth=false,prior=10.0:1000G_phase1.snps.high_confidence.hg38.vcf.gz \
   --resource dbsnp,known=true,training=false,truth=false,prior=2.0:Homo_sapiens_assembly38.dbsnp138.vcf.gz \
   -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
   -mode SNP \
   -O output.AS.recal \
   --tranches-file output.AS.tranches \
   --rscript-file output.plots.AS.R
 
```
Note that to use the allele-specific (AS) mode, the input VCF must have been produced using allele-specific annotations in HaplotypeCaller. Note also that each allele will have a separate line in the output recalibration file with its own VQSLOD and `culprit`, which will be transferred to the final VCF by the ApplyRecalibration tool.
Caveats
- The values used in the example above are only meant to show how the command lines are composed. They are not meant to be taken as specific recommendations of values to use in your own work, and they may be different from the values cited elsewhere in our documentation. For the latest and greatest recommendations on how to set parameter values for your own analyses, please read the Best Practices section of the documentation, especially the FAQ document on VQSR parameters.
- Whole genomes and exomes take slightly different parameters, so make sure you adapt your commands accordingly! See the documents linked above for details.
- If you work with small datasets (e.g. targeted capture experiments or small number of exomes), you will run into problems. Read the docs linked above for advice on how to deal with those issues.
- In order to create the model reporting plots, the Rscript executable needs to be in your environment PATH (this is the scripting version of R, not the interactive version). See http://www.r-project.org for more information on how to download and install R.
Additional notes
- This tool only accepts a single input variant file unlike earlier version of GATK, which accepted multiple input variant files.
- SNPs and indels must be recalibrated in separate runs, but it is not necessary to separate them into different files. See the tutorial linked above for an example workflow. Note that mixed records are treated as indels.

Field Summary

Fields
Modifier and Type Field and Description

protected int max_attempts
The statistical model being built by this tool may fail due to simple statistical sampling issues.
- Fields inherited from class org.broadinstitute.hellbender.engine.MultiVariantWalker
  multiVariantInputArgumentCollection
- Fields inherited from class org.broadinstitute.hellbender.engine.VariantWalkerBase
  FEATURE_CACHE_LOOKAHEAD
- Fields inherited from class org.broadinstitute.hellbender.engine.GATKTool
  addOutputSAMProgramRecord, addOutputVCFCommandLine, cloudIndexPrefetchBuffer, cloudPrefetchBuffer, createOutputBamIndex, createOutputBamMD5, createOutputVariantIndex, createOutputVariantMD5, disableBamIndexCaching, intervalArgumentCollection, lenientVCFProcessing, outputSitesOnlyVCFs, progressMeter, readArguments, referenceArguments, SECONDS_BETWEEN_PROGRESS_UPDATES_NAME, seqValidationArguments
- Fields inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
  GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, QUIET, specialArgumentsCollection, TMP_DIR, useJdkDeflater, useJdkInflater, VERBOSITY

Fields
Modifier and Type	Field and Description
`protected int`	`max_attempts` The statistical model being built by this tool may fail due to simple statistical sampling issues.

Constructor Summary

Constructors
Constructor and Description

VariantRecalibrator()

Constructors
Constructor and Description
`VariantRecalibrator()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`apply(htsjdk.variant.variantcontext.VariantContext vc, ReadsContext readsContext, ReferenceContext ref, FeatureContext featureContext)` Process an individual variant.
`void`	`closeTool()` This method is called by the GATK framework at the end of the `GATKTool.doWork()` template method.
`protected org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel`	`GMMFromTables(GATKReportTable muTable, GATKReportTable sigmaTable, GATKReportTable pmixTable, int numAnnotations, int numVariants)` Rebuild a Gaussian Mixture Model from gaussian means and co-variates stored in a GATKReportTables
`protected GATKReportTable`	`makeVectorTable(java.lang.String tableName, java.lang.String tableDescription, java.util.List<java.lang.String> annotationList, double[] perAnnotationValues, java.lang.String columnName, java.lang.String formatString)`
`void`	`onTraversalStart()` Operations performed just prior to the start of traversal.
`java.lang.Object`	`onTraversalSuccess()` Operations performed immediately after a successful traversal (ie when no uncaught exceptions were thrown during the traversal).
`protected GATKReport`	`writeModelReport(org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel goodModel, org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel badModel, java.util.List<java.lang.String> annotationList)`

Methods inherited from class org.broadinstitute.hellbender.engine.MultiVariantWalker
getDrivingVariantsFeatureInputs, getHeaderForVariants, getMultiVariantInputArgumentCollection, getSamplesForVariants, getSequenceDictionaryForDrivingVariants, getSpliteratorForDrivingVariants, initializeDrivingVariants, onShutdown, onStartup

Methods inherited from class org.broadinstitute.hellbender.engine.VariantWalkerBase
getBestAvailableSequenceDictionary, getProgressMeterRecordLabel, getTransformedVariantStream, makePostVariantFilterTransformer, makePreVariantFilterTransformer, makeVariantFilter, requiresFeatures, traverse

Methods inherited from class org.broadinstitute.hellbender.engine.GATKTool
addFeatureInputsAfterInitialization, addFeatureInputsAfterInitialization, createSAMWriter, createSAMWriter, createVCFWriter, doWork, getDefaultCloudIndexPrefetchBufferSize, getDefaultCloudPrefetchBufferSize, getDefaultReadFilters, getDefaultToolVCFHeaderLines, getDefaultVariantAnnotationGroups, getDefaultVariantAnnotations, getHeaderForFeatures, getHeaderForReads, getHeaderForSAMWriter, getMasterSequenceDictionary, getPluginDescriptors, getReferenceDictionary, getSequenceDictionaryValidationArgumentCollection, getToolName, getTransformedReadStream, getTraversalIntervals, hasFeatures, hasReads, hasReference, hasUserSuppliedIntervals, makePostReadFilterTransformer, makePreReadFilterTransformer, makeReadFilter, makeVariantAnnotations, requiresIntervals, requiresReads, requiresReference, useVariantAnnotations

Methods inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getSupportInformation, getToolkitName, getToolkitShortName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - max_attempts
```
@Advanced
 @Argument(fullName="max-attempts",
          doc="Number of attempts to build a model before failing",
          optional=true)
protected int max_attempts
```
    The statistical model being built by this tool may fail due to simple statistical sampling issues. Rather than dying immediately when the initial model fails, this argument allows the tool to restart with a different random seed and try to build the model again. The first successfully built model will be kept. Note that the most common underlying cause of model building failure is that there is insufficient data to build a really robust model. This argument provides a workaround for that issue but it is preferable to provide this tool with more data (typically by including more samples or more territory) in order to generate a more robust model.
- Constructor Detail
  - VariantRecalibrator
```
public VariantRecalibrator()
```
- Method Detail
  - onTraversalStart
```
public void onTraversalStart()
```
    Description copied from class: GATKTool
    
    Operations performed just prior to the start of traversal. Should be overridden by tool authors who need to process arguments local to their tool or perform other kinds of local initialization. Default implementation does nothing.
    
    Overrides:
    
    onTraversalStart in class GATKTool
  - apply
```
public void apply(htsjdk.variant.variantcontext.VariantContext vc,
                  ReadsContext readsContext,
                  ReferenceContext ref,
                  FeatureContext featureContext)
```
    Description copied from class: VariantWalkerBase
    
    Process an individual variant. Must be implemented by tool authors. In general, tool authors should simply stream their output from apply(), and maintain as little internal state as possible.
    
    Specified by:
    
    apply in class VariantWalkerBase
    
    Parameters:
    
    vc - Current variant being processed.
    
    readsContext - Reads overlapping the current variant. Will be an empty, but non-null, context object if there is no backing source of reads data (in which case all queries on it will return an empty array/iterator)
    
    ref - Reference bases spanning the current variant. Will be an empty, but non-null, context object if there is no backing source of reference data (in which case all queries on it will return an empty array/iterator). Can request extra bases of context around the current variant's interval by invoking ReferenceContext.setWindow(int, int) on this object before calling ReferenceContext.getBases()
    
    featureContext - Features spanning the current variant. Will be an empty, but non-null, context object if there is no backing source of Feature data (in which case all queries on it will return an empty List).
  - onTraversalSuccess
```
public java.lang.Object onTraversalSuccess()
```
    Description copied from class: GATKTool
    
    Operations performed immediately after a successful traversal (ie when no uncaught exceptions were thrown during the traversal). Should be overridden by tool authors who need to close local resources, etc., after traversal. Also allows tools to return a value representing the traversal result, which is printed by the engine. Default implementation does nothing and returns null.
    
    Overrides:
    
    onTraversalSuccess in class GATKTool
    
    Returns:
    
    Object representing the traversal result, or null if a tool does not return a value
  - closeTool
```
public void closeTool()
```
    Description copied from class: GATKTool
    
    This method is called by the GATK framework at the end of the GATKTool.doWork() template method. It is called regardless of whether the GATKTool.traverse() has succeeded or not. It is called after the GATKTool.onTraversalSuccess() has completed (successfully or not) but before the GATKTool.doWork() method returns. In other words, on successful runs both GATKTool.onTraversalSuccess() and GATKTool.closeTool() will be called (in this order) while on failed runs (when GATKTool.traverse() causes an exception), only GATKTool.closeTool() will be called. The default implementation does nothing. Subclasses should override this method to close any resources that must be closed regardless of the success of traversal.
    
    Overrides:
    
    closeTool in class GATKTool
  - GMMFromTables
```
protected org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel GMMFromTables(GATKReportTable muTable,
                                                                                              GATKReportTable sigmaTable,
                                                                                              GATKReportTable pmixTable,
                                                                                              int numAnnotations,
                                                                                              int numVariants)
```
    Rebuild a Gaussian Mixture Model from gaussian means and co-variates stored in a GATKReportTables
    
    Parameters:
    
    muTable - Table of Gaussian means
    
    sigmaTable - Table of Gaussian co-variates
    
    pmixTable - Table of PMixLog10 values
    
    numAnnotations - Number of annotations, i.e. Dimension of the annotation space in which the Gaussians live
    
    Returns:
    
    a GaussianMixtureModel whose state reflects the state recorded in the tables.
  - writeModelReport
```
protected GATKReport writeModelReport(org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel goodModel,
                                      org.broadinstitute.hellbender.tools.walkers.vqsr.GaussianMixtureModel badModel,
                                      java.util.List<java.lang.String> annotationList)
```
  - makeVectorTable
```
protected GATKReportTable makeVectorTable(java.lang.String tableName,
                                          java.lang.String tableDescription,
                                          java.util.List<java.lang.String> annotationList,
                                          double[] perAnnotationValues,
                                          java.lang.String columnName,
                                          java.lang.String formatString)
```

Class VariantRecalibrator

Summary of the VQSR procedure

Inputs

Outputs

Usage example

Recalibrating SNPs in exome data

Allele-specific version of the SNP recalibration (beta)

Caveats

Additional notes

Field Summary

Fields inherited from class org.broadinstitute.hellbender.engine.MultiVariantWalker

Fields inherited from class org.broadinstitute.hellbender.engine.VariantWalkerBase

Fields inherited from class org.broadinstitute.hellbender.engine.GATKTool

Fields inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram

Constructor Summary

Method Summary

Methods inherited from class org.broadinstitute.hellbender.engine.MultiVariantWalker

Methods inherited from class org.broadinstitute.hellbender.engine.VariantWalkerBase

Methods inherited from class org.broadinstitute.hellbender.engine.GATKTool

Methods inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram

Methods inherited from class java.lang.Object

Field Detail

max_attempts

Constructor Detail

VariantRecalibrator

Method Detail

onTraversalStart

apply

onTraversalSuccess

closeTool

GMMFromTables

writeModelReport

makeVectorTable