HaplotypeCaller (gatk 4.1.4.0 API)

java.lang.Object
- org.broadinstitute.hellbender.cmdline.CommandLineProgram
- - org.broadinstitute.hellbender.engine.GATKTool
  - - org.broadinstitute.hellbender.engine.WalkerBase
    - - org.broadinstitute.hellbender.engine.AssemblyRegionWalker
      - org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller

All Implemented Interfaces:

org.broadinstitute.barclay.argparser.CommandLinePluginProvider
```
@DocumentedFeature
public final class HaplotypeCaller
extends AssemblyRegionWalker
```
Call germline SNPs and indels via local re-assembly of haplotypes
The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the GVCF workflow used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate GVCF (not to be used in final analysis), which can then be used in GenotypeGVCFs for joint genotyping of multiple samples in a very efficient way. The GVCF workflow enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes (e.g. the 92K exomes of ExAC).

In addition, HaplotypeCaller is able to handle non-diploid organisms as well as pooled experiment data. Note however that the algorithms used to calculate variant likelihoods is not well suited to extreme allele frequencies (relative to ploidy) so its use is not recommended for somatic (cancer) variant discovery. For that purpose, use Mutect2 instead.

Finally, HaplotypeCaller is also able to correctly handle the splice junctions that make RNAseq a challenge for most variant callers, on the condition that the input read data has previously been processed according to our recommendations as documented here.

How HaplotypeCaller works

1. Define active regions

The program determines which regions of the genome it needs to operate on (active regions), based on the presence of evidence for variation.

2. Determine haplotypes by assembly of the active region

For each active region, the program builds a De Bruijn-like graph to reassemble the active region and identifies what are the possible haplotypes present in the data. The program then realigns each haplotype against the reference haplotype using the Smith-Waterman algorithm in order to identify potentially variant sites.

3. Determine likelihoods of the haplotypes given the read data

For each active region, the program performs a pairwise alignment of each read against each haplotype using the PairHMM algorithm. This produces a matrix of likelihoods of haplotypes given the read data. These likelihoods are then marginalized to obtain the likelihoods of alleles for each potentially variant site given the read data.

4. Assign sample genotypes

For each potentially variant site, the program applies Bayes' rule, using the likelihoods of alleles given the read data to calculate the likelihoods of each genotype per sample given the read data observed for that sample. The most likely genotype is then assigned to the sample.

Input

Input bam file(s) from which to make variant calls

Output

Either a VCF or GVCF file with raw, unfiltered SNP and indel calls. Regular VCFs must be filtered either by variant recalibration (Best Practice) or hard-filtering before use in downstream analyses. If using the GVCF workflow, the output is a GVCF file that must first be run through GenotypeGVCFs and then filtering before further analysis.

Usage examples

These are example commands that show how to run HaplotypeCaller for typical use cases. Have a look at the method documentation for the basic GVCF workflow.

Single-sample GVCF calling (outputs intermediate GVCF)
```
 gatk --java-options "-Xmx4g" HaplotypeCaller  \
   -R Homo_sapiens_assembly38.fasta \
   -I input.bam \
   -O output.g.vcf.gz \
   -ERC GVCF
 
```
Single-sample GVCF calling with allele-specific annotations
```
 gatk --java-options "-Xmx4g" HaplotypeCaller  \
   -R Homo_sapiens_assembly38.fasta \
   -I input.bam \
   -O output.g.vcf.gz \
   -ERC GVCF \
   -G Standard \
   -G AS_Standard
 
```
Variant calling with bamout to show realigned reads
```
 gatk --java-options "-Xmx4g" HaplotypeCaller  \
   -R Homo_sapiens_assembly38.fasta \
   -I input.bam \
   -O output.vcf.gz \
   -bamout bamout.bam
 
```
Caveats
- We have not yet fully tested the interaction between the GVCF-based calling or the multisample calling and the RNAseq-specific functionalities. Use those in combination at your own risk.
Special note on ploidy

This tool is able to handle many non-diploid use cases; the desired ploidy can be specified using the -ploidy argument. Note however that very high ploidies (such as are encountered in large pooled experiments) may cause performance challenges including excessive slowness. We are working on resolving these limitations.

Additional Notes
- When working with PCR-free data, be sure to set `-pcr_indel_model NONE` (see argument below).
- When running in `-ERC GVCF` or `-ERC BP_RESOLUTION` modes, the confidence threshold is automatically set to 0. This cannot be overridden by the command line. The threshold can be set manually to the desired level in the next step of the workflow (GenotypeGVCFs)
- We recommend using a list of intervals to speed up analysis. See this document for details.

Field Summary

Fields
Modifier and Type	Field and Description
`static double`	`DEFAULT_ACTIVE_PROB_THRESHOLD`
`static int`	`DEFAULT_ASSEMBLY_REGION_PADDING`
`static int`	`DEFAULT_MAX_ASSEMBLY_REGION_SIZE`
`static int`	`DEFAULT_MAX_PROB_PROPAGATION_DISTANCE`
`static int`	`DEFAULT_MAX_READS_PER_ALIGNMENT`
`static int`	`DEFAULT_MIN_ASSEMBLY_REGION_SIZE`
`java.lang.String`	`outputVCF` A raw, unfiltered, highly sensitive callset in VCF format.

Fields inherited from class org.broadinstitute.hellbender.engine.GATKTool
addOutputSAMProgramRecord, addOutputVCFCommandLine, cloudIndexPrefetchBuffer, cloudPrefetchBuffer, createOutputBamIndex, createOutputBamMD5, createOutputVariantIndex, createOutputVariantMD5, disableBamIndexCaching, features, intervalArgumentCollection, lenientVCFProcessing, outputSitesOnlyVCFs, progressMeter, readArguments, referenceArguments, SECONDS_BETWEEN_PROGRESS_UPDATES_NAME, seqValidationArguments

Fields inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, NIO_PROJECT_FOR_REQUESTER_PAYS, QUIET, specialArgumentsCollection, tmpDir, useJdkDeflater, useJdkInflater, VERBOSITY

Constructor Summary

Constructors
Constructor and Description

HaplotypeCaller()

Constructors
Constructor and Description
`HaplotypeCaller()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`apply(AssemblyRegion region, ReferenceContext referenceContext, FeatureContext featureContext)` Process an individual AssemblyRegion.
`AssemblyRegionEvaluator`	`assemblyRegionEvaluator()`
`void`	`closeTool()` This method is called by the GATK framework at the end of the `GATKTool.doWork()` template method.
`protected double`	`defaultActiveProbThreshold()`
`protected int`	`defaultAssemblyRegionPadding()`
`protected int`	`defaultMaxAssemblyRegionSize()`
`protected int`	`defaultMaxProbPropagationDistance()`
`protected int`	`defaultMaxReadsPerAlignmentStart()`
`protected int`	`defaultMinAssemblyRegionSize()`
`java.util.List<ReadFilter>`	`getDefaultReadFilters()` Returns the default list of CommandLineReadFilters that are used for this tool.
`java.util.List<java.lang.Class<? extends Annotation>>`	`getDefaultVariantAnnotationGroups()` Returns the default list of annotation groups that are used for this tool.
`protected boolean`	`includeReadsWithDeletionsInIsActivePileups()`
`java.util.Collection<Annotation>`	`makeVariantAnnotations()` If we are in reference confidence mode we want to filter the annotations as there are certain annotations in the standard HaplotypeCaller set which are no longer relevant, thus we filter them out before constructing the VariantAnnotationEngine because the user args will have been parsed by that point.
`void`	`onTraversalStart()` Operations performed just prior to the start of traversal.
`boolean`	`useVariantAnnotations()` Must be overridden in order to add annotation arguments to the engine.

Methods inherited from class org.broadinstitute.hellbender.engine.AssemblyRegionWalker
createDownsampler, getProgressMeterRecordLabel, onShutdown, onStartup, requiresReads, requiresReference, traverse

Methods inherited from class org.broadinstitute.hellbender.engine.WalkerBase
directlyAccessEngineFeatureManager, directlyAccessEngineReadsDataSource, directlyAccessEngineReferenceDataSource

Methods inherited from class org.broadinstitute.hellbender.engine.GATKTool
addFeatureInputsAfterInitialization, createSAMWriter, createSAMWriter, createVCFWriter, createVCFWriter, doWork, getBestAvailableSequenceDictionary, getDefaultCloudIndexPrefetchBufferSize, getDefaultCloudPrefetchBufferSize, getDefaultToolVCFHeaderLines, getDefaultVariantAnnotations, getGenomicsDBOptions, getHeaderForFeatures, getHeaderForReads, getHeaderForSAMWriter, getMasterSequenceDictionary, getPluginDescriptors, getReferenceDictionary, getSequenceDictionaryValidationArgumentCollection, getToolName, getTransformedReadStream, getTraversalIntervals, hasFeatures, hasReads, hasReference, hasUserSuppliedIntervals, makePostReadFilterTransformer, makePreReadFilterTransformer, makeReadFilter, onTraversalSuccess, requiresFeatures, requiresIntervals, transformTraversalIntervals

Methods inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getSupportInformation, getToolkitName, getToolkitShortName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - DEFAULT_MIN_ASSEMBLY_REGION_SIZE
```
public static final int DEFAULT_MIN_ASSEMBLY_REGION_SIZE
```
    See Also:
    
    Constant Field Values
  - DEFAULT_MAX_ASSEMBLY_REGION_SIZE
```
public static final int DEFAULT_MAX_ASSEMBLY_REGION_SIZE
```
    See Also:
    
    Constant Field Values
  - DEFAULT_ASSEMBLY_REGION_PADDING
```
public static final int DEFAULT_ASSEMBLY_REGION_PADDING
```
    See Also:
    
    Constant Field Values
  - DEFAULT_MAX_READS_PER_ALIGNMENT
```
public static final int DEFAULT_MAX_READS_PER_ALIGNMENT
```
    See Also:
    
    Constant Field Values
  - DEFAULT_ACTIVE_PROB_THRESHOLD
```
public static final double DEFAULT_ACTIVE_PROB_THRESHOLD
```
    See Also:
    
    Constant Field Values
  - DEFAULT_MAX_PROB_PROPAGATION_DISTANCE
```
public static final int DEFAULT_MAX_PROB_PROPAGATION_DISTANCE
```
    See Also:
    
    Constant Field Values
  - outputVCF
```
@Argument(fullName="output",
          shortName="O",
          doc="File to which variants should be written")
public java.lang.String outputVCF
```
    A raw, unfiltered, highly sensitive callset in VCF format.
- Constructor Detail
  - HaplotypeCaller
```
public HaplotypeCaller()
```
- Method Detail
  - defaultMinAssemblyRegionSize
```
protected int defaultMinAssemblyRegionSize()
```
    Specified by:
    
    defaultMinAssemblyRegionSize in class AssemblyRegionWalker
    
    Returns:
    
    Default value for the AssemblyRegionWalker.minAssemblyRegionSize parameter, if none is provided on the command line
  - defaultMaxAssemblyRegionSize
```
protected int defaultMaxAssemblyRegionSize()
```
    Specified by:
    
    defaultMaxAssemblyRegionSize in class AssemblyRegionWalker
    
    Returns:
    
    Default value for the AssemblyRegionWalker.maxAssemblyRegionSize parameter, if none is provided on the command line
  - defaultAssemblyRegionPadding
```
protected int defaultAssemblyRegionPadding()
```
    Specified by:
    
    defaultAssemblyRegionPadding in class AssemblyRegionWalker
    
    Returns:
    
    Default value for the AssemblyRegionWalker.assemblyRegionPadding parameter, if none is provided on the command line
  - defaultMaxReadsPerAlignmentStart
```
protected int defaultMaxReadsPerAlignmentStart()
```
    Specified by:
    
    defaultMaxReadsPerAlignmentStart in class AssemblyRegionWalker
    
    Returns:
    
    Default value for the AssemblyRegionWalker.maxReadsPerAlignmentStart parameter, if none is provided on the command line
  - defaultActiveProbThreshold
```
protected double defaultActiveProbThreshold()
```
    Specified by:
    
    defaultActiveProbThreshold in class AssemblyRegionWalker
    
    Returns:
    
    Default value for the AssemblyRegionWalker.activeProbThreshold parameter, if none is provided on the command line
  - defaultMaxProbPropagationDistance
```
protected int defaultMaxProbPropagationDistance()
```
    Specified by:
    
    defaultMaxProbPropagationDistance in class AssemblyRegionWalker
    
    Returns:
    
    Default value for the AssemblyRegionWalker.maxProbPropagationDistance parameter, if none is provided on the command line
  - includeReadsWithDeletionsInIsActivePileups
```
protected boolean includeReadsWithDeletionsInIsActivePileups()
```
    Specified by:
    
    includeReadsWithDeletionsInIsActivePileups in class AssemblyRegionWalker
    
    Returns:
    
    If true, include reads with deletions at the current locus in the pileups passed to the AssemblyRegionEvaluator.
  - getDefaultReadFilters
```
public java.util.List<ReadFilter> getDefaultReadFilters()
```
    Description copied from class: AssemblyRegionWalker
    
    Returns the default list of CommandLineReadFilters that are used for this tool. The filters returned by this method are subject to selective enabling/disabling and customization by the user via the command line. The default implementation uses the WellformedReadFilter filter with all default options, as well as the ReadFilterLibrary.MappedReadFilter. Subclasses can override to provide alternative filters. Note: this method is called before command line parsing begins, and thus before a SAMFileHeader is available through {link #getHeaderForReads}.
    
    Overrides:
    
    getDefaultReadFilters in class AssemblyRegionWalker
    
    Returns:
    
    List of default filter instances to be applied for this tool.
  - getDefaultVariantAnnotationGroups
```
public java.util.List<java.lang.Class<? extends Annotation>> getDefaultVariantAnnotationGroups()
```
    Description copied from class: GATKTool
    
    Returns the default list of annotation groups that are used for this tool. The annotations returned by this method will have default arguments, which can be overridden with specific arguments using GATKTool.getDefaultVariantAnnotations(). Returned annotation groups are subject to selective enabling/disabling by the user via the command line. The default implementation returns an empty list.
    
    Overrides:
    
    getDefaultVariantAnnotationGroups in class GATKTool
    
    Returns:
    
    List of annotation groups to be applied for this tool.
  - useVariantAnnotations
```
public boolean useVariantAnnotations()
```
    Description copied from class: GATKTool
    
    Must be overridden in order to add annotation arguments to the engine. If this is set to true the engine will dynamically discover all Annotations in the package defined by org.broadinstitute.hellbender.cmdline.GATKPlugin.GATKAnnotationPluginDescriptor#pluginPackageName and automatically generate and add command line arguments allowing the user to specify which annotations or groups of annotations to use. To specify default annotations for a tool simply specify them using GATKTool.getDefaultVariantAnnotationGroups() or GATKTool.getDefaultVariantAnnotations() To access instantiated annotation objects simply use GATKTool.makeVariantAnnotations().
    
    Overrides:
    
    useVariantAnnotations in class GATKTool
  - makeVariantAnnotations
```
public java.util.Collection<Annotation> makeVariantAnnotations()
```
    If we are in reference confidence mode we want to filter the annotations as there are certain annotations in the standard HaplotypeCaller set which are no longer relevant, thus we filter them out before constructing the VariantAnnotationEngine because the user args will have been parsed by that point.
    
    Overrides:
    
    makeVariantAnnotations in class GATKTool
    
    Returns:
    
    a collection of annotation arguments with alterations depending on hcArgs.emitReferenceConfidence
    
    See Also:
    
    GATKTool.makeVariantAnnotations()
  - assemblyRegionEvaluator
```
public AssemblyRegionEvaluator assemblyRegionEvaluator()
```
    Specified by:
    
    assemblyRegionEvaluator in class AssemblyRegionWalker
    
    Returns:
    
    The evaluator to be used to determine whether each locus is active or not. Must be implemented by tool authors. The results of this per-locus evaluator are used to determine the bounds of each active and inactive region.
  - onTraversalStart
```
public void onTraversalStart()
```
    Description copied from class: GATKTool
    
    Operations performed just prior to the start of traversal. Should be overridden by tool authors who need to process arguments local to their tool or perform other kinds of local initialization. Default implementation does nothing.
    
    Overrides:
    
    onTraversalStart in class GATKTool
  - apply
```
public void apply(AssemblyRegion region,
                  ReferenceContext referenceContext,
                  FeatureContext featureContext)
```
    Description copied from class: AssemblyRegionWalker
    
    Process an individual AssemblyRegion. Must be implemented by tool authors. Each region will come pre-marked as either "active" or "inactive" using the results of the configured AssemblyRegionWalker.assemblyRegionEvaluator(). This method will be called once for each active AND inactive region, and it is up to the implementation how to handle/process active vs. inactive regions.
    
    Specified by:
    
    apply in class AssemblyRegionWalker
    
    Parameters:
    
    region - region to process (pre-marked as either active or inactive)
    
    referenceContext - reference data overlapping the full extended span of the assembly region
    
    featureContext - features overlapping the full extended span of the assembly region
  - closeTool
```
public void closeTool()
```
    Description copied from class: GATKTool
    
    This method is called by the GATK framework at the end of the GATKTool.doWork() template method. It is called regardless of whether the GATKTool.traverse() has succeeded or not. It is called after the GATKTool.onTraversalSuccess() has completed (successfully or not) but before the GATKTool.doWork() method returns. In other words, on successful runs both GATKTool.onTraversalSuccess() and GATKTool.closeTool() will be called (in this order) while on failed runs (when GATKTool.traverse() causes an exception), only GATKTool.closeTool() will be called. The default implementation does nothing. Subclasses should override this method to close any resources that must be closed regardless of the success of traversal.
    
    Overrides:
    
    closeTool in class GATKTool

Class HaplotypeCaller

How HaplotypeCaller works

1. Define active regions

2. Determine haplotypes by assembly of the active region

3. Determine likelihoods of the haplotypes given the read data

4. Assign sample genotypes

Input

Output

Usage examples

Single-sample GVCF calling (outputs intermediate GVCF)

Single-sample GVCF calling with allele-specific annotations

Variant calling with bamout to show realigned reads

Caveats

Special note on ploidy

Additional Notes

Field Summary

Fields inherited from class org.broadinstitute.hellbender.engine.AssemblyRegionWalker

Fields inherited from class org.broadinstitute.hellbender.engine.GATKTool

Fields inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram

Constructor Summary

Method Summary

Methods inherited from class org.broadinstitute.hellbender.engine.AssemblyRegionWalker

Methods inherited from class org.broadinstitute.hellbender.engine.WalkerBase

Methods inherited from class org.broadinstitute.hellbender.engine.GATKTool

Methods inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_MIN_ASSEMBLY_REGION_SIZE

DEFAULT_MAX_ASSEMBLY_REGION_SIZE

DEFAULT_ASSEMBLY_REGION_PADDING

DEFAULT_MAX_READS_PER_ALIGNMENT

DEFAULT_ACTIVE_PROB_THRESHOLD

DEFAULT_MAX_PROB_PROPAGATION_DISTANCE

outputVCF

Constructor Detail

HaplotypeCaller

Method Detail

defaultMinAssemblyRegionSize

defaultMaxAssemblyRegionSize

defaultAssemblyRegionPadding

defaultMaxReadsPerAlignmentStart

defaultActiveProbThreshold

defaultMaxProbPropagationDistance

includeReadsWithDeletionsInIsActivePileups

getDefaultReadFilters

getDefaultVariantAnnotationGroups

useVariantAnnotations

makeVariantAnnotations

assemblyRegionEvaluator

onTraversalStart

apply

closeTool