@DocumentedFeature public final class HaplotypeCaller extends AssemblyRegionWalker
The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.
In the GVCF workflow used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate GVCF (not to be used in final analysis), which can then be used in GenotypeGVCFs for joint genotyping of multiple samples in a very efficient way. The GVCF workflow enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes (e.g. the 92K exomes of ExAC).
In addition, HaplotypeCaller is able to handle non-diploid organisms as well as pooled experiment data. Note however that the algorithms used to calculate variant likelihoods is not well suited to extreme allele frequencies (relative to ploidy) so its use is not recommended for somatic (cancer) variant discovery. For that purpose, use Mutect2 instead.
Finally, HaplotypeCaller is also able to correctly handle the splice junctions that make RNAseq a challenge for most variant callers, on the condition that the input read data has previously been processed according to our recommendations as documented here.
The program determines which regions of the genome it needs to operate on (active regions), based on the presence of
evidence for variation.
For each active region, the program builds a De Bruijn-like graph to reassemble the active region and identifies what are the possible haplotypes present in the data. The program then realigns each haplotype against the reference haplotype using the Smith-Waterman algorithm in order to identify potentially variant sites.
For each active region, the program performs a pairwise alignment of each read against each haplotype using the PairHMM algorithm. This produces a matrix of likelihoods of haplotypes given the read data. These likelihoods are then marginalized to obtain the likelihoods of alleles for each potentially variant site given the read data.
For each potentially variant site, the program applies Bayes' rule, using the likelihoods of alleles given the read data to calculate the likelihoods of each genotype per sample given the read data observed for that sample. The most likely genotype is then assigned to the sample.
Input bam file(s) from which to make variant calls
Either a VCF or GVCF file with raw, unfiltered SNP and indel calls. Regular VCFs must be filtered either by variant recalibration (Best Practice) or hard-filtering before use in downstream analyses. If using the GVCF workflow, the output is a GVCF file that must first be run through GenotypeGVCFs and then filtering before further analysis.
These are example commands that show how to run HaplotypeCaller for typical use cases. Have a look at the method documentation for the basic GVCF workflow.
gatk --java-options "-Xmx4g" HaplotypeCaller \ -R Homo_sapiens_assembly38.fasta \ -I input.bam \ -O output.g.vcf.gz \ -ERC GVCF
gatk --java-options "-Xmx4g" HaplotypeCaller \ -R Homo_sapiens_assembly38.fasta \ -I input.bam \ -O output.g.vcf.gz \ -ERC GVCF \ -G Standard \ -G AS_Standard
gatk --java-options "-Xmx4g" HaplotypeCaller \ -R Homo_sapiens_assembly38.fasta \ -I input.bam \ -O output.vcf.gz \ -bamout bamout.bam
This tool is able to handle many non-diploid use cases; the desired ploidy can be specified using the -ploidy argument. Note however that very high ploidies (such as are encountered in large pooled experiments) may cause performance challenges including excessive slowness. We are working on resolving these limitations.
Modifier and Type | Field and Description |
---|---|
static double |
DEFAULT_ACTIVE_PROB_THRESHOLD |
static int |
DEFAULT_ASSEMBLY_REGION_PADDING |
static int |
DEFAULT_MAX_ASSEMBLY_REGION_SIZE |
static int |
DEFAULT_MAX_PROB_PROPAGATION_DISTANCE |
static int |
DEFAULT_MAX_READS_PER_ALIGNMENT |
static int |
DEFAULT_MIN_ASSEMBLY_REGION_SIZE |
java.lang.String |
outputVCF
A raw, unfiltered, highly sensitive callset in VCF format.
|
activeProbThreshold, activityProfileOut, ASSEMBLY_PADDING_LONG_NAME, ASSEMBLY_REGION_OUT_LONG_NAME, assemblyRegionOut, assemblyRegionPadding, FORCE_ACTIVE_REGIONS_LONG_NAME, forceActive, MAX_ASSEMBLY_LONG_NAME, MAX_STARTS_LONG_NAME, maxAssemblyRegionSize, maxProbPropagationDistance, maxReadsPerAlignmentStart, MIN_ASSEMBLY_LONG_NAME, minAssemblyRegionSize, PROFILE_OUT_LONG_NAME, PROPAGATION_LONG_NAME, THRESHOLD_LONG_NAME
addOutputSAMProgramRecord, addOutputVCFCommandLine, cloudIndexPrefetchBuffer, cloudPrefetchBuffer, createOutputBamIndex, createOutputBamMD5, createOutputVariantIndex, createOutputVariantMD5, disableBamIndexCaching, features, intervalArgumentCollection, lenientVCFProcessing, outputSitesOnlyVCFs, progressMeter, readArguments, referenceArguments, SECONDS_BETWEEN_PROGRESS_UPDATES_NAME, seqValidationArguments
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, NIO_PROJECT_FOR_REQUESTER_PAYS, QUIET, specialArgumentsCollection, tmpDir, useJdkDeflater, useJdkInflater, VERBOSITY
Constructor and Description |
---|
HaplotypeCaller() |
Modifier and Type | Method and Description |
---|---|
void |
apply(AssemblyRegion region,
ReferenceContext referenceContext,
FeatureContext featureContext)
Process an individual AssemblyRegion.
|
AssemblyRegionEvaluator |
assemblyRegionEvaluator() |
void |
closeTool()
This method is called by the GATK framework at the end of the
GATKTool.doWork() template method. |
protected double |
defaultActiveProbThreshold() |
protected int |
defaultAssemblyRegionPadding() |
protected int |
defaultMaxAssemblyRegionSize() |
protected int |
defaultMaxProbPropagationDistance() |
protected int |
defaultMaxReadsPerAlignmentStart() |
protected int |
defaultMinAssemblyRegionSize() |
java.util.List<ReadFilter> |
getDefaultReadFilters()
Returns the default list of CommandLineReadFilters that are used for this tool.
|
java.util.List<java.lang.Class<? extends Annotation>> |
getDefaultVariantAnnotationGroups()
Returns the default list of annotation groups that are used for this tool.
|
protected boolean |
includeReadsWithDeletionsInIsActivePileups() |
java.util.Collection<Annotation> |
makeVariantAnnotations()
If we are in reference confidence mode we want to filter the annotations as there are certain annotations in the standard
HaplotypeCaller set which are no longer relevant, thus we filter them out before constructing the
VariantAnnotationEngine because the user args will have been parsed by that point.
|
void |
onTraversalStart()
Operations performed just prior to the start of traversal.
|
boolean |
useVariantAnnotations()
Must be overridden in order to add annotation arguments to the engine.
|
createDownsampler, getProgressMeterRecordLabel, onShutdown, onStartup, requiresReads, requiresReference, traverse
directlyAccessEngineFeatureManager, directlyAccessEngineReadsDataSource, directlyAccessEngineReferenceDataSource
addFeatureInputsAfterInitialization, createSAMWriter, createSAMWriter, createVCFWriter, createVCFWriter, doWork, getBestAvailableSequenceDictionary, getDefaultCloudIndexPrefetchBufferSize, getDefaultCloudPrefetchBufferSize, getDefaultToolVCFHeaderLines, getDefaultVariantAnnotations, getGenomicsDBOptions, getHeaderForFeatures, getHeaderForReads, getHeaderForSAMWriter, getMasterSequenceDictionary, getPluginDescriptors, getReferenceDictionary, getSequenceDictionaryValidationArgumentCollection, getToolName, getTransformedReadStream, getTraversalIntervals, hasFeatures, hasReads, hasReference, hasUserSuppliedIntervals, makePostReadFilterTransformer, makePreReadFilterTransformer, makeReadFilter, onTraversalSuccess, requiresFeatures, requiresIntervals, transformTraversalIntervals
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getSupportInformation, getToolkitName, getToolkitShortName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus
public static final int DEFAULT_MIN_ASSEMBLY_REGION_SIZE
public static final int DEFAULT_MAX_ASSEMBLY_REGION_SIZE
public static final int DEFAULT_ASSEMBLY_REGION_PADDING
public static final int DEFAULT_MAX_READS_PER_ALIGNMENT
public static final double DEFAULT_ACTIVE_PROB_THRESHOLD
public static final int DEFAULT_MAX_PROB_PROPAGATION_DISTANCE
@Argument(fullName="output", shortName="O", doc="File to which variants should be written") public java.lang.String outputVCF
protected int defaultMinAssemblyRegionSize()
defaultMinAssemblyRegionSize
in class AssemblyRegionWalker
AssemblyRegionWalker.minAssemblyRegionSize
parameter, if none is provided on the command lineprotected int defaultMaxAssemblyRegionSize()
defaultMaxAssemblyRegionSize
in class AssemblyRegionWalker
AssemblyRegionWalker.maxAssemblyRegionSize
parameter, if none is provided on the command lineprotected int defaultAssemblyRegionPadding()
defaultAssemblyRegionPadding
in class AssemblyRegionWalker
AssemblyRegionWalker.assemblyRegionPadding
parameter, if none is provided on the command lineprotected int defaultMaxReadsPerAlignmentStart()
defaultMaxReadsPerAlignmentStart
in class AssemblyRegionWalker
AssemblyRegionWalker.maxReadsPerAlignmentStart
parameter, if none is provided on the command lineprotected double defaultActiveProbThreshold()
defaultActiveProbThreshold
in class AssemblyRegionWalker
AssemblyRegionWalker.activeProbThreshold
parameter, if none is provided on the command lineprotected int defaultMaxProbPropagationDistance()
defaultMaxProbPropagationDistance
in class AssemblyRegionWalker
AssemblyRegionWalker.maxProbPropagationDistance
parameter, if none is provided on the command lineprotected boolean includeReadsWithDeletionsInIsActivePileups()
includeReadsWithDeletionsInIsActivePileups
in class AssemblyRegionWalker
public java.util.List<ReadFilter> getDefaultReadFilters()
AssemblyRegionWalker
WellformedReadFilter
filter with all default options, as well as the ReadFilterLibrary.MappedReadFilter
.
Subclasses can override to provide alternative filters.
Note: this method is called before command line parsing begins, and thus before a SAMFileHeader is
available through {link #getHeaderForReads}.getDefaultReadFilters
in class AssemblyRegionWalker
public java.util.List<java.lang.Class<? extends Annotation>> getDefaultVariantAnnotationGroups()
GATKTool
GATKTool.getDefaultVariantAnnotations()
. Returned annotation groups are subject to selective enabling/disabling
by the user via the command line. The default implementation returns an empty list.getDefaultVariantAnnotationGroups
in class GATKTool
public boolean useVariantAnnotations()
GATKTool
Annotation
s in the package defined by org.broadinstitute.hellbender.cmdline.GATKPlugin.GATKAnnotationPluginDescriptor#pluginPackageName
and automatically
generate and add command line arguments allowing the user to specify which annotations or groups of annotations to use.
To specify default annotations for a tool simply specify them using GATKTool.getDefaultVariantAnnotationGroups()
or GATKTool.getDefaultVariantAnnotations()
To access instantiated annotation objects simply use GATKTool.makeVariantAnnotations()
.useVariantAnnotations
in class GATKTool
public java.util.Collection<Annotation> makeVariantAnnotations()
makeVariantAnnotations
in class GATKTool
GATKTool.makeVariantAnnotations()
public AssemblyRegionEvaluator assemblyRegionEvaluator()
assemblyRegionEvaluator
in class AssemblyRegionWalker
public void onTraversalStart()
GATKTool
onTraversalStart
in class GATKTool
public void apply(AssemblyRegion region, ReferenceContext referenceContext, FeatureContext featureContext)
AssemblyRegionWalker
AssemblyRegionWalker.assemblyRegionEvaluator()
. This method will be called once for each active AND inactive region,
and it is up to the implementation how to handle/process active vs. inactive regions.apply
in class AssemblyRegionWalker
region
- region to process (pre-marked as either active or inactive)referenceContext
- reference data overlapping the full extended span of the assembly regionfeatureContext
- features overlapping the full extended span of the assembly regionpublic void closeTool()
GATKTool
GATKTool.doWork()
template method.
It is called regardless of whether the GATKTool.traverse()
has succeeded or not.
It is called after the GATKTool.onTraversalSuccess()
has completed (successfully or not)
but before the GATKTool.doWork()
method returns.
In other words, on successful runs both GATKTool.onTraversalSuccess()
and GATKTool.closeTool()
will be called (in this order) while
on failed runs (when GATKTool.traverse()
causes an exception), only GATKTool.closeTool()
will be called.
The default implementation does nothing.
Subclasses should override this method to close any resources that must be closed regardless of the success of traversal.