All Classes and Interfaces

Class
Description
Class to translate back and forth from absolute long-typed base positions to relative ones (the usual contig, position pairs).
 
Abstract class that coordinates the general task of taking in a set of alignment information, possibly in SAM format, possibly in other formats, and merging that with the set of all reads for which alignment was attempted, stored in an unmapped SAM file.
Abstract class that coordinates the general task of taking in a set of alignment information, possibly in SAM format, possibly in other formats, and merging that with the set of all reads for which alignment was attempted, stored in an unmapped SAM file.
 
AbstractBCICodec<F extends htsjdk.tribble.Feature>
 
Base class for concordance walkers, which process one variant at a time from one or more sources of variants, with optional contextual information from a reference, sets of reads, and/or supplementary sources of Features.
store a truth vc in case of a false negative, an eval vc in case of a false positive, or a concordance pair of truth and eval in case of a true positive.
A FeatureSink that buffers and resolves (by merging, or by checking for redundancy) features that occur on the same interval.
 
The position files of Illumina are nearly the same form: Pos files consist of text based tabbed x-y coordinate float pairs, locs files are binary x-y float pairs, clocs are compressed binary x-y float pairs.
Class for parsing text files where each line consists of fields separated by whitespace.
AbstractLocatableCollection<METADATA extends LocatableMetadata,RECORD extends htsjdk.samtools.util.Locatable>
Represents a sequence dictionary, an immutable, coordinate-sorted (with no overlaps allowed) collection of records that extend Locatable (although contigs are assumed to be non-null when writing to file), a set of mandatory column headers given by a TableColumnCollection, and lambdas for reading and writing records.
Abstract class that holds parameters and methods common to classes that perform duplicate detection and/or marking within SAM/BAM/CRAM files.
Little class used to package up a header and an iterable/iterator.
Abstract class that holds parameters and methods common to classes that optical duplicate detection.
Read threading graph class intended to contain duplicated code between ReadThreadingGraph and JunctionTreeLinkedDeBruijnGraph.
Edge factory that encapsulates the numPruningSamples assembly parameter
 
Represents AbstractRecordCollection (which can be represented as a SAMFileHeader), an immutable collection of records, a set of mandatory column headers given by a TableColumnCollection, and lambdas for reading and writing records.
AbstractSampleLocatableCollection<RECORD extends htsjdk.samtools.util.Locatable>
Represents a sample name, a sequence dictionary, an immutable, coordinate-sorted (with no overlaps allowed) collection of records that extend Locatable (although contigs are assumed to be non-null when writing to file), a set of mandatory column headers given by a TableColumnCollection, and lambdas for reading and writing records.
Represents a sample name, an immutable collection of records, a set of mandatory column headers given by a TableColumnCollection, and lambdas for reading and writing records.
AbstractWgsMetricsCollector<T extends htsjdk.samtools.util.AbstractRecordAndOffset>
Class for collecting data on reference coverage, base qualities and excluded bases from one AbstractLocusInfo object for CollectWgsMetrics.
Combines multiple Picard QualityYieldMetrics files into a single file.
Combines multiple Variant Calling Metrics files into a single file.
Class holding information about per-base activity scores for assembly region traversal
Captures the probability that a specific locus in the genome represents an "active" site containing real variation.
The type of the value returned by ActivityProfileState.getResultValue()
Given a MultiIntervalShard of GATKRead, iterates over each locus within that shard, and calculates the ActivityProfileState there, using the provided AssemblyRegionEvaluator to determine if each site is active.
An efficient way of representing a set of ActivityProfileStates in an interval; used by Spark.
Store one or more AdapterPairs to use to mark adapter sequence of SAMRecords.
 
 
Trims (hard clips) adapter sequences from read ends.
A utility class for matching reads to adapters.
 
Metropolis MCMC sampler using an adaptive step size that increases / decreases in order to decrease / increase acceptance rate to some desired value.
A tool to add comments to a BAM file header.
 
 
Assigns all the reads in a file to a single new read-group.
Describes the results of the AFCalc Only the bare essentials are represented here, as all AFCalc models must return meaningful results for all of these fields.
Categorical sample trait for association and analysis Samples can have unknown status, be affected or unaffected by the categorical trait, or they can be marked as actually having an other trait value (stored in an associated value in the Sample class)
Class for managing aliases and querying Funcotation to determine fields.
Holding necessary information about a local assembly for use in SV discovery.
 
An assembly with its contigs aligned to reference, or a reason that there isn't an assembly.
 
Locally assembled contig: its name its sequence as produced by the assembler (no reverse complement like in the SAM record if it maps to '-' strand), and its stripped-down alignment information.
After configuration scoring and picking, the original alignments can be classified as good and bad mappings: good: the ones used the picked configuration bad: unused alignments in the chosen configuration; these likely contain more noise than information they can be turned into string representation following the format as in AlignmentInterval.toPackedString()
 
Loads various upstream assembly and alignment formats and turn into custom AlignedContig format in the discovery stage.
Filter out reads where the alignment does not match the contents of the header.
Bundles together and AlignmentContext and a ReferenceContext
Bundles together a pileup and a location.
 
Create an iterator for traversing alignment contexts in a specified manner.
Each assembled contig should have at least one such accompanying structure, or 0 when it is unmapped.
 
Steps a single read along its alignment to the genome The logical model for generating extended events is as follows: the "record state" implements the traversal along the reference; thus stepForwardOnGenome() returns on every and only on actual reference bases.
High level metrics about the alignment of reads within a SAM file, produced by the CollectAlignmentSummaryMetrics program and usually stored in a file with the extension ".alignment_summary_metrics".
 
 
 
 
This class is similar to LocationAndAlleles but allows keeping only an allele/ref pair rather than a list of alleles.
Filters out a record if the allele balance for heterozygotes is out of a defined range across all samples.
The purpose of this set of utilities is to downsample a set of reads to remove contamination.
Stratifies the eval RODs by the allele count of the alternate allele Looks first at the MLEAC value in the INFO field, and uses that value if present.
Filtering haplotypes that contribute weak alleles to the genotyping.
Filtering haplotypes that contribute weak alleles to the genotyping.
Filtering haplotypes that contribute weak alleles to the genotyping.
Helps read and set allele specific filters in the INFO field.
Variant allele fraction for each sample.
 
Segments alternate-allele-fraction data using kernel segmentation.
Given segments and counts of alt and ref reads over a list of het sites, infers the minor-allele fraction of each segment.
Enumerates the parameters for AlleleFractionState.
Represents priors for the allele-fraction model.
Stratifies the eval RODs by the allele frequency of the alternate allele Either uses a constant 0.005 frequency grid, and projects the AF INFO field value or logit scale from -30 to 30.
 
Allele frequency calculations for the Exac dataset
This tool uses VariantEval to bin variants in Thousand Genomes by allele frequency.
 
Allele frequency utilities that are dataset-agnostic
AlleleLikelihoods<EVIDENCE extends htsjdk.samtools.util.Locatable,A extends htsjdk.variant.variantcontext.Allele>
Evidence-likelihoods container implementation based on integer indexed arrays.
 
AlleleList<A extends htsjdk.variant.variantcontext.Allele>
Minimal interface for random access to a collection of Alleles.
AlleleList.ActualPermutation<A extends htsjdk.variant.variantcontext.Allele>
 
AlleleList.NonPermutation<A extends htsjdk.variant.variantcontext.Allele>
This is the identity permutation.
AlleleListPermutation<A extends htsjdk.variant.variantcontext.Allele>
Marks allele list permutation implementation classes.
Useful when you know the interval and the alleles of interest ahead of the counting.
 
This is a marker interface used to indicate which annotations are allele-specific.
A class to encapsulate the raw data for allele-specific classes compatible with the ReducibleAnnotation interface
Utilities class containing methods for restricting VariantContext and GenotypesContext objects to a reduced set of alleles, as well as for choosing the best set of alleles to keep and for cleaning up annotations and genotypes after subsetting.
Utilities class containing methods for restricting VariantContext and GenotypesContext objects to a reduced set of alleles, as well as for choosing the best set of alleles to keep and for cleaning up annotations and genotypes after subsetting.
Reference and alternate allele counts at a site specified by an interval.
Simple data structure to pass and read/write a List of AllelicCount objects.
Collects reference/alternate allele counts at specified sites.
A super-simplified/stripped-down/faster version of IntervalAlignmentContextIterator that takes a locus iterator and a *single* interval, and returns an AlignmentContext for every locus in the interval.
Created by tsato on 10/11/17.
Writer
Filters out reads that have greater than the threshold number for unknown (N) bases.
Enum to hold the amino acids and their standard codons.
 
 
Evaluate and compare base quality score recalibration tables
Process reads from a saturation mutagenesis experiment.
 
Represents an interval with a set of annotations.
Simple class that just has an interval and sorted name-value pairs.
Read AnnotatedIntervals from a xsv file (see XsvLocatableTableCodec.
Represents a collection of intervals annotated with CopyNumberAnnotations.
Represents a collection of annotated intervals.
 
Converts an annotated interval representing a segment to a variant context.
 
 
Given identified pair of breakpoints for a simple SV and its supportive evidence, i.e.
Annotates intervals with GC content, and optionally, mappability and segmental-duplication content.
Annotate every variant in a VCF with the depth at that locus in a bam.
Given mixing weights of different samples in a pooled bam, annotate a corresponding vcf containing individual sample genotypes.
An annotation group is a set of annotation that have something in common and should be added at the same time.
Exception thrown when loading gene annotations.
Represents a key for a named, typed annotation.
Represents an immutable ordered collection of named, typed annotations for an interval.
 
Perform singular value decomposition (and pseudoinverse calculation) in pure Java, Commons Math.
Apply base quality score recalibration
The collection of all arguments needed for ApplyBQSR.
Apply base quality score recalibration with Spark.
 
The collection of those arguments for ApplyBQSR that are not already defined in RecalibrationArgumentCollection.
Apply a score cutoff to filter variants based on a recalibration table
 
 
A simple class to store names and counts for the the Control Information fields that are stored in an Illumina GTC file.
Container for the artifact prior probabilities for the read orientation model
 
 
Container class for ArtifactPrior objects.
This enum encapsulates the domain of the discrete latent random variable z
Easy to use creator of artificial BAM files for testing Allows us to make a stream of reads or an index BAM file with read having the following properties - coming from n samples - of fixed read length and aligned to the genome with M operator - having N reads per alignment start - skipping N bases between each alignment start - starting at a given alignment start
this fake iterator allows us to look at how specific piles of reads are handled
 
 
Allele-specific rank Sum Test of REF versus ALT base quality scores
Allele-specific strand bias estimated using Fisher's Exact Test *
Allele-specific likelihood-based test for the inbreeding among samples
Allele specific Rank Sum Test for mapping qualities of REF versus ALT reads
Allele-specific call confidence normalized by depth of sample reads supporting the allele
Allele-specific implementation of rank sum test annotations
Allele-specific Rank Sum Test for relative positioning of REF versus ALT allele within reads
Allele-specific Root Mean Square of the mapping quality of reads across all samples.
This is a marker interface used to indicate which annotations are "Standard" and allele-specific.
Adds the strand bias table annotation for use in mutect filters
Allele-specific implementation of strand bias annotations
Allele-specific strand bias estimated by the Symmetric Odds Ratio test
Calculate read counts per allele for allele-specific expression analysis of RNAseq data
 
 
 
Set of arguments for Assembly Based Callers
Created by davidben on 9/8/16.
 
A simple heuristic optimizer based on extensive manual review of alignments produced by the aligner (currently "bwa mem -x intractg") with the aim for picking a configuration that provides "optimal coverage" for the input assembly contig.
 
A wrapper around AlignedContig to represent mapped assembly contig whose alignments went through AssemblyContigAlignmentsRDDProcessor and may represent SV breakpoints.
 
 
 
Region of the genome that gets assembled by the local assembly engine.
 
Classes that implement this interface have the ability to evaluate how likely it is that a site is "active" (contains potential real variation).
Given an iterator of ActivityProfileState, finds AssemblyRegions.
Given a MultiIntervalShard of GATKRead, iterates over each AssemblyRegion within that shard, using the provided AssemblyRegionEvaluator to determine the boundaries between assembly regions.
 
Helper component to manage active region trimming
An AssemblyRegionWalker is a tool that processes an entire region of reads at a time, each marked as either "active" (containing possible variation) or "inactive" (not likely to contain actual variation).
Encapsulates an AssemblyRegion with its ReferenceContext and FeatureContext.
A Spark version of AssemblyRegionWalker.
Result of assembling, with the resulting graph and status
Status of the assembly result
Collection of read assembly using several kmerSizes.
A service that can be used to write to a stream using a thread background thread and an executor service.
Wrapper around a CloseableIterator that reads in a separate thread, for cases in which that might be efficient.
Describes
An AutoCloseable collection that will automatically close all of its elements.
Reference to another object that perform some action when closed.
 
Biallelic-frequency of a sample at some locus.
Codec to handle BafEvidence in BlockCompressedInterval files
Codec to handle BafEvidence in tab-delimited text files
Imposes additional ordering of same-locus BafEvidence by sample.
 
Designs baits for hybrid selection!
Set of possible design strategies for bait design.
Command line program to print statistics from BAM index (.bai) file Statistics include count of aligned and unaligned reads for each reference sequence and a count of all records with no start coordinate.
Converts a BAM file into a BFQ (binary fastq formatted) file.
Deprecated.
A band pass filtering version of the activity profile Applies a band pass filter with a Gaussian kernel to the input state probabilities to smooth them out of an interval
 
 
 
these are features that only the walker can override
A class for finding the distance between multiple (matched) barcodes and multiple barcode reads.
BarcodeExtractor is used to match barcodes and collect barcode match metrics.
Utility class to hang onto data about the best match for a given barcode
Created by jcarey on 3/13/14.
Reads a single barcode file line by line and returns the barcode if there was a match or NULL otherwise.
Metrics produced by the ExtractIlluminaBarcodes program that is used to parse data in the basecalls directory and determine to which barcode each read should be assigned.
 
An interface that can take a collection of bases (provided as SamLocusIterator.RecordAndOffset and SamLocusAndReferenceIterator.SAMLocusAndReference) and generates a ErrorMetric from them.
Tools that process sequencing machine data, e.g.
BasecallsConverter utilizes an underlying IlluminaDataProvider to convert parsed and decoded sequencing data from standard Illumina formats to specific output records (FASTA records/SAM records).
Interface that defines a converter that takes ClusterData and returns OUTPUT_RECORD type objects.
Interface that defines a writer that will write out OUTPUT_RECORD type objects.
BasecallsConverterBuilder creates and configures BasecallsConverter objects.
 
 
Simple edge class for connecting nodes in the graph.
An interface and implementations for classes that apply a RecordAndOffsetStratifier to put bases into various "bins" and then compute an ErrorMetric on these bases using a BaseErrorCalculator.
 
An error metric for the errors in bases.
 
BaseGraph<V extends BaseVertex,E extends BaseEdge>
Common code for graphs used for local assembly.
Parse various formats and versions of Illumina Basecall files, and use them the to populate ClusterData objects.
Collection of baseline copy-number states.
Median base quality of bases supporting each allele.
Clips reads on both ends using base quality scores
 
 
Rank Sum Test of REF versus ALT base quality scores
 
 
Reference window function for BQSR.
First pass of the base quality score recalibration.
Spark version of the first pass of the base quality score recalibration.
 
BaseUtils contains some basic utilities for manipulating nucleotides.
 
 
 
A graph vertex that holds some sequence information
TextFileParser which reads a single text file.
A source of reference base calls.
 
 
 
 
Created by jcarey on 3/14/14.
A class that implements the IlluminaData interfaces provided by this parser One BclData object is returned to IlluminaDataProvider per cluster and each first level array in bases and qualities represents a single read in that cluster
 
 
Annoyingly, there are two different files with extension .bci in NextSeq output.
Describes a mechanism for revising and evaluating qualities read from a BCL file.
BCL Files are base call and quality score binary files containing a (base,quality) pair for successive clusters.
Implementation of the GATKRead interface for the AlignmentRecord class.
 
For an aligner that aligns each end independently, select the alignment for each end with the best MAPQ, and make that the primary.
For an aligner that aligns each end independently, select the alignment for each end with the best MAPQ, and make that the primary.
This strategy was designed for TopHat output, but could be of general utility.
This strategy was designed for TopHat output, but could be of general utility.
 
Beta-binomial using the Apache Math3 Framework.
 
 
 
Utility class for dealing with BigQuery connections / tables / queries /etc.
Abstract base class for readers of table with records stored in binary.
Abstract file writing class for record tables stored in binary format.
CNV defragmenter for when the intervals used for coverage collection are available.
 
 
 
BlockCompressedIntervalStream.Reader<T extends htsjdk.tribble.Feature>
 
BlockCompressedIntervalStream.WriteFunc<F extends htsjdk.tribble.Feature>
 
BlockCompressedIntervalStream.Writer<F extends htsjdk.tribble.Feature>
 
A simple program to convert an Illumina bpm (bead pool manifest file) into a normalization manifest (bpm.csv) file The normalization manifest (bpm.csv) is a simple text file generated by Illumina tools - it has a specific format and is used by ZCall .
The full BQSR pipeline in one tool to run on Spark.
 
 
 
 
 
 
 
A helper struct for annotating complications that make the locations represented by its associated NovelAdjacencyAndAltHaplotype a little ambiguous.
For novel adjacency between reference locations that are on the same chromosome, and with a strand switch.
 
For this specific complication, we support a what could be defined as incomplete picture, that involves inverted duplication: two overlapping alignments to reference first alignment: --------------------> second alignment: <--------------------- |--------||----------| Seg.1 Seg.2 At least Seg.1 is invert duplicated, and Seg.2 is inverted trans-inserted between the two copies (one of which is inverted).
 
For simple deletion, insertion, and replacement (dep and ins at the same time).
 
For duplications small enough that we seemingly have assembled across the whole event.
This is for dealing with case when the duplicated range could NOT be inferred exactly, but only from a simple optimization scheme.
 
 
 
A class that acts as a filter for breakpoint evidence.
Various types of read anomalies that provide evidence of genomic breakpoints.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
A class to examine a stream of BreakpointEvidence, and group it into Intervals.
Based on alignment signature of the input simple chimera, and evidence contig having the chimera, infers exact position of breakpoints following the left-aligning convention, alt haplotype sequence based on given contig sequence complications such as homology, inserted sequence and duplicated ref region, if any.
Utilities for dealing with google buckets.
A class to represent an 'Extended' Illumina Manifest file.
A class to represent a record (line) from an Extended Illumina Manifest [Assay] entry
 
 
Command line program to generate a BAM index (.bai) file from a BAM (.bam) file
The "bunny" log format: =[**]= START =[**]= STEPEND =[**]= END The functions here create an id for you and keep track of it, and format the various strings, sending it to a logger if you provided one.
Runs BWA and MarkDuplicates on Spark.
A collection of the arguments that are used for BWA.
Utils to move data from a BwaMemAlignment into a GATKRead, or into a SAM tag.
Manage a global collection of BwaMemIndex instances.
Create a BWA-MEM index image file for use with GATK BWA tools
 
The BwaSparkEngine provides a simple interface for transforming a JavaRDD in which the reads are paired and unaligned, into a JavaRDD of aligned reads, and does so lazily.
Takes a VCFFileReader and an IntervalList and provides a single iterator over all variants in all the intervals.
Trivial adapter class allowing a primitive byte[] array to be accessed using the java.util.Iterator interface
A caching version of the IndexedFastaSequenceFile that avoids going to disk as often as the raw indexer.
(Internal) Collects read metrics relevant to structural variant discovery
 
Calculates the fraction of reads coming from cross-sample contamination, given results from GetPileupSummaries.
Calculates various metrics on a sample fingerprint, indicating whether the fingerprint satisfies the assumptions we have.
Calculate genotype posterior probabilities given family and/or known population genotypes
Given a VCF of known variants from multiple samples, calculate how much each sample contributes to a pooled BAM.
 
Estimates the parameters for the DRAGstr model for an input sample.
Calls copy-ratio segments as amplified, deleted, or copy-number neutral.
 
 
 
Carries the result of a call to #assignGenotypeLikelihoods
 
Represents a CBS-style segmentation to enable IGV-compatible plotting.
Collects variants and generates metrics about them.
 
Class for collapsing a collection of similar SVCallRecord objects, such as clusters produced by CanonicalSVLinkage, into a single representative call.
Define strategies for collapsing alt alleles with different subtypes.
Define strategies for collapsing variant intervals.
Main class for SV clustering.
 
Stream output captured from a stream.
Stream output captured from a streaming stream.
A read name encoder conforming to the standard described by Illumina Casava 1.8.
This class provides that data structure for cbcls.
------------------------------------- CBCL Header ----------------------------------- Bytes 0 - 1 Version number, current version is 1 unsigned 16 bits little endian integer Bytes 2 - 5 Header size unsigned 32 bits little endian integer Byte 6 Number of bits per basecall unsigned Byte 7 Number of bits per q-score unsigned
ChainPruner<V extends BaseVertex,E extends BaseEdge>
 
 
 
Checks the sample identity of the sequence/genotype data in the provided file (SAM/BAM or VCF) against a set of known genotypes in the supplied genotype file (in VCF format).
Program to check a lane of an Illumina output directory.
Compare GATK's internal pileup to a reference Samtools pileup
Check a BAM/VCF for compatibility against specified references.
TableWriter to format and write the table output.
Simple class to check the terminator block of a SAM file.
 
Counts and frequency of alleles in called genotypes
This class allows code that manipulates cigars to do so naively by handling complications such as merging consecutive identical operators within the builder.
 
 
Implementation of a circular byte buffer that uses a large byte[] internally and supports basic read/write operations from/to other byte[]s passed as arguments.
Utilities for dealing with reflection.
 
FuncotationFilter matching variants which: Occur on a gene in the American College of Medical Genomics (ACMG)'s list of clinically-significant variants Have been labeled by ClinVar as pathogenic or likely pathogenic Have a max MAF of 5% across sub-populations of ExAC or gnomAD
Represents a clip on a read.
Rank Sum Test for hard-clipped bases on REF versus ALT reads
How should we represent a clipped bases in a read?
Utilities to clip the adapter sequence from a SAMRecord read
Read clipping based on quality, position or sequence matching.
 
The clocs file format is one of 3 Illumina formats(pos, locs, and clocs) that stores position data exclusively.
An Iterator that automatically closes a resource when the end of the iteration is reached.
Efficiently clusters a set of evaluation ("eval") SVs with their closest truth SVs.
Output container for an evaluation record and its closest truth record.
Summary
Store the information from Illumina files for a single cluster with one or more reads.
Takes ClusterData provided by an IlluminaDataProvider into one or two SAMRecords, as appropriate, and optionally marking adapter sequence.
A metric class to hold the result of ClusterCrosscheckMetrics fingerprints.
 
Stores clustering parameters for different combinations of supporting algorithm types (depth-only/depth-only, depth-only/PESR, and PESR/PESR)
Annotate a VCF with scores from a Convolutional Neural Network (CNN).
Train a Convolutional Neural Network (CNN) for filtering variants.
Write variant tensors for training a Convolutional Neural Network (CNN) for filtering variants.
 
Clustering engine class for defragmenting depth-based DEL/DUP calls, such as those produced by GermlineCNVCaller.
A command line tool to read a BAM file and produce standard alignment metrics that would be applicable to any alignment.
 
Collects reference and alternate allele counts at specified sites.
Collects summary and per-sample metrics about variant calls in a VCF file.
 
 
 
 
Collects base distribution per cycle in SAM/BAM/CRAM file(s).
Collect DuplicateMark'ing metrics from an input file that was already Duplicate-Marked.
At each genomic locus, count the number of F1R2/F2R1 alt reads.
 
Tool to collect information about GC bias in the reads in a given BAM file.
Collect metrics regarding the reason for reads (sequenced by HiSeqX) not passing the Illumina PF Filter.
a metric class for describing FP failing reads from an Illumina HiSeqX lane *
Metrics produced by the GetHiSeqXPFFailMetrics program.
 
 
This tool takes a SAM/BAM file input and collects metrics that are specific for sequence datasets generated through hybrid-selection.
A Command line tool to collect Illumina Basecalling metrics for a sequencing run Requires a Lane and an input file of Barcodes to expect.
Utility for collating Tile records from the Illumina TileMetrics file into lane-level and phasing-level metrics.
A CLP that, given a BAM and a VCF with genotypes of the same sample, estimates the rate of independent replication of reads within the bam.
Command line program to read non-duplicate insert sizes, create a Histogram and report distribution statistics.
Collects insert size distribution information in alignment data.
Command-line program to compute metrics about outward-facing pairs, inward-facing pairs, and chimeras in a jumping library.
Class that is designed to instantiate and execute multiple metrics programs that extend SinglePassSamProgram while making only a single pass through the SAM file and supplying each program with the records as it goes.
 
 
Runs multiple metrics collection modules for a given alignment file.
 
 
Class for trying to quantify the CpCG->CpCA error rate.
Metrics class for outputs.
Command line program to calculate quality yield metrics
A set of metrics used to describe the general quality of a BAM file
 
 
Collects quality yield metrics in SAM/BAM/CRAM file(s).
Computes a number of metrics that are useful for evaluating coverage and performance of whole genome sequencing experiments, same implementation as CollectWgsMetrics, with different defaults: lacks baseQ and mappingQ filters and has much higher coverage cap.
 
Collects read counts at specified intervals.
 
 
Calculates and reports QC metrics for RRBS data based on the methylation status at individual C/G bases as well as CpG sites across all reads in the input BAM/SAM file.
 
Program to collect error metrics on bases stratified in various ways.
Quantify substitution errors caused by mismatched base pairings during various stages of sample / library prep.
Creates discordant read pair, split read evidence, site depth, and read depth files for use in the GATK-SV pipeline.
 
Both CollectTargetedPCRMetrics and CollectHsSelection share virtually identical program structures except for the name of their targeting mechanisms (e.g.
This tool calculates a set of PCR-related metrics from an aligned SAM or BAM file containing targeted sequencing data.
Collects summary and per-sample metrics about variant calls in a VCF file.
A collection of metrics relating to snps and indels within a variant-calling file (VCF) for a given sample.
A collection of metrics relating to snps and indels within a variant-calling file (VCF).
Computes a number of metrics that are useful for evaluating coverage and performance of whole genome sequencing experiments.
 
 
 
Metrics for evaluating the performance of whole genome sequencing experiments.
 
A simple program to combine multiple genotyping array VCFs into one VCF The input VCFs must have the same sequence dictionary and same list of variant loci.
Combine per-sample gVCF files produced by HaplotypeCaller into a multi-sample gVCF file
 
Adapter shim/alternate GATK entry point for use by GATK tests to run tools in command line argument validation mode.
Main class to be used as an alternative entry point to org.broadinstitute.hellbender.Main for performing command line validation only rather than executing the tool.
Embodies defaults for global values that affect how the Picard Command Line operates.
Abstract class to facilitate writing command-line programs.
Abstract class to facilitate writing command-line programs.
A shim to make use of try-with-resources for tool shutdown
Class for handling translation of Picard-style command line argument syntax to POSIX-style argument syntax; used for running tests written with Picard style syntax against the Barclay command line parser.
Split a collection of middle nodes in a graph into their shared prefix and suffix values This code performs the following transformation.
Compares the base qualities of two SAM/BAM/CRAM files.
Determine if two potentially identical BAMs have the same duplicate reads.
A simple tool to compare two Illumina GTC files.
 
CompareMatrix contains a square matrix of linear dimension QualityUtils.MAX_SAM_QUAL_SCORE.
Compare two metrics files.
 
Display reference comparison as a tab-delimited table and summarize reference differences.
 
TableWriter to format and write the table output.
TableWriter to format and write SNP table output.
 
Rudimentary SAM comparer.
Required stratification grouping output by each comp
A Spark Partitioner that puts tasks with greater complexities into earlier partitions.
This tool looks for low-complexity STR sequences along the reference that are later used to estimate the Dragstr model during single sample auto calibration CalibrateDragstrModel.
Class to make multiple funcotator output at the same time.
 
A class to represent data as a list of <value,count> pairs.
Evaluate site-level concordance of an input VCF against a truth VCF.
Created by davidben on 3/2/17.
Created by tsato on 2/8/17.
 
 
Combines adjacent intervals in DepthEvidence files.
A singleton class to act as a user interface for loading configuration files from org.aeonbits.owner.
Keep reads that do NOT contain one or more kmers from a set of SVKmerShorts
Wrapper for ContainsKmerReadFilter to avoid serializing the kmer filter in Spark
 
This is the probabilistic contamination model that we use to distinguish homs from hets The model is similar to that of ContEst, in that it assumes that each contaminant read is independently drawn from the population.
Created by David Benjamin on 2/13/17.
 
 
Stratifies the evaluation by each contig in the reference sequence.
 
 
 
This class scans the chimeric alignments of input AlignedContig, filters out the alignments that offers weak evidence for a breakpoint and, makes interpretation based on the SimpleChimera extracted.
 
This is a troubleshooting utility that converts a headerless BAM shard (e.g., a part-r-00000.bam, part-r-00001.bam, etc.), produced by a Spark tool with --sharded-output set to true, into a readable BAM file by adding a header and a BGZF terminator.
 
 
 
 
A record containing the integer copy-number posterior distribution for a single interval.
Collection of integer copy-number posteriors.
Tools that analyze read coverage to detect copy number variants
 
 
 
Segments copy-ratio data using kernel segmentation.
Represents a segmented model for copy ratio fit to denoised log2 copy-ratio data.
Enumerates the parameters for CopyRatioState.
 
 
Factory for creating Funcotations by handling a SQLite database containing information from COSMIC.
Count and print to standard output (and optionally to a file) the total number of bases in a SAM/BAM/CRAM file
Counts the number of times each base occurs in a reference, and prints the counts to standard output (and optionally to a file).
Calculate the overall number of bases SAM/BAM/CRAM file
Class for managing a list of Counters of integer, provides methods to access data from Counters with respect to an offset.
Count variants which were not filtered in a VCF.
Counting filter that discards reads are unaligned or aligned with MQ==0 and whose 5' ends look like adapter Sequence
Counting filter that discards reads that have been marked as duplicates.
A SamRecordFilter that counts the number of bases in the reads which it filters out.
Counting filter that discards reads below a configurable mapping quality threshold.
Counting filter that discards reads that are unpaired in sequencing and paired reads whose mates are not mapped.
Wrapper/adapter for ReadFilter that counts the number of reads filtered, and provides a filter count summary.
Private class for Counting AND filters
Wrapper/adapter for VariantFilter that counts the number of variants filtered, and provides a filter count summary.
Private class for Counting AND filters
Apply a read-based annotation that reports the number of Ns seen at a given site.
Count and print to standard output (and optionally to a file) the total number of reads in a SAM/BAM/CRAM file.
Calculate the overall number of reads in a SAM/BAM file
Count variant records in a VCF file, regardless of filter status.
 
 
The Covariate interface.
 
Total depth of coverage per sample and over all samples.
Tools that count coverage, e.g.
This is a class for managing the output formatting/files for DepthOfCoverage.
 
Represents total coverage over each contig in an ordered set associated with a named sample.
Represents a sequence dictionary and total coverage over each contig in an ordered set associated with a cohort of named samples.
CpG is a stratification module for VariantEval that divides the input data by within/not within a CpG site
This struct contains two key pieces of information that provides interpretation of the event:
 
One of the two fundamental classes (the other is CpxVariantCanonicalRepresentation) for complex variant interpretation and alt haplotype extraction.
 
This deals with the special case where a contig has multiple (> 2) alignments and seemingly has the complete alt haplotype assembled.
(Internal) Tries to extract simple variants from a provided GATK-SV CPX.vcf
A simple program to create a standard picard metrics file from the output of bafRegress
Create an Extended Illumina Manifest by performing a liftover to Build 37.
Create a Hadoop BAM splitting index and optionally a BAM index from a BAM file.
Creates a panel of normals (PoN) for read-count denoising given the read counts for samples in the panel.
Create a SAM/BAM file from a fasta containing reference sequence.
 
Create a panel of normals (PoN) containing germline and artifactual sites for use with Mutect2.
A simple program to create a standard picard metrics file from the output of VerifyIDIntensity
Checks that all data in the set of input files appear to come from the same individual.
A class to hold the result of crosschecking fingerprints.
The data type.
 
Deprecated.
6/6/2017 Use CrosscheckFingerprints instead.
 
Converts a given string into a Boolean after trimming whitespace from that string.
Produces custom MAF fields (e.g.
The Cycle covariate.
Flow Annotation: cycle skip status: cycle-skip, possible-cycle-skip, non-skip
Interface for tagging any class that represents a collection of datasets required to update posterior samples for Markov-Chain Monte Carlo sampling using samplers implementing the ParameterSampler interface.
Table data-line string array wrapper.
 
An abstract class to allow for the creation of a Funcotation for a given data source.
Utilities for reading / working with / manipulating Data Sources.
 
 
Utility class to use with DbSnp files to determine is a locus is a dbSnp site.
Utility class to use with DbSnp files to determine is a locus is a dbSnp site.
Little tuple class to contain one bitset for SNPs and another for Indels.
Little tuple class to contain one bitset for SNPs and another for Indels.
Enum to hold the possible types of dbSnps.
Enumerates individual deciles.
Represents a set of deciles.
Default GATKReadFilterArgumentCollection applied in GATK for optional read filters in the command line.
Arguments for requesting VariantContext annotations to be processed by VariantAnnotatorEngine for tools that process variants objects.
Experimental stratification by the degeneracy of an amino acid, according to VCF annotation.
Iterate through a delimited text file in which columns are found by looking at a header line rather than by position.
Denoises read counts to produce denoised copy ratios.
When a tool is removed from GATK (after having been tagged with @DeprecatedFeature for a suitable period), an entry should be added to this list to issue a message when the user tries to run that tool.
Read counts for an indefinite number of samples on some interval.
Codec to handle DepthEvidence in BlockCompressedInterval files
Codec to handle DepthEvidence in tab-delimited text files
Merges records for the same interval into a single record, when possible, throws if not possible.
Filters out a record if all variant samples have depth lower than the given value.
Assess sequence coverage by a wide array of metrics, partitioned by sample, read group, or library
A class helper for storing running intervalPartition data.
A class for storing summarized coverage statistics for DepthOfCoverage.
Holds histograms of alt depth=1 sites for reference contexts.
Depth of coverage of each allele per sample
Depth of informative coverage for each sample.
Determines the integer ploidy state of all contigs for germline samples given counts data.
 
Tools that collect sequencing quality-related and comparative metrics
 
A genotype produced by one of the concrete implementations of AbstractAlleleCaller.
Simple enum to represent the three possible combinations of major/major, major/minor and minor/minor haplotypes for a diploid individual.
The Dirichlet distribution is a distribution on multinomial distributions: if pi is a vector of positive multinomial weights such that sum_i pi[i] = 1, the Dirichlet pdf is P(pi) = [prod_i Gamma(alpha[i]) / Gamma(sum_i alpha[i])] * prod_i pi[i]^(alpha[i] - 1) The vector alpha comprises the sufficient statistics for the Dirichlet distribution.
Documents evidence of a too-close or too-far-apart read pair.
Codec to handle DiscordantPairEvidence in BlockCompressedInterval files
Codec to handle DiscordantPairEvidence in tab-delimited text files
(Internal) Examines aligned contigs from local assemblies and calls structural variants
Disk-based implementation of ReadEndsForMarkDuplicatesMap.
 
Models a single output file in the DoC walker.
 
 
 
Classes annotated with this annotation are NOT intended or designed to be extended and should be treated as final.
User argument to specify a sequence of doubles with 3 values in the format "start:step:limit".
A simple shard implementation intended to be used for splitting reads by partition in Spark tools
Given a bam grouped by the same unique molecular identifier (UMI), this tool drops a specified fraction of duplicate sets and returns a new bam.
The basic downsampler API, with no reads-specific operations.
Summary
Type of downsampling method to invoke.
Describes the method for downsampling reads at a given locus.
This is the DRAGEN-GATK genotyper model.
Read transformer intended to replicate DRAGEN behavior for handling mapping qualities.
 
Holds information about a locus on the reference that might be used to estimate the DRAGstr model parameters.
 
Represents the DRAGstr model fitting relevant stats at a given locus on the genome for the target sample.
Collection of Dragstr Locus cases constraint to a particular period and (minimum) repeat-length
 
 
Pair-HMM score imputator based on the DRAGstr model parameters.
Holds the values of the DRAGstr model parameters for different combinations of repeat unit length (period) and number of repeat units.
Partial mutable collection of Dragstr Parameters used to compose the final immutable DragstrParams.
Utils to read and write DragstrParams instances from an to files and other resources.
Utility to find short-tandem-repeats on read sequences.
Tool to figure out the period and repeat-length (in units) of STRs in a reference sequence.
 
 
A walker that processes duplicate reads that share the same Unique molecule Identifier (UMI) as a single unit.
Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords.
Factory class that creates either regular or flow-based duplication metrics.
Masks read bases and base qualities using the symmetric DUST algorithm
When it is necessary to pick a primary alignment from a group of alignments for a read, pick the one that maps the earliest base in the read.
When it is necessary to pick a primary alignment from a group of alignments for a read, pick the one that maps the earliest base in the read.
 
 
Dummy class representing a mated read fragment at a particular start position to be used for accounting when deciding whether to duplicate unmatched fragments.
 
Codec to decode data in GTF format from ENSEMBL.
Created by farjoun on 6/26/18.
 
Summary metrics produced by CollectSequencingArtifactMetrics as a roll up of the context-specific error rates, to provide global error rates per type of base substitution.
Errors in Mutect2 fall into three major categories -- technical artifacts that depend on (usually hidden) features and do not follow the independent reads assumption of the somatic likelihoods model, non-somatic variants such as germline mutations and contamination, and sequencing errors that are captured by the base qualities and the somatic likelihoods model.
Attempts to estimate library complexity from sequence alone.
Required stratification grouping output by each eval
Compare INFO field values between two VCFs or compare two different INFO fields from one VCF.
 
Extract simple VariantContext events from a single haplotype
 
This class holds information about pairs of intervals on the reference that are connected by one or more BreakpointEvidence objects that have distal targets.
 
This class is responsible for iterating over a collection of BreakpointEvidence to find clusters of evidence with distal targets (discordant read pairs or split reads) that agree in their location and target intervals and strands.
Example/toy program that shows how to implement the AssemblyRegionWalker interface.
Example/toy program that shows how to implement the AssemblyRegionWalker interface.
Example Spark tool for collecting multi-level metrics.
Example Spark tool for collecting example single-level metrics.
Example/toy program that shows how to implement the FeatureWalker interface.
Example/toy program that shows how to implement the IntervalWalker interface.
Example/toy program that shows how to implement the IntervalWalker interface.
Example/toy program that shows how to implement the LocusWalker interface.
Example/toy program that shows how to implement the LocusWalker interface.
Example subclass that shows how to use the MultiFeatureWalker class.
An example multi-level metrics collector that just counts the number of reads (per unit/level)
Example argument collection for multi-level metrics.
Example multi-level metrics collector for illustrating how to collect metrics on specified accumulation levels.
Example implementation of a multi-level Spark metrics collector.
Example/toy program that prints reads from the provided file or files with corresponding reference bases (if a reference is provided).
Example/toy ReadWalker program that uses a Python script.
Program group for Example programs
Example/toy program that prints reads from the provided file or files with corresponding reference bases (if a reference is provided).
Example/toy program that prints reads from the provided file or files with corresponding reference bases (if a reference is provided).
Example/toy program that prints reads from the provided file or files along with overlapping variants (if a source of variants is provided).
Example/toy program that prints reads from the provided file or files along with overlapping variants (if a source of variants is provided).
Counts the number of times each reference context is seen as well as how many times it's overlapped by reads and variants.
 
Argument argument collection for Example single level metrics.
ExampleSingleMetricsCollector for Spark.
Example ReadWalker program that uses a Python streaming executor to stream summary data from a BAM input file to a Python process through an asynchronous stream writer.
This walker makes two traversals through variants in a vcf.
Example/toy program that shows how to implement the VariantWalker interface.
Example/toy program that shows how to implement the VariantWalker interface.
Phred-scaled p-value for exact test of excess heterozygosity.
Filter out reads where the number of soft-/hard-clipped bases on either end is above a certain threshold.
Created by davidben on 11/30/15.
 
Program to create a fingerprint for the contaminating sample when the level of contamination is both known and uniform in the genome.
Determine the barcode for each read in an Illumina lane.
Extracts barcodes and accumulates metrics for an entire tile.
Subsets reads by name (basically a parallel version of "grep -f", or "grep -vf")
Simple command line program that allows sub-sequences represented by an interval list to be extracted from a reference sequence file.
(Internal) Extracts evidence of structural variations from reads
Extracts site-level variant annotations, labels, and other metadata from a VCF file to HDF5 files.
 
Created by tsato on 3/14/18.
 
Created by tsato on 2/14/17.
Stratifies the eval RODs by each family in the eval ROD, as described by the pedigree.
Utility to compute genotype posteriors given family priors.
Generate an alternative reference sequence over the specified interval
Create a subset of a FASTA reference sequence
Converts a FASTQ file to an unaligned BAM or SAM file.
Class represents fast algorithm for collecting data from AbstractLocusInfo with a list of aligned EdgingRecordAndOffset objects.
Wrapper around FeatureManager that presents Feature data from a particular interval to a client tool without improperly exposing engine internals.
FeatureDataSource<T extends htsjdk.tribble.Feature>
Enables traversals and queries over sources of Features, which are metadata associated with a location on the genome in a format supported by our file parsing framework, Tribble.
FeatureInput<T extends htsjdk.tribble.Feature>
Class to represent a Feature-containing input file.
Handles discovery of available codecs and Feature arguments, file format detection and codec selection, and creation/management/querying of FeatureDataSources for each source of Features.
 
FeatureOutputCodec<F extends htsjdk.tribble.Feature,S extends FeatureSink<F>>
A FeatureOutputCodec can encode Features into some type of FeatureSink.
This class knows about all FeatureOutputCodec implementations, and allows you to find an appropriate codec to create a given file type.
FeatureOutputStream<F extends htsjdk.tribble.Feature>
Class for output streams that encode Tribble Features.
FeatureSink<F extends htsjdk.tribble.Feature>
 
FeatureWalker<F extends htsjdk.tribble.Feature>
A FeatureWalker is a tool that processes a Feature at a time from a source of Features, with optional contextual information from a reference, sets of reads, and/or supplementary sources of Features.
For each sample and for each allele a list feature vectors of supporting reads In order to reduce the number of delimiter characters, we flatten featurized reads.
LocalAssemblyHandler that uses FermiLite.
Summary
 
Stratifies by the FILTER status (PASS, FAIL) of the eval records
Filter false positive alignment artifacts from a VCF callset.
 
Iterator that dynamically applies filter strings to VariantContext records supplied by an underlying iterator.
 
Created by jcarey on 3/13/14.
Illumina uses an algorithm described in "Theory of RTA" that determines whether or not a cluster passes filter("PF") or not.
Filter variants based on clinically-significant Funcotations.
The allele frequency data source that was used when Funcotating the input VCF.
The version of the Human Genome reference which was used when Funcotating the input VCF.
 
 
Helper class used on the final pass of FilterMutectCalls to record total expected true positives, false positives, and false negatives, as well as false positives and false negatives attributable to each filter
Given specified intervals, annotated intervals output by AnnotateIntervals, and/or counts output by CollectReadCounts, outputs a filtered Picard interval list.
Filter variants in a Mutect2 VCF callset.
Summary
 
 
Stratifies by the FILTER type(s) for each line, with PASS used for passing
Apply tranche filtering to VCF based on scores from an annotation in the INFO field.
Applies a set of hard filters to Variants and to Genotypes within a VCF.
Find assembly regions from reads in a distributed Spark setting.
Identifies sequences that occur at high frequency in a reference
(Internal) Produces local assemblies of genomic regions that may harbor structural variants
 
 
Summary
class to represent a genetic fingerprint as a set of HaplotypeProbabilities objects that give the relative probabilities of each of the possible haplotypes at a locus.
Major class that coordinates the activities involved in comparing genetic fingerprint data whether the source is from a genotyping platform or derived from sequence data.
class to hold the details of a element of fingerprinting PU tag
Detailed metrics about an individual SNP/Haplotype comparison within a fingerprint comparison.
Summary fingerprinting metrics and statistics about the comparison of the sequence data from a single read group (lane or index within a lane) vs.
Class for holding metrics on a single fingerprint.
Class that is used to represent the results of comparing a read group within a SAM file, or a sample within a VCF against one or more set of fingerprint genotypes.
A set of utilities used in the fingerprinting environment
A class that holds VariantContexts sorted by genomic position
Implements the Fisher's exact test for 2x2 tables assuming the null hypothesis of odd ratio of 1.
Strand bias estimated using Fisher's Exact Test
Filters records based on the phred scaled p-value from the Fisher Strand test stored in the FS attribute.
Summary
 
Tool for replacing or fixing up a VCF header.
Accumulate flag statistics given a BAM file, e.g.
 
Spark tool to accumulate flag statistics given a BAM file, e.g.
Simple struct container class for the 5'/3' flank settings.
A little shim that let's you implement a mapPartitions operation (which takes an iterator over all items in the partition, and returns an iterator over all items to which they are mapped) in terms of a flatMap function (which takes a single input item, and returns an iterator over any number of output items).
Base class for flow based annotations Some flow based annotations depend on the results from other annotations, regardless if they were called for by user arguments.
 
Flow based replacement for PairHMM likelihood calculation.
 
 
Haplotype that also keeps information on the flow space @see FlowBasedRead Haplotype can't be extended, so this extends Allele
A common base class for flow based filters which test for conditions on an hmer basis
Flow Based HMM, intended to incorporate the scoring model of the FlowBasedAlignmentLikelihoodEngine while allowing for frame-shift insertions and deletions for better genotyping.
 
Class for performing the pair HMM for global alignment in FlowSpace.
Tools that perform variant calling and genotyping for short variants (SNPs, SNVs and Indels) on flow-based sequencing platforms
Adds flow information to the usual GATKRead.
 
utility class for flow based read
 
A read filter to test if the TP values for each hmer in a flow based read form a polindrome (as they should)
A read filter to test if the TP values for each hmer in a flow based read form are wihin the allowed range (being the possible lengths of hmers - maxHmer)
Finds specific features in reads, scores the confidence of each feature relative to the reference in each read and writes them into a VCF file.
 
 
Set of arguments for the FlowFeatureMapper
Class representing a single read fragment at a particular start location without a mapped mate.
Fractional Downsampler: selects a specified fraction of the reads for inclusion.
All available evidence coming from a single biological fragment.
Class representing a single read fragment at a particular start location without a mapped mate.
Represents the results of the reads -> fragment calculation.
Fragment depth of coverage of each allele per sample
Median fragment length of reads supporting each allele.
 
Keep only read pairs (0x1) with absolute insert length less than or equal to the specified maximum, and/or greater than or equal to the specified minimum.
 
Perform functional annotation on a segment file (tsv).
Abstract class representing a Funcotator annotation.
A filter to apply to Funcotations in FilterFuncotations.
A linked map of transcript IDs to funcotations.
Represents metadata information for fields in in a Funcotation.
 
Funcotator (FUNCtional annOTATOR) analyzes given variants for their function (as retrieved from a set of data sources) and produces the analysis in a specified output file.
Class to store argument definitions specific to Funcotator.
An enum to handle the different types of input files for data sources.
The file format of the output file.
 
FuncotatorDataSourceDownloader is a tool to download the latest data sources for Funcotator.
Class that performs functional annotation of variants.
 
 
A type to keep track of different specific genuses.
Class representing exceptions that arise when trying to create a coding sequence for a variant:
Arguments to be be used by the Funcotator GATKTool, which are specific to Funcotator.
Stratifies by nonsense, missense, silent, and all annotations in the input ROD, from the INFO field annotation.
 
The scheme is defined in the constructor.
The default scheme is derived from the GA4GH Benchmarking Work Group's proposed evaluation scheme.
Concatenate efficiently BAM files that resulted from a scattered parallel analysis.
 
 
 
 
Simple little class that combines multiple VCFs that have exactly the same set of samples and nonoverlapping sets of loci.
This tool combines together rows of variant calls from multiple VCFs, e.g.
 
An abstract ArgumentCollection for defining the set of annotation descriptor plugin arguments that are exposed to the user on the command line.
A plugin descriptor for managing the dynamic discovery of both InfoFieldAnnotation and GenotypeAnnotation objects within the packages defined by the method getPackageNames() (default org.broadinstitute.hellbender.tools.walkers.annotator).
 
Configuration file for GATK options.
A GATKDataSource is something that can be iterated over from start to finish and/or queried by genomic interval.
Custom DocWorkUnit used for generating GATK help/documentation.
Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords.
Class GATKException.
 
 
For wrapping errors that are believed to never be reachable
Utility class containing various methods for working with GenomicsDB Contains code to modify the GenomicsDB import input using the Protobuf API References: GenomicsDB Protobuf structs: https://github.com/GenomicsDB/GenomicsDB/blob/master/src/resources/genomicsdb_vid_mapping.proto Protobuf generated Java code guide: https://developers.google.com/protocol-buffers/docs/javatutorial#the-protocol-buffer-api https://developers.google.com/protocol-buffers/docs/reference/java-generated
Class representing a GSONWorkUnit for GATK work units.
Custom Barclay-based Javadoc Doclet used for generating GATK help/documentation.
The GATK Documentation work unit handler class that is the companion to GATKHelpDoclet.
GATK tool command line arguments that are input or output resources.
Unified read interface for use throughout the GATK.
An abstract ArgumentCollection for defining the set of read filter descriptor plugin arguments that are exposed to the user on the command line.
A CommandLinePluginDescriptor for ReadFilter plugins
Converts a GATKRead to a BDG AlignmentRecord
Interface for classes that are able to write GATKReads to some output destination.
GATKRegistrator registers Serializers for our project.
Container class for GATK report tables
column information within a GATK report table
Column width and left/right alignment.
 
The gatherable data types acceptable in a GATK report column.
 
 
 
 
 
Base class for GATK spark tools that accept standard kinds of inputs (reads, reference, and/or intervals).
 
 
 
 
 
 
Base class for all GATK tools.
Variant is (currently) a minimal variant interface needed by the Hellbender pipeline.
 
 
 
 
This class contains any constants (primarily FORMAT/INFO keys) in VCF files used by the GATK.
This class contains the VCFHeaderLine definitions for the annotation keys in GATKVCFConstants.
Choose the Tribble indexing strategy
Custom Barclay-based Javadoc Doclet used for generating tool WDL.
The GATK WDL work unit handler.
Learn multiplicative correction factors as a function of GC content using a simple regression.
Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.
 
Calculates GC Bias Metrics on multiple levels Created by kbergin on 3/23/15.
High level metrics that capture how biased the coverage in a certain lane is.
Utilities to calculate GC Bias Created by kbergin on 9/23/15.
Flow Annotation: percentage of G or C in the window around hmer
A class to represent a Functional Annotation.
Represents the type and severity of a variant.
 
A builder object to create GencodeFuncotations.
A factory to create GencodeFuncotations.
A Gencode GTF Feature representing a CDS.
Tribble Codec to read data from a GENCODE GTF file.
A Gencode GTF Feature representing an exon.
A GencodeGtfFeature represents data in a GENCODE GTF file.
Keyword identifying the source of the feature, like a program (e.g.
Additional relevant information appended to a feature.
Type of the feature represented in a single line of a GENCODE GTF File.
Indication of whether a feature is new, tenatative, or already known.
Whether the first base of the CDS segment is the first (frame 0), second (frame 1) or third (frame 2) \ in the codon of the ORF.
Biotype / transcript type for the transcript or gene represented in a feature.
Status of how a position was annotated / verified: 1 - verified locus 2 - manually annotated locus 3 - automatically annotated locus For more information, see: https://www.gencodegenes.org/data_format.html https://en.wikipedia.org/wiki/General_feature_format
 
Attribute that indicates the status of the mapping.
Attribute that compares the mapping to the existing target annotations.
Transcript score according to how well mRNA and EST alignments match over its full length.
Struct-like container class for the fields in a GencodeGtfFeature This is designed to be a basic dummy class to make feature instantiation easier.
A Gencode GTF Feature representing a gene.
A Gencode GTF Feature representing a selenocysteine.
A Gencode GTF Feature representing a start codon.
A Gencode GTF Feature representing a stop codon.
A Gencode GTF Feature representing a transcript.
A Gencode GTF Feature representing an untranslated region.
Holds annotation of a gene for storage in an OverlapDetector.
Load gene annotations into an OverlapDetector of Gene objects.
Evaluate gene expression from RNA-seq reads aligned to genome.
 
 
 
 
This class can - only work on segments.
Genome location representation.
Factory class for creating GenomeLocs
How much validation should we do at runtime with this parser?
Class GenomeLocCollection
 
Constants related to GenomicsDB
Import single-sample GVCFs into GenomicsDB before joint genotyping.
Encapsulates the GenomicsDB-specific options relevant to the FeatureDataSource
Collection of allele counts for a genotype.
Represents an annotation that is computed for a single genotype.
Created by davidben on 6/10/16.
 
Summary
A simple structure to return the results of getAlleles.
Class that holds metrics about the Genotype Concordance contingency tables.
A class to store the counts for various truth and call state classifications relative to a reference.
Class that holds detail metrics about Genotype Concordance
This defines for each valid TruthState and CallState tuple, the set of contingency table entries that to which the tuple should contribute.
Created by kbergin on 6/19/15.
Created by kbergin on 7/30/15.
A class to store the various classifications for: 1.
These states represent the relationship between the call genotype and the truth genotype relative to a reference sequence.
A specific state for a 2x2 contingency table.
A minute class to store the truth and call state respectively.
These states represent the relationship between a truth genotype and the reference sequence.
Class that holds summary metrics about Genotype Concordance
 
An interface for classes that perform Genotype filtration.
Created by bimber on 5/17/2017.
Perform joint genotyping on one or more samples pre-called with HaplotypeCaller
 
Engine class to allow for other classes to replicate the behavior of GenotypeGVCFs.
 
Helper to calculate genotype likelihoods for DRAGEN advanced genotyping models (BQD - Base Quality Dropout, and FRD - Foreign Reads Detection).
Genotype likelihood calculator utility.
Class to compose genotype prior probability calculators.
Genotype filter that filters out genotypes below a given quality threshold.
Summarize genotype statistics from all samples at the site level
 
Miscellaneous tools, e.g.
GenotypingData<A extends htsjdk.variant.variantcontext.Allele>
Encapsulates the data use to make the genotype calls.
Base class for genotyper engines.
GenotypingLikelihoods<A extends htsjdk.variant.variantcontext.Allele>
Genotyping Likelihoods collection.
A wrapping interface between the various versions of genotypers so as to keep them interchangeable.
 
Calls copy-number variants in germline samples given their counts and the corresponding output of DetermineGermlineContigPloidy.
 
 
Helper class for PostprocessGermlineCNVCalls for single-sample postprocessing of GermlineCNVCaller calls into genotyped intervals.
This class stores naming standards in the GermlineCNVCaller.
Helper class for PostprocessGermlineCNVCalls for single-sample postprocessing of segmented GermlineCNVCaller calls.
GermlineCNVVariantComposer<DATA extends htsjdk.samtools.util.Locatable>
 
 
 
 
 
Usage example
Summarizes counts of reads that support reference, alternate and other alleles for given sites.
Emit a single sample name from the bam header into an output file.
Implements Gibbs sampling of a multivariate probability density function.
Perform "quick and dirty" joint genotyping on one or more samples pre-called with HaplotypeCaller
Guts of the GnarlyGenotyper
Efficient algorithm to obtain the list of best haplotypes given the instance.
Utility functions used in the graphs package
Created by farjoun on 11/2/16.
 
An internal tool to produce a flexible and robust ground truth set for base calling training.
Class to convert an Illumina GTC file into a VCF file.
 
Combines variants into GVCF blocks.
Turns an iterator of VariantContext into one which combines GVCF blocks.
An accumulator for collecting metrics about a single-sample GVCF.
Genome-wide VCF writer Merges blocks based on GQ
 
Utility class that allows easy creation of destinations for the HaplotypeBAMWriters
A BAMWriter that aligns reads to haplotypes and emits their best alignments to a destination.
Possible modes for writing haplotypes to BAMs
Calculate likelihood matrix for each Allele in VCF against a set of Reads limited by a set of Haplotypes
Set of arguments for the HaplotypeBasedVariantRecaller
Represents information about a group of SNPs that form a haplotype in perfect LD with one another.
Call germline SNPs and indels via local re-assembly of haplotypes
Set of arguments for the HaplotypeCallerEngine
the different flow modes, in terms of their parameters and their values NOTE: a parameter value ending with /o is optional - meaning it will not fail the process if it is not existent on the target parameters collection.
The core engine for the HaplotypeCaller that does all of the actual work of the tool.
A short helper class that manages a singleton debug stream for HaplotypeCaller genotyping information that is useful for debugging.
HaplotypeCaller's genotyping strategy implementation.
 
******************************************************************************** * This tool DOES NOT match the output of HaplotypeCaller.
Set of annotations meant to be reflective of HaplotypeFiltering operations that were applied in FlowBased HaplotypeCaller.
A collection of metadata about Haplotype Blocks including multiple in memory "indices" of the data to make it easy to query the correct HaplotypeBlock or Snp by snp names, positions etc.
Abstract class for storing and calculating various likelihoods and probabilities for haplotype alleles given evidence.
Log10(P(evidence| haplotype)) for the 3 different possible haplotypes {aa, ab, bb}
Represents the probability of the underlying haplotype of the contaminating sample given the data.
Represents a set of HaplotypeProbabilities that were derived from a single SNP genotype at a point in time.
Represents the likelihood of the HaplotypeBlock given the GenotypeLikelihoods (GL field from a VCF, which is actually a log10-likelihood) for each of the SNPs in that block.
Represents the probability of the underlying haplotype given the data.
A wrapper class for any HaplotypeProbabilities instance that will assume that the given evidence is that of a tumor sample and provide an hp for the normal sample that tumor came from.
a service class for HaplotypeBasedVariableRecaller that reads a SAM/BAM file, interprets the reads as haplotypes and called a provided consumer with the 'best' haplotypes found for a given query location.
Base class for Hard filters that are applied at the allele level
 
An output stream which stops at the threshold instead of potentially triggering early.
Indicates that this object has a genomic location and provides a systematic interface to get it.
Helper class for SimpleCountCollection used to read/write HDF5.
Represents the SVD panel of normals to be created by CreateReadCountPanelOfNormals.
TODO move into hdf5-java-bindings
A comparator for headerless SAMRecords that exactly matches the ordering of the SAMRecordCoordinateComparator
 
 
General heterogeneous ploidy model.
A class containing utility methods used in the calculation of annotations related to cohort heterozygosity, e.g.
Class used for storing a list of doubles as a run length encoded histogram that compresses the data into bins spaced at defined intervals.
Holds all the hits (alignments) for a read or read pair.
 
Flow Annotation: length of the hmer indel, if so
Flow Annotation: nucleotide of the hmer indel, if so
Flow Annotation: motifs to the left and right of the indel
A read filter to test if the quality values for each hmer in a flow based read form a polindrome (as they should)
PloidyModel implementation tailored to work with a homogeneous constant ploidy across samples and positions.
Homo sapiens genome constants.
Multiset implementation that provides low memory overhead with a high load factor by using the hopscotch algorithm.
 
 
A uniquely keyed map with O(1) operations.
A uniquely keyed map with O(1) operations.
 
A map that can contain multiple values for a given key.
A map that can contain multiple values for a given key.
 
Implements Set by imposing a unique-element constraint on HopscotchCollection.
Implements Set by imposing a unique-element constraint on HopscotchCollection.
 
A map that can contain multiple values for a given key, but distinct entries.
A map that can contain multiple values for a given key, but distinct entries.
 
Filters out reads above a threshold identity (number of matches less deletions), given in bases.
Calculates HS metrics for a given SAM or BAM file.
Metrics generated by CollectHsMetrics for the analysis of target-capture sequencing experiments.
Classes of data that can be requested in an htsget request as defined by the spec
Class allowing deserialization from json htsget error response
Formats currently supported by htsget as defined by spec
A tool that downloads a file hosted on an htsget server to a local file
Builder for an htsget request that allows converting the request to a URI after validating that it is properly formed
Fields which can be used to filter a htsget request as defined by the spec
Class allowing deserialization from json htsget response
 
 
 
 
Program to create a fingerprint for the contaminating sample when the level of contamination is both known and uniform in the genome.
Utilities for interacting with IGV-specific formats.
Describes adapters used on each pair of strands
A class to encompass writing an Illumina adpc.bin file.
 
Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis.
 
Simple switch to control the read name format to emit.
IlluminaBasecallsToSam transforms a lane of Illumina data file formats (bcl, locs, clocs, qseqs, etc.) into SAM, BAM or CRAM file format.
A class to parse the contents of an Illumina Bead Pool Manifest (BPM) file A BPM file contains metadata (including the alleles, mapping and normalization information) on an Illumina Genotyping Array Each type of genotyping array has a specific BPM .
A simple class to represent a locus entry in an Illumina Bead Pool Manifest (BPM) file
IlluminaDataProviderFactory accepts options for parsing Illumina data files for a lane and creates an IlluminaDataProvider, an iterator over the ClusterData for that lane, which utilizes these options.
List of data types of interest when parsing Illumina data.
General utils for dealing with IlluminaFiles as well as utils for specific, support formats.
 
 
Embodies characteristics that describe a lane.
A class to represent an Illumina Manifest file.
A class to represent a record (line) from an Illumina Manifest [Assay] entry
 
Illumina's TileMetricsOut.bin file codes various metrics, both concrete (all density id's are code 100) or as a base code (e.g.
Metrics for Illumina Basecalling that stores median phasing and prephasing percentages on a per-template-read, per-lane basis.
A read name encoder following the encoding initially produced by picard fastq writers.
Misc utilities for working with Illumina specific files and data
Describes adapters used on each pair of strands
 
Likelihood-based test for the consanguinuity among samples
Flow Annotation: indel class: ins, del, NA
A calculator that estimates the error rate of the bases it observes for indels only.
Metric to be used for InDel errors
Flow Annotation: length of indel
Simple utility for histogramming indel lengths Based on code from chartl
Stratifies the eval RODs by the indel size Indel sizes are stratified from sizes -100 to +100.
 
A class to store information relevant for biological rate estimation
This class delegates genotyping to allele count- and ploidy-dependent GenotypeLikelihoodCalculators under the assumption that sample genotypes are independent conditional on their population frequencies.
IndexedAlleleList<A extends htsjdk.variant.variantcontext.Allele>
Allele list implementation using an indexed-set.
Simple implementation of a sample-list using an indexed-set.
Set where each element can be reference by a unique integer index that runs from 0 to the size of the set - 1.
This tool creates an index file for the various kinds of feature-containing files supported by GATK (such as VCF and BED files).
Represents 0-based integer index range.
 
A class to provide methods for accessing Illumina Infinium Data Files.
A class to parse the contents of an Illumina Infinium cluster (EGT) file A cluster file contains information about the clustering information used in mapping red / green intensity information to genotype calls
A class to encapsulate the table of contents for an Illumina Infinium Data Files.
A class to parse the contents of an Illumina Infinium genotype (GTC) file A GTC file is the output of Illumina's genotype calling software (either Autocall or Autoconvert) and contains genotype calls, confidence scores, basecalls and raw intensities for all calls made on the chip.
 
A class to parse the contents of an Illumina Infinium Normalization Manifest file An Illumina Infinium Normalization Manifest file contains a subset of the information contained in the Illumina Manifest file in addition to the normalization ID which is needed for normalizating intensities in GtcToVcf
 
A class to store fields that are specific to a VCF generated from an Illumina GTC file.
 
Keeps track of concordance between two info fields.
Table reading class for InfoConcordanceRecords
Table writing class for InfoConcordanceRecords
Annotations relevant to the INFO field of the variant file (ie annotations for sites).
Settings that define text to write to the process stdin.
Holds the information characterizing and insert size distribution.
Supported insert size distributions shapes.
Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insertSizeMetrics".
Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics".
ArgumentCollection for InsertSizeMetrics collectors.
Collects InsertSizeMetrics on the specified accumulationLevels
Collects InsertSizeMetrics on the specified accumulationLevels using
Worker class to collect insert size metrics, add metrics to file, and provides accessors to stats of groups of different level.
Created by davidben on 8/19/16.
A genotyped integer copy-number segment.
Represents a collection of IntegerCopyNumberSegment for a sample.
This class represents integer copy number states.
Created by tsato on 5/1/17.
The channels in a FourChannelIntensityData object, and the channels produced by a ClusterIntensityFileReader, for cases in which it is desirable to handle these abstractly rather than having the specific names in the source code.
For special cases where we want to emit AlignmentContexts regardless of whether we have an overlap with a given interval.
Intended to be used as an @ArgumentCollection for specifying intervals at the command line.
Base interface for an interval argument collection.
The bundle of integer copy-number posterior distribution and baseline integer copy-number state for an interval.
Class to find the coverage of the intervals.
 
An interface for a class that scatters IntervalLists.
a Baseclass for scatterers that scatter by uniqued base count.
Scatters IntervalList by interval count so that resulting IntervalList's have the same number of intervals in them.
Scatters IntervalList by into `interval count` shards so that resulting IntervalList's have approximately same number of intervals in them.
A BaseCount Scatterer that avoid breaking-up intervals.
Like IntervalListScattererWithoutSubdivision but will overflow current list if the projected size of the remaining lists is bigger than the "ideal".
An IntervalListScatterer that attempts to place the same number of (uniquified) bases in each output interval list.
An enum to control the creation of the various IntervalListScatter objects
Trivially simple command line program to convert an IntervalList file to a BED file.
Performs various IntervalList manipulations.
 
Returns a SimpleInterval for each locus in a set of intervals.
a class we use to determine the merging rules for intervals passed to the GATK
IntervalOverlappingIterator<T extends htsjdk.samtools.util.Locatable>
Wraps an iterator of Locatable with a list of sorted intervals to return only the objects which overlaps with them
A simple read filter that allows for the user to specify intervals at the filtering stage.
 
 
 
set operators for combining lists of intervals
Tools that process genomic intervals in various formats.
Stratifies the variants by whether they overlap an interval in the set provided on the command line.
Parse text representations of interval strings that can appear in GATK-based applications.
An enum to classify breakpoints whether the breakpoint is the start or end of a region.
An IntervalWalker is a tool that processes a single interval at a time, with the ability to query optional overlapping sources of reads, reference data, and/or variants/features.
Encapsulates a SimpleInterval with the reads that overlap it (the ReadsContext and its ReferenceContext and FeatureContext.
A Spark version of IntervalWalker.
Histogram of observations on a compact set of non-negative integer values.
 
 
 
Created by davidben on 8/19/16.
A helper class to maintain a cache of an int to double function defined on n = 0, 1, 2.
Utility class for defining a "not" allele concept that is used to score haplotypes that are not supporting the allele.
 
 
A read transformer to convert IUPAC bases (i.e.
Stratifies the eval RODs by user-supplied JEXL expressions https://gatk.broadinstitute.org/hc/en-us/articles/360035891011-JEXL-filtering-expressions for more details
Keep only reads that the attributes of meet a given set of jexl expressions
Joins an RDD of GATKReads to variant data by copying the variants files to every node, using Spark's file copying mechanism.
Merge GCNV segments VCFs This tool takes in segmented VCFs produced by PostprocessGermlineCNVCalls.
A best haplotype object for being used with junction trees.
FORMAT annotations that look at more inputs than regular annotations
INFO annotations that look at more inputs than regular annotations
High level metrics about the presence of outward- and inward-facing pairs within a SAM file generated with a jumping library, produced by the CollectJumpingLibraryMetrics program and usually stored in a file with the extension ".jump_metrics".
 
Experimental version of the ReadThreadingGraph with support for threading reads to generate JunctionTrees for resolving connectivity information at longer ranges.
Represents a result from a K-best haplotype search.
A common interface for the different KBestHaplotypeFinder implementations to conform to
Segments data (i.e., finds multiple changepoints) using a method based on the kernel-segmentation algorithm described in https://hal.inria.fr/hal-01413230/document, which gives a framework to quickly calculate the cost of a segment given a low-rank approximation to a specified kernel.
 
Fast wrapper for byte[] kmers This objects has several important features that make it better than using a raw byte[] for a kmer: -- Can create kmer from a range of a larger byte[], allowing us to avoid Array.copyOfRange -- Fast equals and hashcode methods -- can get actual byte[] of the kmer, even if it's from a larger byte[], and this operation only does the work of that operation once, updating its internal state
A <Kmer,count> pair.
 
A <Kmer,IntervalId> pair.
 
Eliminates dups, and removes over-represented kmers.
Iterates over reads, kmerizing them, and counting up just the kmers that appear in a passed-in set.
generic utility class that counts kmers Basically you add kmers to the counter, and it tells you how many occurrences of each kmer it's seen.
Common interface for those graphs that implement vertex by kmer look-up.
 
KV<K,V>
replacement for dataflow Key-Value class, don't use this anywhere new
Represents a collection of LabeledVariantAnnotationsDatum as a list of lists of datums.
Base walker for both ExtractVariantAnnotations and ScoreVariantAnnotations, which enforces identical variant-extraction behavior in both tools via LabeledVariantAnnotationsWalker.extractVariantMetadata(htsjdk.variant.variantcontext.VariantContext, org.broadinstitute.hellbender.engine.FeatureContext, boolean).
Helper class used to transform tile data for a lane into a collection of IlluminaPhasingMetrics
Set of longs that is larger than the max Java array size ( ~ 2^31 ~ 2 billion) and therefore cannot fit into a single LongHopscotchSet.
 
Learn the prior probability of read orientation artifact from the output of CollectF1R2Counts of Mutect2 Details of the model may be found in docs/mutect/mutect.pdf.
 
Left-align indels in a variant callset
Left-aligns indels in read data
 
Represents a CBS-style segmentation to enable IGV-compatible plotting.
Leveling Downsampler: Given a set of Lists of arbitrary items and a target size, removes items from the Lists in an even fashion until the total size of all Lists is <= the target size.
A class to generate library Ids and keep duplication metrics by library IDs.
A class to generate library Ids and keep duplication metrics by library IDs.
Splits readers by library name.
Keep only reads from the specified library.
statistics of fragment length distributions
 
Simple wrapper about the information LIBS needs about downsampling
Liftover SNPs in HaplotypeMaps from one reference to another
This tool adjusts the coordinates in an interval list on one reference to its homologous interval list on another reference, based on a chain file that describes the correspondence between the two references.
 
Summary
Set of arguments related to ReadLikelihoodCalculationEngine implementations
LikelihoodMatrix<EVIDENCE,A extends htsjdk.variant.variantcontext.Allele>
Likelihood matrix between a set of alleles and evidence.
Rank Sum Test of per-read likelihoods of REF versus ALT reads
Represents a value of copy ratio in linear space generated by GermlineCNVCaller with the corresponding interval.
Collection of copy ratios in linear space generated by GermlineCNVCaller with their corresponding intervals
A reader wrapper around a LineIterator.
FuncotationFilter matching variants which: Have been flagged by LMM as important for loss of function.
 
Something to throw when we have too many Contigs or Traversals to proceed with assembly.
An unbranched sequence of Kmers.
Initial or final Kmer in a Contig.
Simple implementation of Contig interface.
A list of Contigs that presents a reverse-complemented view of a List of Contigs.
 
Implementation of Contig for the reverse-complement of some other Contig.
fixed-size, immutable kmer.
A Kmer that remembers its predecessors and successors, and the number of times it's been observed in the assembly's input set of reads.
Class to implement KmerAdjacency for canonical Kmers.
Class to implement KmerAdjacency for Kmers that are the reverse-complement of a canonical Kmer.
Set of Kmers.
A path through the assembly graph for something (probably a read).
A helper class for Path building.
A single-Contig portion of a path across the assembly graph.
A part of a path that is present as a sub-sequence of some Contig.
A part of a path that isn't present in the graph.
A CharSequence that is a view of the reverse-complement of another sequence.
A count of the number of read Paths that cross through some Contig from some previous Contig to some subsequent Contig.
A list of Contigs through the assembly graph.
 
Set of traversals.
Per-Contig storage for depth-first graph walking.
 
Implements fields for use in known locatables.
Interface for marking objects that contain metadata associated with a collection of locatables.
Factory for creating TableFuncotations by handling `Separated Value` files with arbitrary delimiters (e.g.
This class exists to allow VariantContext objects to be compared based only on their location and set of alleles, providing a more liberal equals method so that VariantContext objects can be placed into a Set which retains only VCs that have non-redundant location and Allele lists.
 
Created by jcarey on 3/13/14.
The locs file format is one 3 Illumina formats(pos, locs, and clocs) that stores position data exclusively.
Describes the behavior of a locus relative to a gene.
Iterator that traverses a SAM File, accumulating information on a per-locus basis Produces AlignmentContext objects, that contain ReadPileups of PileupElements.
A LocusWalker is a tool that processes reads that overlap a single position in a reference at a time from one or multiple sources of reads, with optional contextual information from a reference and/or sets of variants/Features.
An implementation of LocusWalker that supports arbitrary interval side inputs.
Encapsulates an AlignmentContext with its ReferenceContext and FeatureContext.
A Spark version of LocusWalker.
FuncotationFilter matching variants which: Are classified as FRAME_SHIFT_*, NONSENSE, START_CODON_DEL, or SPLICE_SITE Occur on a gene where loss of function is a disease mechanism Have a max MAF of 1% across sub-populations of ExAC or gnomAD
 
Wrapper class so that the log10Factorial array is only calculated if it's used
Util class for performing the pair HMM for global alignment.
Logging utilities.
 
Bloom filter for primitive longs.
 
Utility class, useful for flow based applications, implementing a workaround for long homopolymers handling.
This class is based on the HopscotchCollection and HopscotchSet classes for storing Objects.
 
Iterator-like interface for collections of primitive long's
Prune all chains from this graph where all edges in the path have multiplicity < pruneFactor For A -[1]> B -[1]> C -[1]> D would be removed with pruneFactor 2 but A -[1]> B -[2]> C -[1]> D would not be because the linear chain includes an edge with weight >= 2
An LRU cache implemented as an extension to LinkedHashMap
 
the different flow modes, in terms of their parameters and their values NOTE: a parameter value ending with /o is optional - meaning it will not fail the process if it is not existent on the target parameters collection.
 
A Funcotator output renderer for writing to MAF files.
Class to hold all the constants required for the MafOutputRenderer.
This is the main class of Hellbender and is the way of executing individual command line programs.
Creates a VCF that contains all the site-level information for all records in the input VCF but no genotype information.
Creates a TSV from sample name to VCF/GVCF path, with one line per input.
Imported with changes from Picard private.
The ranked data in one list and a list of the number of ties.
The results of performing a rank sum test.
The values of U1, U2 and the transformed number of ties needed for the calculation of sigma in the normal approximation.
A variable that indicates if the test is one sided or two sided and if it's one sided which group is the dominator in the null hypothesis.
Median mapping quality of reads supporting each alt allele.
 
Rank Sum Test for mapping qualities of REF versus ALT reads
Keep only reads with mapping qualities within a specified range.
A read transformer to modify the mapping quality of reads with MQ=255 to reads with MQ=60
Count of all reads with MAPQ = 0 across all samples
A better duplication marking algorithm that handles all cases including clipped and gapped alignments.
Enum used to control how duplicates are flagged in the DT optional tag on each read.
Enum for the possible values that a duplicate read can be tagged with in the DT attribute.
 
MarkDuplicates calculation helper class for flow based mode The class extends the behavior of MarkDuplicates which contains the complete code for the non-flow based mode.
 
This class helps us compute and compare duplicate scores, which are used for selecting the non-duplicate during duplicate marking (see MarkDuplicatesGATK).
MarkDuplicates on Spark
An argument collection for use with tools that mark optical duplicates.
A common interface for the data types that represent reads for mark duplicates spark.
 
Utility classes and functions for Mark Duplicates.
Wrapper object used for storing an object and some type of index information.
Comparator for TransientFieldPhysicalLocation objects by their attributes and strandedness.
An even better duplication marking algorithm that handles all cases including clipped and gapped alignments.
This will iterate through a coordinate sorted SAM file (iterator) and either mark or remove duplicates as appropriate.
 
Command line program to mark the location of adapter sequences.
This is the mark queue.
Represents the results of a fingerprint comparison between one dataset and a specific fingerprint file.
Keep only paired reads that are not near each other in a coordinate-sorted source of reads.
General math utilities
A collection of common math operations that work with log values.
MathUtils is a static class (no instantiation allowed!) with some useful math methods.
 
A utility class that computes on the fly average and standard deviation for a stream of numbers.
Static class for implementing some matrix summary stats that are not in Apache, Spark, etc
Program to generate a data table and chart of mean quality by cycle from a BAM file.
Program to generate a data table and chart of mean quality by cycle from a BAM file.
Map from String to ReadEnds object.
Class for the identification and tracking of mendelian violation.
Mendelian violation detection and counting
Describes the type and number of mendelian violations found within a Trio.
Created by farjoun on 6/25/16.
An extension of MetricBase that knows how to merge-by-adding fields that are appropriately annotated (MergeByAdding).
Metrics whose values can be merged by adding.
Metrics whose values should be equal when merging.
Metrics that are merged manually in the MergeableMetricBase.merge(MergeableMetricBase) ()}.
Metrics that are not merged, but are subsequently derived from other metrics, for example by MergeableMetricBase.calculateDerivedFields().
Metrics that are not merged.
 
 
Summary
 
Merge the stats output by scatters of a single Mutect2 job.
Class to take genotype calls from a ped file output from zCall and merge them into a vcf from autocall.
This tool is used for combining SAM and/or BAM files from different runs or read groups into a single file, similar to the \"merge\" function of Samtools (http://www.htslib.org/doc/samtools.html).
Combines multiple variant files into a single variant file.
Interface for marking objects that contain metadata that can be represented as a SAMFileHeader.
 
 
Tools that perform metagenomic analysis, e.g.
Tools that performs methylation calling and methylation-based coverage for bisulfite BAMs
Identifies methylated bases from bisulfite sequencing data.
For use with Picard metrics programs that may output metrics for multiple levels of aggregation with an analysis.
For use with Picard metrics programs that may output metrics for multiple levels of aggregation with an analysis.
 
Base class for defining a set of metrics collector arguments.
Created by knoblett on 9/15/15.
Each metrics collector has to be able to run from 4 different contexts: - a standalone walker tool - the org.broadinstitute.hellbender.metrics.analysis.CollectMultipleMetrics walker tool - a standalone Spark tool - the CollectMultipleMetricsSpark tool In order to allow a single collector implementation to be shared across all of these contexts (standalone and CollectMultiple, Spark and non-Spark), collectors should be factored into the following classes, where X in the class names represents the specific type of metrics being collected: XMetrics extends MetricBase: defines the aggregate metrics that we're trying to collect XMetricsArgumentCollection: defines parameters for XMetrics, extends MetricsArgumentCollection XMetricsCollector: processes a single read, and has a reduce/combiner For multi level collectors, XMetricsCollector is composed of several classes: XMetricsCollector extends MultiLevelReducibleCollector< XMetrics, HISTOGRAM_KEY, XMetricsCollectorArgs, XMetricsPerUnitCollector> XMetricsPerUnitCollector: per level collector, implements PerUnitMetricCollector<XMetrics, HISTOGRAM_KEY, XMetricsCollectorArgs> (requires a combiner) XMetricsCollectorArgs per-record argument (type argument for MultiLevelReducibleCollector) XMetricsCollectorSpark: adapter/bridge between RDD and the (read-based) XMetricsCollector, implements MetricsCollectorSpark CollectXMetrics extends org.broadinstitute.hellbender.metrics.analysis.SinglePassSamProgram CollectXMetricsSpark extends MetricsCollectorSparkTool The following schematic shows the general relationships of these collector component classes in the context of various tools, with the arrows indicating a "delegates to" relationship via composition or inheritance: CollectXMetrics CollectMultipleMetrics \ / \ / v v _______________________________________ | XMetricsCollector =========|=========> MultiLevelReducibleCollector | | | | | V | | | XMetrics | V | XMetricsCollectorArgumentCollection | PerUnitXMetricCollector --------------------------------------- ^ | | XMetricsCollectorSpark ^ ^ / \ / \ CollectXMetricsSpark CollectMultipleMetricsSpark The general lifecycle of a Spark collector (XMetricsCollectorSpark in the diagram above) looks like this: CollectorType collector = new CollectorType() CollectorArgType args = // get metric-specific input arguments // NOTE: getDefaultReadFilters is called before the collector's initialize // method is called, so the read filters cannot access argument values ReadFilter filter == collector.getDefaultReadFilters(); // pass the input arguments to the collector for initialization collector.initialize(args, defaultMetricsHeaders); collector.collectMetrics( getReads().filter(filter), samFileHeader ); collector.saveMetrics(getReadSourceName());
Base class for standalone Spark metrics collector tools.
Filter out reads that: Fail platform/vendor quality checks (0x200) Are unmapped (0x4) Represent secondary/supplementary alignments (0x100 or 0x800)
Utility methods for dealing with MetricsFile and related classes.
 
Implements slice sampling of a continuous, univariate, unnormalized probability density function (PDF), which is assumed to be unimodal.
A stripped-down version of the former UnifiedGenotyper's genotyping strategy implementation, used only by the HaplotypeCaller for its isActive() determination.
MinimalVariant is a minimal implementation of the GATKVariant interface.
Created by David Benjamin on 2/13/17.
 
Checks for and errors out (or fixes if requested) when it detects reads with base qualities that are not encoded with phred-scaled quality scores.
Simple class for storing a sample and its mixing fraction within a pooled bam.
MMapBackedIteratorFactory a file reader that takes a header size and a binary file, maps the file to a read-only byte buffer and provides methods to retrieve the header as it's own bytebuffer and create iterators of different data types over the values of file (starting after the end of the header).
This class is a static helper for implementing 'mode arguments' by tools.
 
 
 
Models segmented copy ratios from denoised copy ratios and segmented minor-allele fractions from allelic counts.
 
A container class for the molecule ID, which consists of an integer ID and a binary strand.
Molten for @Analysis modules.
For a paired-end aligner that aligns each end independently, select the pair of alignments that result in the largest insert size.
For a paired-end aligner that aligns each end independently, select the pair of alignments that result in the largest insert size.
 
 
 
 
A DeBruijnVertex that supports multiple copies of the same kmer
Represents a segmented model for copy ratio and allele fraction.
MultiFeatureWalker<F extends htsjdk.tribble.Feature>
A MultiFeatureWalker is a tool that presents one Feature at a time in sorted order from multiple sources of Features.
 
MultiFeatureWalker.MergingIterator<F extends htsjdk.tribble.Feature>
 
MultiFeatureWalker.PQContext<F extends htsjdk.tribble.Feature>
 
MultiFeatureWalker.PQEntry<F extends htsjdk.tribble.Feature>
 
Iterate over queryname-sorted SAM, and return each group of reads with the same queryname.
A class to represent shards of read data spanning multiple intervals.
An interface to represent shards of arbitrary data spanning multiple intervals.
MultiLevelCollector<METRIC_TYPE extends htsjdk.samtools.metrics.MetricBase,HISTOGRAM_KEY extends Comparable<HISTOGRAM_KEY>,ARGTYPE>
MultiLevelCollector handles accumulating Metrics at different MetricAccumulationLevels(ALL_READS, SAMPLE, LIBRARY, READ_GROUP).
MultiLevelCollector<METRIC_TYPE extends htsjdk.samtools.metrics.MetricBase,Histogram_KEY extends Comparable,ARGTYPE>
MultiLevelCollector handles accumulating Metrics at different MetricAccumulationLevels(ALL_READS, SAMPLE, LIBRARY, READ_GROUP).
 
 
Abstract base class for reducible multi-level metrics collectors.
A MultiplePassReadWalker traverses input reads multiple times.
Implemented by MultiplePassReadWalker-derived tools.
A VariantWalker that makes multiple passes through the variants.
Edge class for connecting nodes in the graph that tracks some per-sample information.
Segments copy-ratio data and/or alternate-allele-fraction data from one or more samples using kernel segmentation.
Created by jcarey on 3/13/14.
NextSeq-style bcl's have all tiles for a cycle in a single file.
Parse .bcl.bgzf files that contain multiple tiles in a single file.
MultiTileFileUtil<OUTPUT_RECORD extends picard.illumina.parser.IlluminaData>
For file types for which there is one file per lane, with fixed record size, and all the tiles in it, so the s_.bci file can be used to figure out where each tile starts and ends.
Read filter file that contains multiple tiles in a single file.
Created by jcarey on 3/13/14.
Read locs file that contains multiple tiles in a single file.
MultiTileParser<OUTPUT_RECORD extends picard.illumina.parser.IlluminaData>
Abstract class for files with fixed-length records for multiple tiles, e.g.
MultiVariantDataSource aggregates multiple FeatureDataSources of variants, and enables traversals and queries over those sources through a single interface.
Class that defines the variant arguments used for a MultiVariantWalker.
 
A MultiVariantWalker is a tool that processes one variant at a time, in position order, from multiple sources of variants, with optional contextual information from a reference, sets of reads, and/or supplementary sources of Features.
A MultiVariantWalker that walks over multiple variant context sources in reference order and emits to client tools groups of all input variant contexts by their start position.
Class for executing MUMmer alignment pipeline to detect SNPs and INDELs in mismatching sequences.
Call somatic short mutations via local assembly of haplotypes.
Base class for filters that apply at the allele level.
Created by davidben on 9/15/16.
Base class for all Mutect2Filters
 
 
 
 
 
 
Naive methods for binomial genotyping of heterozygous sites from pileup allele counts.
 
Utilities to provide architecture-dependent native functions
 
A read transformer that refactors NDN cigar elements to one N element.
Utility class that error-corrects reads.
Wrapper utility class that holds, for each position in read, a list of bytes representing candidate corrections.
 
 
A canonical, master list of the standard NGS platforms.
Class to copy a file using java.nio.
An interface that defines a method to use to calculate a checksum on an InputStream.
An enum to allow for verbosity of logging progress of an NioFileCopierWithProgressMeter.
An object to hold the results of a copy operation performed by NioFileCopierWithProgressMeterResults.
An extension of Hadoop's LocalFileSystem that doesn't write (or verify) .crc files.
A collection representing a real valued vector generated by GermlineCNVCaller
A tool to count the number of non-N bases in a fasta file
A version of the classic StandardPairHMMInputScoreImputator that allows for decoupled insertion and deletion penalties for the model.
 
 
 
Little program to "normalize" a fasta file to ensure that all line of sequence are the same length, and are a reasonable length!
Filters out reads marked as duplicates.
This class represents a pair of inferred genomic locations on the reference whose novel adjacency is generated due to an SV event (in other words, a simple rearrangement between two genomic locations) that is suggested by the input SimpleChimera, and complications as enclosed in BreakpointComplications in pinning down the locations to exact base pair resolution.
 
Stratifies by whether a site in in the list of known RODs (e.g., dbsnp by default)
 
Represents the nucleotide alphabet with support for IUPAC ambiguity codes.
Helper class to count the number of occurrences of each nucleotide code in a sequence.
 
Class to work with exclusive pairs of elements, example - pairs of alleles that do not occur in the haplotypes
 
SVD using the ojAlgo library.
Stratifies the eval RODs into sites where the indel is 1 bp in length and those where the event is 2+.
A logger wrapper class which only outputs the first warning provided to it
 
Contains methods for finding optical/co-localized/sequencing duplicates.
An argument collection for use with tools that mark optical duplicates.
Created by davidben on 4/27/16.
An argument collection for use with tools that accept zero or more input files containing Feature records (eg., BED files, hapmap files, etc.).
An interval argument class that allows -L to be specified but does not require it.
An argument collection for use with tools that accept zero or more input files containing reads (eg., BAM/SAM/CRAM files).
Picard default argument collection for an optional reference.
An argument collection for use with tools that optionally accept a reference file as input.
An ArgumentCollection with an optional output argument, and utility methods for printing String output to it To use this class add an @ArgumentCollection variable to your tool like so:
An argument collection for use with tools that accept zero or more input files containing VariantContext records (eg., VCF files).
Count of read pairs in the F1R2 and F2R1 configurations supporting the reference and alternate alleles
Original Alignment annotation counts the number of alt reads where the original alignment contig doesn't match the current alignment contig
Miscellaneous tools, e.g.
Base interface for an output argument collection.
In multiple locations we need to know what cycles are output, as of now we output all non-skip cycles, but rather than sprinkle this knowledge throughout the parser code, instead OutputMapping provides all the data a client might want about the cycles to be output including what ReadType they are.
Describes the mode of output for the caller.
An abstract class to allow for writing output for the Funcotator.
Settings that define text to capture from a process stream.
Filter out reads where the number of bases without soft-clips (M, I, X, and = CIGAR operators) is lower than a threshold.
The class manages reads and splices and tries to apply overhang clipping when appropriate.
An error metric for the errors invovling bases in the overlapping region of a read-pair.
 
A calculator that estimates the error rate of the bases it observes, assuming that the reference is truth.
Class representing a pair of reads together with accompanying optical duplicate marking information.
Pair<X extends Comparable<X>,Y extends Comparable<Y>>
Simple Pair class.
Serializers for each subclass of PairedEnds which rely on implementations of serializations within each class itself
Struct-like class to store information about the paired reads for mark duplicates.
 
 
 
An iterator that takes a pair of iterators over VariantContexts and iterates over them in tandem.
Little class to hold a pair of VariantContexts that are in sync with one another.
Class for performing the pair HMM for global alignment.
 
 
Common interface for pair-hmm score calculators.
 
 
Helper class that implement calculations required to implement the PairHMM Finite State Automation (FSA) model.
Arguments for native PairHMM implementations
 
Trims (hard clips) soft-clipped bases due to the following artifact: When a sequence and its reverse complement occur near opposite ends of a fragment DNA damage (especially in the case of FFPE samples and ancient DNA) can disrupt base-pairing causing a single-strand loop of the sequence and its reverse complement, after which end repair copies the true 5' end of the fragment onto the 3' end of the fragment.
A base class for Metrics for targeted panels.
 
Parallel copy a file or directory from Google Cloud Storage into the HDFS file system used by Spark
Represents a parameter value with a named ParameterEnum key.
 
Interface for tagging an enum that represents the name of every Parameter comprising a ParameterizedState.
 
Represents a parameterized model.
Builder for constructing a ParameterizedModel to be Gibbs sampled using GibbsSampler.
 
Represents a mapped collection of Parameter objects, i.e., named, ordered, enumerated keys associated with values of mixed type via a key -> key, value map.
Interface for generating random samples of a Parameter value, given an ParameterizedState and a DataCollection.
 
This class should eventually be merged into Utils, which is in hellbender, and then this class should be deleted.
A specialized read walker that may be gracefully stopped before the input stream ends A tool derived from this class should implement PartialReadWalker.shouldExitEarly(GATKRead) to indicate when to stop.
It allows you to ask whether a given interval is near the beginning or end of the partition.
Dummy class used for preserving reads that need to be marked as non-duplicate despite not wanting to perform any processing on the reads.
Pass-Through Downsampler: Implementation of the ReadsDownsampler interface that does no downsampling whatsoever, and instead simply "passes-through" all the reads it's given.
Path<V extends BaseVertex,E extends BaseEdge>
A path thought a BaseGraph class to keep track of paths
Iterate through the lines of a Path.
A class whose purpose is to initialize the various plugins that provide Path support.
Produce a set of k-mers from the given host reference.
Build an annotated taxonomy datafile for a given microbe reference.
Align reads to a microbe reference using BWA-MEM and Spark.
Filters low complexity, low quality, duplicate, and host reads.
Combined tool that performs all PathSeq steps: read filtering, microbe reference alignment and abundance scoring
Classify reads and estimate abundances of each taxon in the reference.
Represents a .ped file of family information as documented here: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml Stores the information in memory as a map of individualId -> Pedigree information for that individual
A common interface for handling annotations that require pedigree file information either in the form of explicitly selected founderIDs or in the form of an imported pedigreeFile.
 
Reads PED file-formatted tabular text files See http://www.broadinstitute.org/mpg/tagger/faq.html See http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped The "ped" file format refers to the widely-used format for linkage pedigree data.
 
An enum that specifies which, if any, of the standard PED fields are missing from the input records.
Apply an annotation based on aggregation data from all reads supporting each allele.
A container for allele to value mapping.
 
Represent a permutation of an ordered set or list of elements.
Given 1-dimensional data, finds all local minima sorted by decreasing topological persistence.
 
 
PerTileParser<ILLUMINA_DATA extends picard.illumina.parser.IlluminaData>
Abstract base class for Parsers that open a single tile file at a time and iterate through them.
 
A Collector for individual ExampleMultiMetrics for a given SAMPLE or SAMPLE/LIBRARY or SAMPLE/LIBRARY/READ_GROUP (depending on aggregation levels)
A Collector for individual InsertSizeMetrics for a given SAMPLE or SAMPLE/LIBRARY or SAMPLE/LIBRARY/READ_GROUP (depending on aggregation levels)
PerUnitMetricCollector<BEAN extends htsjdk.samtools.metrics.MetricBase,HKEY extends Comparable<HKEY>,ARGTYPE>
PerRecordCollector - An interface for classes that collect data in order to generate one or more metrics.
PerUnitMetricCollector<BEAN extends htsjdk.samtools.metrics.MetricBase,HKEY extends Comparable,ARGTYPE>
PerRecordCollector - An interface for classes that collect data in order to generate one or more metrics.
Argument Collection which holds parameters common to classes that want to add PG tags to reads in SAM/BAM files
Small interface that provides access to the physical location information about a cluster.
Stores the minimal information needed for optical duplicate detection.
This stores records that are comparable for detecting optical duplicates.
Small class that provides access to the physical location information about a cluster.
Small class that provides access to the physical location information about a cluster.
This is the main class of Picard and is the way of executing individual command line programs.
Base class for all Picard tools.
Adapter shim for use within GATK to run Picard tools.
Basic Picard runtime exception that, for now, does nothing much
Custom Barclay-based Javadoc Doclet used for generating Picard help/documentation.
The Picard Documentation work unit handler class that is the companion to PicardHelpDoclet.
A Subclass of HtsPath with conversion to Path making use of IOUtil
Exception used to propagate non-zero return values from Picard tools.
Prints read alignments in samtools pileup format.
Helper class for handling pileup allele detection supplement for assembly.
Set of arguments for configuring the pileup detection code
Represents an individual base in a reads pileup.
 
Prints read alignments in samtools pileup format.
Created by David Benjamin on 2/14/17.
 
 
Keep only reads where the the Read Group platform attribute (RG:PL tag) contains the given string.
Filter out reads where the the platform unit attribute (PU tag) contains the given string.
Information about the number of chromosome per sample at a given location.
 
Creates plots of standardized and denoised copy ratios.
Creates plots of denoised and segmented copy-ratio and minor-allele-fraction estimates.
 
Created by jcarey on 3/13/14.
The pos file format is one 3 Illumina formats(pos, locs, and clocs) that stores position data exclusively.
PositionalDownsampler: Downsample each stack of reads at each alignment start to a size <= a target coverage using a ReservoirDownsampler.
Summary
PosParser parses multiple files formatted as one of the three file formats that contain position information only (pos, locs, and clocs).
Existence of a de novo mutation in at least one of the given families
 
 
A class to wrangle all the various and sundry genotype posterior options, mostly from CalculateGenotypePosteriors
 
Postprocesses the output of GermlineCNVCaller and generates VCF files as well as a concatenated denoised copy ratio file.
Performs post-processing steps to get a bam aligned to a transcriptome ready for RSEM (https://github.com/deweylab/RSEM) Suppose the read name "Q1" aligns to multiple loci in the transcriptome.
 
Performs on-the-fly filtering of the provided VariantContext Iterator such that only variants that satisfy all predicates are emitted.
 
Prepares bins for coverage collection.
It is useful to define a key such that the key will occur at most once among the primary alignments in a given file (assuming the file is valid).
It is useful to define a key such that the key will occur at most once among the primary alignments in a given file (assuming the file is valid).
Given a set of alignments for a read or read pair, mark one alignment as primary, according to whatever strategy is appropriate.
Given a set of alignments for a read or read pair, mark one alignment as primary, according to whatever strategy is appropriate.
A diagnostic tool that prints information about the compressed blocks in a BGZF format file, such as a .vcf.gz file or a .bam file.
 
 
Prints (and optionally subsets) an rd (DepthEvidence) file or a counts file as one or more (for multi-sample DepthEvidence files) counts files for CNV determination.
Write reads from SAM format file (SAM/BAM/CRAM) that pass criteria to a new file.
 
 
Merges locus-sorted files of evidence for structural variation into a single output file.
Print out variants from a VCF file.
Facade to Runtime.exec() and java.lang.Process.
Command acknowledgements that are returned from a process managed by StreamingProcessController.
 
 
 
 
Facilitate consistent logging output when progressing through a stream of SAM records.
A basic progress meter to print out the number of records processed (and other metrics) during a traversal at a configurable time interval.
ProgressReportingDelegatingCodec<A extends htsjdk.tribble.Feature,B>
This class is useful when we want to report progress when indexing.
Utility for loading properties files from resources.
Class representing the change in the protein sequence for a specific reference/alternate allele pair in a variant.
 
Loads Bwa index and aligns reads.
Wrapper class for using the PathSeq Bwa aligner class in Spark.
 
Aligns using BWA and filters out reads above the minimum coverage and identity.
Utility functions for PathSeq Bwa tool
Performs PathSeq filtering steps and manages associated resources.
 
Dummy filter metrics class that does nothing
Logs filtering read counts to metrics file
Interface for filter metrics logging
Metrics that are calculated during the PathSeq filter
Kmer Bloom Filter class that encapsulates the filter, kmer size, and kmer mask
 
Classes that provide a way to test kmers for set membership and keep track of the kmer size and mask
Kmer Hopscotch set class that encapsulates the filter, kmer size, and kmer mask
 
PathSeq utilities for kmer libraries
Class for separating paired and unpaired reads in an RDD
Stores taxonomic IDs that were hits of a read pair.
Helper class for ClassifyReads that stores the name, taxonomic class and parent, reference length, and reference contig names of a given taxon in the pathogen reference.
Pathogen abundance scores assigned to a taxonomic node and reported by the PathSeqScoreSpark tool.
 
Logs number of mapped and unmapped reads to metrics file
Interface for score metrics logging
Metrics that are calculated during the PathSeq scoring
 
Important NCBI taxonomy database constants
Helper class for holding taxonomy data used by ClassifyReads
 
Represents a taxonomic tree with nodes assigned a name and taxonomic rank (e.g.
 
Node class for PSTree
 
Common functions for PathSeq
A class that receives a stream of elements and transforms or filters them in some way, such as by downsampling with a Downsampler.
Iterator wrapper around our generic {@link PushPullTransformer)} interface.
Base class for services for executing Python Scripts.
Enum of possible executables that can be launched by this executor.
Generic service for executing Python Scripts.
Python script execution exception.
Given an HDF5 file containing annotations for a training set (in the format specified by VariantAnnotationsModel.trainAndSerialize(java.io.File, java.lang.String)), a Python script containing modeling code, and a JSON file containing hyperparameters, the PythonSklearnVariantAnnotationsModel.trainAndSerialize(java.io.File, java.lang.String) method can be used to train a model.
Given an HDF5 file containing annotations for a test set (in the format specified by VariantAnnotationsScorer.score(java.io.File, java.io.File)), a Python script containing scoring code, and a file containing a pickled Python lambda function for scoring, the PythonSklearnVariantAnnotationsScorer.score(java.io.File, java.io.File) method can be used to generate scores.
Filters out sites that have a QD annotation applied to them and where the QD value is lower than a lower limit.
A template name and an intervalId.
 
Class to find the template names associated with reads in specified intervals.
Iterates over reads, kmerizing them, and checking the kmers against a set of KmerAndIntervals to figure out which intervals (if any) a read belongs in.
Class that acts as a mapper from a stream of reads to a stream of KmerAndIntervals.
Class that acts as a mapper from a stream of reads to a stream of <kmer,qname> pairs for a set of interesting kmers.
Variant confidence normalized by unfiltered depth of variant samples
The Reported Quality Score covariate.
Charts quality score distribution within a BAM file.
Charts quality score distribution within a BAM file.
QualityUtils is a static class with some utility methods for manipulating quality scores.
A set of metrics used to describe the general quality of a BAM file
MetricsArgumentCollection argument collection for QualityYield metrics.
QualityYieldMetricsCollector for Spark.
A general algorithm for quantizing quality score distributions to use a specific number of levels Takes a histogram of quality scores and a desired number of levels and produces a map from original quality scores -> quantized quality scores.
Class that encapsulates the information necessary for quality score quantization for BQSR
A BigQuery record for a row of results from a query.
A collection of helper utilities for iterating through reads that are in query-name sorted read order as pairs
 
 
 
This is a specialized HaplotypeCaller tool, designed to allow for breaking the monolithic haplotype caller process into smaller discrete steps.
 
 
 
This is a specialized haplotype caller engine, designed to allow for breaking the monolithic haplotype caller process into smaller discrete steps.
 
 
 
While structurally identical to CompositeIndex, this class is maintained as it makes code more readable when the two are used together (see QSeqParser)
Abstract root for all RankSum based annotations
INFO level annotation of the counts of genotypes with respect to the reference allele.
Replace bases in reads with reference bases.
Classes, methods, and enums that deal with the stratification of read bases and reference information.
Stratifies into quintiles of read cycle.
Stratifies according to the number of matching cigar operators (from CIGAR string) that the read has.
A CollectionStratifier is a stratifier that uses a collection of stratifiers to inform the stratification.
Types of consensus reads as determined by the number of duplicates used from first and second strands.
Stratify by tags used during duplex and single index consensus calling.
An enum designed to hold a binned version of any probability-like number (between 0 and 1) in quintiles
Stratifies base into their read's tile which is parsed from the read-name.
Stratifies base into their read's X coordinate which is parsed from the read-name.
Stratifies base into their read's Y coordinate which is parsed from the read-name.
A stratifier that uses GC (of the read) to stratify.
Stratifies according to the length of an insertion or deletion.
Stratifies according to the number of indel bases (from CIGAR string) that the read has.
 
Stratify bases according to the type of Homopolymer that they belong to (repeating element, final reference base and whether the length is "long" or not).
Stratifies according to the overall mismatches (from SAMTag.NM) that the read has against the reference, NOT including the current base.
Stratify by the number of Ns found in the read.
An enum for holding a reads read-pair's Orientation (i.e.
A PairStratifier is a stratifier that uses two other stratifiers to inform the stratification.
An enum to hold information about the "properness" of a read pair
An enum for holding the direction for a read (positive strand or negative strand
An enum to hold the ordinality of a read
The main interface for a stratifier.
Trivial wrapper around a GATKRead iterator that saves all reads returned in a cache, which can be periodically returned and emptied by the client.
Figures out what kind of BreakpointEvidence, if any, a read represents.
A comprehensive clipping tool.
Constants for use with the GATKRead interface
ReadContextData is additional data that's useful when processing reads.
Comparator for sorting Reads by coordinate.
The object temporarily held by a read that describes all of its covariates.
Data for a single end of a paired-end read, a barcode read, or for the entire read if not paired end.
Tools that manipulate read data in SAM, BAM or CRAM format
Represents one set of cycles in an ReadStructure (e.g.
Little struct-like class to hold read pair (and fragment) end data for duplicate marking.
Little struct-like class to hold read pair (and fragment) end data for MarkDuplicatesWithMateCigar
Codec for ReadEnds that just outputs the primitive fields and reads them back.
Interface for storing and retrieving ReadEnds objects.
 
Created by nhomer on 9/13/15.
A class to store individual records for MarkDuplicatesWithMateCigar.
 
Splits a reader by some value.
Filters which operate on GATKRead should subclass this by overriding ReadFilter.test(GATKRead) ReadFilter implements Predicate and Serializable.
 
 
 
An iterator that filters reads from an existing iterator of reads.
Standard ReadFilters
Do not filter out any read.
Filter out reads containing skipped region from the reference (CIGAR strings with 'N' operator).
Keep only reads that are first of pair (0x1 and 0x40).
Keep only reads containing good CIGAR strings.
Filter out reads without the SAM record RG (Read Group) tag.
Filter out unmapped reads.
Filter out reads without available mapping quality (MAPQ=255).
Filter out reads with mapping quality equal to zero.
Filter out reads where the bases and qualities do not match in length.
For paired reads (0x1), keep only reads that are mapped, have a mate that is mapped (read is not 0x8), and both the read and its mate are on different strands (when read is 0x20, it is not 0x10), as is the typical case.
Keep only reads that have a mate that maps to the same contig (RNEXT is "="), is single ended (not 0x1) or has an unmapped mate (0x8).
Filter reads whose mate is unmapped as well as unmapped reads.
If original alignment and mate original alignment tags exist, filter reads that were originally chimeric (mates were on different contigs).
Filter out reads with fragment length (insert size) different from zero.
Filter out reads that do not align to the reference.
Filter out reads marked as duplicate (0x400).
Keep only paired reads that are marked as not properly paired (0x1 and !0x2).
Filter out reads representing secondary alignments (0x100).
Filter out reads representing supplementary alignments (0x800).
Filter out unpaired reads (not 0x1).
Filter out reads failing platform/vendor quality checks (0x200).
Keep only reads representing primary alignments (those that satisfy both the NotSecondaryAlignment and NotSupplementaryAlignment filters, or in terms of SAM flag values, must have neither of the 0x100 or 0x800 flags set).
Keep only paired reads that are properly paired (0x1 and 0x2).
Filter out reads where the read and CIGAR do not match in length.
Keep only paired reads (0x1) that are second of pair (0x80).
Keep only reads with sequenced bases.
Keep only reads where the read end corresponds to a proper alignment -- that is, the read ends after the start (non-negative number of bases in the reference).
Keep only reads with a valid alignment start (POS larger than 0) or is unmapped.
 
Keep records that don't match the specified filter string(s).
The Read Group covariate.
A read filter to test if the read's readGroup has a flow order associated with it
Splits readers read group id.
Keep only reads from the specified read group.
Splits a reader based on a value from a read group.
An abstract argument collection for use with tools that accept input files containing reads (eg., BAM/SAM/CRAM files).
Keep only reads whose length is ≥ min value and ≤ max value.
A cut-down version of AssemblyRegion that doesn't store reads, used in the strict implementation of FindAssemblyRegionsSpark.
Common interface for assembly-haplotype vs reads likelihood engines.
 
A bag of data about reads: contig name to id mapping, fragment length statistics by read group, mean length.
 
 
A class to track the genomic location of the start of the first and last mapped reads in a partition.
 
 
 
 
 
Provides access to the physical location information about a cluster.
Keep only reads with this read name.
Created by tsato on 3/28/18.
 
Data structure that contains the set of reads sharing the same queryname, including the primary, secondary (i.e.
Represents a pileup of reads at a given position.
Median distance of variant starts from ends of reads supporting each alt allele.
 
Rank Sum Test for relative positioning of REF versus ALT alleles within reads
compare GATKRead by queryname duplicates the exact ordering of SAMRecordQueryNameComparator
 
Wrapper around ReadsDataSource that presents reads overlapping a specific interval to a client, without improperly exposing the entire ReadsDataSource interface.
An interface for managing traversals over sources of reads.
An extension of the basic downsampler API with reads-specific operations
Iterator wrapper around our generic {@link ReadsDownsampler)} interface.
Find <intervalId,list> pairs for interesting template names.
Encodes a unique key for read, read pairs and fragments.
Key class for representing relevant duplicate marking identifiers into a single long key for fragment data.
Key class for representing relevant duplicate marking identifiers into a two long key values for pair data data.
Manages traversals and queries over sources of reads which are accessible via Paths (for now, SAM/BAM/CRAM files only).
ReadsPipelineSpark is our standard pipeline that takes unaligned or aligned reads and runs BWA (if specified), MarkDuplicates, BQSR, and HaplotypeCaller.
ReadsSparkSink writes GATKReads to a file.
Loads the reads from disk either serially (using samReaderFactory) or in parallel using Hadoop-BAM.
Keep only reads whose strand is either forward (not 0x10) or reverse (0x10), as specified.
Describes the intended logical output structure of clusters of an Illumina run.
A container class for a set of reads that share the same unique molecular identifier (UMI) as judged by FGBio GroupReadsByUmi (http://fulcrumgenomics.github.io/fgbio/tools/latest/GroupReadsByUmi.html) Examples of molecule IDs (MI tag): "0/A" (The first molecule in the bam, A strand) "0/B" (The first molecule in the bam, B strand) "99/A" (100th molecule in the bam, A strand) For a given set of reads with the same molecule number, the strand with a larger number of reads is defined as the A strand.
Possible output formats when writing reads.
Keep only reads that contain a tag with a value that agrees with parameters as specified.
 
 
Set of arguments related to the ReadThreadingAssembler
Note: not final but only intended to be subclassed for testing.
Classes which perform transformations from GATKRead -> GATKRead should implement this interface by overriding SerializableFunction<GATKRead,GATKRead>#apply(GATKRead)
 
An iterator that transforms read (i.e.
A read type describes a stretch of cycles in an ReadStructure (e.g.
A miscellaneous collection of utilities for working with reads, headers, etc.
A ReadWalker is a tool that processes a single read at a time from one or multiple sources of reads, with optional contextual information from a reference and/or sets of variants/Features.
Encapsulates an GATKRead with its ReferenceContext and FeatureContext.
A Spark version of ReadWalker.
 
 
Condense homRef blocks in a single-sample GVCF
 
Combines variants into GVCF blocks.
 
 
An individual piece of recalibration data.
A collection of the arguments that are used for BQSR.
This class has all the static functionality for reading a recalibration report file into memory.
Utility class to facilitate base quality score recalibration.
This helper class holds the data HashMap as well as submaps that represent the marginal distributions collapsed over all needed dimensions.
An interface for annotations that are calculated using raw data across samples, rather than the median (or median of median) of samples values The Raw annotation keeps some summary (one example might be a histogram of the raw values for each sample) of the individual sample (or allele) level annotation.
A class to encapsulate the raw data for classes compatible with the ReducibleAnnotation interface
Base interface for a reference argument collection.
Local reference context at a variant position.
ReferenceBases stores the bases of the reference genome for a particular interval.
Evaluate GVCF reference block concordance of an input GVCF against a truth GVCF.
Reference confidence emission modes.
Code for estimating the reference confidence This code can estimate the probability that the data for a single sample is consistent with a well-determined REF/REF diploid genotype.
Holds information about a genotype call of a single sample reference vs.
 
Variant context utilities related to merging variant-context instances.
 
Wrapper around ReferenceDataSource that presents data from a specific interval/window to a client, without improperly exposing the entire ReferenceDataSource interface.
Manages traversals and queries over reference data.
Manages traversals and queries over reference data (for now, fasta files only) Supports targeted queries over the reference by interval, but does not yet support complete iteration over the entire reference.
Class to load a reference sequence from a fasta file on Spark.
Class to load a reference sequence from a fasta file on HDFS.
An abstract ArgumentCollection for specifying a reference sequence file
Manages traversals and queries over in-memory reference data.
Wrapper to load a reference sequence from a file stored on HDFS, GCS, or locally.
Class representing a pair of references and their differences.
 
Tools that analyze and manipulate FASTA format references
Table utilized by CompareReferences tool to compare and analyze sequences found in specified references.
ReferenceShard is section of the reference genome that's used for sharding work for pairing things with the reference.
Internal interface to load a reference sequence.
A ReferenceSource impl that is backed by a .2bit representation of a reference genome.
A collection of static methods for dealing with references.
A reference walker is a tool which processes each base in a given reference.
A library of reference window functions suitable for passing in to transforms such as AddContextDataToRead.
A function for requesting a fixed number of extra bases of reference context on either side of each read.
Loads gene annotations from a refFlat file into an OverlapDetector.
 
Class which contains utility functions that use reflection.
Allows for reading in RefSeq information TODO this header needs to be rewritten
The ref seq feature.
Created by IntelliJ IDEA.
Holds information about a genotype call of a single sample reference vs.
Remove indels that are close to another indel from a vcf file.
Renames a sample within a VCF or BCF.
Reorders a SAM/BAM input file according to the order of contigs in a second reference file.
 
Little struct-like class to hold a record index, the index of the corresponding representative read, and duplicate set size information.
Codec for read names and integers that outputs the primitive fields and reads them back.
An argument collection for use with tools that accept one or more input files containing Feature records (eg., BED files, hapmap files, etc.), and require at least one such input.
An ArgumentCollection that requires one or more intervals be specified with -L at the command line
 
An argument collection for use with tools that accept one or more input files containing reads (eg., BAM/SAM/CRAM files), and require at least one such input.
Argument collection for references that are required (and not common).
An argument collection for use with tools that require a reference file as input.
 
An argument collection for use with tools that accept one or more input files containing VariantContext records (eg., VCF files), and require at least one such input.
Reservoir Downsampler: Selects n reads out of a stream whose size is not known in advance, with every read in the stream having an equal chance of being selected for inclusion.
Stores a resource by path and a relative class.
 
This tool reverts the original base qualities (if specified) and adds the mate cigar tag to mapped SAM, BAM or CRAM files.
Used as a return for the canSkipSAMFile function.
Reverts a SAM file by optionally restoring original quality scores and by removing all alignment information.
 
Reverts a SAM file by optionally restoring original quality scores and by removing all alignment information.
 
Util class for executing R scripts.
Root Mean Square of the mapping quality of reads across all samples.
Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".
 
 
Holds information about CpG sites encountered for RRBS processing QC
 
Holds summary statistics from RRBS processing QC
Generic service for executing RScripts
 
Libraries embedded in the StingUtils package.
 
Class that takes in a set of alignment information in SAM format and merges it with the set of all reads for which alignment was attempted, stored in an unmapped SAM file.
Class that takes in a set of alignment information in SAM format and merges it with the set of all reads for which alignment was attempted, stored in an unmapped SAM file.
Compare two SAM/BAM files.
 
Argument collection for SAM comparison
Metric for results of SamComparison.
Class used to direct output from a HaplotypeBAMWriter to a bam/sam file.
A GATKRead writer that writes to a SAM/BAM file.
Converts a BAM file to human-readable SAM output or vice versa
Decoder for single sample SAM pileup data.
Simple representation of a single base with associated quality from a SAM pileup
A tribble feature representing a SAM pileup.
Stratifies the eval RODs by each sample in the eval ROD.
Represents an individual under study.
Simple database for managing samples
Class for creating a temporary in memory database of samples.
List samples that are non-reference at a given variant site
An immutable, indexed set of samples.
Interface for marking objects that contain metadata associated with a collection of locatables associated with a single sample.
Interface for marking objects that contain metadata associated with a single sample.
A class to hold the mappings of sample names to VCF / VCF index paths.
Splits readers sample names.
Utility class to determine the tumor normal pairs that makeup a VCF Header
Keep only reads for a given sample.
An iterator that allows for traversals over a SamReader restricted to a set of intervals, unmapped reads, or both.
 
 
SAMRecordAndReferenceMultiLevelCollector<BEAN extends htsjdk.samtools.metrics.MetricBase,HKEY extends Comparable<HKEY>>
 
SAMRecordAndReferenceMultiLevelCollector<BEAN extends htsjdk.samtools.metrics.MetricBase,HKEY extends Comparable>
 
SAMRecordMultiLevelCollector<BEAN extends htsjdk.samtools.metrics.MetricBase,HKEY extends Comparable<HKEY>>
Defines a MultilevelPerRecordCollector using the argument type of SAMRecord so that this doesn't have to be redefined for each subclass of MultilevelPerRecordCollector
SAMRecordMultiLevelCollector<BEAN extends htsjdk.samtools.metrics.MetricBase,HKEY extends Comparable>
Defines a MultilevelPerRecordCollector using the argument type of SAMRecord so that this doesn't have to be redefined for each subclass of MultilevelPerRecordCollector
Efficient serializer for SAMRecords that uses SAMRecordSparkCodec for encoding/decoding.
A class that uses a slightly adapted version of BAMRecordCodec for serialization/deserialization of SAMRecords.
Implementation of the GATKRead interface for the SAMRecord class.
Efficient serializer for SAMRecordToGATKReadAdapters that uses SAMRecordSparkCodec for encoding/decoding.
Wraps a SAMRecord iterator within an iterator of GATKReads.
This class sets the duplicate read flag as the result state when examining sets of records.
Class to take unmapped reads in SAM/BAM/CRAM file format and create Maq binary fastq format file(s) -- one or two of them, depending on whether it's a paired-end read.
Extracts read sequences and qualities from the input SAM/BAM file and writes them into the output file in Sanger FASTQ format.
Extracts read sequences and qualities from the input SAM/BAM file and SAM/BAM tags and writes them into output files in Sanger FASTQ format.
A builder class that expands functionality for SA tags.
A Tool for breaking up a reference into intervals of alternating regions of N and ACGT bases.
 
Scores variant calls in a VCF file based on site-level annotations using a previously trained model.
Base class for executors that find and run scripts in an external script engine process (R, Python, etc).
Base type for exceptions thrown by the ScriptExecutor.
For extracting simple variants from input GATK-SV complex variants.
 
 
 
 
Class that represents the exon numbers overlapped by a genomic region.
 
Select a subset of variants from a VCF file
A graph that contains base sequence at each node
A simple data object to hold a comparison between a reference sequence and an alternate allele.
A series of utility functions that enable the GATK to compare two sequence dictionaries -- from the reference, from BAMs, or from feature sources -- for consistency.
Class with helper methods for generating and writing SequenceDictionary objects.
 
 
interface for argument collections that control how sequence dictionary validation should be handled
doesn't provide a configuration argument, and always returns false, useful for tools that do not want to perform sequence dictionary validation, like aligners
most tools will want to use this, it defaults to performing sequence dictionary validation but provides the option to disable it
In broad terms, each sequencing platform can be classified by whether it flows nucleotides in some order such that homopolymers get sequenced in a single event (ie 454 or Ion) or it reads each position in the sequence one at a time, regardless of base composition (Illumina or Solid).
 
Bait bias artifacts broken down by context.
Summary analysis of a single bait bias artifact, also known as a reference bias artifact.
Pre-adapter artifacts broken down by context.
Summary analysis of a single pre-adapter artifact.
A graph vertex containing a sequence of bases and a unique ID that allows multiple distinct nodes in the graph to have the same sequence.
 
Represents a Function that is Serializable.
 
 
Deprecated.
Fixes the NM, MD, and UQ tags in a SAM or BAM file.
Set size utility
ENUM of possible human sexes: male, female, or unknown
Represents the sex of an individual.
A Shard of records of type T covering a specific genomic interval, optionally expanded by a configurable amount of padded data, that provides the ability to iterate over its records.
Holds the bounds of a Shard, both with and without padding
A Shard backed by a ShardBoundary and a collection of records.
Iterator that will break up each input interval into shards.
Variant writer tha splits output to multiple VCFs given the maximum records per file.
adapts a normal Shard into a MultiIntervalShard that contains only the single wrapped shard this is a temporary shim until we can fully adopt MultiIntervalShard in HaplotypeCallerSpark
Merges the incoming vertices of a vertex V of a graph Looks at the vertices that are incoming to V (i.e., have an outgoing edge connecting to V).
Split a collection of middle nodes in a graph into their shared prefix and suffix values This code performs the following transformation.
Create a fasta with the bases shifted by offset delta1 = offset - 1 delta2 = total - delta1 To shift forward: if you are given a position in the regular fasta (pos_r) and want the position in the shifted fasta (pos_s): if pos_r > delta1 => pos_s = pos_r - delta1 == pos_r - offset +1 otherwise pos_s = pos_r + delta2 == pos_r + total - offset + 1 To shift back: if you are given a position in the shifted fasta (pos_s) and want the position in the regular fasta (pos_r): if pos_s > delta2 => pos_r = pos_s - delta2 == pos_s - total + offset - 1 otherwise pos_r = pos_s + delta1 == pos_s + offset - 1 Example command line: ShiftFasta -R "<CIRCURLAR_REFERENCE.fasta>" // the reference to shift -O "<SHIFTED_REFERENCE.fasta>" // output; the shifted fasta --shift-back-output "<SHIFT_BACK.chain>" // output; the shiftback chain file to use when lifting over --shift-offset-list "" // optional; Specifies the offset to shift for each contig in the reference.
Tools that perform variant calling and genotyping for short variants (SNPs, SNVs and Indels)
Conceptually, a simple chimera represents the junction on AssemblyContigWithFineTunedAlignments that have exactly two good alignments.
Struct to represent the (distance - 1) between boundaries of the two alignments represented by this CA, on reference, and on read.
 
This caller is loosely based on the legacy ReCapSeg caller that was originally implemented in ReCapSeg v1.4.5.0, but introduces major changes.
Represents a count at an interval.
 
Simple data structure to pass and read/write a List of SimpleCount objects.
CONTIG, START, END, COUNT Note: Unlike the package-private enums in other collection classes, this enum and its TableColumnCollection are public so that they can be accessed by SimpleCountCodec, which must be in org.broadinstitute.hellbender.utils.codecs to be discovered as a codec.
A calculator that estimates the error rate of the bases it observes, assuming that the reference is truth.
This utility class performs a simple tagging of germline segments in a tumor segments file.
Minimal immutable class representing a 1-based closed ended genomic interval SimpleInterval does not allow null contig names.
Represents a collection of SimpleInterval.
Factory for creating TableFuncotations by handling `Separated Value` files with arbitrary delimiters (e.g.
 
Metadata associated with a collection of locatables.
This is a simple tool to mark duplicates using the DuplicateSetIterator, DuplicateSet, and SAMRecordDuplicateComparator.
Simply a wrapper to link together NovelAdjacencyAndAltHaplotype and evidence SimpleChimera's.
Utility structs for extraction information from the consensus NovelAdjacencyAndAltHaplotype out of multiple ChimericAlignments, to be later added to annotations of the VariantContext extracted.
 
This deals with the special case where a contig has exactly two alignments and seemingly has the complete alt haplotype assembled.
Masks read bases with a supra-threshold number of A/T's or G/C's within a given window size.
Metadata associated with a collection of locatables associated with a single sample.
Metadata associated with a single sample.
Simple implementation of the SVD interface for storing the matrices (and vector) of a SVD result.
 
 
 
 
 
 
 
 
This class is very versatile, but as a result, it must do some lazy loading after it receives the first write command.
A simple TSV/CSV/XSV writer with support for writing in the cloud with configurable delimiter.
A class for finding the distance between a single barcode and a barcode-read (with base qualities)
 
Super class that is designed to provide some consistent structure between subclasses that simply iterate once over a coordinate sorted BAM and collect information from the records as the go in order to produce some kind of output.
Encompasses an aligner to a single-sequence reference.
 
Perform singular value decomposition (and pseudoinverse calculation).
The read depth of each base call for a sample at some locus.
Codec to handle SiteDepths in BlockCompressedInterval files
Codec to handle SiteDepths in tab-delimited text files
Imposes additional ordering of same-locus SiteDepth records by sample.
Merges locus-sorted SiteDepth evidence files, and calculates the bi-allelic frequency (baf) for each sample and site, and writes these values as a BafEvidence output file.
Implements slice sampling of a continuous, univariate, unnormalized probability density function (PDF), which is assumed to be unimodal.
Interface and factory for Smith-Waterman aligners
 
 
This class collects the various SWParameters that are used for various alignment procedures.
SmithWatermanIntelAligner class that converts instance of SWAlignerNativeBinding into a SmithWatermanIntelAligner This is optimized for Intel Architectures and can fail if Machine does not support AVX and will throw UserException
Pairwise discrete smith-waterman alignment implemented in pure java ************************************************************************ **** IMPORTANT NOTE: **** **** This class assumes that all bytes come from UPPERCASED chars! **** ************************************************************************
The state of a trace step through the matrix
Class to represent a SNP in context of a haplotype block that is used in fingerprinting.
Stratifies variants as genes or coding regions, according to the effect modifier, as indicated by snpEff.
 
Created with IntelliJ IDEA.
 
 
An implementation of a feature mapper that finds SNPs (SVN) This class only finds SNP that are surrounded by a specific number of bases identical to the reference.
Filter out reads where the ratio of soft-clipped bases to total bases exceeds some given value.
A model for the allele fraction spectrum of somatic variation.
 
 
 
Genome-wide VCF writer for somatic (Mutect2) output Merges reference blocks based on TLOD
Created by David Benjamin on 3/9/17.
 
 
 
 
SortedBasecallsConverter utilizes an underlying IlluminaDataProvider to convert parsed and decoded sequencing data from standard Illumina formats to specific output records (FASTA records/SAM records).
Summary
Sorts a SAM or BAM file.
SortSam on Spark (works on SAM/BAM/CRAM)
Sorts one or more VCF files according to the order of the contigs in the header/sequence dictionary and then by coordinate.
Command line arguments needed for configuring a spark context
 
Manages creation of the Spark context.
Class with helper methods to convert objects (mostly matrices) to/from Spark (particularly, in MLLib)
Utility methods for sharding Locatable objects (such as reads) for given intervals, without using a shuffle.
SVD using MLLib
Miscellaneous Spark-related utilities
SplitCRAM - split a cram file into smaller cram files (shards) containing a minimal number of records while still respecting container boundaries.
This tool takes in intervals via the standard arguments of IntervalArgumentCollection and splits them into interval files for scattering.
Splits reads that contain Ns in their cigar string (e.g.
Documents evidence of reads (of some sample at some locus) that align well to reference for some portion of the read, and fails to align for another portion of the read.
Codec to handle SplitReadEvidence in BlockCompressedInterval files
Codec to handle SplitReadEvidence in tab-delimited text files
Imposes additional ordering of same-locus SplitReadEvidence by sample and strand.
Outputs reads from a SAM/BAM/CRAM by read group, sample and library name
Command-line program to split a SAM/BAM/CRAM file into separate files based on library name.
Splits the input queryname sorted or query-grouped SAM/BAM/CRAM file and writes it into multiple BAM files, each with an approximately equal number of reads.
Splits the input VCF file into two, one for indels and one for SNPs.
This is a marker interface used to indicate which annotations are "Standard".
A set of String constants in which the name of the constant (minus the _SHORT_NAME suffix) is the standard long Option name, and the value of the constant is the standard shortName.
This is pulled out so that every caller isn't exposed to the arguments from every other caller.
Represents the list of standard BQSR covariates.
 
This is a marker interface used to indicate which annotations are part of the standard flow based group
This is a marker interface used to indicate which annotations are "Standard" for the HaplotypeCaller only.
This is a marker interface used to indicate which annotations are "Standard" for Mutect2 only.
A set of String constants in which the name of the constant (minus the _SHORT_NAME suffix) is the standard long Option name, and the value of the constant is the standard shortName.
Standard or classic pair-hmm score imputator.
 
 
 
 
 
 
Number of forward and reverse reads that support REF and ALT alleles
Class of tests to detect strand bias.
Common strand bias utilities used by allele specific strand bias annotators
Class to represent a strand-corrected Allele.
Simple container class to represent bases that have been corrected for strandedness already.
Represents an interval and strand from the reference genome.
 
Strand bias estimated by the Symmetric Odds Ratio test
For symbolizing the change of strand from one alignment to the next of an assembly contig.
Represents the full state space of all stratification combinations
 
A basic interface for a class to be used with the StratificationManager system
Represents a decimation table.
Facade to Runtime.exec() and java.lang.Process.
Python executor used to interact with a cooperative, keep-alive Python process.
Various constants used by StreamingProcessController that require synchronized equivalents in the companion process, i.e., if the streaming process is written in Python, there must be equivalent Python constants for use by the Python code.
Where to read/write a stream
The content of stdout or stderr.
 
Removes /1 or /2 and any whitespace from the end of the read name if present
Class to create and access STR table file contents.
Utility class to compose the contents of the STR Table file.
 
Tools that detect structural variants
 
 
 
 
Runs the structural variation discovery workflow on a single sample
 
SubsettedLikelihoodMatrix<EVIDENCE extends htsjdk.samtools.util.Locatable,A extends htsjdk.variant.variantcontext.Allele>
Fast wrapper for a LikelihoodMatrix that uses only a subset of alleles.
 
Simple allele counter for SVs.
Adds gene overlap, predicted functional consequence, and noncoding element overlap annotations to a structural variant (SV) VCF from the GATK-SV pipeline.
 
 
 
 
 
Clusters structural variants based on coordinates, event type, and supporting algorithms.
Base class for clustering items that possess start/end genomic coordinates.
Available clustering algorithms
 
Arguments for use with SVClusterEngine.
Some useful functions for creating different kinds of SVClusterEngine.
 
This tool calculates SV genotype concordance between an "evaluation" VCF and a "truth" VCF.
Generates SV records annotated with concordance metrics given a pair of "evaluation" and "truth" SVs.
 
Variant context with additional method to mine the structural variant specific information from structural variant records.
Interface for SVD implementation.
Represents copy ratios for a sample that has been standardized and denoised by an SVDReadCountPanelOfNormals.
Utility class for package-private methods for performing SVD-based denoising and related operations.
Entry point for creating an instance of SVD.
(Internal) Examines aligned contigs from local assemblies and calls structural variants or their breakpoints
 
 
 
 
 
Interface for the panel of normals (PoN) for SVD-based coverage denoising.
An iterator over kmers with a specified maximum DUST-style, low-complexity score.
Memory-economical utilities for producing a FASTQ file.
 
 
 
 
 
 
Naturally collating, simple interval WARNING: THIS IS NOT THE SAME AS THE BED COORDINATE SYSTEM OR SimpleInterval !!!!!
 
 
A Red-Black tree with intervals for keys.
 
 
 
 
 
Iterator over successive Kmers from a sequence of characters.
 
An immutable SVKmerLong.
 
An immutable SVKmerShort.
 
Any class with loci that are potentially on different chromosomes should implement this interface.
 
 
 
 
Various types of structural variations.
Useful scraps of this and that.
 
 
A utility class that writes out variants to a VCF file.
A wrapper that converts instances of SWAlignerNativeBinding into a SmithWatermanAligner
An annotation to denote Configuration options that should be injected into the Java System Properties.
Parser for tab-delimited files
Parse a tabbed text file in which columns are found by looking at a header line rather than by position.
Reads tab deliminated tabular text files
Represents a list of table columns.
Feature representing a row in a text table.
A Funcotation to hold data from simple tabular data.
Reads the contents of a tab separated value formatted text input into records of an arbitrary type TableReader.
A reference to a BigQuery table by project, dataset, and table name, along with the contained fields.
Common constants for table readers and writers.
Class to write tab separated value files.
 
Enum for two-sided things, for example which end of a read has been clipped, which end of a chain within an assembly graph etc.
Tandem repeat unit composition and counts per allele
Stratifies the evals into sites that are tandem repeats
Metrics class for the analysis of reads obtained from targeted pcr experiments e.g.
Calculates HS metrics for a given SAM or BAM file.
TargetMetrics, are metrics to measure how well we hit specific targets (or baits) when using a targeted sequencing process like hybrid selection or Targeted PCR Techniques (TSCA).
TargetMetrics, are metrics to measure how well we hit specific targets (or baits) when using a targeted sequencing process like hybrid selection or Targeted PCR Techniques (TSCA).
TargetMetrics, are metrics to measure how well we hit specific targets (or baits) when using a targeted sequencing process like hybrid selection or Targeted PCR Techniques (TSCA).
A simple class that is used to store the coverage information about an interval.
Indicates the ordinal of a fragment in a paired sequenced template.
TensorType documents the tensors available and what information they encode.
 
For internal test purposes only.
 
Program group for use with internal test CommandLinePrograms only.
Common utilities for dealing with text formatting.
 
 
 
 
 
Created by David Benjamin on 5/13/15.
 
TheoreticalSensitivityMetrics, are metrics calculated from TheoreticalSensitivity and parameters used in the calculation.
 
 
This version of the thread pool executor will throw an exception if any of the internal jobs have throw exceptions while executing
 
 
Represents a tile from TileMetricsOut.bin.
Load a file containing 8-byte records like this: tile number: 4-byte int number of clusters in tile: 4-byte int Number of records to read is determined by reaching EOF.
 
Reads a TileMetricsOut file commonly found in the InterOp directory of an Illumina Run Folder.
Helper class which captures the combination of a lane, tile & metric code
IlluminaPhasingMetrics corresponds to a single record in a TileMetricsOut file
 
Utility for reading the tile data from an Illumina run directory's TileMetricsOut.bin file
Captures information about a phasing value - Which read it corresponds to, which phasing type and a median value
Defines the first or second template read for a tile
 
A set of training variants for use with VQSR.
Trains a model for scoring variant calls based on site-level annotations.
Created by gauthier on 7/13/17.
 
 
 
 
 
The manner to select a single transcript from a set of transcripts to report as the "best" or main transcript.
This tool takes a pair of SAM files sharing the same read names (e.g.
A common class for holding the fields in PhysicalLocation that we don't want to be serialized by kryo.
Enum representation of a transition from one base to any other.
 
A simple container class for parameters controlling which records get returned during traversals.
An enumeration to represent true, false, or unknown.
a service class for HaplotypeBasedVariableRecaller that reads SAM/BAM files.
A class for imposing a trio structure on three samples; a common paradigm
 
Convenience class for tumor normal pair.
 
 
 
This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA.
UmiGraph is used to identify UMIs that come from the same original source molecule.
Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords using the UmiAwareDuplicateSetIterator.
Finds a lower bound on the number of unique reads at a locus that support a non-reference allele.
Create a unique ID for an arbitrary object and wrap it.
Clears the 0x400 duplicate SAM flag from reads.
A utility class for dealing with unsigned types.
A utility class for dealing with unsigned types.
UnortedBasecallsConverter utilizes an underlying IlluminaDataProvider to convert parsed and decoded sequencing data from standard Illumina formats to specific output records (FASTA records/SAM records).
Takes a VCF file and a Sequence Dictionary (from a variety of file types) and updates the Sequence Dictionary in VCF.
Updates the reference contigs in the header of the VCF format file, i.e.
Class UserException.
 
 
 
 
Class UserException.CouldNotCreateOutputFile
 
Class UserException.CouldNotReadInputFile
 
 
 
 
 
 
 
 
Class UserException.MalformedFile
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
This tool reports on the validity of a SAM or BAM file relative to the SAM format specification.
 
Validate a VCF file with a strict set of criteria
 
 
Describes the functionality for an executor that manages the delegation of work to VariantProcessor.Accumulators.
A VariantAccumulatorExecutor that breaks down work into chunks described by the provided VariantIteratorProducer and spreads them over the indicated number of threads.
 
Interface of all variant annotations.
File interface for passing annotations to a modeling backend and indicating a path prefix for resulting output.
 
File interface for passing annotations to a scoring backend and returning scores.
Annotate variant calls with context information
The class responsible for computing annotations for variants.
A container object for storing the objects necessary for carrying over expression annotations.
 
VariantContextVariantAdapter wraps the existing htsjdk VariantContext class so it can be used with the GATKVariant API.
 
Given a variant callset, it is common to calculate various quality control metrics.
The collection of arguments for VariantEval
 
A wrapper used internally by VariantEval and related classes to pass information related to the evaluation/stratification context, without exposing the entire walker to the consumer.
This class allows other classes to replicate the behavior of VariantEval Usage: -Pass the genotype args into the constructor, which will the initialize the engine completely
Class for writing the GATKReport for VariantEval Accepts a fulled evaluated (i.e., there's no more data coming) set of stratifications and evaluators and supports writing out the data in these evaluators to a GATKReport.
Tools that evaluate and refine variant calls, e.g.
 
 
Interface for classes that can generate filters for VariantContexts.
Tools that filter variants
Collects common variant filters.
Do not filter out any variants.
Filter out any variants that are symbolic or SV.
Filter out any variants that fail (variant-level) filters.
Filter variant calls based on INFO and/or FORMAT annotations
Keep only variants with any of these IDs.
A mechanism for iterating over CloseableIterator of VariantContexts in in some fashion, given VCF files and optionally an interval list.
VariantLocusWalker processes variants from a single source, grouped by locus overlap, or optionally one at a time in order, with optional contextual information from a reference, sets of reads, and/or supplementary sources of Features.
Tools that manipulate variant call format (VCF) data
Annotate the ID field and attribute overlap FLAGs for a VariantContext against a FeatureContext or a list of VariantContexts.
Describes an object that processes variants and produces a result.
Handles VariantContexts, and accumulates their data in some fashion internally.
Generates instances of VariantProcessor.Accumulators.
Simple builder of VariantProcessors.
Takes a collection of results produced by VariantProcessor.Accumulator.result() and merges them into a single RESULT.
Build a recalibration model to score variant quality for filtering purposes
 
 
VariantShard is section of the genome that's used for sharding work for pairing things with variants.
VariantsSparkSink writes variants to a VCF file in parallel using Hadoop-BAM.
VariantsSparkSource loads Variants from files serially (using FeatureDataSource) or in parallel using Hadoop-BAM.
Extract fields from a VCF file to a tab-delimited table
 
 
 
Classes which perform transformations from VariantContext -> VariantContext should implement this interface by overriding < VariantContext ,VariantContext>#apply(VariantContext) Created by jonn on 6/26/18.
Flow Annotation: type of variant: SNP/NON-H-INDEL/H-INDEL
Stratifies the eval variants by their type (SNP, INDEL, ETC)
This code and logic for determining variant types was mostly retained from VQSR.
Enum to hold the possible types of dbSnps.
Keep only variants with any of these variant types.
A VariantWalker is a tool that processes a variant at a time from a source of variants, with optional contextual information from a reference, sets of reads, and/or supplementary sources of Features.
Base class for variant walkers, which process variants from one or more sources of variants, with optional contextual information from a reference, sets of reads, and/or supplementary sources of Features.
Encapsulates a VariantContext with the reads that overlap it (the ReadsContext and its ReferenceContext and FeatureContext.
A Spark version of VariantWalker.
Deprecated.
from 2022-03-17, Use VcfPathSegment
Deprecated.
from 2022-03-17, Use VcfPathSegmentGenerator
Converts an ASCII VCF file to a binary BCF or vice versa.
A class to create annotations from VCF feature sources.
A concrete class for FuncotationMetadata that can be easily built from a VCF Header.
A Funcotator output renderer for writing to VCF files.
Describes a segment of a particular VCF file.
Describes a mechanism for producing VcfPathSegments from a VCF file path.
A simple program to convert a Genotyping Arrays VCF to an ADPC file (Illumina intensity data file).
Converts a VCF or BCF file to a Picard Interval List.
 
Utils for dealing with VCF files.
Created by farjoun on 4/1/17.
Class for performing the pair HMM for global alignment using AVX instructions contained in a native shared library.
Type for implementation of VectorLoglessPairHMM
 
Prints a SAM or BAM file to the screen.
 
 
 
Base class for pre-packaged walker traversals in the GATK engine.
Tests whether a flow based read is "well-formed" -- that is, is free of major internal inconsistencies and issues that could lead to errors downstream.
Tests whether a read is "well-formed" -- that is, is free of major internal inconsistencies and issues that could lead to errors downstream.
Metrics for evaluating the performance of whole genome sequencing experiments.
Interface for processing data and generate result for CollectWgsMetrics
WgsMetricsProcessorImpl<T extends htsjdk.samtools.util.AbstractRecordAndOffset>
Implementation of WgsMetricsProcessor that gets input data from a given iterator and processes it with a help of collector
Support for Python-like xreadlines() function as a class.
Codec class to read from XSV (e.g.
A feature to represent a line in an arbitrarily delimited (XSV) file (i.e.
Utility class to zip and unzip files.