@DocumentedFeature public final class DetermineGermlineContigPloidy extends CommandLineProgram
CollectReadCounts
; TSV files may be compressed (e.g., with bgzip),
but must then have filenames ending with the extension .gz. See the documentation for the input
argument
for details on enabling streaming of indexed count files from Google Cloud Storage.
Germline karyotyping is a frequently performed task in bioinformatics pipelines, e.g. for sex determination and aneuploidy identification. This tool uses counts data for germline karyotyping.
Performing germline karyotyping using counts data requires calibrating ("modeling") the technical coverage bias
and variance for each contig. The Bayesian model and the associated inference scheme implemented in
DetermineGermlineContigPloidy
includes provisions for inferring and explaining away much of the technical
variation. Furthermore, karyotyping confidence is automatically adjusted for individual samples and contigs.
Running DetermineGermlineContigPloidy
is the first computational step in the GATK germline CNV calling
pipeline. It provides a baseline ("default") copy-number state for each contig/sample with respect to which the
probability of alternative states is allocated.
The computation done by this tool, aside from input data parsing and validation, is performed outside of the Java
Virtual Machine and using the gCNV computational python module, namely gcnvkernel
. It is crucial that
the user has properly set up a python conda environment with gcnvkernel
and its dependencies
installed. If the user intends to run DetermineGermlineContigPloidy
using one of the official GATK Docker images,
the python environment is already set up. Otherwise, the environment must be created and activated as described in the
main GATK README.md file.
Advanced users may wish to set the THEANO_FLAGS
environment variable to override the GATK theano
configuration. For example, by running
THEANO_FLAGS="base_compiledir=PATH/TO/BASE_COMPILEDIR" gatk DetermineGermlineContigPloidy ...
, users can specify
the theano compilation directory (which is set to $HOME/.theano
by default). See theano documentation
at
http://deeplearning.net/software/theano/library/config.html.
This tool has two operation modes as described below:
model
argument, the tool will run in
the COHORT mode. In this mode, ploidy model parameters (e.g. coverage bias and variance for each contig) are
inferred, along with baseline contig ploidy states of each sample. It is possible to run the tool over a subset
of all intervals present in the input count files, which can be specified by -L; this can be used to pass a
filtered interval list produced by FilterIntervals
to mask intervals from modeling. Intervals may also be
blacklisted using -XL. The specified intervals that result from resolving -L/-XL inputs must be exactly present
in all of the input count files.
A TSV file specifying prior probabilities for each integer ploidy state and for each contig is required in this
mode and must be specified via the contig-ploidy-priors
argument. The following shows an example of
such a table:
CONTIG_NAME | PLOIDY_PRIOR_0 | PLOIDY_PRIOR_1 | PLOIDY_PRIOR_2 | PLOIDY_PRIOR_3 |
1 | 0.01 | 0.01 | 0.97 | 0.01 |
2 | 0.01 | 0.01 | 0.97 | 0.01 |
X | 0.01 | 0.49 | 0.49 | 0.01 |
Y | 0.50 | 0.50 | 0.00 | 0.00 |
Note that the contig names appearing under CONTIG_NAME
column must match contig names in the input
counts files, and all contigs appearing in the input counts files must have a corresponding entry in the priors
table. The order of contigs is immaterial in the priors table. The highest ploidy state is determined by the
prior table (3 in the above example). A ploidy state can be strictly forbidden by setting its prior probability
to 0. For example, the Y contig in the above example can only assume 0 and 1 ploidy states.
The tool output in the COHORT mode will contain two subdirectories, one ending with "-model" and the other ending with "-calls". The model subdirectory contains the inferred parameters of the ploidy model, which may be used later on for karyotyping one or more similarly-sequenced samples (see below). The calls subdirectory contains one subdirectory for each sample, listing various sample-specific quantities such as the global read-depth, average ploidy, per-contig baseline ploidies, and per-contig coverage variance estimates.
model
argument, then the tool will run in the CASE mode. In this mode, the parameters of the ploidy
model are loaded from the provided directory and only sample-specific quantities are inferred. The modeled
intervals are then specified by a file contained in the model directory, all interval-related arguments are
ignored in this mode, and all model intervals must be present in all of the input count files. The tool output
in the CASE mode is only the "-calls" subdirectory and is organized similarly to the COHORT mode.
In the CASE mode, the contig ploidy prior table is taken directly from the provided model parameters path and must be not provided again.
The quality of ploidy model parametrization and the sensitivity/precision of germline karyotyping are
sensitive to the choice of model hyperparameters, including standard deviation of mean contig coverage bias
(set using the mean-bias-standard-deviation
argument), mapping error rate
(set using the mapping-error-rate
argument), and the typical scale of contig- and sample-specific
unexplained variance (set using the global-psi-scale
and sample-psi-scale
arguments,
respectively). It is crucial to note that these hyperparameters are not universal
and must be tuned for each sequencing protocol and properly set at runtime.
The model underlying this tool assumes integer ploidy states (in contrast to fractional/variable ploidy states). Therefore, it is to be used strictly on germline samples and for the purpose of sex determination, autosomal aneuploidy detection, or as a part of the GATK germline CNV calling pipeline. The presence of large somatic events and mosaicism (e.g., sex chromosome loss and somatic trisomy) will naturally lead to unreliable results. We strongly recommended inspecting genotyping qualities (GQ) from the tool output and considering to drop low-GQ contigs in downstream analyses. Finally, given the Bayesian status of this tool, we suggest including as many high-quality germline samples as possible for ploidy model parametrizaton in the COHORT mode. This will downplay the role of questionable samples and will yield a more reliable estimation of genuine sequencing biases.
Accurate germline karyotyping requires incorporating SNP allele-fraction data and counts data in a unified probabilistic model and is beyond the scope of the present tool. The current implementation only uses counts data for karyotyping and while being fast, it may not provide the most reliable results.
COHORT mode:
gatk DetermineGermlineContigPloidy \ --input normal_1.counts.hdf5 \ --input normal_2.counts.hdf5 \ ... \ --contig-ploidy-priors a_valid_ploidy_priors_table.tsv --output output_dir \ --output-prefix normal_cohort
COHORT mode (with optional interval filtering):
gatk DetermineGermlineContigPloidy \ -L intervals.interval_list \ -XL blacklist_intervals.interval_list \ --interval-merging-rule OVERLAPPING_ONLY \ --input normal_1.counts.hdf5 \ --input normal_2.counts.hdf5 \ ... \ --contig-ploidy-priors a_valid_ploidy_priors_table.tsv --output output_dir \ --output-prefix normal_cohort
CASE mode:
gatk DetermineGermlineContigPloidy \ --model a_valid_ploidy_model_dir --input normal_1.counts.hdf5 \ --input normal_2.counts.hdf5 \ ... \ --output output_dir \ --output-prefix normal_case
Modifier and Type | Class and Description |
---|---|
static class |
DetermineGermlineContigPloidy.RunMode |
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
CALLS_PATH_SUFFIX |
static java.lang.String |
CASE_DETERMINE_PLOIDY_AND_DEPTH_PYTHON_SCRIPT |
static java.lang.String |
COHORT_DETERMINE_PLOIDY_AND_DEPTH_PYTHON_SCRIPT |
static java.lang.String |
CONTIG_PLOIDY_PRIORS_FILE_LONG_NAME |
static java.lang.String |
INPUT_MODEL_INTERVAL_FILE |
protected IntervalArgumentCollection |
intervalArgumentCollection |
static java.lang.String |
MODEL_PATH_SUFFIX |
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, NIO_PROJECT_FOR_REQUESTER_PAYS, QUIET, specialArgumentsCollection, tmpDir, useJdkDeflater, useJdkInflater, VERBOSITY
Constructor and Description |
---|
DetermineGermlineContigPloidy() |
Modifier and Type | Method and Description |
---|---|
protected java.lang.Object |
doWork()
Do the work after command line has been parsed.
|
protected void |
onStartup()
Perform initialization/setup after command-line argument parsing but before doWork() is invoked.
|
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getPluginDescriptors, getSupportInformation, getToolkitName, getToolkitShortName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, onShutdown, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus
public static final java.lang.String COHORT_DETERMINE_PLOIDY_AND_DEPTH_PYTHON_SCRIPT
public static final java.lang.String CASE_DETERMINE_PLOIDY_AND_DEPTH_PYTHON_SCRIPT
public static final java.lang.String INPUT_MODEL_INTERVAL_FILE
public static final java.lang.String MODEL_PATH_SUFFIX
public static final java.lang.String CALLS_PATH_SUFFIX
public static final java.lang.String CONTIG_PLOIDY_PRIORS_FILE_LONG_NAME
@ArgumentCollection protected IntervalArgumentCollection intervalArgumentCollection
protected void onStartup()
CommandLineProgram
onStartup
in class CommandLineProgram
protected java.lang.Object doWork()
CommandLineProgram
doWork
in class CommandLineProgram