DetermineGermlineContigPloidy (gatk 4.1.7.0 API)

java.lang.Object
- org.broadinstitute.hellbender.cmdline.CommandLineProgram
- - org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy

All Implemented Interfaces:: org.broadinstitute.barclay.argparser.CommandLinePluginProvider

@DocumentedFeature
public final class DetermineGermlineContigPloidy
extends CommandLineProgram

Determines the integer ploidy state of all contigs for germline samples given counts data. These should be either HDF5 or TSV count files generated by CollectReadCounts; TSV files may be compressed (e.g., with bgzip), but must then have filenames ending with the extension .gz. See the documentation for the input argument for details on enabling streaming of indexed count files from Google Cloud Storage.

Introduction

Germline karyotyping is a frequently performed task in bioinformatics pipelines, e.g. for sex determination and aneuploidy identification. This tool uses counts data for germline karyotyping.

Performing germline karyotyping using counts data requires calibrating ("modeling") the technical coverage bias and variance for each contig. The Bayesian model and the associated inference scheme implemented in DetermineGermlineContigPloidy includes provisions for inferring and explaining away much of the technical variation. Furthermore, karyotyping confidence is automatically adjusted for individual samples and contigs.

Running DetermineGermlineContigPloidy is the first computational step in the GATK germline CNV calling pipeline. It provides a baseline ("default") copy-number state for each contig/sample with respect to which the probability of alternative states is allocated.

Python environment setup

The computation done by this tool, aside from input data parsing and validation, is performed outside of the Java Virtual Machine and using the gCNV computational python module, namely gcnvkernel. It is crucial that the user has properly set up a python conda environment with gcnvkernel and its dependencies installed. If the user intends to run DetermineGermlineContigPloidy using one of the official GATK Docker images, the python environment is already set up. Otherwise, the environment must be created and activated as described in the main GATK README.md file.

Advanced users may wish to set the THEANO_FLAGS environment variable to override the GATK theano configuration. For example, by running THEANO_FLAGS="base_compiledir=PATH/TO/BASE_COMPILEDIR" gatk DetermineGermlineContigPloidy ..., users can specify the theano compilation directory (which is set to $HOME/.theano by default). See theano documentation at http://deeplearning.net/software/theano/library/config.html.

Tool run modes

This tool has two operation modes as described below:

COHORT mode:

If a ploidy model parameter path is not provided via the model argument, the tool will run in the COHORT mode. In this mode, ploidy model parameters (e.g. coverage bias and variance for each contig) are inferred, along with baseline contig ploidy states of each sample. It is possible to run the tool over a subset of all intervals present in the input count files, which can be specified by -L; this can be used to pass a filtered interval list produced by FilterIntervals to mask intervals from modeling. Intervals may also be blacklisted using -XL. The specified intervals that result from resolving -L/-XL inputs must be exactly present in all of the input count files.

A TSV file specifying prior probabilities for each integer ploidy state and for each contig is required in this mode and must be specified via the contig-ploidy-priors argument. The following shows an example of such a table:

CONTIG_NAME	PLOIDY_PRIOR_0	PLOIDY_PRIOR_1	PLOIDY_PRIOR_2	PLOIDY_PRIOR_3
1	0.01	0.01	0.97	0.01
2	0.01	0.01	0.97	0.01
X	0.01	0.49	0.49	0.01
Y	0.50	0.50	0.00	0.00

Note that the contig names appearing under CONTIG_NAME column must match contig names in the input counts files, and all contigs appearing in the input counts files must have a corresponding entry in the priors table. The order of contigs is immaterial in the priors table. The highest ploidy state is determined by the prior table (3 in the above example). A ploidy state can be strictly forbidden by setting its prior probability to 0. For example, the Y contig in the above example can only assume 0 and 1 ploidy states.

The tool output in the COHORT mode will contain two subdirectories, one ending with "-model" and the other ending with "-calls". The model subdirectory contains the inferred parameters of the ploidy model, which may be used later on for karyotyping one or more similarly-sequenced samples (see below). The calls subdirectory contains one subdirectory for each sample, listing various sample-specific quantities such as the global read-depth, average ploidy, per-contig baseline ploidies, and per-contig coverage variance estimates.

CASE mode:

If a path containing previously inferred ploidy model parameters is provided via the model argument, then the tool will run in the CASE mode. In this mode, the parameters of the ploidy model are loaded from the provided directory and only sample-specific quantities are inferred. The modeled intervals are then specified by a file contained in the model directory, all interval-related arguments are ignored in this mode, and all model intervals must be present in all of the input count files. The tool output in the CASE mode is only the "-calls" subdirectory and is organized similarly to the COHORT mode.

In the CASE mode, the contig ploidy prior table is taken directly from the provided model parameters path and must be not provided again.

Important Remarks

Choice of hyperparameters:: The quality of ploidy model parametrization and the sensitivity/precision of germline karyotyping are sensitive to the choice of model hyperparameters, including standard deviation of mean contig coverage bias (set using the mean-bias-standard-deviation argument), mapping error rate (set using the mapping-error-rate argument), and the typical scale of contig- and sample-specific unexplained variance (set using the global-psi-scale and sample-psi-scale arguments, respectively). It is crucial to note that these hyperparameters are not universal and must be tuned for each sequencing protocol and properly set at runtime.
Mosaicism and fractional ploidies:: The model underlying this tool assumes integer ploidy states (in contrast to fractional/variable ploidy states). Therefore, it is to be used strictly on germline samples and for the purpose of sex determination, autosomal aneuploidy detection, or as a part of the GATK germline CNV calling pipeline. The presence of large somatic events and mosaicism (e.g., sex chromosome loss and somatic trisomy) will naturally lead to unreliable results. We strongly recommended inspecting genotyping qualities (GQ) from the tool output and considering to drop low-GQ contigs in downstream analyses. Finally, given the Bayesian status of this tool, we suggest including as many high-quality germline samples as possible for ploidy model parametrizaton in the COHORT mode. This will downplay the role of questionable samples and will yield a more reliable estimation of genuine sequencing biases.
Coverage-based germline karyotyping:: Accurate germline karyotyping requires incorporating SNP allele-fraction data and counts data in a unified probabilistic model and is beyond the scope of the present tool. The current implementation only uses counts data for karyotyping and while being fast, it may not provide the most reliable results.

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class DetermineGermlineContigPloidy.RunMode

Nested Classes
Modifier and Type	Class and Description
`static class`	`DetermineGermlineContigPloidy.RunMode`

Field Summary

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`CALLS_PATH_SUFFIX`
`static java.lang.String`	`CASE_DETERMINE_PLOIDY_AND_DEPTH_PYTHON_SCRIPT`
`static java.lang.String`	`COHORT_DETERMINE_PLOIDY_AND_DEPTH_PYTHON_SCRIPT`
`static java.lang.String`	`CONTIG_PLOIDY_PRIORS_FILE_LONG_NAME`
`static java.lang.String`	`INPUT_MODEL_INTERVAL_FILE`
`protected IntervalArgumentCollection`	`intervalArgumentCollection`
`static java.lang.String`	`MODEL_PATH_SUFFIX`

Fields inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, NIO_PROJECT_FOR_REQUESTER_PAYS, QUIET, specialArgumentsCollection, tmpDir, useJdkDeflater, useJdkInflater, VERBOSITY

Constructor Summary

Constructors
Constructor and Description

DetermineGermlineContigPloidy()

Constructors
Constructor and Description
`DetermineGermlineContigPloidy()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected java.lang.Object`	`doWork()` Do the work after command line has been parsed.
`protected void`	`onStartup()` Perform initialization/setup after command-line argument parsing but before doWork() is invoked.

Methods inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getPluginDescriptors, getSupportInformation, getToolkitName, getToolkitShortName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, onShutdown, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - COHORT_DETERMINE_PLOIDY_AND_DEPTH_PYTHON_SCRIPT
```
public static final java.lang.String COHORT_DETERMINE_PLOIDY_AND_DEPTH_PYTHON_SCRIPT
```
    See Also:
    
    Constant Field Values
  - CASE_DETERMINE_PLOIDY_AND_DEPTH_PYTHON_SCRIPT
```
public static final java.lang.String CASE_DETERMINE_PLOIDY_AND_DEPTH_PYTHON_SCRIPT
```
    See Also:
    
    Constant Field Values
  - INPUT_MODEL_INTERVAL_FILE
```
public static final java.lang.String INPUT_MODEL_INTERVAL_FILE
```
    See Also:
    
    Constant Field Values
  - MODEL_PATH_SUFFIX
```
public static final java.lang.String MODEL_PATH_SUFFIX
```
    See Also:
    
    Constant Field Values
  - CALLS_PATH_SUFFIX
```
public static final java.lang.String CALLS_PATH_SUFFIX
```
    See Also:
    
    Constant Field Values
  - CONTIG_PLOIDY_PRIORS_FILE_LONG_NAME
```
public static final java.lang.String CONTIG_PLOIDY_PRIORS_FILE_LONG_NAME
```
    See Also:
    
    Constant Field Values
  - intervalArgumentCollection
```
@ArgumentCollection
protected IntervalArgumentCollection intervalArgumentCollection
```
- Constructor Detail
  - DetermineGermlineContigPloidy
```
public DetermineGermlineContigPloidy()
```
- Method Detail
  - onStartup
```
protected void onStartup()
```
    Description copied from class: CommandLineProgram
    
    Perform initialization/setup after command-line argument parsing but before doWork() is invoked. Default implementation does nothing. Subclasses can override to perform initialization.
    
    Overrides:
    
    onStartup in class CommandLineProgram
  - doWork
```
protected java.lang.Object doWork()
```
    Description copied from class: CommandLineProgram
    
    Do the work after command line has been parsed. RuntimeException may be thrown by this method, and are reported appropriately.
    
    Specified by:
    
    doWork in class CommandLineProgram
    
    Returns:
    
    the return value or null is there is none.

Class DetermineGermlineContigPloidy

Introduction

Python environment setup

Tool run modes

Important Remarks

Usage examples

Nested Class Summary

Field Summary

Fields inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram

Constructor Summary

Method Summary

Methods inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram

Methods inherited from class java.lang.Object

Field Detail

COHORT_DETERMINE_PLOIDY_AND_DEPTH_PYTHON_SCRIPT

CASE_DETERMINE_PLOIDY_AND_DEPTH_PYTHON_SCRIPT

INPUT_MODEL_INTERVAL_FILE

MODEL_PATH_SUFFIX

CALLS_PATH_SUFFIX

CONTIG_PLOIDY_PRIORS_FILE_LONG_NAME

intervalArgumentCollection

Constructor Detail

DetermineGermlineContigPloidy

Method Detail

onStartup

doWork