GermlineCNVCaller (gatk 4.1.7.0 API)

java.lang.Object
- org.broadinstitute.hellbender.cmdline.CommandLineProgram
- - org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller

All Implemented Interfaces:

org.broadinstitute.barclay.argparser.CommandLinePluginProvider
```
@DocumentedFeature
public final class GermlineCNVCaller
extends CommandLineProgram
```
Calls copy-number variants in germline samples given their counts and the corresponding output of DetermineGermlineContigPloidy. The former should be either HDF5 or TSV count files generated by CollectReadCounts; TSV files may be compressed (e.g., with bgzip), but must then have filenames ending with the extension .gz. See the documentation for the input argument for details on enabling streaming of indexed count files from Google Cloud Storage.
Introduction

Reliable detection of copy-number variation (CNV) from read-depth ("coverage" or "counts") data such as whole exome sequencing (WES), whole genome sequencing (WGS), and custom gene sequencing panels requires a comprehensive model to account for technical variation in library preparation and sequencing. The Bayesian model and the associated inference scheme implemented in GermlineCNVCaller includes provisions for inferring and explaining away much of the technical variation. Furthermore, CNV calling confidence is automatically adjusted for each sample and genomic interval.
The parameters of the probabilistic model for read-depth bias and variance estimation (hereafter, "the coverage model") can be automatically inferred by GermlineCNVCaller by providing a cohort of germline samples sequenced using the same sequencing platform and library preparation protocol (in case of WES, the same capture kit). We refer to this run mode of the tool as the COHORT mode. The number of samples required for the COHORT mode depends on several factors such as the sequencing depth, tissue type/quality and similarity in the cohort, and the stringency of following the library preparation and sequencing protocols. For WES and WGS samples, we recommend including at least 30 samples.

The parametrized coverage model can be used for CNV calling on future case samples provided that they are strictly compatible with the cohort used to generate the model parameters (in terms of tissue type(s), library preparation and sequencing protocols). We refer to this mode as the CASE run mode. There is no lower limit on the number of samples for running GermlineCNVCaller in CASE mode.

In both tool run modes, GermlineCNVCaller requires karyotyping and global read-depth information for all samples. Such information can be automatically generated by running DetermineGermlineContigPloidy on all samples, and passed on to GermlineCNVCaller by providing the ploidy output calls using the argument contig-ploidy-calls. The ploidy state of a contig is used as the baseline ("default") copy-number state of all intervals contained in that contig for the corresponding sample. All other copy-number states are treated as alternative states and get equal shares from the total alternative state probability (set using the p-alt argument).

Python environment setup

The computation done by this tool, aside from input data parsing and validation, is performed outside of the Java Virtual Machine and using the gCNV computational python module, namely gcnvkernel. It is crucial that the user has properly set up a python conda environment with gcnvkernel and its dependencies installed. If the user intends to run GermlineCNVCaller using one of the official GATK Docker images, the python environment is already set up. Otherwise, the environment must be created and activated as described in the main GATK README.md file.

Advanced users may wish to set the THEANO_FLAGS environment variable to override the GATK theano configuration. For example, by running THEANO_FLAGS="base_compiledir=PATH/TO/BASE_COMPILEDIR" gatk GermlineCNVCaller ..., users can specify the theano compilation directory (which is set to $HOME/.theano by default). See theano documentation at http://deeplearning.net/software/theano/library/config.html.

Tool run modes

COHORT mode:

The tool will be run in COHORT mode using the argument run-mode COHORT. In this mode, coverage model parameters are inferred simultaneously with the CNV states. Depending on available memory, it may be necessary to run the tool over a subset of all intervals, which can be specified by -L; this can be used to pass a filtered interval list produced by FilterIntervals to mask intervals from modeling. The specified intervals must be present in all of the input count files. The output will contain two subdirectories, one ending with "-model" and the other with "-calls".

The model subdirectory contains a snapshot of the inferred parameters of the coverage model, which may be used later for CNV calling in one or more similarly-sequenced samples as mentioned earlier. Optionally, the path to a previously obtained coverage model parametrization can be provided via the model argument in COHORT mode, in which case, the provided parameters will be only used for model initialization and a new parametrization will be generated based on the input count files. Furthermore, the genomic intervals are set to those used in creating the previous parametrization and interval-related arguments will be ignored. Note that the newly obtained parametrization ultimately reflects the input count files from the last run, regardless of whether or not an initialization parameter set is given. If the users wishes to model coverage data from two or more cohorts simultaneously, all of the input counts files must be given to the tool at once.
The calls subdirectory contains sequentially-ordered subdirectories for each sample, each listing various sample-specific estimated quantities such as the probability of various copy-number states for each interval, a parametrization of the GC curve, sample-specific unexplained variance, read-depth, and loadings of coverage bias factors.

CASE mode:

The tool will be run in CASE mode using the argument run-mode CASE. The path to a previously obtained model directory must be provided via the model argument in this mode. The modeled intervals are then specified by a file contained in the model directory, all interval-related arguments are ignored in this mode, and all model intervals must be present in all of the input count files. The tool output in CASE mode is only the "-calls" subdirectory and is organized similarly to that in COHORT mode.

Note that at the moment, this tool does not automatically verify the compatibility of the provided parametrization with the provided count files. Model compatibility may be assessed a posteriori by inspecting the magnitude of sample-specific unexplained variance of each sample, and asserting that they lie within the same range as those obtained from the cohort used to generate the parametrization. This manual step is expected to be made automatic in the future.

Important Remarks

Choice of hyperparameters:

The quality of coverage model parametrization and the sensitivity/precision of germline CNV calling are sensitive to the choice of model hyperparameters, including the prior probability of alternative copy-number states (set using the p-alt argument), prevalence of active (i.e. CNV-rich) intervals (set via the p-active argument), the coherence length of CNV events and active/silent domains across intervals (set using the cnv-coherence-length and class-coherence-length arguments, respectively), and the typical scale of interval- and sample-specific unexplained variance (set using the interval-psi-scale and sample-psi-scale arguments, respectively). It is crucial to note that these hyperparameters are not universal and must be tuned for each sequencing protocol and properly set at runtime.

Running GermlineCNVCaller on a subset of intervals:

As mentioned earlier, it may be necessary to run the tool over a subset of all intervals depending on available memory. The number of intervals must be large enough to include a genomically diverse set of regions for reliable inference of the GC bias curve, as well as other bias factors. For WES and WGS, we recommend no less than 10000 consecutive intervals spanning at least 10 - 50 mb.

Memory requirements for the python subprocess ("gcnvkernel"):

The computation done by this tool, for the most part, is performed outside of JVM and via a spawned python subprocess. The Java heap memory is only used for loading sample counts and preparing raw data for the python subprocess. The user must ensure that the machine has enough free physical memory for spawning and executing the python subprocess. Generally speaking, the resource requirements of this tool scale linearly with each of the number of samples, the number of modeled intervals, the highest copy number state, the number of bias factors, and the number of knobs on the GC curve. For example, the python subprocess requires approximately 16GB of physical memory for modeling 10000 intervals for 100 samples, with 16 maximum bias factors, maximum copy-number state of 10, and explicit GC bias modeling.

Usage examples

COHORT mode:
```
 gatk GermlineCNVCaller \
   --run-mode COHORT \
   -L intervals.interval_list \
   --interval-merging-rule OVERLAPPING_ONLY \
   --contig-ploidy-calls path_to_contig_ploidy_calls \
   --input normal_1.counts.hdf5 \
   --input normal_2.counts.hdf5 \
   ... \
   --output output_dir \
   --output-prefix normal_cohort_run
 
```
CASE mode:
```
 gatk GermlineCNVCaller \
   --run-mode CASE \
   --contig-ploidy-calls path_to_contig_ploidy_calls \
   --model previous_model_path \
   --input normal_1.counts.hdf5 \
   ... \
   --output output_dir \
   --output-prefix normal_case_run
 
```

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class GermlineCNVCaller.RunMode

Nested Classes
Modifier and Type	Class and Description
`static class`	`GermlineCNVCaller.RunMode`

Field Summary

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`CALLS_PATH_SUFFIX`
`static java.lang.String`	`CASE_SAMPLE_CALLING_PYTHON_SCRIPT`
`static java.lang.String`	`COHORT_DENOISING_CALLING_PYTHON_SCRIPT`
`static java.lang.String`	`CONTIG_PLOIDY_CALLS_DIRECTORY_LONG_NAME`
`static java.lang.String`	`INPUT_MODEL_INTERVAL_FILE`
`protected IntervalArgumentCollection`	`intervalArgumentCollection`
`static java.lang.String`	`MODEL_PATH_SUFFIX`
`static java.lang.String`	`RUN_MODE_LONG_NAME`
`static java.lang.String`	`TRACKING_PATH_SUFFIX`

Fields inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, NIO_PROJECT_FOR_REQUESTER_PAYS, QUIET, specialArgumentsCollection, tmpDir, useJdkDeflater, useJdkInflater, VERBOSITY

Constructor Summary

Constructors
Constructor and Description

GermlineCNVCaller()

Constructors
Constructor and Description
`GermlineCNVCaller()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected java.lang.Object`	`doWork()` Do the work after command line has been parsed.
`protected void`	`onStartup()` Perform initialization/setup after command-line argument parsing but before doWork() is invoked.

Methods inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getPluginDescriptors, getSupportInformation, getToolkitName, getToolkitShortName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, onShutdown, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - COHORT_DENOISING_CALLING_PYTHON_SCRIPT
```
public static final java.lang.String COHORT_DENOISING_CALLING_PYTHON_SCRIPT
```
    See Also:
    
    Constant Field Values
  - CASE_SAMPLE_CALLING_PYTHON_SCRIPT
```
public static final java.lang.String CASE_SAMPLE_CALLING_PYTHON_SCRIPT
```
    See Also:
    
    Constant Field Values
  - INPUT_MODEL_INTERVAL_FILE
```
public static final java.lang.String INPUT_MODEL_INTERVAL_FILE
```
    See Also:
    
    Constant Field Values
  - MODEL_PATH_SUFFIX
```
public static final java.lang.String MODEL_PATH_SUFFIX
```
    See Also:
    
    Constant Field Values
  - CALLS_PATH_SUFFIX
```
public static final java.lang.String CALLS_PATH_SUFFIX
```
    See Also:
    
    Constant Field Values
  - TRACKING_PATH_SUFFIX
```
public static final java.lang.String TRACKING_PATH_SUFFIX
```
    See Also:
    
    Constant Field Values
  - CONTIG_PLOIDY_CALLS_DIRECTORY_LONG_NAME
```
public static final java.lang.String CONTIG_PLOIDY_CALLS_DIRECTORY_LONG_NAME
```
    See Also:
    
    Constant Field Values
  - RUN_MODE_LONG_NAME
```
public static final java.lang.String RUN_MODE_LONG_NAME
```
    See Also:
    
    Constant Field Values
  - intervalArgumentCollection
```
@ArgumentCollection
protected IntervalArgumentCollection intervalArgumentCollection
```
- Constructor Detail
  - GermlineCNVCaller
```
public GermlineCNVCaller()
```
- Method Detail
  - onStartup
```
protected void onStartup()
```
    Description copied from class: CommandLineProgram
    
    Perform initialization/setup after command-line argument parsing but before doWork() is invoked. Default implementation does nothing. Subclasses can override to perform initialization.
    
    Overrides:
    
    onStartup in class CommandLineProgram
  - doWork
```
protected java.lang.Object doWork()
```
    Description copied from class: CommandLineProgram
    
    Do the work after command line has been parsed. RuntimeException may be thrown by this method, and are reported appropriately.
    
    Specified by:
    
    doWork in class CommandLineProgram
    
    Returns:
    
    the return value or null is there is none.

Class GermlineCNVCaller

Introduction

Python environment setup

Tool run modes

Important Remarks

Usage examples

Nested Class Summary

Field Summary

Fields inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram

Constructor Summary

Method Summary

Methods inherited from class org.broadinstitute.hellbender.cmdline.CommandLineProgram

Methods inherited from class java.lang.Object

Field Detail

COHORT_DENOISING_CALLING_PYTHON_SCRIPT

CASE_SAMPLE_CALLING_PYTHON_SCRIPT

INPUT_MODEL_INTERVAL_FILE

MODEL_PATH_SUFFIX

CALLS_PATH_SUFFIX

TRACKING_PATH_SUFFIX

CONTIG_PLOIDY_CALLS_DIRECTORY_LONG_NAME

RUN_MODE_LONG_NAME

intervalArgumentCollection

Constructor Detail

GermlineCNVCaller

Method Detail

onStartup

doWork