@DocumentedFeature public final class GermlineCNVCaller extends CommandLineProgram
DetermineGermlineContigPloidy
. The former should be either HDF5 or TSV count files generated by
CollectReadCounts
; TSV files may be compressed (e.g., with bgzip),
but must then have filenames ending with the extension .gz. See the documentation for the input
argument
for details on enabling streaming of indexed count files from Google Cloud Storage.
Reliable detection of copy-number variation (CNV) from read-depth ("coverage" or "counts") data such as whole
exome sequencing (WES), whole genome sequencing (WGS), and custom gene sequencing panels requires a comprehensive
model to account for technical variation in library preparation and sequencing. The Bayesian model and the associated
inference scheme implemented in GermlineCNVCaller
includes provisions for inferring and explaining away much
of the technical variation. Furthermore, CNV calling confidence is automatically adjusted for each sample and
genomic interval.
The parameters of the probabilistic model for read-depth bias and variance estimation (hereafter, "the coverage
model") can be automatically inferred by GermlineCNVCaller
by providing a cohort of germline samples
sequenced using the same sequencing platform and library preparation protocol (in case of WES, the same capture
kit). We refer to this run mode of the tool as the COHORT mode. The number of samples required for the
COHORT mode depends on several factors such as the sequencing depth, tissue type/quality and similarity in the cohort,
and the stringency of following the library preparation and sequencing protocols. For WES and WGS samples, we
recommend including at least 30 samples.
The parametrized coverage model can be used for CNV calling on future case samples provided that they are
strictly compatible with the cohort used to generate the model parameters (in terms of tissue type(s), library
preparation and sequencing protocols). We refer to this mode as the CASE run mode. There is no lower
limit on the number of samples for running GermlineCNVCaller
in CASE mode.
In both tool run modes, GermlineCNVCaller
requires karyotyping and global read-depth information for
all samples. Such information can be automatically generated by running DetermineGermlineContigPloidy
on all samples, and passed on to GermlineCNVCaller
by providing the ploidy output calls using the argument
contig-ploidy-calls
. The ploidy state of a contig is used as the baseline
("default") copy-number state of all intervals contained in that contig for the corresponding sample. All other
copy-number states are treated as alternative states and get equal shares from the total alternative state
probability (set using the p-alt
argument).
The computation done by this tool, aside from input data parsing and validation, is performed outside of the Java
Virtual Machine and using the gCNV computational python module, namely gcnvkernel
. It is crucial that
the user has properly set up a python conda environment with gcnvkernel
and its dependencies
installed. If the user intends to run GermlineCNVCaller
using one of the official GATK Docker images,
the python environment is already set up. Otherwise, the environment must be created and activated as described in the
main GATK README.md file.
Advanced users may wish to set the THEANO_FLAGS
environment variable to override the GATK theano
configuration. For example, by running
THEANO_FLAGS="base_compiledir=PATH/TO/BASE_COMPILEDIR" gatk GermlineCNVCaller ...
, users can specify
the theano compilation directory (which is set to $HOME/.theano
by default). See theano documentation
at
http://deeplearning.net/software/theano/library/config.html.
The tool will be run in COHORT mode using the argument run-mode COHORT
.
In this mode, coverage model parameters are inferred simultaneously with the CNV states. Depending on
available memory, it may be necessary to run the tool over a subset of all intervals, which can be specified
by -L; this can be used to pass a filtered interval list produced by FilterIntervals
to mask
intervals from modeling. The specified intervals must be present in all of the input count files. The output
will contain two subdirectories, one ending with "-model" and the other with "-calls".
The model subdirectory contains a snapshot of the inferred parameters of the coverage model, which may be
used later for CNV calling in one or more similarly-sequenced samples as mentioned earlier. Optionally, the path
to a previously obtained coverage model parametrization can be provided via the model
argument
in COHORT mode, in which case, the provided parameters will be only used for model initialization and
a new parametrization will be generated based on the input count files. Furthermore, the genomic intervals are
set to those used in creating the previous parametrization and interval-related arguments will be ignored.
Note that the newly obtained parametrization ultimately reflects the input count files from the last run,
regardless of whether or not an initialization parameter set is given. If the users wishes to model coverage
data from two or more cohorts simultaneously, all of the input counts files must be given to the tool at once.
The calls subdirectory contains sequentially-ordered subdirectories for each sample, each listing various sample-specific estimated quantities such as the probability of various copy-number states for each interval, a parametrization of the GC curve, sample-specific unexplained variance, read-depth, and loadings of coverage bias factors.
The tool will be run in CASE mode using the argument run-mode CASE
. The path to a previously
obtained model directory must be provided via the model
argument in this mode. The modeled intervals are
then specified by a file contained in the model directory, all interval-related arguments are ignored in this
mode, and all model intervals must be present in all of the input count files. The tool output in CASE mode
is only the "-calls" subdirectory and is organized similarly to that in COHORT mode.
Note that at the moment, this tool does not automatically verify the compatibility of the provided parametrization with the provided count files. Model compatibility may be assessed a posteriori by inspecting the magnitude of sample-specific unexplained variance of each sample, and asserting that they lie within the same range as those obtained from the cohort used to generate the parametrization. This manual step is expected to be made automatic in the future.
The quality of coverage model parametrization and the sensitivity/precision of germline CNV calling are
sensitive to the choice of model hyperparameters, including the prior probability of alternative copy-number states
(set using the p-alt
argument), prevalence of active (i.e. CNV-rich) intervals (set via the
p-active
argument), the coherence length of CNV events and active/silent domains
across intervals (set using the cnv-coherence-length
and class-coherence-length
arguments,
respectively), and the typical scale of interval- and sample-specific unexplained variance
(set using the interval-psi-scale
and sample-psi-scale
arguments, respectively). It is crucial
to note that these hyperparameters are not universal and must be tuned for each sequencing protocol
and properly set at runtime.
GermlineCNVCaller
on a subset of intervals:As mentioned earlier, it may be necessary to run the tool over a subset of all intervals depending on available memory. The number of intervals must be large enough to include a genomically diverse set of regions for reliable inference of the GC bias curve, as well as other bias factors. For WES and WGS, we recommend no less than 10000 consecutive intervals spanning at least 10 - 50 mb.
The computation done by this tool, for the most part, is performed outside of JVM and via a spawned python subprocess. The Java heap memory is only used for loading sample counts and preparing raw data for the python subprocess. The user must ensure that the machine has enough free physical memory for spawning and executing the python subprocess. Generally speaking, the resource requirements of this tool scale linearly with each of the number of samples, the number of modeled intervals, the highest copy number state, the number of bias factors, and the number of knobs on the GC curve. For example, the python subprocess requires approximately 16GB of physical memory for modeling 10000 intervals for 100 samples, with 16 maximum bias factors, maximum copy-number state of 10, and explicit GC bias modeling.
COHORT mode:
gatk GermlineCNVCaller \ --run-mode COHORT \ -L intervals.interval_list \ --interval-merging-rule OVERLAPPING_ONLY \ --contig-ploidy-calls path_to_contig_ploidy_calls \ --input normal_1.counts.hdf5 \ --input normal_2.counts.hdf5 \ ... \ --output output_dir \ --output-prefix normal_cohort_run
CASE mode:
gatk GermlineCNVCaller \ --run-mode CASE \ --contig-ploidy-calls path_to_contig_ploidy_calls \ --model previous_model_path \ --input normal_1.counts.hdf5 \ ... \ --output output_dir \ --output-prefix normal_case_run
Modifier and Type | Class and Description |
---|---|
static class |
GermlineCNVCaller.RunMode |
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
CALLS_PATH_SUFFIX |
static java.lang.String |
CASE_SAMPLE_CALLING_PYTHON_SCRIPT |
static java.lang.String |
COHORT_DENOISING_CALLING_PYTHON_SCRIPT |
static java.lang.String |
CONTIG_PLOIDY_CALLS_DIRECTORY_LONG_NAME |
static java.lang.String |
INPUT_MODEL_INTERVAL_FILE |
protected IntervalArgumentCollection |
intervalArgumentCollection |
static java.lang.String |
MODEL_PATH_SUFFIX |
static java.lang.String |
RUN_MODE_LONG_NAME |
static java.lang.String |
TRACKING_PATH_SUFFIX |
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, NIO_PROJECT_FOR_REQUESTER_PAYS, QUIET, specialArgumentsCollection, tmpDir, useJdkDeflater, useJdkInflater, VERBOSITY
Constructor and Description |
---|
GermlineCNVCaller() |
Modifier and Type | Method and Description |
---|---|
protected java.lang.Object |
doWork()
Do the work after command line has been parsed.
|
protected void |
onStartup()
Perform initialization/setup after command-line argument parsing but before doWork() is invoked.
|
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getPluginDescriptors, getSupportInformation, getToolkitName, getToolkitShortName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, onShutdown, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus
public static final java.lang.String COHORT_DENOISING_CALLING_PYTHON_SCRIPT
public static final java.lang.String CASE_SAMPLE_CALLING_PYTHON_SCRIPT
public static final java.lang.String INPUT_MODEL_INTERVAL_FILE
public static final java.lang.String MODEL_PATH_SUFFIX
public static final java.lang.String CALLS_PATH_SUFFIX
public static final java.lang.String TRACKING_PATH_SUFFIX
public static final java.lang.String CONTIG_PLOIDY_CALLS_DIRECTORY_LONG_NAME
public static final java.lang.String RUN_MODE_LONG_NAME
@ArgumentCollection protected IntervalArgumentCollection intervalArgumentCollection
protected void onStartup()
CommandLineProgram
onStartup
in class CommandLineProgram
protected java.lang.Object doWork()
CommandLineProgram
doWork
in class CommandLineProgram