@DocumentedFeature public final class GenomicsDBImport extends GATKTool
The GATK4 Best Practice Workflow for SNP and Indel calling uses GenomicsDBImport to merge GVCFs from multiple samples. GenomicsDBImport offers the same functionality as CombineGVCFs and initially came from the Intel-Broad Center for Genomics. The datastore transposes sample-centric variant information across genomic loci to make data more accessible to tools.
To query the contents of the GenomicsDB datastore, use SelectVariants. See Tutorial#11813 to get started.
Details on GenomicsDB are at https://github.com/GenomicsDB/GenomicsDB/wiki. In brief, GenomicsDB utilises a data storage system optimized for storing/querying sparse arrays. Genomics data is typically sparse in that each sample has few variants with respect to the entire reference genome. GenomicsDB contains specialized code for genomics applications, such as VCF parsing and INFO field annotation calculation.
One or more GVCFs produced by in HaplotypeCaller with the `-ERC GVCF` or `-ERC BP_RESOLUTION` settings, containing the samples to joint-genotype.
A GenomicsDB workspace
gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport \ -V data/gvcfs/mother.g.vcf.gz \ -V data/gvcfs/father.g.vcf.gz \ -V data/gvcfs/son.g.vcf.gz \ --genomicsdb-workspace-path my_database \ --tmp-dir=/path/to/large/tmp \ -L 20Provide sample GVCFs in a map file.
gatk --java-options "-Xmx4g -Xms4g" \ GenomicsDBImport \ --genomicsdb-workspace-path my_database \ --batch-size 50 \ -L chr1:1000-10000 \ --sample-name-map cohort.sample_map \ --tmp-dir=/path/to/large/tmp \ --reader-threads 5The sample map is a tab-delimited text file with sample_name--tab--path_to_sample_vcf per line. Using a sample map saves the tool from having to download the GVCF headers in order to determine the sample names. Sample names in the sample name map file may have non-tab whitespace, but may not begin or end with whitespace.
sample1 sample1.vcf.gz sample2 sample2.vcf.gz sample3 sample3.vcf.gzAdd new samples to an existing genomicsdb workspace.
gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport \ -V data/gvcfs/mother.g.vcf.gz \ -V data/gvcfs/father.g.vcf.gz \ -V data/gvcfs/son.g.vcf.gz \ --genomicsdb-update-workspace-path my_database \ --tmp-dir=/path/to/large/tmp \In the incremental import case, no intervals are specified in the command because the tool will use the same intervals used in the initial import. Sample map is also supported for incremental import Get Picard-style interval_list from existing workspace
gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport \ --genomicsdb-update-workspace-path my_database \ --output-interval-list-to-file /output/path/to/fileThe interval_list for the specified/existing workspace will be written to /output/path/to/file. This will output a Picard-style interval_list (with a sequence dictionary header)
GenomicsDBFeatureReader
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
BATCHSIZE_ARG_LONG_NAME |
static java.lang.String |
CONSOLIDATE_ARG_NAME |
static java.lang.String |
INCREMENTAL_WORKSPACE_ARG_LONG_NAME |
static java.lang.String |
INTERVAL_LIST_LONG_NAME |
static int |
INTERVAL_LIST_SIZE_WARNING_THRESHOLD |
static java.lang.String |
MAX_NUM_INTERVALS_TO_IMPORT_IN_PARALLEL |
static java.lang.String |
MERGE_INPUT_INTERVALS_LONG_NAME |
static java.lang.String |
OVERWRITE_WORKSPACE_LONG_NAME |
static java.lang.String |
SAMPLE_NAME_MAP_LONG_NAME |
static java.lang.String |
SEGMENT_SIZE_ARG_LONG_NAME |
static java.lang.String |
VALIDATE_SAMPLE_MAP_LONG_NAME |
static java.lang.String |
VCF_BUFFER_SIZE_ARG_NAME |
static java.lang.String |
VCF_INITIALIZER_THREADS_LONG_NAME |
static java.lang.String |
WORKSPACE_ARG_LONG_NAME |
addOutputSAMProgramRecord, addOutputVCFCommandLine, cloudIndexPrefetchBuffer, cloudPrefetchBuffer, createOutputBamIndex, createOutputBamMD5, createOutputVariantIndex, createOutputVariantMD5, disableBamIndexCaching, features, intervalArgumentCollection, lenientVCFProcessing, outputSitesOnlyVCFs, progressMeter, readArguments, referenceArguments, SECONDS_BETWEEN_PROGRESS_UPDATES_NAME, seqValidationArguments
GATK_CONFIG_FILE, logger, NIO_MAX_REOPENS, NIO_PROJECT_FOR_REQUESTER_PAYS, QUIET, specialArgumentsCollection, tmpDir, useJdkDeflater, useJdkInflater, VERBOSITY
Constructor and Description |
---|
GenomicsDBImport() |
Modifier and Type | Method and Description |
---|---|
htsjdk.samtools.SAMSequenceDictionary |
getBestAvailableSequenceDictionary()
Overriding getBestAvailableSequenceDictionary() to prefer the mergedVCFHeader's
sequence directory, if present, over any other dictionaries
|
int |
getDefaultCloudIndexPrefetchBufferSize() |
int |
getDefaultCloudPrefetchBufferSize() |
java.lang.String |
getProgressMeterRecordLabel() |
protected void |
initializeIntervals()
Loads our intervals using the best available sequence
dictionary (as returned by
getBestAvailableSequenceDictionary() )
to parse/verify them. |
static java.util.LinkedHashMap<java.lang.String,java.net.URI> |
loadSampleNameMapFile(java.nio.file.Path sampleToFileMapPath)
load a tab delimited new line separated file of sample name to URI mapping:
this maintains the keys in the same order that they appeared in the file
this tool should only call
loadSampleNameMapFileInSortedOrder(Path) ,
this version is exposed for the benefit of FixCallSetSampleOrdering
ex:
Sample1\tpathToSample1.vcf\n
Sample2\tpathTosample2.vcf\n
... |
static java.util.SortedMap<java.lang.String,java.net.URI> |
loadSampleNameMapFileInSortedOrder(java.nio.file.Path sampleToFileMapPath)
load a tab delimited new line separated file of sample name to URI mapping:
ex:
Sample1\tpathToSample1.vcf\n
Sample2\tpathTosample2.vcf\n
...
|
void |
onShutdown()
Close all data sources on shutdown.
|
void |
onStartup()
Before traversal starts, create the feature readers
for all the input GVCFs, create the merged header and
initialize the interval
|
void |
onTraversalStart()
Before traversal, fix configuration parameters and initialize
GenomicsDB.
|
java.lang.Object |
onTraversalSuccess()
Operations performed immediately after a successful traversal (ie when no uncaught exceptions were thrown during the traversal).
|
protected java.util.List<SimpleInterval> |
transformTraversalIntervals(java.util.List<SimpleInterval> getIntervals,
htsjdk.samtools.SAMSequenceDictionary sequenceDictionary)
Get the largest interval per contig that contains the intervals specified on the command line.
|
void |
traverse()
A complete traversal from start to finish.
|
addFeatureInputsAfterInitialization, closeTool, createSAMWriter, createSAMWriter, createVCFWriter, createVCFWriter, directlyAccessEngineFeatureManager, directlyAccessEngineReadsDataSource, directlyAccessEngineReferenceDataSource, doWork, getDefaultReadFilters, getDefaultToolVCFHeaderLines, getDefaultVariantAnnotationGroups, getDefaultVariantAnnotations, getGenomicsDBOptions, getHeaderForFeatures, getHeaderForReads, getHeaderForSAMWriter, getMasterSequenceDictionary, getPluginDescriptors, getReferenceDictionary, getSequenceDictionaryValidationArgumentCollection, getToolName, getTransformedReadStream, getTraversalIntervals, hasFeatures, hasReads, hasReference, hasUserSuppliedIntervals, makePostReadFilterTransformer, makePreReadFilterTransformer, makeReadFilter, makeVariantAnnotations, requiresFeatures, requiresIntervals, requiresReads, requiresReference, useVariantAnnotations
customCommandLineValidation, getCommandLine, getCommandLineParser, getDefaultHeaders, getMetricsFile, getSupportInformation, getToolkitName, getToolkitShortName, getToolStatusWarning, getUsage, getVersion, instanceMain, instanceMainPostParseArgs, isBetaFeature, isExperimentalFeature, parseArgs, printLibraryVersions, printSettings, printStartupMessage, runTool, setDefaultHeaders, warnOnToolStatus
public static final java.lang.String WORKSPACE_ARG_LONG_NAME
public static final java.lang.String INCREMENTAL_WORKSPACE_ARG_LONG_NAME
public static final java.lang.String SEGMENT_SIZE_ARG_LONG_NAME
public static final java.lang.String OVERWRITE_WORKSPACE_LONG_NAME
public static final java.lang.String INTERVAL_LIST_LONG_NAME
public static final java.lang.String VCF_BUFFER_SIZE_ARG_NAME
public static final java.lang.String BATCHSIZE_ARG_LONG_NAME
public static final java.lang.String CONSOLIDATE_ARG_NAME
public static final java.lang.String SAMPLE_NAME_MAP_LONG_NAME
public static final java.lang.String VALIDATE_SAMPLE_MAP_LONG_NAME
public static final java.lang.String MERGE_INPUT_INTERVALS_LONG_NAME
public static final java.lang.String VCF_INITIALIZER_THREADS_LONG_NAME
public static final java.lang.String MAX_NUM_INTERVALS_TO_IMPORT_IN_PARALLEL
public static final int INTERVAL_LIST_SIZE_WARNING_THRESHOLD
protected java.util.List<SimpleInterval> transformTraversalIntervals(java.util.List<SimpleInterval> getIntervals, htsjdk.samtools.SAMSequenceDictionary sequenceDictionary)
transformTraversalIntervals
in class GATKTool
getIntervals
- intervals to be transformedsequenceDictionary
- used to validate intervalspublic int getDefaultCloudPrefetchBufferSize()
getDefaultCloudPrefetchBufferSize
in class GATKTool
GATKConfig
file.public int getDefaultCloudIndexPrefetchBufferSize()
getDefaultCloudIndexPrefetchBufferSize
in class GATKTool
GATKTool.getDefaultCloudPrefetchBufferSize()
.
The default implementation returns -1.
This value is maintained in the GATKConfig
file.public java.lang.String getProgressMeterRecordLabel()
getProgressMeterRecordLabel
in class GATKTool
ProgressMeter.DEFAULT_RECORD_LABEL
,
but tools may override to provide a more appropriate label (like "reads" or "regions")public void onStartup()
public static java.util.LinkedHashMap<java.lang.String,java.net.URI> loadSampleNameMapFile(java.nio.file.Path sampleToFileMapPath)
loadSampleNameMapFileInSortedOrder(Path)
,
this version is exposed for the benefit of FixCallSetSampleOrdering
ex:
Sample1\tpathToSample1.vcf\n
Sample2\tpathTosample2.vcf\n
...
The sample names must be unique.sampleToFileMapPath
- path to the mapping filepublic static java.util.SortedMap<java.lang.String,java.net.URI> loadSampleNameMapFileInSortedOrder(java.nio.file.Path sampleToFileMapPath)
sampleToFileMapPath
- path to the mapping filepublic void onTraversalStart()
onTraversalStart
in class GATKTool
public void traverse()
public java.lang.Object onTraversalSuccess()
GATKTool
onTraversalSuccess
in class GATKTool
protected void initializeIntervals()
getBestAvailableSequenceDictionary()
)
to parse/verify them. Does nothing if no intervals were specified.public void onShutdown()
GATKTool
onShutdown
in class GATKTool
public htsjdk.samtools.SAMSequenceDictionary getBestAvailableSequenceDictionary()
getBestAvailableSequenceDictionary
in class GATKTool