Checks that all data in the set of input files appear to come from the same
individual. Can be used to compare according to readgroups, libraries, samples, or files.
Operates on bams/sams and vcfs (including gvcfs).
Summary
Checks if all the genetic data within a set of files appear to come from the same individual.
It quickly determines whether a "group's" genotype matches that of an input SAM/BAM/VCF by selective sampling,
and has been designed to work well even for low-depth SAM/BAMs.
The tool collects "fingerprints" (essentially genotype information from different parts of the genome)
at the finest level available in the data (readgroup for SAM files
and sample for VCF files) and then optionally aggregates it by library, sample or file, to increase power and provide
results at the desired resolution. Output is in a "Moltenized" format, one row per comparison. The results will
be emitted into a metric file for the class
CrosscheckMetric
.
In this format the output will include the LOD score and also tumor-aware LOD score which can
help assess identity even in the presence of a severe loss of heterozygosity with high purity (which could
otherwise fail to notice that samples are from the same individual.)
A matrix output is also available to facilitate visual inspection of crosscheck results.
Since there can be many rows of output in the metric file, we recommend the use of
ClusterCrosscheckMetrics
as a follow-up step to running CrosscheckFingerprints.
There are cases where one would like to identify a few groups out of a collection of many possible groups (say
to link a bam to it's correct sample in a multi-sample vcf. In this case one would not case for the cross-checking
of the various samples in the VCF against each other, but only in checking the identity of the bam against the various
samples in the vcf. The
SECOND_INPUT
is provided for this use-case. With
SECOND_INPUT
provided, CrosscheckFingerprints
does the following:
aggregation of data happens independently for the input files in INPUT
and SECOND_INPUT
.
aggregation of data happens at the SAMPLE level.
each samples from INPUT
will only be compared to that same sample in INPUT
.
MATRIX_OUTPUT
is disabled.
In some cases, the groups collected may not have any observations (calls for a vcf, reads for a bam) at fingerprinting sites, or a sample in INPUT may be missing from the SECOND_INPUT.
These cases are handled as follows: If running in CHECK_SAME_SAMPLES mode with INPUT and SECOND_INPUT, and either INPUT or SECOND_INPUT includes a sample
not found in the other, or contains a sample with no observations at any fingerprinting sites, an error will be logged and the tool will return EXIT_CODE_WHEN_MISMATCH.
In all other running modes, when any group which is being crosschecked does not have any observations at fingerprinting sites, a warning is logged. As long as there is at least
one comparison where both sides have observations at fingerprinting sites, the tool will return zero. However, if all comparisons have at least one side with no observations
at fingerprinting sites, an error will be logged and the tool will return EXIT_CODE_WHEN_NO_VALID_CHECKS.
Examples
Check that all the readgroups from a sample match each other:
java -jar picard.jar CrosscheckFingerprints \
INPUT=sample.with.many.readgroups.bam \
HAPLOTYPE_MAP=fingerprinting_haplotype_database.txt \
LOD_THRESHOLD=-5 \
OUTPUT=sample.crosscheck_metrics
Check that all the readgroups match as expected when providing reads from two samples from the same individual:
java -jar picard.jar CrosscheckFingerprints \
INPUT=sample.one.with.many.readgroups.bam \
INPUT=sample.two.with.many.readgroups.bam \
HAPLOTYPE_MAP=fingerprinting_haplotype_database.txt \
LOD_THRESHOLD=-5 \
EXPECT_ALL_GROUPS_TO_MATCH=true \
OUTPUT=sample.crosscheck_metrics
Detailed Explanation
This tool calculates the LOD score for identity check between "groups" of data in the INPUT files as defined by
the CROSSCHECK_BY argument. A positive value indicates that the data seems to have come from the same individual
or, in other words the identity checks out. The scale is logarithmic (base 10), so a LOD of 6 indicates
that it is 1,000,000 more likely that the data matches the genotypes than not. A negative value indicates
that the data do not match. A score that is near zero is inconclusive and can result from low coverage
or non-informative genotypes. Each group is assigned a sample identifier (for SAM this is taken from the SM tag in
the appropriate readgroup header line, for VCF this is taken from the column label in the file-header.
After combining all the data from the same "group" together, an all-against-all comparison is performed. Results are
categorized a
CrosscheckMetric.FingerprintResult
enum: EXPECTED_MATCH, EXPECTED_MISMATCH, UNEXPECTED_MATCH, UNEXPECTED_MISMATCH,
or AMBIGUOUS depending on the LOD score and on whether the sample identifiers of the groups agree: LOD scores that are
less than LOD_THRESHOLD are considered mismatches, and those greater than -LOD_THRESHOLD are matches (between is ambiguous).
If the sample identifiers are equal, the groups are expected to match. They are expected to mismatch otherwise.
The identity check makes use of haplotype blocks defined in the HAPLOTYPE_MAP file to enable it to have higher
statistical power for detecting identity or swap by aggregating data from several SNPs in the haplotype block. This
enables an identity check of samples with very low coverage (e.g. ~1x mean coverage).
When provided a VCF, the identity check looks at the PL, GL and GT fields (in that order) and uses the first one that
it finds.