Converts an RDD of ADAM read records into SAM records.
Converts an RDD of ADAM read records into SAM records.
Returns a SAM/BAM formatted RDD of reads, as well as the file header.
Cuts reads into _k_-mers, and then counts the number of occurrences of each _k_-mer.
Cuts reads into _k_-mers, and then counts the number of occurrences of each _k_-mer.
The value of _k_ to use for cutting _k_-mers.
Returns an RDD containing k-mer/count pairs.
Runs a quality control pass akin to the Samtools FlagStat tool.
Runs a quality control pass akin to the Samtools FlagStat tool.
Returns a tuple of (failedQualityMetrics, passedQualityMetrics)
Returns all reference regions that overlap this read.
Returns all reference regions that overlap this read.
If a read is unaligned, it covers no reference region. If a read is aligned we expect it to cover a single region. A chimeric read would cover multiple regions, but we store chimeric reads in a way similar to BAM, where the split alignments are stored in multiple separate reads.
Read to produce regions for.
The seq of reference regions this read covers.
Groups all reads by record group and read name.
Groups all reads by record group and read name.
SingleReadBuckets with primary, secondary and unmapped reads
Marks reads as possible fragment duplicates.
Marks reads as possible fragment duplicates.
A new RDD where reads have the duplicate read flag set. Duplicate reads are NOT filtered out.
Realigns indels using a concensus-based heuristic.
Realigns indels using a concensus-based heuristic.
The model to use for generating consensus sequences to realign against.
If the input data is sorted, setting this parameter to true avoids a second sort.
The size of the largest indel to use for realignment.
The maximum number of consensus sequences to realign against per target region.
Log-odds threshold to use when realigning; realignments are only finalized if the log-odds threshold is exceeded.
The maximum width of a single target region for realignment.
Returns an RDD of mapped reads which have been realigned.
Reassembles read pairs from two sets of unpaired reads.
Reassembles read pairs from two sets of unpaired reads. The assumption is that the two sets were _originally_ paired together.
The rdd containing the second read from the pairs.
How stringently to validate the reads.
Returns an RDD with the pair information recomputed.
The RDD that this is called on should be the RDD with the first read from the pair.
Runs base quality score recalibration on a set of reads.
Runs base quality score recalibration on a set of reads. Uses a table of known SNPs to mask true variation during the recalibration process.
A table of known SNPs to mask valid variants.
An optional local path to dump recalibration observations to.
Returns an RDD of recalibrated reads.
Replaces the underlying RDD and SequenceDictionary and emits a new object.
Replaces the underlying RDD and SequenceDictionary and emits a new object.
New RDD to replace current RDD.
New sequence dictionary to replace current dictionary.
Returns a new AlignmentRecordRDD.
Saves this RDD to disk, with the type identified by the extension.
Saves this RDD to disk, with the type identified by the extension.
Path to save the file at.
Whether the file is sorted or not.
Returns true if saving succeeded.
Saves AlignmentRecords as a directory of Parquet files or as SAM/BAM.
Saves AlignmentRecords as a directory of Parquet files or as SAM/BAM.
This method infers the output format from the file extension. Filenames ending in .sam/.bam are saved as SAM/BAM, and all other files are saved as Parquet.
Save configuration arguments.
If the output is sorted, this will modify the SAM/BAM header.
Returns true if saving succeeded.
Saves reads in FASTQ format.
Saves reads in FASTQ format.
Path to save files at.
Optional second path for saving files. If set, two files will be saved.
If true, writes out reads with the base qualities from the original qualities (SAM "OQ") field. If false, writes out reads with the base qualities from the qual field. Default is false.
Whether to sort the FASTQ files by read name or not. Defaults to false. Sorting the output will recover pair order, if desired.
Iff strict, throw an exception if any read in this RDD is not accompanied by its mate.
An optional persistance level to set. If this level is set, then reads will be cached (at the given persistance) level between passes.
Saves these AlignmentRecords to two FASTQ files.
Saves these AlignmentRecords to two FASTQ files.
The files are one for the first mate in each pair, and the other for the second mate in the pair.
Path at which to save a FASTQ file containing the first mate of each pair.
Path at which to save a FASTQ file containing the second mate of each pair.
If true, writes out reads with the base qualities from the original qualities (SAM "OQ") field. If false, writes out reads with the base qualities from the qual field. Default is false.
Iff strict, throw an exception if any read in this RDD is not accompanied by its mate.
An optional persistance level to set. If this level is set, then reads will be cached (at the given persistance) level between passes.
Saves this RDD to disk as a Parquet file.
Saves this RDD to disk as a Parquet file.
Path to save the file at.
Saves this RDD to disk as a Parquet file.
Saves this RDD to disk as a Parquet file.
Path to save the file at.
Size per block.
Size per page.
Name of the compression codec to use.
Whether or not to disable bit-packing.
Saves this RDD to disk as a Parquet file.
Saves this RDD to disk as a Parquet file.
Path to save the file at.
Size per block.
Size per page.
Name of the compression codec to use.
Whether or not to disable bit-packing. Default is false.
Saves RDD as a directory of Parquet files.
Saves RDD as a directory of Parquet files.
The RDD is written as a directory of Parquet files, with Parquet configuration described by the input param args. The provided sequence dictionary is written at args.outputPath/_seqdict.avro as Avro binary.
Save configuration arguments.
Saves this RDD to disk as a SAM/BAM file.
Saves this RDD to disk as a SAM/BAM file.
Path to save the file at.
If true, saves as SAM. If false, saves as BAM.
If true, saves output as a single file.
If the output is sorted, this will modify the header.
Saves an RDD of ADAM read data into the SAM/BAM format.
Saves an RDD of ADAM read data into the SAM/BAM format.
Path to save files to.
Selects whether to save as SAM or BAM. The default value is true (save in SAM format).
If true, saves output as a single file.
If the output is sorted, this will modify the header.
Converts an RDD into the SAM spec string it represents.
Converts an RDD into the SAM spec string it represents.
This method converts an RDD of AlignmentRecords back to an RDD of SAMRecordWritables and a SAMFileHeader, and then maps this RDD into a string on the driver that represents this file in SAM.
A string on the driver representing this RDD of reads in SAM format.
Saves Avro data to a Hadoop file system.
Saves Avro data to a Hadoop file system.
This method uses a SparkContext to identify our underlying file system, which we then save to.
Frustratingly enough, although all records generated by the Avro IDL compiler have a static SCHEMA$ field, this field does not belong to the SpecificRecordBase abstract class, or the SpecificRecord interface. As such, we must force the user to pass in the schema.
The type of the specific record we are saving.
Path to save records to.
SparkContext used for identifying underlying file system.
Schema of records we are saving.
Seq of records we are saving.
Called in saveAsParquet after saving RDD to Parquet to save metadata.
Called in saveAsParquet after saving RDD to Parquet to save metadata.
Writes any necessary metadata to disk. If not overridden, writes the sequence dictionary to disk as Avro.
Sorts our read data by reference positions, with contigs ordered by name.
Sorts our read data by reference positions, with contigs ordered by name.
Sorts reads by the location where they are aligned. Unaligned reads are put at the end and sorted by read name. Contigs are ordered lexicographically.
Returns a new RDD containing sorted reads.
sortReadsByReferencePositionAndIndex
Sorts our read data by reference positions, with contigs ordered by index.
Sorts our read data by reference positions, with contigs ordered by index.
Sorts reads by the location where they are aligned. Unaligned reads are put at the end and sorted by read name. Contigs are ordered by index that they are ordered in the SequenceDictionary.
Returns a new RDD containing sorted reads.
sortReadsByReferencePosition
Converts this set of reads into a corresponding CoverageRDD.
Converts this set of reads into a corresponding CoverageRDD.
Determines whether to merge adjacent coverage elements with the same score a single coverage.
CoverageRDD containing mapped RDD of Coverage.
Convert this set of reads into fragments.
Convert this set of reads into fragments.
Returns a FragmentRDD where all reads have been grouped together by the original sequence fragment they come from.