Loads alignments from a given path, and infers the input type.
Loads alignments from a given path, and infers the input type.
This method can load:
* AlignmentRecords via Parquet (default) * SAM/BAM/CRAM (.sam, .bam, .cram) * FASTQ (interleaved, single end, paired end) (.ifq, .fq/.fastq) * FASTA (.fa, .fasta) * NucleotideContigFragments via Parquet (.contig.adam)
As hinted above, the input type is inferred from the file path extension.
Path to load data from.
The fields to project; ignored if not Parquet.
The path to load a second end of FASTQ data from. Ignored if not FASTQ.
Optional record group name to set if loading FASTQ.
Validation stringency used on FASTQ import/merging.
Returns an AlignmentRecordRDD which wraps the RDD of reads, sequence dictionary representing the contigs these reads are aligned to if the reads are aligned, and the record group dictionary for the reads if one is available.
loadFasta
loadFastq
loadInterleavedFastq
loadParquetAlignments
loadBam
Loads a SAM/BAM file.
Loads a SAM/BAM file.
This reads the sequence and record group dictionaries from the SAM/BAM file header. SAMRecords are read from the file and converted to the AlignmentRecord schema.
Path to the file on disk.
Returns an AlignmentRecordRDD which wraps the RDD of reads, sequence dictionary representing the contigs these reads are aligned to if the reads are aligned, and the record group dictionary for the reads if one is available.
loadAlignments
Loads features stored in BED6/12 format.
Loads features stored in BED6/12 format.
The path to the file to load.
An optional minimum number of partitions to load. If not set, falls back to the configured Spark default parallelism.
Optional stringency to pass. LENIENT stringency will warn when a malformed line is encountered, SILENT will ignore the malformed line, STRICT will throw an exception.
Returns a FeatureRDD.
Loads file of Features to a CoverageRDD.
Loads file of Features to a CoverageRDD. Coverage is stored in the score attribute of Feature.
File path to load coverage from.
CoverageRDD containing an RDD of Coverage
Loads a FASTA file.
Loads a FASTA file.
The path to load from.
The length to split contigs into. This sets the parallelism achievable.
Returns a NucleotideContigFragmentRDD containing the contigs.
Loads (possibly paired) FASTQ data.
Loads (possibly paired) FASTQ data.
The path where the first set of reads are.
The path where the second set of reads are, if provided.
The optional record group name to associate to the reads.
The validation stringency to use when validating the reads.
Returns the reads as an unaligned AlignmentRecordRDD.
loadUnpairedFastq
loadPairedFastq
Loads Features from a file, autodetecting the file type.
Loads Features from a file, autodetecting the file type.
Loads files ending in .bed as BED6/12, .gff3 as GFF3, .gtf/.gff as GTF/GFF2, .narrow[pP]eak as NarrowPeak, and .interval_list as IntervalList. If none of these match, we fall back to Parquet.
The path to the file to load.
An optional projection to push down.
An optional minimum number of partitions to use. For textual formats, if this is None, we fall back to the Spark default parallelism.
Returns a FeatureRDD.
loadParquetFeatures
loadIntervalList
loadNarrowPeak
loadGff3
loadGtf
loadBed
Auto-detects the file type and loads a FragmentRDD.
Auto-detects the file type and loads a FragmentRDD.
This method can load:
* Fragments via Parquet (default) * SAM/BAM/CRAM (.sam, .bam, .cram) * FASTQ (interleaved only --> .ifq) * Autodetects AlignmentRecord as Parquet with .reads.adam extension.
Path to load data from.
Returns the loaded data as a FragmentRDD.
Auto-detects the file type and loads a GenotypeRDD.
Auto-detects the file type and loads a GenotypeRDD.
If the file has a .vcf/.vcf.gz/.vcf.bgzf/.vcf.bgz extension, loads as VCF. Else, falls back to Parquet.
The path to load.
An optional subset of fields to load.
Returns a GenotypeRDD.
loadParquetGenotypes
loadVcf
Loads features stored in GFF3 format.
Loads features stored in GFF3 format.
The path to the file to load.
An optional minimum number of partitions to load. If not set, falls back to the configured Spark default parallelism.
Optional stringency to pass. LENIENT stringency will warn when a malformed line is encountered, SILENT will ignore the malformed line, STRICT will throw an exception.
Returns a FeatureRDD.
Loads features stored in GFF2/GTF format.
Loads features stored in GFF2/GTF format.
The path to the file to load.
An optional minimum number of partitions to load. If not set, falls back to the configured Spark default parallelism.
Optional stringency to pass. LENIENT stringency will warn when a malformed line is encountered, SILENT will ignore the malformed line, STRICT will throw an exception.
Returns a FeatureRDD.
Functions like loadBam, but uses bam index files to look at fewer blocks, and only returns records within a specified ReferenceRegion.
Functions like loadBam, but uses bam index files to look at fewer blocks, and only returns records within a specified ReferenceRegion. Bam index file required.
The path to the input data. Currently this path must correspond to a single Bam file. The bam index file associated needs to have the same name.
The ReferenceRegion we are filtering on
Functions like loadBam, but uses bam index files to look at fewer blocks, and only returns records within the specified ReferenceRegions.
Functions like loadBam, but uses bam index files to look at fewer blocks, and only returns records within the specified ReferenceRegions. Bam index file required.
The path to the input data. Currently this path must correspond to a single Bam file. The bam index file associated needs to have the same name.
Iterable of ReferenceRegions we are filtering on
Loads a VCF file indexed by a tabix (tbi) file into an RDD.
Loads a VCF file indexed by a tabix (tbi) file into an RDD.
The file to load.
Iterator of ReferenceRegions we are filtering on.
The validation stringency to use when validating the VCF.
Returns a VariantContextRDD.
Loads a VCF file indexed by a tabix (tbi) file into an RDD.
Loads a VCF file indexed by a tabix (tbi) file into an RDD.
The file to load.
ReferenceRegions we are filtering on.
Returns a VariantContextRDD.
Loads reads from interleaved FASTQ.
Loads reads from interleaved FASTQ.
In interleaved FASTQ, the two reads from a paired sequencing protocol are interleaved in a single file. This is a zipped representation of the typical paired FASTQ.
Path to load.
Returns the file as an unaligned AlignmentRecordRDD.
Loads interleaved FASTQ data as Fragments.
Loads interleaved FASTQ data as Fragments.
Fragments represent all of the reads from a single sequenced fragment as a single object, which is a useful representation for some tasks.
The path to load.
Returns a FragmentRDD containing the paired reads grouped by sequencing fragment.
Loads features stored in IntervalList format.
Loads features stored in IntervalList format.
The path to the file to load.
An optional minimum number of partitions to load. If not set, falls back to the configured Spark default parallelism.
Optional stringency to pass. LENIENT stringency will warn when a malformed line is encountered, SILENT will ignore the malformed line, STRICT will throw an exception.
Returns a FeatureRDD.
Loads features stored in NarrowPeak format.
Loads features stored in NarrowPeak format.
The path to the file to load.
An optional minimum number of partitions to load. If not set, falls back to the configured Spark default parallelism.
Optional stringency to pass. LENIENT stringency will warn when a malformed line is encountered, SILENT will ignore the malformed line, STRICT will throw an exception.
Returns a FeatureRDD.
Loads paired FASTQ data from two files.
Loads paired FASTQ data from two files.
The path where the first set of reads are.
The path where the second set of reads are.
The optional record group name to associate to the reads.
The validation stringency to use when validating the reads.
Returns the reads as an unaligned AlignmentRecordRDD.
loadFastq
This method will create a new RDD.
This method will create a new RDD.
The type of records to return
The path to the input data
An optional pushdown predicate to use when reading the data
An option projection schema to use when reading the data
An RDD with records of the specified type
Loads alignment data from a Parquet file.
Loads alignment data from a Parquet file.
The path of the file to load.
An optional predicate to push down into the file.
An optional schema designating the fields to project.
Returns an AlignmentRecordRDD which wraps the RDD of reads, sequence dictionary representing the contigs these reads are aligned to if the reads are aligned, and the record group dictionary for the reads if one is available.
The sequence dictionary is read from an avro file stored at filePath/_seqdict.avro and the record group dictionary is read from an avro file stored at filePath/_rgdict.avro. These files are pure avro, not Parquet.
loadAlignments
Loads NucleotideContigFragments stored in Parquet, with metadata.
Loads NucleotideContigFragments stored in Parquet, with metadata.
The path to load files from.
An optional predicate to push down into the file.
An optional projection to use for reading.
Returns a NucleotideContigFragmentRDD.
Loads Parquet file of Features to a CoverageRDD.
Loads Parquet file of Features to a CoverageRDD. Coverage is stored in the score attribute of Feature.
File path to load coverage from.
An optional predicate to push down into the file.
CoverageRDD containing an RDD of Coverage
Loads Features stored in Parquet, with accompanying metadata.
Loads Features stored in Parquet, with accompanying metadata.
The path to load files from.
An optional predicate to push down into the file.
An optional projection to use for reading.
Returns a FeatureRDD.
Loads Fragments stored in Parquet, with accompanying metadata.
Loads Fragments stored in Parquet, with accompanying metadata.
The path to load files from.
An optional predicate to push down into the file.
An optional projection to use for reading.
Returns a FragmentRDD.
Loads Genotypes stored in Parquet with accompanying metadata.
Loads Genotypes stored in Parquet with accompanying metadata.
The path to load files from.
An optional predicate to push down into the file.
An optional projection to use for reading.
Returns a GenotypeRDD.
Loads VariantAnnotations stored in Parquet, with metadata.
Loads VariantAnnotations stored in Parquet, with metadata.
The path to load files from.
An optional predicate to push down into the file.
An optional projection to use for reading.
Returns VariantAnnotationRDD.
Loads Variants stored in Parquet with accompanying metadata.
Loads Variants stored in Parquet with accompanying metadata.
The path to load files from.
An optional predicate to push down into the file.
An optional projection to use for reading.
Returns a VariantRDD.
Auto-detects the file type and loads a broadcastable ReferenceFile.
Auto-detects the file type and loads a broadcastable ReferenceFile.
If the file type is 2bit, loads a 2bit file. Else, uses loadSequences to load the reference as an RDD, which is then collected to the driver.
The path to load.
The length of fragment to use for splitting.
Returns a broadcastable ReferenceFile.
loadSequences
Auto-detects the file type and loads contigs as a NucleotideContigFragmentRDD.
Auto-detects the file type and loads contigs as a NucleotideContigFragmentRDD.
Loads files ending in .fa/.fasta/.fa.gz/.fasta.gz as FASTA, else, falls back to Parquet.
The path to load.
An optional subset of fields to load.
The length of fragment to use for splitting.
Returns a NucleotideContigFragmentRDD.
loadReferenceFile
loadParquetContigFragments
loadFasta
Loads unpaired FASTQ data from two files.
Loads unpaired FASTQ data from two files.
The path where the first set of reads are.
The optional record group name to associate to the reads.
If true, sets the read as first from the fragment.
If true, sets the read as second from the fragment.
The validation stringency to use when validating the reads.
Returns the reads as an unaligned AlignmentRecordRDD.
loadFastq
Loads VariantAnnotations into an RDD, and automatically detects the underlying storage format.
Loads VariantAnnotations into an RDD, and automatically detects the underlying storage format.
Can load variant annotations from either Parquet or VCF.
The path to load files from.
An optional projection to use for reading.
Returns VariantAnnotationRDD.
loadParquetVariantAnnotations
loadVcfAnnotations
Auto-detects the file type and loads a VariantRDD.
Auto-detects the file type and loads a VariantRDD.
If the file has a .vcf/.vcf.gz/.vcf.bgzf/.vcf.bgz extension, loads as VCF. Else, falls back to Parquet.
The path to load.
An optional subset of fields to load.
Returns a VariantRDD.
loadParquetVariants
loadVcf
Loads a VCF file into an RDD.
Loads a VCF file into an RDD.
The file to load.
The validation stringency to use when validating the VCF.
Returns a VariantContextRDD.
loadVcfAnnotations
Loads variant annotations stored in VCF format.
Loads variant annotations stored in VCF format.
The path to the VCF file(s) to load annotations from.
Returns VariantAnnotationRDD.
The SparkContext to wrap.
The ADAMContext provides functions on top of a SparkContext for loading genomic data.