Class FindAssemblyRegionsSpark
java.lang.Object
org.broadinstitute.hellbender.engine.spark.FindAssemblyRegionsSpark
Find assembly regions from reads in a distributed Spark setting. There are two algorithms available, fast,
which looks for assembly regions in each read shard in parallel, and strict, which looks for assembly regions
in each contig in parallel. Fast mode may produce read shard boundary artifacts for assembly regions compared to the
walker version. Strict mode should be identical to the walker version, at the cost of increased runtime compared to
the fast version.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic org.apache.spark.api.java.JavaRDD<AssemblyRegionWalkerContext>
getAssemblyRegionsFast
(org.apache.spark.api.java.JavaSparkContext ctx, org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, htsjdk.samtools.SAMSequenceDictionary sequenceDictionary, String referenceFileName, FeatureManager features, List<ShardBoundary> intervalShards, org.apache.spark.broadcast.Broadcast<Supplier<AssemblyRegionEvaluator>> assemblyRegionEvaluatorSupplierBroadcast, AssemblyRegionReadShardArgumentCollection shardingArgs, AssemblyRegionArgumentCollection assemblyRegionArgs, boolean shuffle, boolean trackPileups) Get an RDD of assembly regions for the given reads and intervals using the fast algorithm (looks for assembly regions in each read shard in parallel).static org.apache.spark.api.java.JavaRDD<AssemblyRegionWalkerContext>
getAssemblyRegionsStrict
(org.apache.spark.api.java.JavaSparkContext ctx, org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, htsjdk.samtools.SAMSequenceDictionary sequenceDictionary, String referenceFileName, FeatureManager features, List<ShardBoundary> intervalShards, org.apache.spark.broadcast.Broadcast<Supplier<AssemblyRegionEvaluator>> assemblyRegionEvaluatorSupplierBroadcast, AssemblyRegionReadShardArgumentCollection shardingArgs, AssemblyRegionArgumentCollection assemblyRegionArgs, boolean shuffle) Get an RDD of assembly regions for the given reads and intervals using the strict algorithm (looks for assembly regions in each contig in parallel).
-
Constructor Details
-
FindAssemblyRegionsSpark
public FindAssemblyRegionsSpark()
-
-
Method Details
-
getAssemblyRegionsFast
public static org.apache.spark.api.java.JavaRDD<AssemblyRegionWalkerContext> getAssemblyRegionsFast(org.apache.spark.api.java.JavaSparkContext ctx, org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, htsjdk.samtools.SAMSequenceDictionary sequenceDictionary, String referenceFileName, FeatureManager features, List<ShardBoundary> intervalShards, org.apache.spark.broadcast.Broadcast<Supplier<AssemblyRegionEvaluator>> assemblyRegionEvaluatorSupplierBroadcast, AssemblyRegionReadShardArgumentCollection shardingArgs, AssemblyRegionArgumentCollection assemblyRegionArgs, boolean shuffle, boolean trackPileups) Get an RDD of assembly regions for the given reads and intervals using the fast algorithm (looks for assembly regions in each read shard in parallel).- Parameters:
ctx
- the Spark contextreads
- the coordinate-sorted readsheader
- the header for the readssequenceDictionary
- the sequence dictionary for the readsreferenceFileName
- the file name for the referencefeatures
- source of arbitrary features (may be null)intervalShards
- the sharded intervals to find assembly regions forassemblyRegionEvaluatorSupplierBroadcast
- evaluator used to determine whether a locus is activeshardingArgs
- the arguments for sharding readsassemblyRegionArgs
- the arguments for finding assembly regionsshuffle
- whether to use a shuffle or not when sharding reads- Returns:
- an RDD of assembly regions
-
getAssemblyRegionsStrict
public static org.apache.spark.api.java.JavaRDD<AssemblyRegionWalkerContext> getAssemblyRegionsStrict(org.apache.spark.api.java.JavaSparkContext ctx, org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, htsjdk.samtools.SAMSequenceDictionary sequenceDictionary, String referenceFileName, FeatureManager features, List<ShardBoundary> intervalShards, org.apache.spark.broadcast.Broadcast<Supplier<AssemblyRegionEvaluator>> assemblyRegionEvaluatorSupplierBroadcast, AssemblyRegionReadShardArgumentCollection shardingArgs, AssemblyRegionArgumentCollection assemblyRegionArgs, boolean shuffle) Get an RDD of assembly regions for the given reads and intervals using the strict algorithm (looks for assembly regions in each contig in parallel).- Parameters:
ctx
- the Spark contextreads
- the coordinate-sorted readsheader
- the header for the readssequenceDictionary
- the sequence dictionary for the readsreferenceFileName
- the file name for the referencefeatures
- source of arbitrary features (may be null)intervalShards
- the sharded intervals to find assembly regions forassemblyRegionEvaluatorSupplierBroadcast
- evaluator used to determine whether a locus is activeshardingArgs
- the arguments for sharding readsassemblyRegionArgs
- the arguments for finding assembly regionsshuffle
- whether to use a shuffle or not when sharding reads- Returns:
- an RDD of assembly regions
-