org.broadinstitute.hellbender.engine.spark.FindAssemblyRegionsSpark

public class FindAssemblyRegionsSpark extends Object

Find assembly regions from reads in a distributed Spark setting. There are two algorithms available, fast, which looks for assembly regions in each read shard in parallel, and strict, which looks for assembly regions in each contig in parallel. Fast mode may produce read shard boundary artifacts for assembly regions compared to the walker version. Strict mode should be identical to the walker version, at the cost of increased runtime compared to the fast version.

Constructor Summary

Constructors

Constructor

Description

FindAssemblyRegionsSpark()
Method Summary

Modifier and Type

Method

Description

static org.apache.spark.api.java.JavaRDD<AssemblyRegionWalkerContext>

getAssemblyRegionsFast(org.apache.spark.api.java.JavaSparkContext ctx, org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, htsjdk.samtools.SAMSequenceDictionary sequenceDictionary, String referenceFileName, FeatureManager features, List<ShardBoundary> intervalShards, org.apache.spark.broadcast.Broadcast<Supplier<AssemblyRegionEvaluator>> assemblyRegionEvaluatorSupplierBroadcast, AssemblyRegionReadShardArgumentCollection shardingArgs, AssemblyRegionArgumentCollection assemblyRegionArgs, boolean shuffle, boolean trackPileups)

Get an RDD of assembly regions for the given reads and intervals using the fast algorithm (looks for assembly regions in each read shard in parallel).

static org.apache.spark.api.java.JavaRDD<AssemblyRegionWalkerContext>

getAssemblyRegionsStrict(org.apache.spark.api.java.JavaSparkContext ctx, org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, htsjdk.samtools.SAMSequenceDictionary sequenceDictionary, String referenceFileName, FeatureManager features, List<ShardBoundary> intervalShards, org.apache.spark.broadcast.Broadcast<Supplier<AssemblyRegionEvaluator>> assemblyRegionEvaluatorSupplierBroadcast, AssemblyRegionReadShardArgumentCollection shardingArgs, AssemblyRegionArgumentCollection assemblyRegionArgs, boolean shuffle)

Get an RDD of assembly regions for the given reads and intervals using the strict algorithm (looks for assembly regions in each contig in parallel).

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- FindAssemblyRegionsSpark
  
  public FindAssemblyRegionsSpark()
Method Details
- getAssemblyRegionsFast
  
  public static org.apache.spark.api.java.JavaRDD<AssemblyRegionWalkerContext> getAssemblyRegionsFast(org.apache.spark.api.java.JavaSparkContext ctx, org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, htsjdk.samtools.SAMSequenceDictionary sequenceDictionary, String referenceFileName, FeatureManager features, List<ShardBoundary> intervalShards, org.apache.spark.broadcast.Broadcast<Supplier<AssemblyRegionEvaluator>> assemblyRegionEvaluatorSupplierBroadcast, AssemblyRegionReadShardArgumentCollection shardingArgs, AssemblyRegionArgumentCollection assemblyRegionArgs, boolean shuffle, boolean trackPileups)
  
  Get an RDD of assembly regions for the given reads and intervals using the fast algorithm (looks for assembly regions in each read shard in parallel).
  
  Parameters:
  
  ctx - the Spark context
  
  reads - the coordinate-sorted reads
  
  header - the header for the reads
  
  sequenceDictionary - the sequence dictionary for the reads
  
  referenceFileName - the file name for the reference
  
  features - source of arbitrary features (may be null)
  
  intervalShards - the sharded intervals to find assembly regions for
  
  assemblyRegionEvaluatorSupplierBroadcast - evaluator used to determine whether a locus is active
  
  shardingArgs - the arguments for sharding reads
  
  assemblyRegionArgs - the arguments for finding assembly regions
  
  shuffle - whether to use a shuffle or not when sharding reads
  
  Returns:
  
  an RDD of assembly regions
- getAssemblyRegionsStrict
  
  public static org.apache.spark.api.java.JavaRDD<AssemblyRegionWalkerContext> getAssemblyRegionsStrict(org.apache.spark.api.java.JavaSparkContext ctx, org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, htsjdk.samtools.SAMSequenceDictionary sequenceDictionary, String referenceFileName, FeatureManager features, List<ShardBoundary> intervalShards, org.apache.spark.broadcast.Broadcast<Supplier<AssemblyRegionEvaluator>> assemblyRegionEvaluatorSupplierBroadcast, AssemblyRegionReadShardArgumentCollection shardingArgs, AssemblyRegionArgumentCollection assemblyRegionArgs, boolean shuffle)
  
  Get an RDD of assembly regions for the given reads and intervals using the strict algorithm (looks for assembly regions in each contig in parallel).
  
  Parameters:
  
  ctx - the Spark context
  
  reads - the coordinate-sorted reads
  
  header - the header for the reads
  
  sequenceDictionary - the sequence dictionary for the reads
  
  referenceFileName - the file name for the reference
  
  features - source of arbitrary features (may be null)
  
  intervalShards - the sharded intervals to find assembly regions for
  
  assemblyRegionEvaluatorSupplierBroadcast - evaluator used to determine whether a locus is active
  
  shardingArgs - the arguments for sharding reads
  
  assemblyRegionArgs - the arguments for finding assembly regions
  
  shuffle - whether to use a shuffle or not when sharding reads
  
  Returns:
  
  an RDD of assembly regions

Class FindAssemblyRegionsSpark

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

FindAssemblyRegionsSpark

Method Details

getAssemblyRegionsFast

getAssemblyRegionsStrict