Class FindAssemblyRegionsSpark

java.lang.Object
org.broadinstitute.hellbender.engine.spark.FindAssemblyRegionsSpark

public class FindAssemblyRegionsSpark extends Object
Find assembly regions from reads in a distributed Spark setting. There are two algorithms available, fast, which looks for assembly regions in each read shard in parallel, and strict, which looks for assembly regions in each contig in parallel. Fast mode may produce read shard boundary artifacts for assembly regions compared to the walker version. Strict mode should be identical to the walker version, at the cost of increased runtime compared to the fast version.
  • Constructor Details

    • FindAssemblyRegionsSpark

      public FindAssemblyRegionsSpark()
  • Method Details

    • getAssemblyRegionsFast

      public static org.apache.spark.api.java.JavaRDD<AssemblyRegionWalkerContext> getAssemblyRegionsFast(org.apache.spark.api.java.JavaSparkContext ctx, org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, htsjdk.samtools.SAMSequenceDictionary sequenceDictionary, String referenceFileName, FeatureManager features, List<ShardBoundary> intervalShards, org.apache.spark.broadcast.Broadcast<Supplier<AssemblyRegionEvaluator>> assemblyRegionEvaluatorSupplierBroadcast, AssemblyRegionReadShardArgumentCollection shardingArgs, AssemblyRegionArgumentCollection assemblyRegionArgs, boolean shuffle, boolean trackPileups)
      Get an RDD of assembly regions for the given reads and intervals using the fast algorithm (looks for assembly regions in each read shard in parallel).
      Parameters:
      ctx - the Spark context
      reads - the coordinate-sorted reads
      header - the header for the reads
      sequenceDictionary - the sequence dictionary for the reads
      referenceFileName - the file name for the reference
      features - source of arbitrary features (may be null)
      intervalShards - the sharded intervals to find assembly regions for
      assemblyRegionEvaluatorSupplierBroadcast - evaluator used to determine whether a locus is active
      shardingArgs - the arguments for sharding reads
      assemblyRegionArgs - the arguments for finding assembly regions
      shuffle - whether to use a shuffle or not when sharding reads
      Returns:
      an RDD of assembly regions
    • getAssemblyRegionsStrict

      public static org.apache.spark.api.java.JavaRDD<AssemblyRegionWalkerContext> getAssemblyRegionsStrict(org.apache.spark.api.java.JavaSparkContext ctx, org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, htsjdk.samtools.SAMSequenceDictionary sequenceDictionary, String referenceFileName, FeatureManager features, List<ShardBoundary> intervalShards, org.apache.spark.broadcast.Broadcast<Supplier<AssemblyRegionEvaluator>> assemblyRegionEvaluatorSupplierBroadcast, AssemblyRegionReadShardArgumentCollection shardingArgs, AssemblyRegionArgumentCollection assemblyRegionArgs, boolean shuffle)
      Get an RDD of assembly regions for the given reads and intervals using the strict algorithm (looks for assembly regions in each contig in parallel).
      Parameters:
      ctx - the Spark context
      reads - the coordinate-sorted reads
      header - the header for the reads
      sequenceDictionary - the sequence dictionary for the reads
      referenceFileName - the file name for the reference
      features - source of arbitrary features (may be null)
      intervalShards - the sharded intervals to find assembly regions for
      assemblyRegionEvaluatorSupplierBroadcast - evaluator used to determine whether a locus is active
      shardingArgs - the arguments for sharding reads
      assemblyRegionArgs - the arguments for finding assembly regions
      shuffle - whether to use a shuffle or not when sharding reads
      Returns:
      an RDD of assembly regions