SparkUtils (gatk 4.1.4.1 API)

java.lang.Object
- org.broadinstitute.hellbender.utils.spark.SparkUtils

public final class SparkUtils
extends java.lang.Object

Miscellaneous Spark-related utilities

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static void`	`convertHeaderlessHadoopBamShardToBam(java.io.File bamShard, htsjdk.samtools.SAMFileHeader header, java.io.File destination)` Converts a headerless Hadoop bam shard (eg., a part0000, part0001, etc.
`static <T> void`	`destroyBroadcast(org.apache.spark.broadcast.Broadcast<T> broadcast, java.lang.String whatBroadcast)` Sometimes Spark has trouble destroying a broadcast variable, but we'd like the app to continue anyway.
`static <K,V> java.util.Iterator<scala.Tuple2<K,java.lang.Iterable<V>>>`	`getSpanningIterator(java.util.Iterator<scala.Tuple2<K,V>> iterator)` An iterator that groups values having the same key into iterable collections.
`static boolean`	`pathExists(org.apache.spark.api.java.JavaSparkContext ctx, org.apache.hadoop.fs.Path targetPath)` Determine if the `targetPath` exists.
`static org.apache.spark.api.java.JavaRDD<GATKRead>`	`putReadsWithTheSameNameInTheSamePartition(htsjdk.samtools.SAMFileHeader header, org.apache.spark.api.java.JavaRDD<GATKRead> reads, org.apache.spark.api.java.JavaSparkContext ctx)` Ensure all reads with the same name appear in the same partition of a queryname sorted RDD.
`static org.apache.spark.api.java.JavaRDD<GATKRead>`	`querynameSortReadsIfNecessary(org.apache.spark.api.java.JavaRDD<GATKRead> reads, int numReducers, htsjdk.samtools.SAMFileHeader header)` Sort reads into queryname order if they are not already sorted
`static org.apache.spark.api.java.JavaRDD<GATKRead>`	`sortReadsAccordingToHeader(org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, int numReducers)` Do a total sort of an RDD of `GATKRead` according to the sort order in the header.
`static <T> org.apache.spark.api.java.JavaRDD<T>`	`sortUsingElementsAsKeys(org.apache.spark.api.java.JavaRDD<T> elements, java.util.Comparator<T> comparator, int numReducers)` Do a global sort of an RDD using the given comparator.
`static <K,V> org.apache.spark.api.java.JavaPairRDD<K,java.lang.Iterable<V>>`	`spanByKey(org.apache.spark.api.java.JavaPairRDD<K,V> rdd)` Like `groupByKey`, but assumes that values are already sorted by key, so no shuffle is needed, which is much faster.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Method Detail
  - destroyBroadcast
```
public static <T> void destroyBroadcast(org.apache.spark.broadcast.Broadcast<T> broadcast,
                                        java.lang.String whatBroadcast)
```
    Sometimes Spark has trouble destroying a broadcast variable, but we'd like the app to continue anyway.
  - convertHeaderlessHadoopBamShardToBam
```
public static void convertHeaderlessHadoopBamShardToBam(java.io.File bamShard,
                                                        htsjdk.samtools.SAMFileHeader header,
                                                        java.io.File destination)
```
    Converts a headerless Hadoop bam shard (eg., a part0000, part0001, etc. file produced by ReadsSparkSink) into a readable bam file by adding a header and a BGZF terminator. This method is not intended for use with Hadoop bam shards that already have a header -- these shards are already readable using samtools. Currently ReadsSparkSink saves the "shards" with a header for the ReadsWriteFormat.SHARDED case, and without a header for the ReadsWriteFormat.SINGLE case.
    
    Parameters:
    
    bamShard - The headerless Hadoop bam shard to convert
    
    header - header for the BAM file to be created
    
    destination - path to which to write the new BAM file
  - pathExists
```
public static boolean pathExists(org.apache.spark.api.java.JavaSparkContext ctx,
                                 org.apache.hadoop.fs.Path targetPath)
```
    Determine if the targetPath exists.
    
    Parameters:
    
    ctx - JavaSparkContext
    
    targetPath - the org.apache.hadoop.fs.Path object to check
    
    Returns:
    
    true if the targetPath exists, otherwise false
  - sortReadsAccordingToHeader
```
public static org.apache.spark.api.java.JavaRDD<GATKRead> sortReadsAccordingToHeader(org.apache.spark.api.java.JavaRDD<GATKRead> reads,
                                                                                     htsjdk.samtools.SAMFileHeader header,
                                                                                     int numReducers)
```
    Do a total sort of an RDD of GATKRead according to the sort order in the header.
    
    Parameters:
    
    reads - a JavaRDD of reads which may or may not be sorted
    
    header - a header which specifies the desired new sort order. Only SAMFileHeader.SortOrder#coordinate and SAMFileHeader.SortOrder#queryname are supported. All others will result in GATKException
    
    numReducers - number of reducers to use when sorting
    
    Returns:
    
    a new JavaRDD or reads which is globally sorted in a way that is consistent with the sort order given in the header
  - sortUsingElementsAsKeys
```
public static <T> org.apache.spark.api.java.JavaRDD<T> sortUsingElementsAsKeys(org.apache.spark.api.java.JavaRDD<T> elements,
                                                                               java.util.Comparator<T> comparator,
                                                                               int numReducers)
```
    Do a global sort of an RDD using the given comparator. This method uses the RDD elements themselves as the keys in the spark key/value sort. This may be inefficient if the comparator only uses looks at a small fraction of the element to perform the comparison.
  - putReadsWithTheSameNameInTheSamePartition
```
public static org.apache.spark.api.java.JavaRDD<GATKRead> putReadsWithTheSameNameInTheSamePartition(htsjdk.samtools.SAMFileHeader header,
                                                                                                    org.apache.spark.api.java.JavaRDD<GATKRead> reads,
                                                                                                    org.apache.spark.api.java.JavaSparkContext ctx)
```
    Ensure all reads with the same name appear in the same partition of a queryname sorted RDD. This avoids a global shuffle and only transfers the leading elements from each partition which is fast in most cases. The RDD must be queryname sorted. If there are so many reads with the same name that they span multiple partitions this will throw GATKException.
  - spanByKey
```
public static <K,V> org.apache.spark.api.java.JavaPairRDD<K,java.lang.Iterable<V>> spanByKey(org.apache.spark.api.java.JavaPairRDD<K,V> rdd)
```
    Like groupByKey, but assumes that values are already sorted by key, so no shuffle is needed, which is much faster.
    
    Type Parameters:
    
    K - type of keys
    
    V - type of values
    
    Parameters:
    
    rdd - the input RDD
    
    Returns:
    
    an RDD where each the values for each key are grouped into an iterable collection
  - getSpanningIterator
```
public static <K,V> java.util.Iterator<scala.Tuple2<K,java.lang.Iterable<V>>> getSpanningIterator(java.util.Iterator<scala.Tuple2<K,V>> iterator)
```
    An iterator that groups values having the same key into iterable collections.
    
    Type Parameters:
    
    K - type of keys
    
    V - type of values
    
    Parameters:
    
    iterator - an iterator over key-value pairs
    
    Returns:
    
    an iterator over pairs of keys and grouped values
  - querynameSortReadsIfNecessary
```
public static org.apache.spark.api.java.JavaRDD<GATKRead> querynameSortReadsIfNecessary(org.apache.spark.api.java.JavaRDD<GATKRead> reads,
                                                                                        int numReducers,
                                                                                        htsjdk.samtools.SAMFileHeader header)
```
    Sort reads into queryname order if they are not already sorted

Class SparkUtils

Method Summary

Methods inherited from class java.lang.Object

Method Detail

destroyBroadcast

convertHeaderlessHadoopBamShardToBam

pathExists

sortReadsAccordingToHeader

sortUsingElementsAsKeys

putReadsWithTheSameNameInTheSamePartition

spanByKey

getSpanningIterator

querynameSortReadsIfNecessary