public final class SparkUtils
extends java.lang.Object
Modifier and Type | Method and Description |
---|---|
static void |
convertHeaderlessHadoopBamShardToBam(java.io.File bamShard,
htsjdk.samtools.SAMFileHeader header,
java.io.File destination)
Converts a headerless Hadoop bam shard (eg., a part0000, part0001, etc.
|
static <T> void |
destroyBroadcast(org.apache.spark.broadcast.Broadcast<T> broadcast,
java.lang.String whatBroadcast)
Sometimes Spark has trouble destroying a broadcast variable, but we'd like the app to continue anyway.
|
static <K,V> java.util.Iterator<scala.Tuple2<K,java.lang.Iterable<V>>> |
getSpanningIterator(java.util.Iterator<scala.Tuple2<K,V>> iterator)
An iterator that groups values having the same key into iterable collections.
|
static boolean |
pathExists(org.apache.spark.api.java.JavaSparkContext ctx,
org.apache.hadoop.fs.Path targetPath)
Determine if the
targetPath exists. |
static org.apache.spark.api.java.JavaRDD<GATKRead> |
putReadsWithTheSameNameInTheSamePartition(htsjdk.samtools.SAMFileHeader header,
org.apache.spark.api.java.JavaRDD<GATKRead> reads,
org.apache.spark.api.java.JavaSparkContext ctx)
Ensure all reads with the same name appear in the same partition of a queryname sorted RDD.
|
static org.apache.spark.api.java.JavaRDD<GATKRead> |
querynameSortReadsIfNecessary(org.apache.spark.api.java.JavaRDD<GATKRead> reads,
int numReducers,
htsjdk.samtools.SAMFileHeader header)
Sort reads into queryname order if they are not already sorted
|
static org.apache.spark.api.java.JavaRDD<GATKRead> |
sortReadsAccordingToHeader(org.apache.spark.api.java.JavaRDD<GATKRead> reads,
htsjdk.samtools.SAMFileHeader header,
int numReducers)
Do a total sort of an RDD of
GATKRead according to the sort order in the header. |
static <T> org.apache.spark.api.java.JavaRDD<T> |
sortUsingElementsAsKeys(org.apache.spark.api.java.JavaRDD<T> elements,
java.util.Comparator<T> comparator,
int numReducers)
Do a global sort of an RDD using the given comparator.
|
static <K,V> org.apache.spark.api.java.JavaPairRDD<K,java.lang.Iterable<V>> |
spanByKey(org.apache.spark.api.java.JavaPairRDD<K,V> rdd)
Like
groupByKey , but assumes that values are already sorted by key, so no shuffle is needed,
which is much faster. |
public static <T> void destroyBroadcast(org.apache.spark.broadcast.Broadcast<T> broadcast, java.lang.String whatBroadcast)
public static void convertHeaderlessHadoopBamShardToBam(java.io.File bamShard, htsjdk.samtools.SAMFileHeader header, java.io.File destination)
ReadsSparkSink
) into a readable bam file
by adding a header and a BGZF terminator.
This method is not intended for use with Hadoop bam shards that already have a header -- these shards are
already readable using samtools. Currently ReadsSparkSink
saves the "shards" with a header for the
ReadsWriteFormat.SHARDED
case, and without a header for the ReadsWriteFormat.SINGLE
case.bamShard
- The headerless Hadoop bam shard to convertheader
- header for the BAM file to be createddestination
- path to which to write the new BAM filepublic static boolean pathExists(org.apache.spark.api.java.JavaSparkContext ctx, org.apache.hadoop.fs.Path targetPath)
targetPath
exists.ctx
- JavaSparkContexttargetPath
- the org.apache.hadoop.fs.Path
object to checkpublic static org.apache.spark.api.java.JavaRDD<GATKRead> sortReadsAccordingToHeader(org.apache.spark.api.java.JavaRDD<GATKRead> reads, htsjdk.samtools.SAMFileHeader header, int numReducers)
GATKRead
according to the sort order in the header.reads
- a JavaRDD of reads which may or may not be sortedheader
- a header which specifies the desired new sort order.
Only SAMFileHeader.SortOrder#coordinate
and SAMFileHeader.SortOrder#queryname
are supported.
All others will result in GATKException
numReducers
- number of reducers to use when sortingpublic static <T> org.apache.spark.api.java.JavaRDD<T> sortUsingElementsAsKeys(org.apache.spark.api.java.JavaRDD<T> elements, java.util.Comparator<T> comparator, int numReducers)
public static org.apache.spark.api.java.JavaRDD<GATKRead> putReadsWithTheSameNameInTheSamePartition(htsjdk.samtools.SAMFileHeader header, org.apache.spark.api.java.JavaRDD<GATKRead> reads, org.apache.spark.api.java.JavaSparkContext ctx)
GATKException
.public static <K,V> org.apache.spark.api.java.JavaPairRDD<K,java.lang.Iterable<V>> spanByKey(org.apache.spark.api.java.JavaPairRDD<K,V> rdd)
groupByKey
, but assumes that values are already sorted by key, so no shuffle is needed,
which is much faster.K
- type of keysV
- type of valuesrdd
- the input RDDpublic static <K,V> java.util.Iterator<scala.Tuple2<K,java.lang.Iterable<V>>> getSpanningIterator(java.util.Iterator<scala.Tuple2<K,V>> iterator)
K
- type of keysV
- type of valuesiterator
- an iterator over key-value pairs