Performs a region join between two RDDs (shuffle join).
Performs a region join between two RDDs (shuffle join).
This implementation is shuffle-based, so does not require collecting one side into memory like BroadcastRegionJoin. It basically performs a global sort of each RDD by genome position and then does a sort-merge join, similar to the chromsweep implementation in bedtools. More specifically, it first defines a set of bins across the genome, then assigns each object in the RDDs to each bin that they overlap (replicating if necessary), performs the shuffle, and sorts the object in each bin. Finally, each bin independently performs a chromsweep sort-merge join.
The 'left' side of the join
The 'right' side of the join
implicit type of leftRDD
implicit type of rightRDD
An RDD of pairs (x, y), where x is from leftRDD, y is from rightRDD, and the region corresponding to x overlaps the region corresponding to y.