Performs a region join between two RDDs (broadcast join).
Performs a region join between two RDDs (broadcast join).
This implementation first _collects_ the left-side RDD; therefore, if the left-side RDD is large or otherwise idiosyncratic in a spatial sense (i.e. contains a set of regions whose unions overlap a significant fraction of the genome) then the performance of this implementation will likely be quite bad.
Once the left-side RDD is collected, its elements are reduced to their distinct unions; these can then be used to define the partitions over which the region-join will be computed.
The regions in the left-side are keyed by their corresponding partition (each such region should have exactly one partition). The regions in the right-side are also keyed by their corresponding partitions (here there can be more than one partition for a region, since a region may cross the boundaries of the partitions defined by the left-side).
Finally, within each separate partition, we essentially perform a cartesian-product-and-filter operation. The result is the region-join.
The 'left' side of the join
The 'right' side of the join
implicit type of baseRDD
implicit type of joinedRDD
An RDD of pairs (x, y), where x is from baseRDD, y is from joinedRDD, and the region corresponding to x overlaps the region corresponding to y.
Extends the BroadcastRegionJoin trait to implement a right outer join.