This is a specialized version of org.apache.spark.rdd.ShuffledRDD that is optimized for
shuffling rows instead of Java key-value pairs. Note that something like this should eventually
be implemented in Spark core, but that is blocked by some more general refactorings to shuffle
interfaces / internals.
This RDD takes a ShuffleDependency (dependency),
and an optional array of partition start indices as input arguments
(specifiedPartitionStartIndices).
The dependency has the parent RDD of this RDD, which represents the dataset before shuffle
(i.e. map output). Elements of this RDD are (partitionId, Row) pairs.
Partition ids should be in the range [0, numPartitions - 1].
dependency.partitioner is the original partitioner used to partition
map output, and dependency.partitioner.numPartitions is the number of pre-shuffle partitions
(i.e. the number of partitions of the map output).
When specifiedPartitionStartIndices is defined, specifiedPartitionStartIndices.length
will be the number of post-shuffle partitions. For this case, the ith post-shuffle
partition includes specifiedPartitionStartIndices[i] to
specifiedPartitionStartIndices[i+1] - 1 (inclusive).
When specifiedPartitionStartIndices is not defined, there will be
dependency.partitioner.numPartitions post-shuffle partitions. For this case,
a post-shuffle partition is created for every pre-shuffle partition.
Linear Supertypes
RDD[InternalRow], Logging, Serializable, Serializable, AnyRef, Any
This is a specialized version of org.apache.spark.rdd.ShuffledRDD that is optimized for shuffling rows instead of Java key-value pairs. Note that something like this should eventually be implemented in Spark core, but that is blocked by some more general refactorings to shuffle interfaces / internals.
This RDD takes a ShuffleDependency (
dependency
), and an optional array of partition start indices as input arguments (specifiedPartitionStartIndices
).The
dependency
has the parent RDD of this RDD, which represents the dataset before shuffle (i.e. map output). Elements of this RDD are (partitionId, Row) pairs. Partition ids should be in the range [0, numPartitions - 1].dependency.partitioner
is the original partitioner used to partition map output, anddependency.partitioner.numPartitions
is the number of pre-shuffle partitions (i.e. the number of partitions of the map output).When
specifiedPartitionStartIndices
is defined,specifiedPartitionStartIndices.length
will be the number of post-shuffle partitions. For this case, thei
th post-shuffle partition includesspecifiedPartitionStartIndices[i]
tospecifiedPartitionStartIndices[i+1] - 1
(inclusive).When
specifiedPartitionStartIndices
is not defined, there will bedependency.partitioner.numPartitions
post-shuffle partitions. For this case, a post-shuffle partition is created for every pre-shuffle partition.