(Since version 1.0.0) use mapPartitionsWithIndex and filter
(Since version 1.0.0) use mapPartitionsWithIndex and flatMap
(Since version 1.0.0) use mapPartitionsWithIndex and foreach
(Since version 1.2.0) use TaskContext.get
(Since version 0.7.0) use mapPartitionsWithIndex
(Since version 1.0.0) use mapPartitionsWithIndex
(Since version 1.0.0) use collect
This is a specialized version of org.apache.spark.rdd.ShuffledRDD that is optimized for shuffling rows instead of Java key-value pairs. Note that something like this should eventually be implemented in Spark core, but that is blocked by some more general refactorings to shuffle interfaces / internals.
This RDD takes a ShuffleDependency (
dependency
), and a optional array of partition start indices as input arguments (specifiedPartitionStartIndices
).The
dependency
has the parent RDD of this RDD, which represents the dataset before shuffle (i.e. map output). Elements of this RDD are (partitionId, Row) pairs. Partition ids should be in the range [0, numPartitions - 1].dependency.partitioner
is the original partitioner used to partition map output, anddependency.partitioner.numPartitions
is the number of pre-shuffle partitions (i.e. the number of partitions of the map output).When
specifiedPartitionStartIndices
is defined,specifiedPartitionStartIndices.length
will be the number of post-shuffle partitions. For this case, thei
th post-shuffle partition includesspecifiedPartitionStartIndices[i]
tospecifiedPartitionStartIndices[i+1] - 1
(inclusive).When
specifiedPartitionStartIndices
is not defined, there will bedependency.partitioner.numPartitions
post-shuffle partitions. For this case, a post-shuffle partition is created for every pre-shuffle partition.