Package

org.apache.spark.sql.execution

exchange

Permalink

package exchange

Visibility
  1. Public
  2. All

Type Members

  1. case class BroadcastExchangeExec(mode: BroadcastMode, child: SparkPlan) extends Exchange with Product with Serializable

    Permalink

    A BroadcastExchangeExec collects, transforms and finally broadcasts the result of a transformed SparkPlan.

  2. case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] with Product with Serializable

    Permalink

    Ensures that the Partitioning of input data meets the Distribution requirements for each operator by inserting ShuffleExchange Operators where required.

    Ensures that the Partitioning of input data meets the Distribution requirements for each operator by inserting ShuffleExchange Operators where required. Also ensure that the input partition ordering requirements are met.

  3. abstract class Exchange extends SparkPlan with UnaryExecNode

    Permalink

    Base class for operators that exchange data among multiple threads or processes.

    Base class for operators that exchange data among multiple threads or processes.

    Exchanges are the key class of operators that enable parallelism. Although the implementation differs significantly, the concept is similar to the exchange operator described in "Volcano -- An Extensible and Parallel Query Evaluation System" by Goetz Graefe.

  4. class ExchangeCoordinator extends Logging

    Permalink

    A coordinator used to determines how we shuffle data between stages generated by Spark SQL.

    A coordinator used to determines how we shuffle data between stages generated by Spark SQL. Right now, the work of this coordinator is to determine the number of post-shuffle partitions for a stage that needs to fetch shuffle data from one or multiple stages.

    A coordinator is constructed with three parameters, numExchanges, targetPostShuffleInputSize, and minNumPostShufflePartitions.

    • numExchanges is used to indicated that how many ShuffleExchanges that will be registered to this coordinator. So, when we start to do any actual work, we have a way to make sure that we have got expected number of ShuffleExchanges.
    • targetPostShuffleInputSize is the targeted size of a post-shuffle partition's input data size. With this parameter, we can estimate the number of post-shuffle partitions. This parameter is configured through spark.sql.adaptive.shuffle.targetPostShuffleInputSize.
    • minNumPostShufflePartitions is an optional parameter. If it is defined, this coordinator will try to make sure that there are at least minNumPostShufflePartitions post-shuffle partitions.

    The workflow of this coordinator is described as follows:

    • Before the execution of a SparkPlan, for a ShuffleExchange operator, if an ExchangeCoordinator is assigned to it, it registers itself to this coordinator. This happens in the doPrepare method.
    • Once we start to execute a physical plan, a ShuffleExchange registered to this coordinator will call postShuffleRDD to get its corresponding post-shuffle ShuffledRowRDD. If this coordinator has made the decision on how to shuffle data, this ShuffleExchange will immediately get its corresponding post-shuffle ShuffledRowRDD.
    • If this coordinator has not made the decision on how to shuffle data, it will ask those registered ShuffleExchanges to submit their pre-shuffle stages. Then, based on the size statistics of pre-shuffle partitions, this coordinator will determine the number of post-shuffle partitions and pack multiple pre-shuffle partitions with continuous indices to a single post-shuffle partition whenever necessary.
    • Finally, this coordinator will create post-shuffle ShuffledRowRDDs for all registered ShuffleExchanges. So, when a ShuffleExchange calls postShuffleRDD, this coordinator can lookup the corresponding RDD.

    The strategy used to determine the number of post-shuffle partitions is described as follows. To determine the number of post-shuffle partitions, we have a target input size for a post-shuffle partition. Once we have size statistics of pre-shuffle partitions from stages corresponding to the registered ShuffleExchanges, we will do a pass of those statistics and pack pre-shuffle partitions with continuous indices to a single post-shuffle partition until the size of a post-shuffle partition is equal or greater than the target size. For example, we have two stages with the following pre-shuffle partition size statistics: stage 1: [100 MB, 20 MB, 100 MB, 10MB, 30 MB] stage 2: [10 MB, 10 MB, 70 MB, 5 MB, 5 MB] assuming the target input size is 128 MB, we will have three post-shuffle partitions, which are:

    • post-shuffle partition 0: pre-shuffle partition 0 and 1
    • post-shuffle partition 1: pre-shuffle partition 2
    • post-shuffle partition 2: pre-shuffle partition 3 and 4
  5. case class ReuseExchange(conf: SQLConf) extends Rule[SparkPlan] with Product with Serializable

    Permalink

    Find out duplicated exchanges in the spark plan, then use the same exchange for all the references.

  6. case class ReusedExchangeExec(output: Seq[Attribute], child: Exchange) extends SparkPlan with LeafExecNode with Product with Serializable

    Permalink

    A wrapper for reused exchange to have different output, because two exchanges which produce logically identical output will have distinct sets of output attribute ids, so we need to preserve the original ids because they're what downstream operators are expecting.

  7. case class ShuffleExchange(newPartitioning: Partitioning, child: SparkPlan, coordinator: Option[ExchangeCoordinator]) extends Exchange with Product with Serializable

    Permalink

    Performs a shuffle that will result in the desired newPartitioning.

Value Members

  1. object BroadcastExchangeExec extends Serializable

    Permalink
  2. object ShuffleExchange extends Serializable

    Permalink

Ungrouped