execution

package execution

The physical execution component of Spark SQL. Note that this is a private package. All classes in catalyst are considered an internal API to Spark SQL and are subject to change between minor releases.

Linear Supertypes

AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

execution
AnyRef
Any

Hide All
Show All

Visibility

Public
All

Type Members

class AggregatingAccumulator extends AccumulatorV2[InternalRow, InternalRow]
Accumulator that computes a global aggregate.
case class AppendColumnsExec(func: (Any) ⇒ Any, deserializer: Expression, serializer: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Applies the given function to each input row, appending the encoded result at the end of the row.
case class AppendColumnsWithObjectExec(func: (Any) ⇒ Any, inputSerializer: Seq[NamedExpression], newColumnsSerializer: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with ObjectConsumerExec with Product with Serializable
An optimized version of AppendColumnsExec, that can be executed on deserialized object directly.
case class ApplyColumnarRulesAndInsertTransitions(columnarRules: Seq[ColumnarRule], outputsColumnar: Boolean) extends Rule[SparkPlan] with Product with Serializable
Apply any user defined ColumnarRules and find the correct place to insert transitions to/from columnar formatted data.
Apply any user defined ColumnarRules and find the correct place to insert transitions to/from columnar formatted data.
columnarRules
custom columnar rules
outputsColumnar
whether or not the produced plan should output columnar format.
trait BaseLimitExec extends SparkPlan with LimitExec with CodegenSupport
Helper trait which defines methods that are shared by both LocalLimitExec and GlobalLimitExec.
trait BaseScriptTransformationExec extends SparkPlan with UnaryExecNode
abstract class BaseScriptTransformationWriterThread extends Thread with Logging
abstract class BaseSubqueryExec extends SparkPlan
Parent class for different types of subquery plans
trait BinaryExecNode extends SparkPlan with BinaryLike[SparkPlan]
trait BlockingOperatorWithCodegen extends SparkPlan with CodegenSupport
A special kind of operators which support whole stage codegen.
A special kind of operators which support whole stage codegen. Blocking means these operators will consume all the inputs first, before producing output. Typical blocking operators are sort and aggregate.
abstract class BufferedRowIterator extends AnyRef
An iterator interface used to pull the output from generated function for multiple operators (whole stage codegen).
class CacheManager extends Logging with AdaptiveSparkPlanHelper
Provides support in a SQLContext for caching query results and automatically using these cached results when subsequent queries are executed.
Provides support in a SQLContext for caching query results and automatically using these cached results when subsequent queries are executed. Data is cached using byte buffers stored in an InMemoryRelation. This relation is automatically substituted query plans that return the sameResult as the originally cached query.
Internal to Spark SQL.
case class CachedData(plan: LogicalPlan, cachedRepresentation: InMemoryRelation) extends Product with Serializable
Holds a cached logical plan and its data
case class CoGroupExec(func: (Any, Iterator[Any], Iterator[Any]) ⇒ TraversableOnce[Any], keyDeserializer: Expression, leftDeserializer: Expression, rightDeserializer: Expression, leftGroup: Seq[Attribute], rightGroup: Seq[Attribute], leftAttr: Seq[Attribute], rightAttr: Seq[Attribute], leftOrder: Seq[SortOrder], rightOrder: Seq[SortOrder], outputObjAttr: Attribute, left: SparkPlan, right: SparkPlan) extends SparkPlan with BinaryExecNode with ObjectProducerExec with Product with Serializable
Co-groups the data from left and right children, and calls the function with each group and 2 iterators containing all elements in the group from left and right side.
Co-groups the data from left and right children, and calls the function with each group and 2 iterators containing all elements in the group from left and right side. The result of this function is flattened before being output.
class CoGroupedIterator extends Iterator[(InternalRow, Iterator[InternalRow], Iterator[InternalRow])]
Iterates over GroupedIterators and returns the cogrouped data, i.e.
Iterates over GroupedIterators and returns the cogrouped data, i.e. each record is a grouping key with its associated values from all GroupedIterators. Note: we assume the output of each GroupedIterator is ordered by the grouping key.
case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Physical plan for returning a new RDD that has exactly numPartitions partitions.
Physical plan for returning a new RDD that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you see ShuffleExchange. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
case class CoalescedMapperPartitionSpec(startMapIndex: Int, endMapIndex: Int, numReducers: Int) extends ShufflePartitionSpec with Product with Serializable
case class CoalescedPartitionSpec(startReducerIndex: Int, endReducerIndex: Int, dataSize: Option[Long] = None) extends ShufflePartitionSpec with Product with Serializable
class CoalescedPartitioner extends Partitioner
A Partitioner that might group together one or more partitions from the parent.
trait CodegenSupport extends SparkPlan
An interface for those physical operators that support codegen.
case class CollapseCodegenStages(codegenStageCounter: AtomicInteger = new AtomicInteger(0)) extends Rule[SparkPlan] with Product with Serializable
Find the chained plans that support codegen, collapse them together as WholeStageCodegen.
Find the chained plans that support codegen, collapse them together as WholeStageCodegen.
The codegenStageCounter generates ID for codegen stages within a query plan. It does not affect equality, nor does it participate in destructuring pattern matching of WholeStageCodegenExec.
This ID is used to help differentiate between codegen stages. It is included as a part of the explain output for physical plans, e.g.
Physical Plan
*(5) SortMergeJoin [x#3L], [y#9L], Inner :- *(2) Sort [x#3L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(x#3L, 200) : +- *(1) Project [(id#0L % 2) AS x#3L] : +- *(1) Filter isnotnull((id#0L % 2)) : +- *(1) Range (0, 5, step=1, splits=8) +- *(4) Sort [y#9L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(y#9L, 200) +- *(3) Project [(id#6L % 2) AS y#9L] +- *(3) Filter isnotnull((id#6L % 2)) +- *(3) Range (0, 5, step=1, splits=8)
where the ID makes it obvious that not all adjacent codegen'd plan operators are of the same codegen stage.
The codegen stage ID is also optionally included in the name of the generated classes as a suffix, so that it's easier to associate a generated class back to the physical operator. This is controlled by SQLConf: spark.sql.codegen.useIdInClassName
The ID is also included in various log messages.
Within a query, a codegen stage in a plan starts counting from 1, in "insertion order". WholeStageCodegenExec operators are inserted into a plan in depth-first post-order. See CollapseCodegenStages.insertWholeStageCodegen for the definition of insertion order.
0 is reserved as a special ID value to indicate a temporary WholeStageCodegenExec object is created, e.g. for special fallback handling when an existing WholeStageCodegenExec failed to generate/compile code.
case class CollectLimitExec(limit: Int = -1, child: SparkPlan, offset: Int = 0) extends SparkPlan with LimitExec with Product with Serializable
Take the first limit elements, collect them to a single partition and then to drop the first offset elements.
Take the first limit elements, collect them to a single partition and then to drop the first offset elements.
This operator will be used when a logical Limit and/or Offset operation is the final operator in an logical plan, which happens when the user is collecting results back to the driver.
case class CollectMetricsExec(name: String, metricExpressions: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Collect arbitrary (named) metrics from a SparkPlan.
case class CollectTailExec(limit: Int, child: SparkPlan) extends SparkPlan with LimitExec with Product with Serializable
Take the last limit elements and collect them to a single partition.
Take the last limit elements and collect them to a single partition.
This operator will be used when a logical Tail operation is the final operator in an logical plan, which happens when the user is collecting results back to the driver.
class ColumnarRule extends AnyRef
Holds a user defined rule that can be used to inject columnar implementations of various operators in the plan.
Holds a user defined rule that can be used to inject columnar implementations of various operators in the plan. The preColumnarTransitions Rule can be used to replace SparkPlan instances with versions that support a columnar implementation. After this Spark will insert any transitions necessary. This includes transitions from row to columnar RowToColumnarExec and from columnar to row ColumnarToRowExec. At this point the postColumnarTransitions Rule is called to allow replacing any of the implementations of the transitions or doing cleanup of the plan, like inserting stages to build larger batches for more efficient processing, or stages that transition the data to/from an accelerator's memory.
case class ColumnarToRowExec(child: SparkPlan) extends SparkPlan with ColumnarToRowTransition with CodegenSupport with Product with Serializable
Provides a common executor to translate an RDD of ColumnarBatch into an RDD of InternalRow.
Provides a common executor to translate an RDD of ColumnarBatch into an RDD of InternalRow. This is inserted whenever such a transition is determined to be needed.
The implementation is based off of similar implementations in org.apache.spark.sql.execution.python.ArrowEvalPythonExec and MapPartitionsInRWithArrowExec. Eventually this should replace those implementations.
trait ColumnarToRowTransition extends SparkPlan with UnaryExecNode
A trait that is used as a tag to indicate a transition from columns to rows.
A trait that is used as a tag to indicate a transition from columns to rows. This allows plugins to replace the current ColumnarToRowExec with an optimized version and still have operations that walk a spark plan looking for this type of transition properly match it.
case class CommandResultExec(output: Seq[Attribute], commandPhysicalPlan: SparkPlan, rows: Seq[InternalRow]) extends SparkPlan with LeafExecNode with InputRDDCodegen with Product with Serializable
Physical plan node for holding data from a command.
Physical plan node for holding data from a command.
commandPhysicalPlan is just used to display the plan tree for EXPLAIN. rows may not be serializable and ideally we should not send rows to the executors. Thus marking them as transient.
trait DataSourceScanExec extends SparkPlan with LeafExecNode
case class DeserializeToObjectExec(deserializer: Expression, outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with UnaryExecNode with ObjectProducerExec with CodegenSupport with Product with Serializable
Takes the input row from child and turns it into object using the given deserializer expression.
Takes the input row from child and turns it into object using the given deserializer expression. The output of this operator is a single-field safe row containing the deserialized object.
abstract class ExecSubqueryExpression extends PlanExpression[BaseSubqueryExec]
The base class for subquery that is used in SparkPlan.
case class ExpandExec(projections: Seq[Seq[Expression]], output: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable
Apply all of the GroupExpressions to every input row, hence we will get multiple output rows for an input row.
Apply all of the GroupExpressions to every input row, hence we will get multiple output rows for an input row.
projections
The group of expressions, all of the group expressions should output the same schema specified bye the parameter output
output
The output Schema
child
Child operator
sealed trait ExplainMode extends AnyRef
case class ExternalRDD[T](outputObjAttr: Attribute, rdd: RDD[T])(session: SparkSession) extends LogicalPlan with LeafNode with ObjectProducer with MultiInstanceRelation with Product with Serializable
Logical plan node for scanning data from an RDD.
case class ExternalRDDScanExec[T](outputObjAttr: Attribute, rdd: RDD[T]) extends SparkPlan with LeafExecNode with ObjectProducerExec with Product with Serializable
Physical plan node for scanning data from an RDD.
trait FileRelation extends AnyRef
An interface for relations that are backed by files.
An interface for relations that are backed by files. When a class implements this interface, the list of paths that it returns will be returned to a user who calls inputPaths on any DataFrame that queries this relation.
case class FileSourceScanExec(relation: HadoopFsRelation, output: Seq[Attribute], requiredSchema: StructType, partitionFilters: Seq[Expression], optionalBucketSet: Option[BitSet], optionalNumCoalescedBuckets: Option[Int], dataFilters: Seq[Expression], tableIdentifier: Option[TableIdentifier], disableBucketedScan: Boolean = false) extends SparkPlan with FileSourceScanLike with Product with Serializable
Physical plan node for scanning data from HadoopFsRelations.
Physical plan node for scanning data from HadoopFsRelations.
relation
The file-based relation to scan.
output
Output attributes of the scan, including data attributes and partition attributes.
requiredSchema
Required schema of the underlying relation, excluding partition columns.
partitionFilters
Predicates to use for partition pruning.
optionalBucketSet
Bucket ids for bucket pruning.
optionalNumCoalescedBuckets
Number of coalesced buckets.
dataFilters
Filters on non-partition columns.
tableIdentifier
Identifier for the table in the metastore.
disableBucketedScan
Disable bucketed scan based on physical query plan, see rule DisableUnnecessaryBucketedScan for details.
trait FileSourceScanLike extends SparkPlan with DataSourceScanExec
A base trait for file scans containing file listing and metrics code.
case class FilterExec(condition: Expression, child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with GeneratePredicateHelper with Product with Serializable
Physical plan for Filter.
case class FlatMapGroupsInRExec(func: Array[Byte], packageNames: Array[Byte], broadcastVars: Array[Broadcast[AnyRef]], inputSchema: StructType, outputSchema: StructType, keyDeserializer: Expression, valueDeserializer: Expression, groupingAttributes: Seq[Attribute], dataAttributes: Seq[Attribute], outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with UnaryExecNode with ObjectProducerExec with Product with Serializable
Groups the input rows together and calls the R function with each group and an iterator containing all elements in the group.
Groups the input rows together and calls the R function with each group and an iterator containing all elements in the group. The result of this function is flattened before being output.
case class FlatMapGroupsInRWithArrowExec(func: Array[Byte], packageNames: Array[Byte], broadcastVars: Array[Broadcast[AnyRef]], inputSchema: StructType, output: Seq[Attribute], keyDeserializer: Expression, groupingAttributes: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Similar with FlatMapGroupsInRExec but serializes and deserializes input/output in Arrow format.
Similar with FlatMapGroupsInRExec but serializes and deserializes input/output in Arrow format. This is also somewhat similar with org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec.
case class GenerateExec(generator: Generator, requiredChildOutput: Seq[Attribute], outer: Boolean, generatorOutput: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable
Applies a Generator to a stream of input rows, combining the output of each into a new stream of rows.
Applies a Generator to a stream of input rows, combining the output of each into a new stream of rows. This operation is similar to a flatMap in functional programming with one important additional feature, which allows the input rows to be joined with their output.
This operator supports whole stage code generation for generators that do not implement terminate().
generator
the generator expression
requiredChildOutput
required attributes from child's output
outer
when true, each input row will be output at least once, even if the output of the given generator is empty.
generatorOutput
the qualified output attributes of the generator of this node, which constructed in analysis phase, and we can not change it, as the parent node bound with it already.
trait GeneratePredicateHelper extends PredicateHelper
case class GlobalLimitExec(limit: Int = -1, child: SparkPlan, offset: Int = 0) extends SparkPlan with BaseLimitExec with Product with Serializable
Take the first limit elements and then drop the first offset elements in the child's single output partition.
class GroupedIterator extends Iterator[(InternalRow, Iterator[InternalRow])]
Iterates over a presorted set of rows, chunking it up by the grouping expression.
Iterates over a presorted set of rows, chunking it up by the grouping expression. Each call to next will return a pair containing the current group and an iterator that will return all the elements of that group. Iterators for each group are lazily constructed by extracting rows from the input iterator. As such, full groups are never materialized by this class.
Example input:
```
Input: [a, 1], [b, 2], [b, 3]
Grouping: x#1
InputSchema: x#1, y#2
```
Result:
```
First call to next():  ([a], Iterator([a, 1])
Second call to next(): ([b], Iterator([b, 2], [b, 3])
```
Note, the class does not handle the case of an empty input for simplicity of implementation. Use the factory to construct a new instance.
case class InSubqueryExec(child: Expression, plan: BaseSubqueryExec, exprId: ExprId, shouldBroadcast: Boolean = false, resultBroadcast: Broadcast[Array[Any]] = null, result: Array[Any] = null) extends ExecSubqueryExpression with UnaryLike[Expression] with Predicate with Product with Serializable
The physical node of in-subquery.
The physical node of in-subquery. When this is used for Dynamic Partition Pruning, as the pruning happens at the driver side, we don't broadcast subquery result.
case class InputAdapter(child: SparkPlan) extends SparkPlan with UnaryExecNode with InputRDDCodegen with Product with Serializable
InputAdapter is used to hide a SparkPlan from a subtree that supports codegen.
InputAdapter is used to hide a SparkPlan from a subtree that supports codegen.
This is the leaf node of a tree with WholeStageCodegen that is used to generate code that consumes an RDD iterator of InternalRow.
trait InputRDDCodegen extends SparkPlan with CodegenSupport
Leaf codegen node reading from a single RDD.
trait LeafExecNode extends SparkPlan with LeafLike[SparkPlan]
trait LimitExec extends SparkPlan with UnaryExecNode
The operator takes limited number of elements from its child operator.
case class LocalLimitExec(limit: Int, child: SparkPlan) extends SparkPlan with BaseLimitExec with Product with Serializable
Take the first limit elements of each child partition, but do not collect or shuffle them.
case class LocalTableScanExec(output: Seq[Attribute], rows: Seq[InternalRow]) extends SparkPlan with LeafExecNode with InputRDDCodegen with Product with Serializable
Physical plan node for scanning data from a local collection.
Physical plan node for scanning data from a local collection.
Seq may not be serializable and ideally we should not send rows and unsafeRows to the executors. Thus marking them as transient.
case class LogicalRDD(output: Seq[Attribute], rdd: RDD[InternalRow], outputPartitioning: Partitioning = UnknownPartitioning(0), outputOrdering: Seq[SortOrder] = Nil, isStreaming: Boolean = false)(session: SparkSession, originStats: Option[Statistics] = None, originConstraints: Option[ExpressionSet] = None) extends LogicalPlan with LeafNode with MultiInstanceRelation with Product with Serializable
Logical plan node for scanning data from an RDD of InternalRow.
Logical plan node for scanning data from an RDD of InternalRow.
It is advised to set the field originStats and originConstraints if the RDD is directly built from DataFrame, so that Spark can make better optimizations.
case class MapElementsExec(func: AnyRef, outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with ObjectConsumerExec with ObjectProducerExec with CodegenSupport with Product with Serializable
Applies the given function to each input object.
Applies the given function to each input object. The output of its child must be a single-field row containing the input object.
This operator is kind of a safe version of ProjectExec, as its output is custom object, we need to use safe row to contain it.
case class MapGroupsExec(func: (Any, Iterator[Any]) ⇒ TraversableOnce[Any], keyDeserializer: Expression, valueDeserializer: Expression, groupingAttributes: Seq[Attribute], dataAttributes: Seq[Attribute], dataOrder: Seq[SortOrder], outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with UnaryExecNode with ObjectProducerExec with Product with Serializable
Groups the input rows together and calls the function with each group and an iterator containing all elements in the group.
Groups the input rows together and calls the function with each group and an iterator containing all elements in the group. The iterator is sorted according to dataOrder if given. The result of this function is flattened before being output.
case class MapPartitionsExec(func: (Iterator[Any]) ⇒ Iterator[Any], outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with ObjectConsumerExec with ObjectProducerExec with Product with Serializable
Applies the given function to input object iterator.
Applies the given function to input object iterator. The output of its child must be a single-field row containing the input object.
case class MapPartitionsInRWithArrowExec(func: Array[Byte], packageNames: Array[Byte], broadcastVars: Array[Broadcast[AnyRef]], inputSchema: StructType, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Similar with MapPartitionsExec and org.apache.spark.sql.execution.r.MapPartitionsRWrapper but serializes and deserializes input/output in Arrow format.
Similar with MapPartitionsExec and org.apache.spark.sql.execution.r.MapPartitionsRWrapper but serializes and deserializes input/output in Arrow format.
This is somewhat similar with org.apache.spark.sql.execution.python.ArrowEvalPythonExec
trait ObjectConsumerExec extends SparkPlan with UnaryExecNode with ReferenceAllColumns[SparkPlan]
Physical version of ObjectConsumer.
trait ObjectProducerExec extends SparkPlan
Physical version of ObjectProducer.
case class OptimizeMetadataOnlyQuery(catalog: SessionCatalog) extends Rule[LogicalPlan] with Product with Serializable
This rule optimizes the execution of queries that can be answered by looking only at partition-level metadata.
This rule optimizes the execution of queries that can be answered by looking only at partition-level metadata. This applies when all the columns scanned are partition columns, and the query has an aggregate operator that satisfies the following conditions: 1. aggregate expression is partition columns. e.g. SELECT col FROM tbl GROUP BY col. 2. aggregate function on partition columns with DISTINCT. e.g. SELECT col1, count(DISTINCT col2) FROM tbl GROUP BY col1. 3. aggregate function on partition columns which have same result w or w/o DISTINCT keyword. e.g. SELECT col1, Max(col2) FROM tbl GROUP BY col1.
trait OrderPreservingUnaryExecNode extends SparkPlan with UnaryExecNode with AliasAwareQueryOutputOrdering[SparkPlan]
case class PartialMapperPartitionSpec(mapIndex: Int, startReducerIndex: Int, endReducerIndex: Int) extends ShufflePartitionSpec with Product with Serializable
case class PartialReducerPartitionSpec(reducerIndex: Int, startMapIndex: Int, endMapIndex: Int, dataSize: Long) extends ShufflePartitionSpec with Product with Serializable
trait PartitioningPreservingUnaryExecNode extends SparkPlan with UnaryExecNode with AliasAwareOutputExpression
A trait that handles aliases in the outputExpressions to produce outputPartitioning that satisfies distribution requirements.
case class PlanLater(plan: LogicalPlan) extends SparkPlan with LeafExecNode with Product with Serializable
case class PlanSubqueries(sparkSession: SparkSession) extends Rule[SparkPlan] with Product with Serializable
Plans subqueries that are present in the given SparkPlan.
case class ProjectExec(projectList: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with PartitioningPreservingUnaryExecNode with OrderPreservingUnaryExecNode with Product with Serializable
Physical plan for Project.
class QueryExecution extends Logging
The primary workflow for executing relational queries using Spark.
The primary workflow for executing relational queries using Spark. Designed to allow easy access to the intermediate phases of query execution for developers.
While this is not a public class, we should avoid changing the function names for the sake of changing them, because a lot of developers use the feature for debugging.
case class RDDScanExec(output: Seq[Attribute], rdd: RDD[InternalRow], name: String, outputPartitioning: Partitioning = UnknownPartitioning(0), outputOrdering: Seq[SortOrder] = Nil) extends SparkPlan with LeafExecNode with InputRDDCodegen with Product with Serializable
Physical plan node for scanning data from an RDD of InternalRow.
case class RangeExec(range: Range) extends SparkPlan with LeafExecNode with CodegenSupport with Product with Serializable
Physical plan for range (generating a range of 64 bit numbers).
final class RecordBinaryComparator extends RecordComparator
case class ReusedSubqueryExec(child: BaseSubqueryExec) extends BaseSubqueryExec with LeafExecNode with Product with Serializable
A wrapper for reused BaseSubqueryExec.
case class RowDataSourceScanExec(output: Seq[Attribute], requiredSchema: StructType, filters: Set[Filter], handledFilters: Set[Filter], pushedDownOperators: PushedDownOperators, rdd: RDD[InternalRow], relation: BaseRelation, tableIdentifier: Option[TableIdentifier]) extends SparkPlan with DataSourceScanExec with InputRDDCodegen with Product with Serializable
Physical plan node for scanning data from a relation.
case class RowToColumnarExec(child: SparkPlan) extends SparkPlan with RowToColumnarTransition with Product with Serializable
Provides a common executor to translate an RDD of InternalRow into an RDD of ColumnarBatch.
Provides a common executor to translate an RDD of InternalRow into an RDD of ColumnarBatch. This is inserted whenever such a transition is determined to be needed.
This is similar to some of the code in ArrowConverters.scala and org.apache.spark.sql.execution.arrow.ArrowWriter. That code is more specialized to convert InternalRow to Arrow formatted data, but in the future if we make OffHeapColumnVector internally Arrow formatted we may be able to replace much of that code.
This is also similar to org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate() and org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.toBatch() toBatch is only ever called from tests and can probably be removed, but populate is used by both Orc and Parquet to initialize partition and missing columns. There is some chance that we could replace populate with RowToColumnConverter, but the performance requirements are different and it would only be to reduce code.
trait RowToColumnarTransition extends SparkPlan with UnaryExecNode
A trait that is used as a tag to indicate a transition from rows to columns.
A trait that is used as a tag to indicate a transition from rows to columns. This allows plugins to replace the current RowToColumnarExec with an optimized version and still have operations that walk a spark plan looking for this type of transition properly match it.
class SQLExecutionRDD extends RDD[InternalRow]
It is just a wrapper over sqlRDD, which sets and makes effective all the configs from the captured SQLConf.
It is just a wrapper over sqlRDD, which sets and makes effective all the configs from the captured SQLConf. Please notice that this means we may miss configurations set after the creation of this RDD and before its execution.
case class SampleExec(lowerBound: Double, upperBound: Double, withReplacement: Boolean, seed: Long, child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable
Physical plan for sampling the dataset.
Physical plan for sampling the dataset.
lowerBound
Lower-bound of the sampling probability (usually 0.0)
upperBound
Upper-bound of the sampling probability. The expected fraction sampled will be ub - lb.
withReplacement
Whether to sample with replacement.
seed
the random seed
child
the SparkPlan
case class ScalarSubquery(plan: BaseSubqueryExec, exprId: ExprId) extends ExecSubqueryExpression with LeafLike[Expression] with SupportQueryContext with Product with Serializable
A subquery that will return only one row and one column.
A subquery that will return only one row and one column.
This is the physical copy of ScalarSubquery to be used inside SparkPlan.
case class ScriptTransformationIOSchema(inputRowFormat: Seq[(String, String)], outputRowFormat: Seq[(String, String)], inputSerdeClass: Option[String], outputSerdeClass: Option[String], inputSerdeProps: Seq[(String, String)], outputSerdeProps: Seq[(String, String)], recordReaderClass: Option[String], recordWriterClass: Option[String], schemaLess: Boolean) extends Serializable with Product
The wrapper class of input and output schema properties
case class SerializeFromObjectExec(serializer: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with ObjectConsumerExec with CodegenSupport with Product with Serializable
Takes the input object from child and turns in into unsafe row using the given serializer expression.
Takes the input object from child and turns in into unsafe row using the given serializer expression. The output of its child must be a single-field row containing the input object.
sealed trait ShufflePartitionSpec extends AnyRef
class ShuffledRowRDD extends RDD[InternalRow]
This is a specialized version of org.apache.spark.rdd.ShuffledRDD that is optimized for shuffling rows instead of Java key-value pairs.
This is a specialized version of org.apache.spark.rdd.ShuffledRDD that is optimized for shuffling rows instead of Java key-value pairs. Note that something like this should eventually be implemented in Spark core, but that is blocked by some more general refactorings to shuffle interfaces / internals.
This RDD takes a ShuffleDependency (dependency), and an array of ShufflePartitionSpec as input arguments.
The dependency has the parent RDD of this RDD, which represents the dataset before shuffle (i.e. map output). Elements of this RDD are (partitionId, Row) pairs. Partition ids should be in the range [0, numPartitions - 1]. dependency.partitioner is the original partitioner used to partition map output, and dependency.partitioner.numPartitions is the number of pre-shuffle partitions (i.e. the number of partitions of the map output).
case class SortExec(sortOrder: Seq[SortOrder], global: Boolean, child: SparkPlan, testSpillFrequency: Int = 0) extends SparkPlan with UnaryExecNode with BlockingOperatorWithCodegen with Product with Serializable
Performs (external) sorting.
Performs (external) sorting.
global
when true performs a global sort of all partitions by shuffling the data first if necessary.
testSpillFrequency
Method for configuring periodic spilling in unit tests. If set, will spill every frequency records.
class SparkOptimizer extends Optimizer
abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable
The base class for physical operators.
The base class for physical operators.
The naming convention is that physical operators end with "Exec" suffix, e.g. ProjectExec.
class SparkPlanInfo extends AnyRef
:: DeveloperApi :: Stores information about a SQL SparkPlan.
:: DeveloperApi :: Stores information about a SQL SparkPlan.

Annotations
@DeveloperApi()
class SparkPlanner extends SparkStrategies with SQLConfHelper
case class SparkScriptTransformationExec(script: String, output: Seq[Attribute], child: SparkPlan, ioschema: ScriptTransformationIOSchema) extends SparkPlan with BaseScriptTransformationExec with Product with Serializable
Transforms the input by forking and running the specified script.
Transforms the input by forking and running the specified script.
script
the command that should be executed.
output
the attributes that are produced by the script.
child
logical plan whose output is transformed.
ioschema
the class set that defines how to handle input/output data.
case class SparkScriptTransformationWriterThread(iter: Iterator[InternalRow], inputSchema: Seq[DataType], ioSchema: ScriptTransformationIOSchema, outputStream: OutputStream, proc: Process, stderrBuffer: CircularBuffer, taskContext: TaskContext, conf: Configuration) extends BaseScriptTransformationWriterThread with Product with Serializable
class SparkSqlAstBuilder extends AstBuilder
Builder that converts an ANTLR ParseTree into a LogicalPlan/Expression/TableIdentifier.
class SparkSqlParser extends AbstractSqlParser
Concrete parser for Spark SQL statements.
abstract class SparkStrategies extends QueryPlanner[SparkPlan]
abstract class SparkStrategy extends GenericStrategy[SparkPlan]
Converts a logical plan into zero or more SparkPlans.
Converts a logical plan into zero or more SparkPlans. This API is exposed for experimenting with the query planner and is not designed to be stable across spark releases. Developers writing libraries should instead consider using the stable APIs provided in org.apache.spark.sql.sources
case class SubqueryAdaptiveBroadcastExec(name: String, index: Int, onlyInBroadcast: Boolean, buildPlan: LogicalPlan, buildKeys: Seq[Expression], child: SparkPlan) extends BaseSubqueryExec with UnaryExecNode with Product with Serializable
Similar to SubqueryBroadcastExec, this node is used to store the initial physical plan of DPP subquery filters when enabling both AQE and DPP.
Similar to SubqueryBroadcastExec, this node is used to store the initial physical plan of DPP subquery filters when enabling both AQE and DPP. It is intermediate physical plan and not executable. After the build side is executed, this node will be replaced with the SubqueryBroadcastExec and the child will be optimized with the ReusedExchange from the build side.
case class SubqueryBroadcastExec(name: String, index: Int, buildKeys: Seq[Expression], child: SparkPlan) extends BaseSubqueryExec with UnaryExecNode with Product with Serializable
Physical plan for a custom subquery that collects and transforms the broadcast key values.
Physical plan for a custom subquery that collects and transforms the broadcast key values. This subquery retrieves the partition key from the broadcast results based on the type of HashedRelation returned. If the key is packed inside a Long, we extract it through bitwise operations, otherwise we return it from the appropriate index of the UnsafeRow.
index
the index of the join key in the list of keys from the build side
buildKeys
the join keys from the build side of the join used
child
the BroadcastExchange or the AdaptiveSparkPlan with BroadcastQueryStageExec from the build side of the join
case class SubqueryExec(name: String, child: SparkPlan, maxNumRows: Option[Int] = None) extends BaseSubqueryExec with UnaryExecNode with Product with Serializable
Physical plan for a subquery.
case class TakeOrderedAndProjectExec(limit: Int, sortOrder: Seq[SortOrder], projectList: Seq[NamedExpression], child: SparkPlan, offset: Int = 0) extends SparkPlan with OrderPreservingUnaryExecNode with Product with Serializable
Take the first limit elements as defined by the sortOrder, then drop the first offset elements, and do projection if needed.
Take the first limit elements as defined by the sortOrder, then drop the first offset elements, and do projection if needed. This is logically equivalent to having a Limit and/or Offset operator after a SortExec operator, or having a ProjectExec operator between them. This could have been named TopK, but Spark's top operator does the opposite in ordering so we name it TakeOrdered to avoid confusion.
trait UnaryExecNode extends SparkPlan with UnaryLike[SparkPlan]
case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan with Product with Serializable
Physical plan for unioning two plans, without a distinct.
Physical plan for unioning two plans, without a distinct. This is UNION ALL in SQL.
If we change how this is implemented physically, we'd need to update org.apache.spark.sql.catalyst.plans.logical.Union.maxRowsPerPartition.
final class UnsafeExternalRowSorter extends AnyRef
final class UnsafeFixedWidthAggregationMap extends AnyRef
Unsafe-based HashMap for performing aggregations where the aggregated values are fixed-width.
Unsafe-based HashMap for performing aggregations where the aggregated values are fixed-width.
This map supports a maximum of 2 billion keys.
final class UnsafeKVExternalSorter extends AnyRef
A class for performing external sorting on key-value records.
A class for performing external sorting on key-value records. Both key and value are UnsafeRows.
Note that this class allows optionally passing in a BytesToBytesMap directly in order to perform in-place sorting of records in the map.
class UnsafeRowSerializer extends Serializer with Serializable
Serializer for serializing UnsafeRows during shuffle.
Serializer for serializing UnsafeRows during shuffle. Since UnsafeRows are already stored as bytes, this serializer simply copies those bytes to the underlying output stream. When deserializing a stream of rows, instances of this serializer mutate and return a single UnsafeRow instance that is backed by an on-heap byte array.
Note that this serializer implements only the Serializer methods that are used during shuffle, so certain SerializerInstance methods will throw UnsupportedOperationException.
case class WholeStageCodegenExec(child: SparkPlan)(codegenStageId: Int) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable
WholeStageCodegen compiles a subtree of plans that support codegen together into single Java function.
WholeStageCodegen compiles a subtree of plans that support codegen together into single Java function.
Here is the call graph of to generate Java source (plan A supports codegen, but plan B does not):
WholeStageCodegen Plan A FakeInput Plan B
-> execute() | doExecute() ---------> inputRDDs() -------> inputRDDs() ------> execute() | +-----------------> produce() | doProduce() -------> produce() | doProduce() | doConsume() <--------- consume() | doConsume() <-------- consume()
SparkPlan A should override doProduce() and doConsume().
doCodeGen() will create a CodeGenContext, which will hold a list of variables for input, used to generated code for BoundReference.

Value Members

object AggregatingAccumulator extends Serializable
object BaseLimitExec extends Serializable
object CoalesceExec extends Serializable
object CoalescedPartitionSpec extends Serializable
object CodegenMode extends ExplainMode with Product with Serializable
Codegen mode means that when printing explain for a DataFrame, if generated codes are available, a physical plan and the generated codes are expected to be printed to the console.
object CollectMetricsExec extends AdaptiveSparkPlanHelper with Serializable
object CommandExecutionMode extends Enumeration
SPARK-35378: Commands should be executed eagerly so that something like sql("INSERT ...") can trigger the table insertion immediately without a .collect().
SPARK-35378: Commands should be executed eagerly so that something like sql("INSERT ...") can trigger the table insertion immediately without a .collect(). To avoid end-less recursion we should use NON_ROOT when recursively executing commands. Note that we can't execute a query plan with leaf command nodes, because many commands return GenericInternalRow and can't be put in a query plan directly, otherwise the query engine may cast GenericInternalRow to UnsafeRow and fail. When running EXPLAIN, or commands inside other command, we should use SKIP to not eagerly trigger the command execution.
object CostMode extends ExplainMode with Product with Serializable
Cost mode means that when printing explain for a DataFrame, if plan node statistics are available, a logical plan and the statistics are expected to be printed to the console.
object ExecSubqueryExpression
object ExplainMode
object ExplainUtils extends AdaptiveSparkPlanHelper
object ExtendedMode extends ExplainMode with Product with Serializable
Extended mode means that when printing explain for a DataFrame, both logical and physical plans are expected to be printed to the console.
object ExternalRDD extends Serializable
object FormattedMode extends ExplainMode with Product with Serializable
Formatted mode means that when printing explain for a DataFrame, explain output is expected to be split into two sections: a physical plan outline and node details.
object GroupedIterator
object HiveResult
Runs a query returning the result in Hive compatible form.
object LogicalRDD extends Logging with Serializable
object MapGroupsExec extends Serializable
object ObjectOperator
Helper functions for physical operators that work with user defined objects.
object PartitionedFileUtil
object QueryExecution
object RemoveRedundantProjects extends Rule[SparkPlan]
Remove redundant ProjectExec node from the spark plan.
Remove redundant ProjectExec node from the spark plan. A ProjectExec node is redundant when - It has the same output attributes and orders as its child's output and the ordering of the attributes is required. - It has the same output attributes as its child's output when attribute output ordering is not required. This rule needs to be a physical rule because project nodes are useful during logical optimization to prune data. During physical planning, redundant project nodes can be removed to simplify the query plan.
object RemoveRedundantSorts extends Rule[SparkPlan]
Remove redundant SortExec node from the spark plan.
Remove redundant SortExec node from the spark plan. A sort node is redundant when its child satisfies both its sort orders and its required child distribution. Note this rule differs from the Optimizer rule EliminateSorts in that this rule also checks if the child satisfies the required distribution so that it is safe to remove not only a local sort but also a global sort when its child already satisfies required sort orders.
object ReplaceHashWithSortAgg extends Rule[SparkPlan]
Replace hash-based aggregate with sort aggregate in the spark plan if:
Replace hash-based aggregate with sort aggregate in the spark plan if:
1. The plan is a pair of partial and final HashAggregateExec or ObjectHashAggregateExec, and the child of partial aggregate satisfies the sort order of corresponding SortAggregateExec. or 2. The plan is a HashAggregateExec or ObjectHashAggregateExec, and the child satisfies the sort order of corresponding SortAggregateExec.
Examples: 1. aggregate after join:
HashAggregate(t1.i, SUM, final) | SortAggregate(t1.i, SUM, complete) HashAggregate(t1.i, SUM, partial) => | | SortMergeJoin(t1.i = t2.j) SortMergeJoin(t1.i = t2.j)
2. aggregate after sort:
HashAggregate(t1.i, SUM, partial) SortAggregate(t1.i, SUM, partial) | => | Sort(t1.i) Sort(t1.i)
Hash-based aggregate can be replaced when its child satisfies the sort order of corresponding sort aggregate. Sort aggregate is faster in the sense that it does not have hashing overhead of hash aggregate.
object SQLExecution
object ScriptTransformationIOSchema extends Serializable
object SimpleMode extends ExplainMode with Product with Serializable
Simple mode means that when printing explain for a DataFrame, only a physical plan is expected to be printed to the console.
object SortPrefixUtils
object SparkPlan extends Serializable
object SubqueryBroadcastExec extends Serializable
object SubqueryExec extends Serializable
object UnaryExecNode extends Serializable
object WholeStageCodegenExec extends Serializable

Packages

execution

package execution

Type Members

Physical Plan

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

execution 

package execution

Type Members

Physical Plan

Value Members

Inherited from AnyRef

Inherited from Any

Ungrouped

execution