package execution
The physical execution component of Spark SQL. Note that this is a private package. All classes in catalyst are considered an internal API to Spark SQL and are subject to change between minor releases.
- Alphabetic
- By Inheritance
- execution
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Type Members
-
class
AggregatingAccumulator extends AccumulatorV2[InternalRow, InternalRow]
Accumulator that computes a global aggregate.
-
case class
AppendColumnsExec(func: (Any) ⇒ Any, deserializer: Expression, serializer: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Applies the given function to each input row, appending the encoded result at the end of the row.
-
case class
AppendColumnsWithObjectExec(func: (Any) ⇒ Any, inputSerializer: Seq[NamedExpression], newColumnsSerializer: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with ObjectConsumerExec with Product with Serializable
An optimized version of AppendColumnsExec, that can be executed on deserialized object directly.
-
case class
ApplyColumnarRulesAndInsertTransitions(columnarRules: Seq[ColumnarRule], outputsColumnar: Boolean) extends Rule[SparkPlan] with Product with Serializable
Apply any user defined ColumnarRules and find the correct place to insert transitions to/from columnar formatted data.
Apply any user defined ColumnarRules and find the correct place to insert transitions to/from columnar formatted data.
- columnarRules
custom columnar rules
- outputsColumnar
whether or not the produced plan should output columnar format.
-
trait
BaseLimitExec extends SparkPlan with LimitExec with CodegenSupport
Helper trait which defines methods that are shared by both LocalLimitExec and GlobalLimitExec.
- trait BaseScriptTransformationExec extends SparkPlan with UnaryExecNode
- abstract class BaseScriptTransformationWriterThread extends Thread with Logging
-
abstract
class
BaseSubqueryExec extends SparkPlan
Parent class for different types of subquery plans
- trait BinaryExecNode extends SparkPlan with BinaryLike[SparkPlan]
-
trait
BlockingOperatorWithCodegen extends SparkPlan with CodegenSupport
A special kind of operators which support whole stage codegen.
A special kind of operators which support whole stage codegen. Blocking means these operators will consume all the inputs first, before producing output. Typical blocking operators are sort and aggregate.
-
abstract
class
BufferedRowIterator extends AnyRef
An iterator interface used to pull the output from generated function for multiple operators (whole stage codegen).
-
class
CacheManager extends Logging with AdaptiveSparkPlanHelper
Provides support in a SQLContext for caching query results and automatically using these cached results when subsequent queries are executed.
Provides support in a SQLContext for caching query results and automatically using these cached results when subsequent queries are executed. Data is cached using byte buffers stored in an InMemoryRelation. This relation is automatically substituted query plans that return the
sameResult
as the originally cached query.Internal to Spark SQL.
-
case class
CachedData(plan: LogicalPlan, cachedRepresentation: InMemoryRelation) extends Product with Serializable
Holds a cached logical plan and its data
-
case class
CoGroupExec(func: (Any, Iterator[Any], Iterator[Any]) ⇒ TraversableOnce[Any], keyDeserializer: Expression, leftDeserializer: Expression, rightDeserializer: Expression, leftGroup: Seq[Attribute], rightGroup: Seq[Attribute], leftAttr: Seq[Attribute], rightAttr: Seq[Attribute], leftOrder: Seq[SortOrder], rightOrder: Seq[SortOrder], outputObjAttr: Attribute, left: SparkPlan, right: SparkPlan) extends SparkPlan with BinaryExecNode with ObjectProducerExec with Product with Serializable
Co-groups the data from left and right children, and calls the function with each group and 2 iterators containing all elements in the group from left and right side.
Co-groups the data from left and right children, and calls the function with each group and 2 iterators containing all elements in the group from left and right side. The result of this function is flattened before being output.
-
class
CoGroupedIterator extends Iterator[(InternalRow, Iterator[InternalRow], Iterator[InternalRow])]
Iterates over GroupedIterators and returns the cogrouped data, i.e.
Iterates over GroupedIterators and returns the cogrouped data, i.e. each record is a grouping key with its associated values from all GroupedIterators. Note: we assume the output of each GroupedIterator is ordered by the grouping key.
-
case class
CoalesceExec(numPartitions: Int, child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Physical plan for returning a new RDD that has exactly
numPartitions
partitions.Physical plan for returning a new RDD that has exactly
numPartitions
partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you see ShuffleExchange. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
- case class CoalescedMapperPartitionSpec(startMapIndex: Int, endMapIndex: Int, numReducers: Int) extends ShufflePartitionSpec with Product with Serializable
- case class CoalescedPartitionSpec(startReducerIndex: Int, endReducerIndex: Int, dataSize: Option[Long] = None) extends ShufflePartitionSpec with Product with Serializable
-
class
CoalescedPartitioner extends Partitioner
A Partitioner that might group together one or more partitions from the parent.
-
trait
CodegenSupport extends SparkPlan
An interface for those physical operators that support codegen.
-
case class
CollapseCodegenStages(codegenStageCounter: AtomicInteger = new AtomicInteger(0)) extends Rule[SparkPlan] with Product with Serializable
Find the chained plans that support codegen, collapse them together as WholeStageCodegen.
Find the chained plans that support codegen, collapse them together as WholeStageCodegen.
The
codegenStageCounter
generates ID for codegen stages within a query plan. It does not affect equality, nor does it participate in destructuring pattern matching of WholeStageCodegenExec.This ID is used to help differentiate between codegen stages. It is included as a part of the explain output for physical plans, e.g.
Physical Plan
*(5) SortMergeJoin [x#3L], [y#9L], Inner :- *(2) Sort [x#3L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(x#3L, 200) : +- *(1) Project [(id#0L % 2) AS x#3L] : +- *(1) Filter isnotnull((id#0L % 2)) : +- *(1) Range (0, 5, step=1, splits=8) +- *(4) Sort [y#9L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(y#9L, 200) +- *(3) Project [(id#6L % 2) AS y#9L] +- *(3) Filter isnotnull((id#6L % 2)) +- *(3) Range (0, 5, step=1, splits=8)
where the ID makes it obvious that not all adjacent codegen'd plan operators are of the same codegen stage.
The codegen stage ID is also optionally included in the name of the generated classes as a suffix, so that it's easier to associate a generated class back to the physical operator. This is controlled by SQLConf: spark.sql.codegen.useIdInClassName
The ID is also included in various log messages.
Within a query, a codegen stage in a plan starts counting from 1, in "insertion order". WholeStageCodegenExec operators are inserted into a plan in depth-first post-order. See CollapseCodegenStages.insertWholeStageCodegen for the definition of insertion order.
0 is reserved as a special ID value to indicate a temporary WholeStageCodegenExec object is created, e.g. for special fallback handling when an existing WholeStageCodegenExec failed to generate/compile code.
-
case class
CollectLimitExec(limit: Int = -1, child: SparkPlan, offset: Int = 0) extends SparkPlan with LimitExec with Product with Serializable
Take the first
limit
elements, collect them to a single partition and then to drop the firstoffset
elements.Take the first
limit
elements, collect them to a single partition and then to drop the firstoffset
elements.This operator will be used when a logical
Limit
and/orOffset
operation is the final operator in an logical plan, which happens when the user is collecting results back to the driver. -
case class
CollectMetricsExec(name: String, metricExpressions: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Collect arbitrary (named) metrics from a SparkPlan.
-
case class
CollectTailExec(limit: Int, child: SparkPlan) extends SparkPlan with LimitExec with Product with Serializable
Take the last
limit
elements and collect them to a single partition.Take the last
limit
elements and collect them to a single partition.This operator will be used when a logical
Tail
operation is the final operator in an logical plan, which happens when the user is collecting results back to the driver. -
class
ColumnarRule extends AnyRef
Holds a user defined rule that can be used to inject columnar implementations of various operators in the plan.
Holds a user defined rule that can be used to inject columnar implementations of various operators in the plan. The preColumnarTransitions Rule can be used to replace SparkPlan instances with versions that support a columnar implementation. After this Spark will insert any transitions necessary. This includes transitions from row to columnar RowToColumnarExec and from columnar to row ColumnarToRowExec. At this point the postColumnarTransitions Rule is called to allow replacing any of the implementations of the transitions or doing cleanup of the plan, like inserting stages to build larger batches for more efficient processing, or stages that transition the data to/from an accelerator's memory.
-
case class
ColumnarToRowExec(child: SparkPlan) extends SparkPlan with ColumnarToRowTransition with CodegenSupport with Product with Serializable
Provides a common executor to translate an RDD of ColumnarBatch into an RDD of InternalRow.
Provides a common executor to translate an RDD of ColumnarBatch into an RDD of InternalRow. This is inserted whenever such a transition is determined to be needed.
The implementation is based off of similar implementations in org.apache.spark.sql.execution.python.ArrowEvalPythonExec and MapPartitionsInRWithArrowExec. Eventually this should replace those implementations.
-
trait
ColumnarToRowTransition extends SparkPlan with UnaryExecNode
A trait that is used as a tag to indicate a transition from columns to rows.
A trait that is used as a tag to indicate a transition from columns to rows. This allows plugins to replace the current ColumnarToRowExec with an optimized version and still have operations that walk a spark plan looking for this type of transition properly match it.
-
case class
CommandResultExec(output: Seq[Attribute], commandPhysicalPlan: SparkPlan, rows: Seq[InternalRow]) extends SparkPlan with LeafExecNode with InputRDDCodegen with Product with Serializable
Physical plan node for holding data from a command.
Physical plan node for holding data from a command.
commandPhysicalPlan
is just used to display the plan tree for EXPLAIN.rows
may not be serializable and ideally we should not sendrows
to the executors. Thus marking them as transient. - trait DataSourceScanExec extends SparkPlan with LeafExecNode
-
case class
DeserializeToObjectExec(deserializer: Expression, outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with UnaryExecNode with ObjectProducerExec with CodegenSupport with Product with Serializable
Takes the input row from child and turns it into object using the given deserializer expression.
Takes the input row from child and turns it into object using the given deserializer expression. The output of this operator is a single-field safe row containing the deserialized object.
-
abstract
class
ExecSubqueryExpression extends PlanExpression[BaseSubqueryExec]
The base class for subquery that is used in SparkPlan.
-
case class
ExpandExec(projections: Seq[Seq[Expression]], output: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable
Apply all of the GroupExpressions to every input row, hence we will get multiple output rows for an input row.
Apply all of the GroupExpressions to every input row, hence we will get multiple output rows for an input row.
- projections
The group of expressions, all of the group expressions should output the same schema specified bye the parameter
output
- output
The output Schema
- child
Child operator
- sealed trait ExplainMode extends AnyRef
-
case class
ExternalRDD[T](outputObjAttr: Attribute, rdd: RDD[T])(session: SparkSession) extends LogicalPlan with LeafNode with ObjectProducer with MultiInstanceRelation with Product with Serializable
Logical plan node for scanning data from an RDD.
-
case class
ExternalRDDScanExec[T](outputObjAttr: Attribute, rdd: RDD[T]) extends SparkPlan with LeafExecNode with ObjectProducerExec with Product with Serializable
Physical plan node for scanning data from an RDD.
-
trait
FileRelation extends AnyRef
An interface for relations that are backed by files.
An interface for relations that are backed by files. When a class implements this interface, the list of paths that it returns will be returned to a user who calls
inputPaths
on any DataFrame that queries this relation. -
case class
FileSourceScanExec(relation: HadoopFsRelation, output: Seq[Attribute], requiredSchema: StructType, partitionFilters: Seq[Expression], optionalBucketSet: Option[BitSet], optionalNumCoalescedBuckets: Option[Int], dataFilters: Seq[Expression], tableIdentifier: Option[TableIdentifier], disableBucketedScan: Boolean = false) extends SparkPlan with FileSourceScanLike with Product with Serializable
Physical plan node for scanning data from HadoopFsRelations.
Physical plan node for scanning data from HadoopFsRelations.
- relation
The file-based relation to scan.
- output
Output attributes of the scan, including data attributes and partition attributes.
- requiredSchema
Required schema of the underlying relation, excluding partition columns.
- partitionFilters
Predicates to use for partition pruning.
- optionalBucketSet
Bucket ids for bucket pruning.
- optionalNumCoalescedBuckets
Number of coalesced buckets.
- dataFilters
Filters on non-partition columns.
- tableIdentifier
Identifier for the table in the metastore.
- disableBucketedScan
Disable bucketed scan based on physical query plan, see rule DisableUnnecessaryBucketedScan for details.
-
trait
FileSourceScanLike extends SparkPlan with DataSourceScanExec
A base trait for file scans containing file listing and metrics code.
-
case class
FilterExec(condition: Expression, child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with GeneratePredicateHelper with Product with Serializable
Physical plan for Filter.
-
case class
FlatMapGroupsInRExec(func: Array[Byte], packageNames: Array[Byte], broadcastVars: Array[Broadcast[AnyRef]], inputSchema: StructType, outputSchema: StructType, keyDeserializer: Expression, valueDeserializer: Expression, groupingAttributes: Seq[Attribute], dataAttributes: Seq[Attribute], outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with UnaryExecNode with ObjectProducerExec with Product with Serializable
Groups the input rows together and calls the R function with each group and an iterator containing all elements in the group.
Groups the input rows together and calls the R function with each group and an iterator containing all elements in the group. The result of this function is flattened before being output.
-
case class
FlatMapGroupsInRWithArrowExec(func: Array[Byte], packageNames: Array[Byte], broadcastVars: Array[Broadcast[AnyRef]], inputSchema: StructType, output: Seq[Attribute], keyDeserializer: Expression, groupingAttributes: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Similar with FlatMapGroupsInRExec but serializes and deserializes input/output in Arrow format.
Similar with FlatMapGroupsInRExec but serializes and deserializes input/output in Arrow format. This is also somewhat similar with org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec.
-
case class
GenerateExec(generator: Generator, requiredChildOutput: Seq[Attribute], outer: Boolean, generatorOutput: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable
Applies a Generator to a stream of input rows, combining the output of each into a new stream of rows.
Applies a Generator to a stream of input rows, combining the output of each into a new stream of rows. This operation is similar to a
flatMap
in functional programming with one important additional feature, which allows the input rows to be joined with their output.This operator supports whole stage code generation for generators that do not implement terminate().
- generator
the generator expression
- requiredChildOutput
required attributes from child's output
- outer
when true, each input row will be output at least once, even if the output of the given
generator
is empty.- generatorOutput
the qualified output attributes of the generator of this node, which constructed in analysis phase, and we can not change it, as the parent node bound with it already.
- trait GeneratePredicateHelper extends PredicateHelper
-
case class
GlobalLimitExec(limit: Int = -1, child: SparkPlan, offset: Int = 0) extends SparkPlan with BaseLimitExec with Product with Serializable
Take the first
limit
elements and then drop the firstoffset
elements in the child's single output partition. -
class
GroupedIterator extends Iterator[(InternalRow, Iterator[InternalRow])]
Iterates over a presorted set of rows, chunking it up by the grouping expression.
Iterates over a presorted set of rows, chunking it up by the grouping expression. Each call to next will return a pair containing the current group and an iterator that will return all the elements of that group. Iterators for each group are lazily constructed by extracting rows from the input iterator. As such, full groups are never materialized by this class.
Example input:
Input: [a, 1], [b, 2], [b, 3] Grouping: x#1 InputSchema: x#1, y#2
Result:
First call to next(): ([a], Iterator([a, 1]) Second call to next(): ([b], Iterator([b, 2], [b, 3])
Note, the class does not handle the case of an empty input for simplicity of implementation. Use the factory to construct a new instance.
-
case class
InSubqueryExec(child: Expression, plan: BaseSubqueryExec, exprId: ExprId, shouldBroadcast: Boolean = false, resultBroadcast: Broadcast[Array[Any]] = null, result: Array[Any] = null) extends ExecSubqueryExpression with UnaryLike[Expression] with Predicate with Product with Serializable
The physical node of in-subquery.
The physical node of in-subquery. When this is used for Dynamic Partition Pruning, as the pruning happens at the driver side, we don't broadcast subquery result.
-
case class
InputAdapter(child: SparkPlan) extends SparkPlan with UnaryExecNode with InputRDDCodegen with Product with Serializable
InputAdapter is used to hide a SparkPlan from a subtree that supports codegen.
InputAdapter is used to hide a SparkPlan from a subtree that supports codegen.
This is the leaf node of a tree with WholeStageCodegen that is used to generate code that consumes an RDD iterator of InternalRow.
-
trait
InputRDDCodegen extends SparkPlan with CodegenSupport
Leaf codegen node reading from a single RDD.
- trait LeafExecNode extends SparkPlan with LeafLike[SparkPlan]
-
trait
LimitExec extends SparkPlan with UnaryExecNode
The operator takes limited number of elements from its child operator.
-
case class
LocalLimitExec(limit: Int, child: SparkPlan) extends SparkPlan with BaseLimitExec with Product with Serializable
Take the first
limit
elements of each child partition, but do not collect or shuffle them. -
case class
LocalTableScanExec(output: Seq[Attribute], rows: Seq[InternalRow]) extends SparkPlan with LeafExecNode with InputRDDCodegen with Product with Serializable
Physical plan node for scanning data from a local collection.
Physical plan node for scanning data from a local collection.
Seq
may not be serializable and ideally we should not sendrows
andunsafeRows
to the executors. Thus marking them as transient. -
case class
LogicalRDD(output: Seq[Attribute], rdd: RDD[InternalRow], outputPartitioning: Partitioning = UnknownPartitioning(0), outputOrdering: Seq[SortOrder] = Nil, isStreaming: Boolean = false)(session: SparkSession, originStats: Option[Statistics] = None, originConstraints: Option[ExpressionSet] = None) extends LogicalPlan with LeafNode with MultiInstanceRelation with Product with Serializable
Logical plan node for scanning data from an RDD of InternalRow.
Logical plan node for scanning data from an RDD of InternalRow.
It is advised to set the field
originStats
andoriginConstraints
if the RDD is directly built from DataFrame, so that Spark can make better optimizations. -
case class
MapElementsExec(func: AnyRef, outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with ObjectConsumerExec with ObjectProducerExec with CodegenSupport with Product with Serializable
Applies the given function to each input object.
Applies the given function to each input object. The output of its child must be a single-field row containing the input object.
This operator is kind of a safe version of ProjectExec, as its output is custom object, we need to use safe row to contain it.
-
case class
MapGroupsExec(func: (Any, Iterator[Any]) ⇒ TraversableOnce[Any], keyDeserializer: Expression, valueDeserializer: Expression, groupingAttributes: Seq[Attribute], dataAttributes: Seq[Attribute], dataOrder: Seq[SortOrder], outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with UnaryExecNode with ObjectProducerExec with Product with Serializable
Groups the input rows together and calls the function with each group and an iterator containing all elements in the group.
Groups the input rows together and calls the function with each group and an iterator containing all elements in the group. The iterator is sorted according to
dataOrder
if given. The result of this function is flattened before being output. -
case class
MapPartitionsExec(func: (Iterator[Any]) ⇒ Iterator[Any], outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with ObjectConsumerExec with ObjectProducerExec with Product with Serializable
Applies the given function to input object iterator.
Applies the given function to input object iterator. The output of its child must be a single-field row containing the input object.
-
case class
MapPartitionsInRWithArrowExec(func: Array[Byte], packageNames: Array[Byte], broadcastVars: Array[Broadcast[AnyRef]], inputSchema: StructType, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Similar with MapPartitionsExec and org.apache.spark.sql.execution.r.MapPartitionsRWrapper but serializes and deserializes input/output in Arrow format.
Similar with MapPartitionsExec and org.apache.spark.sql.execution.r.MapPartitionsRWrapper but serializes and deserializes input/output in Arrow format.
This is somewhat similar with org.apache.spark.sql.execution.python.ArrowEvalPythonExec
-
trait
ObjectConsumerExec extends SparkPlan with UnaryExecNode with ReferenceAllColumns[SparkPlan]
Physical version of
ObjectConsumer
. -
trait
ObjectProducerExec extends SparkPlan
Physical version of
ObjectProducer
. -
case class
OptimizeMetadataOnlyQuery(catalog: SessionCatalog) extends Rule[LogicalPlan] with Product with Serializable
This rule optimizes the execution of queries that can be answered by looking only at partition-level metadata.
This rule optimizes the execution of queries that can be answered by looking only at partition-level metadata. This applies when all the columns scanned are partition columns, and the query has an aggregate operator that satisfies the following conditions: 1. aggregate expression is partition columns. e.g. SELECT col FROM tbl GROUP BY col. 2. aggregate function on partition columns with DISTINCT. e.g. SELECT col1, count(DISTINCT col2) FROM tbl GROUP BY col1. 3. aggregate function on partition columns which have same result w or w/o DISTINCT keyword. e.g. SELECT col1, Max(col2) FROM tbl GROUP BY col1.
- trait OrderPreservingUnaryExecNode extends SparkPlan with UnaryExecNode with AliasAwareQueryOutputOrdering[SparkPlan]
- case class PartialMapperPartitionSpec(mapIndex: Int, startReducerIndex: Int, endReducerIndex: Int) extends ShufflePartitionSpec with Product with Serializable
- case class PartialReducerPartitionSpec(reducerIndex: Int, startMapIndex: Int, endMapIndex: Int, dataSize: Long) extends ShufflePartitionSpec with Product with Serializable
-
trait
PartitioningPreservingUnaryExecNode extends SparkPlan with UnaryExecNode with AliasAwareOutputExpression
A trait that handles aliases in the
outputExpressions
to produceoutputPartitioning
that satisfies distribution requirements. - case class PlanLater(plan: LogicalPlan) extends SparkPlan with LeafExecNode with Product with Serializable
-
case class
PlanSubqueries(sparkSession: SparkSession) extends Rule[SparkPlan] with Product with Serializable
Plans subqueries that are present in the given SparkPlan.
-
case class
ProjectExec(projectList: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with PartitioningPreservingUnaryExecNode with OrderPreservingUnaryExecNode with Product with Serializable
Physical plan for Project.
-
class
QueryExecution extends Logging
The primary workflow for executing relational queries using Spark.
The primary workflow for executing relational queries using Spark. Designed to allow easy access to the intermediate phases of query execution for developers.
While this is not a public class, we should avoid changing the function names for the sake of changing them, because a lot of developers use the feature for debugging.
-
case class
RDDScanExec(output: Seq[Attribute], rdd: RDD[InternalRow], name: String, outputPartitioning: Partitioning = UnknownPartitioning(0), outputOrdering: Seq[SortOrder] = Nil) extends SparkPlan with LeafExecNode with InputRDDCodegen with Product with Serializable
Physical plan node for scanning data from an RDD of InternalRow.
-
case class
RangeExec(range: Range) extends SparkPlan with LeafExecNode with CodegenSupport with Product with Serializable
Physical plan for range (generating a range of 64 bit numbers).
- final class RecordBinaryComparator extends RecordComparator
-
case class
ReusedSubqueryExec(child: BaseSubqueryExec) extends BaseSubqueryExec with LeafExecNode with Product with Serializable
A wrapper for reused BaseSubqueryExec.
-
case class
RowDataSourceScanExec(output: Seq[Attribute], requiredSchema: StructType, filters: Set[Filter], handledFilters: Set[Filter], pushedDownOperators: PushedDownOperators, rdd: RDD[InternalRow], relation: BaseRelation, tableIdentifier: Option[TableIdentifier]) extends SparkPlan with DataSourceScanExec with InputRDDCodegen with Product with Serializable
Physical plan node for scanning data from a relation.
-
case class
RowToColumnarExec(child: SparkPlan) extends SparkPlan with RowToColumnarTransition with Product with Serializable
Provides a common executor to translate an RDD of InternalRow into an RDD of ColumnarBatch.
Provides a common executor to translate an RDD of InternalRow into an RDD of ColumnarBatch. This is inserted whenever such a transition is determined to be needed.
This is similar to some of the code in ArrowConverters.scala and org.apache.spark.sql.execution.arrow.ArrowWriter. That code is more specialized to convert InternalRow to Arrow formatted data, but in the future if we make OffHeapColumnVector internally Arrow formatted we may be able to replace much of that code.
This is also similar to org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate() and org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.toBatch() toBatch is only ever called from tests and can probably be removed, but populate is used by both Orc and Parquet to initialize partition and missing columns. There is some chance that we could replace populate with RowToColumnConverter, but the performance requirements are different and it would only be to reduce code.
-
trait
RowToColumnarTransition extends SparkPlan with UnaryExecNode
A trait that is used as a tag to indicate a transition from rows to columns.
A trait that is used as a tag to indicate a transition from rows to columns. This allows plugins to replace the current RowToColumnarExec with an optimized version and still have operations that walk a spark plan looking for this type of transition properly match it.
-
class
SQLExecutionRDD extends RDD[InternalRow]
It is just a wrapper over
sqlRDD
, which sets and makes effective all the configs from the capturedSQLConf
.It is just a wrapper over
sqlRDD
, which sets and makes effective all the configs from the capturedSQLConf
. Please notice that this means we may miss configurations set after the creation of this RDD and before its execution. -
case class
SampleExec(lowerBound: Double, upperBound: Double, withReplacement: Boolean, seed: Long, child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable
Physical plan for sampling the dataset.
Physical plan for sampling the dataset.
- lowerBound
Lower-bound of the sampling probability (usually 0.0)
- upperBound
Upper-bound of the sampling probability. The expected fraction sampled will be ub - lb.
- withReplacement
Whether to sample with replacement.
- seed
the random seed
- child
the SparkPlan
-
case class
ScalarSubquery(plan: BaseSubqueryExec, exprId: ExprId) extends ExecSubqueryExpression with LeafLike[Expression] with SupportQueryContext with Product with Serializable
A subquery that will return only one row and one column.
A subquery that will return only one row and one column.
This is the physical copy of ScalarSubquery to be used inside SparkPlan.
-
case class
ScriptTransformationIOSchema(inputRowFormat: Seq[(String, String)], outputRowFormat: Seq[(String, String)], inputSerdeClass: Option[String], outputSerdeClass: Option[String], inputSerdeProps: Seq[(String, String)], outputSerdeProps: Seq[(String, String)], recordReaderClass: Option[String], recordWriterClass: Option[String], schemaLess: Boolean) extends Serializable with Product
The wrapper class of input and output schema properties
-
case class
SerializeFromObjectExec(serializer: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with ObjectConsumerExec with CodegenSupport with Product with Serializable
Takes the input object from child and turns in into unsafe row using the given serializer expression.
Takes the input object from child and turns in into unsafe row using the given serializer expression. The output of its child must be a single-field row containing the input object.
- sealed trait ShufflePartitionSpec extends AnyRef
-
class
ShuffledRowRDD extends RDD[InternalRow]
This is a specialized version of org.apache.spark.rdd.ShuffledRDD that is optimized for shuffling rows instead of Java key-value pairs.
This is a specialized version of org.apache.spark.rdd.ShuffledRDD that is optimized for shuffling rows instead of Java key-value pairs. Note that something like this should eventually be implemented in Spark core, but that is blocked by some more general refactorings to shuffle interfaces / internals.
This RDD takes a ShuffleDependency (
dependency
), and an array of ShufflePartitionSpec as input arguments.The
dependency
has the parent RDD of this RDD, which represents the dataset before shuffle (i.e. map output). Elements of this RDD are (partitionId, Row) pairs. Partition ids should be in the range [0, numPartitions - 1].dependency.partitioner
is the original partitioner used to partition map output, anddependency.partitioner.numPartitions
is the number of pre-shuffle partitions (i.e. the number of partitions of the map output). -
case class
SortExec(sortOrder: Seq[SortOrder], global: Boolean, child: SparkPlan, testSpillFrequency: Int = 0) extends SparkPlan with UnaryExecNode with BlockingOperatorWithCodegen with Product with Serializable
Performs (external) sorting.
Performs (external) sorting.
- global
when true performs a global sort of all partitions by shuffling the data first if necessary.
- testSpillFrequency
Method for configuring periodic spilling in unit tests. If set, will spill every
frequency
records.
- class SparkOptimizer extends Optimizer
-
abstract
class
SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable
The base class for physical operators.
The base class for physical operators.
The naming convention is that physical operators end with "Exec" suffix, e.g. ProjectExec.
-
class
SparkPlanInfo extends AnyRef
:: DeveloperApi :: Stores information about a SQL SparkPlan.
:: DeveloperApi :: Stores information about a SQL SparkPlan.
- Annotations
- @DeveloperApi()
- class SparkPlanner extends SparkStrategies with SQLConfHelper
-
case class
SparkScriptTransformationExec(script: String, output: Seq[Attribute], child: SparkPlan, ioschema: ScriptTransformationIOSchema) extends SparkPlan with BaseScriptTransformationExec with Product with Serializable
Transforms the input by forking and running the specified script.
Transforms the input by forking and running the specified script.
- script
the command that should be executed.
- output
the attributes that are produced by the script.
- child
logical plan whose output is transformed.
- ioschema
the class set that defines how to handle input/output data.
- case class SparkScriptTransformationWriterThread(iter: Iterator[InternalRow], inputSchema: Seq[DataType], ioSchema: ScriptTransformationIOSchema, outputStream: OutputStream, proc: Process, stderrBuffer: CircularBuffer, taskContext: TaskContext, conf: Configuration) extends BaseScriptTransformationWriterThread with Product with Serializable
-
class
SparkSqlAstBuilder extends AstBuilder
Builder that converts an ANTLR ParseTree into a LogicalPlan/Expression/TableIdentifier.
-
class
SparkSqlParser extends AbstractSqlParser
Concrete parser for Spark SQL statements.
- abstract class SparkStrategies extends QueryPlanner[SparkPlan]
-
abstract
class
SparkStrategy extends GenericStrategy[SparkPlan]
Converts a logical plan into zero or more SparkPlans.
Converts a logical plan into zero or more SparkPlans. This API is exposed for experimenting with the query planner and is not designed to be stable across spark releases. Developers writing libraries should instead consider using the stable APIs provided in org.apache.spark.sql.sources
-
case class
SubqueryAdaptiveBroadcastExec(name: String, index: Int, onlyInBroadcast: Boolean, buildPlan: LogicalPlan, buildKeys: Seq[Expression], child: SparkPlan) extends BaseSubqueryExec with UnaryExecNode with Product with Serializable
Similar to SubqueryBroadcastExec, this node is used to store the initial physical plan of DPP subquery filters when enabling both AQE and DPP.
Similar to SubqueryBroadcastExec, this node is used to store the initial physical plan of DPP subquery filters when enabling both AQE and DPP. It is intermediate physical plan and not executable. After the build side is executed, this node will be replaced with the SubqueryBroadcastExec and the child will be optimized with the ReusedExchange from the build side.
-
case class
SubqueryBroadcastExec(name: String, index: Int, buildKeys: Seq[Expression], child: SparkPlan) extends BaseSubqueryExec with UnaryExecNode with Product with Serializable
Physical plan for a custom subquery that collects and transforms the broadcast key values.
Physical plan for a custom subquery that collects and transforms the broadcast key values. This subquery retrieves the partition key from the broadcast results based on the type of HashedRelation returned. If the key is packed inside a Long, we extract it through bitwise operations, otherwise we return it from the appropriate index of the UnsafeRow.
- index
the index of the join key in the list of keys from the build side
- buildKeys
the join keys from the build side of the join used
- child
the BroadcastExchange or the AdaptiveSparkPlan with BroadcastQueryStageExec from the build side of the join
-
case class
SubqueryExec(name: String, child: SparkPlan, maxNumRows: Option[Int] = None) extends BaseSubqueryExec with UnaryExecNode with Product with Serializable
Physical plan for a subquery.
-
case class
TakeOrderedAndProjectExec(limit: Int, sortOrder: Seq[SortOrder], projectList: Seq[NamedExpression], child: SparkPlan, offset: Int = 0) extends SparkPlan with OrderPreservingUnaryExecNode with Product with Serializable
Take the first
limit
elements as defined by the sortOrder, then drop the firstoffset
elements, and do projection if needed.Take the first
limit
elements as defined by the sortOrder, then drop the firstoffset
elements, and do projection if needed. This is logically equivalent to having a Limit and/or Offset operator after a SortExec operator, or having a ProjectExec operator between them. This could have been named TopK, but Spark's top operator does the opposite in ordering so we name it TakeOrdered to avoid confusion. - trait UnaryExecNode extends SparkPlan with UnaryLike[SparkPlan]
-
case class
UnionExec(children: Seq[SparkPlan]) extends SparkPlan with Product with Serializable
Physical plan for unioning two plans, without a distinct.
Physical plan for unioning two plans, without a distinct. This is UNION ALL in SQL.
If we change how this is implemented physically, we'd need to update org.apache.spark.sql.catalyst.plans.logical.Union.maxRowsPerPartition.
- final class UnsafeExternalRowSorter extends AnyRef
-
final
class
UnsafeFixedWidthAggregationMap extends AnyRef
Unsafe-based HashMap for performing aggregations where the aggregated values are fixed-width.
Unsafe-based HashMap for performing aggregations where the aggregated values are fixed-width.
This map supports a maximum of 2 billion keys.
-
final
class
UnsafeKVExternalSorter extends AnyRef
A class for performing external sorting on key-value records.
A class for performing external sorting on key-value records. Both key and value are UnsafeRows.
Note that this class allows optionally passing in a
BytesToBytesMap
directly in order to perform in-place sorting of records in the map. -
class
UnsafeRowSerializer extends Serializer with Serializable
Serializer for serializing UnsafeRows during shuffle.
Serializer for serializing UnsafeRows during shuffle. Since UnsafeRows are already stored as bytes, this serializer simply copies those bytes to the underlying output stream. When deserializing a stream of rows, instances of this serializer mutate and return a single UnsafeRow instance that is backed by an on-heap byte array.
Note that this serializer implements only the Serializer methods that are used during shuffle, so certain SerializerInstance methods will throw UnsupportedOperationException.
-
case class
WholeStageCodegenExec(child: SparkPlan)(codegenStageId: Int) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable
WholeStageCodegen compiles a subtree of plans that support codegen together into single Java function.
WholeStageCodegen compiles a subtree of plans that support codegen together into single Java function.
Here is the call graph of to generate Java source (plan A supports codegen, but plan B does not):
WholeStageCodegen Plan A FakeInput Plan B
-> execute() | doExecute() ---------> inputRDDs() -------> inputRDDs() ------> execute() | +-----------------> produce() | doProduce() -------> produce() | doProduce() | doConsume() <--------- consume() | doConsume() <-------- consume()
SparkPlan A should override
doProduce()
anddoConsume()
.doCodeGen()
will create aCodeGenContext
, which will hold a list of variables for input, used to generated code for BoundReference.
Value Members
- object AggregatingAccumulator extends Serializable
- object BaseLimitExec extends Serializable
- object CoalesceExec extends Serializable
- object CoalescedPartitionSpec extends Serializable
-
object
CodegenMode extends ExplainMode with Product with Serializable
Codegen mode means that when printing explain for a DataFrame, if generated codes are available, a physical plan and the generated codes are expected to be printed to the console.
- object CollectMetricsExec extends AdaptiveSparkPlanHelper with Serializable
-
object
CommandExecutionMode extends Enumeration
SPARK-35378: Commands should be executed eagerly so that something like
sql("INSERT ...")
can trigger the table insertion immediately without a.collect()
.SPARK-35378: Commands should be executed eagerly so that something like
sql("INSERT ...")
can trigger the table insertion immediately without a.collect()
. To avoid end-less recursion we should useNON_ROOT
when recursively executing commands. Note that we can't execute a query plan with leaf command nodes, because many commands returnGenericInternalRow
and can't be put in a query plan directly, otherwise the query engine may castGenericInternalRow
toUnsafeRow
and fail. When running EXPLAIN, or commands inside other command, we should useSKIP
to not eagerly trigger the command execution. -
object
CostMode extends ExplainMode with Product with Serializable
Cost mode means that when printing explain for a DataFrame, if plan node statistics are available, a logical plan and the statistics are expected to be printed to the console.
- object ExecSubqueryExpression
- object ExplainMode
- object ExplainUtils extends AdaptiveSparkPlanHelper
-
object
ExtendedMode extends ExplainMode with Product with Serializable
Extended mode means that when printing explain for a DataFrame, both logical and physical plans are expected to be printed to the console.
- object ExternalRDD extends Serializable
-
object
FormattedMode extends ExplainMode with Product with Serializable
Formatted mode means that when printing explain for a DataFrame, explain output is expected to be split into two sections: a physical plan outline and node details.
- object GroupedIterator
-
object
HiveResult
Runs a query returning the result in Hive compatible form.
- object LogicalRDD extends Logging with Serializable
- object MapGroupsExec extends Serializable
-
object
ObjectOperator
Helper functions for physical operators that work with user defined objects.
- object PartitionedFileUtil
- object QueryExecution
-
object
RemoveRedundantProjects extends Rule[SparkPlan]
Remove redundant ProjectExec node from the spark plan.
Remove redundant ProjectExec node from the spark plan. A ProjectExec node is redundant when - It has the same output attributes and orders as its child's output and the ordering of the attributes is required. - It has the same output attributes as its child's output when attribute output ordering is not required. This rule needs to be a physical rule because project nodes are useful during logical optimization to prune data. During physical planning, redundant project nodes can be removed to simplify the query plan.
-
object
RemoveRedundantSorts extends Rule[SparkPlan]
Remove redundant SortExec node from the spark plan.
Remove redundant SortExec node from the spark plan. A sort node is redundant when its child satisfies both its sort orders and its required child distribution. Note this rule differs from the Optimizer rule EliminateSorts in that this rule also checks if the child satisfies the required distribution so that it is safe to remove not only a local sort but also a global sort when its child already satisfies required sort orders.
-
object
ReplaceHashWithSortAgg extends Rule[SparkPlan]
Replace hash-based aggregate with sort aggregate in the spark plan if:
Replace hash-based aggregate with sort aggregate in the spark plan if:
1. The plan is a pair of partial and final HashAggregateExec or ObjectHashAggregateExec, and the child of partial aggregate satisfies the sort order of corresponding SortAggregateExec. or 2. The plan is a HashAggregateExec or ObjectHashAggregateExec, and the child satisfies the sort order of corresponding SortAggregateExec.
Examples: 1. aggregate after join:
HashAggregate(t1.i, SUM, final) | SortAggregate(t1.i, SUM, complete) HashAggregate(t1.i, SUM, partial) => | | SortMergeJoin(t1.i = t2.j) SortMergeJoin(t1.i = t2.j)
2. aggregate after sort:
HashAggregate(t1.i, SUM, partial) SortAggregate(t1.i, SUM, partial) | => | Sort(t1.i) Sort(t1.i)
Hash-based aggregate can be replaced when its child satisfies the sort order of corresponding sort aggregate. Sort aggregate is faster in the sense that it does not have hashing overhead of hash aggregate.
- object SQLExecution
- object ScriptTransformationIOSchema extends Serializable
-
object
SimpleMode extends ExplainMode with Product with Serializable
Simple mode means that when printing explain for a DataFrame, only a physical plan is expected to be printed to the console.
- object SortPrefixUtils
- object SparkPlan extends Serializable
- object SubqueryBroadcastExec extends Serializable
- object SubqueryExec extends Serializable
- object UnaryExecNode extends Serializable
- object WholeStageCodegenExec extends Serializable