Applies the given function to each input row, appending the encoded result at the end of the row.
An optimized version of AppendColumnsExec, that can be executed on deserialized object directly.
Helper trait which defines methods that are shared by both LocalLimitExec and GlobalLimitExec.
Provides support in a SQLContext for caching query results and automatically using these cached results when subsequent queries are executed.
Provides support in a SQLContext for caching query results and automatically using these cached
results when subsequent queries are executed. Data is cached using byte buffers stored in an
InMemoryRelation. This relation is automatically substituted query plans that return the
sameResult
as the originally cached query.
Internal to Spark SQL.
Holds a cached logical plan and its data
Co-groups the data from left and right children, and calls the function with each group and 2 iterators containing all elements in the group from left and right side.
Co-groups the data from left and right children, and calls the function with each group and 2 iterators containing all elements in the group from left and right side. The result of this function is flattened before being output.
Iterates over GroupedIterators and returns the cogrouped data, i.e.
Iterates over GroupedIterators and returns the cogrouped data, i.e. each record is a grouping key with its associated values from all GroupedIterators. Note: we assume the output of each GroupedIterator is ordered by the grouping key.
Physical plan for returning a new RDD that has exactly numPartitions
partitions.
Physical plan for returning a new RDD that has exactly numPartitions
partitions.
Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g.
if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions
is requested, it will stay at the current number of partitions.
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you see ShuffleExchange. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
A Partitioner that might group together one or more partitions from the parent.
An interface for those physical operators that support codegen.
Find the chained plans that support codegen, collapse them together as WholeStageCodegen.
Take the first limit
elements and collect them to a single partition.
Take the first limit
elements and collect them to a single partition.
This operator will be used when a logical Limit
operation is the final operator in an
logical plan, which happens when the user is collecting results back to the driver.
Takes the input row from child and turns it into object using the given deserializer expression.
Takes the input row from child and turns it into object using the given deserializer expression. The output of this operator is a single-field safe row containing the deserialized object.
The base class for subquery that is used in SparkPlan.
Apply all of the GroupExpressions to every input row, hence we will get multiple output rows for an input row.
Apply all of the GroupExpressions to every input row, hence we will get multiple output rows for an input row.
The group of expressions, all of the group expressions should
output the same schema specified bye the parameter output
The output Schema
Child operator
Logical plan node for scanning data from an RDD.
Physical plan node for scanning data from an RDD.
An interface for relations that are backed by files.
An interface for relations that are backed by files. When a class implements this interface,
the list of paths that it returns will be returned to a user who calls inputPaths
on any
DataFrame that queries this relation.
Physical plan node for scanning data from HadoopFsRelations.
Physical plan node for scanning data from HadoopFsRelations.
The file-based relation to scan.
Output attributes of the scan.
Output schema of the scan.
Predicates to use for partition pruning.
Data source filters to use for filtering data within partitions.
identifier for the table in the metastore.
Physical plan for Filter.
Groups the input rows together and calls the R function with each group and an iterator containing all elements in the group.
Groups the input rows together and calls the R function with each group and an iterator containing all elements in the group. The result of this function is flattened before being output.
Applies a Generator to a stream of input rows, combining the output of each into a new stream of rows.
Applies a Generator to a stream of input rows, combining the
output of each into a new stream of rows. This operation is similar to a flatMap
in functional
programming with one important additional feature, which allows the input rows to be joined with
their output.
the generator expression
when true, each output row is implicitly joined with the input tuple that produced it.
when true, each input row will be output at least once, even if the output of the
given generator
is empty. outer
has no effect when join
is false.
the qualified output attributes of the generator of this node, which constructed in analysis phase, and we can not change it, as the parent node bound with it already.
Take the first limit
elements of the child's single output partition.
Iterates over a presorted set of rows, chunking it up by the grouping expression.
Iterates over a presorted set of rows, chunking it up by the grouping expression. Each call to next will return a pair containing the current group and an iterator that will return all the elements of that group. Iterators for each group are lazily constructed by extracting rows from the input iterator. As such, full groups are never materialized by this class.
Example input:
Input: [a, 1], [b, 2], [b, 3] Grouping: x#1 InputSchema: x#1, y#2
Result:
First call to next(): ([a], Iterator([a, 1]) Second call to next(): ([b], Iterator([b, 2], [b, 3])
Note, the class does not handle the case of an empty input for simplicity of implementation. Use the factory to construct a new instance.
A subquery that will check the value of child
whether is in the result of a query or not.
InputAdapter is used to hide a SparkPlan from a subtree that support codegen.
InputAdapter is used to hide a SparkPlan from a subtree that support codegen.
This is the leaf node of a tree with WholeStageCodegen that is used to generate code that consumes an RDD iterator of InternalRow.
Take the first limit
elements of each child partition, but do not collect or shuffle them.
Physical plan node for scanning data from a local collection.
Logical plan node for scanning data from an RDD of InternalRow.
Applies the given function to each input object.
Applies the given function to each input object. The output of its child must be a single-field row containing the input object.
This operator is kind of a safe version of ProjectExec, as its output is custom object, we need to use safe row to contain it.
Groups the input rows together and calls the function with each group and an iterator containing all elements in the group.
Groups the input rows together and calls the function with each group and an iterator containing all elements in the group. The result of this function is flattened before being output.
Applies the given function to input object iterator.
Applies the given function to input object iterator. The output of its child must be a single-field row containing the input object.
Physical version of ObjectConsumer
.
Physical version of ObjectProducer
.
This rule optimizes the execution of queries that can be answered by looking only at partition-level metadata.
This rule optimizes the execution of queries that can be answered by looking only at partition-level metadata. This applies when all the columns scanned are partition columns, and the query has an aggregate operator that satisfies the following conditions: 1. aggregate expression is partition columns. e.g. SELECT col FROM tbl GROUP BY col. 2. aggregate function on partition columns with DISTINCT. e.g. SELECT col1, count(DISTINCT col2) FROM tbl GROUP BY col1. 3. aggregate function on partition columns which have same result w or w/o DISTINCT keyword. e.g. SELECT col1, Max(col2) FROM tbl GROUP BY col1.
A plan node that does nothing but lie about the output of its child.
A plan node that does nothing but lie about the output of its child. Used to spice a (hopefully structurally equivalent) tree from a different optimization sequence into an already resolved tree.
Plans scalar subqueries from that are present in the given SparkPlan.
Physical plan for Project.
The primary workflow for executing relational queries using Spark.
The primary workflow for executing relational queries using Spark. Designed to allow easy access to the intermediate phases of query execution for developers.
While this is not a public class, we should avoid changing the function names for the sake of changing them, because a lot of developers use the feature for debugging.
Physical plan node for scanning data from an RDD of InternalRow.
Physical plan for range (generating a range of 64 bit numbers).
Find out duplicated exchanges in the spark plan, then use the same exchange for all the references.
Physical plan node for scanning data from a relation.
An internal iterator interface which presents a more restrictive API than scala.collection.Iterator.
An internal iterator interface which presents a more restrictive API than scala.collection.Iterator.
One major departure from the Scala iterator API is the fusing of the hasNext()
and next()
calls: Scala's iterator allows users to call hasNext()
without immediately advancing the
iterator to consume the next row, whereas RowIterator combines these calls into a single
advanceNext() method.
Physical plan for sampling the dataset.
Physical plan for sampling the dataset.
Lower-bound of the sampling probability (usually 0.0)
Upper-bound of the sampling probability. The expected fraction sampled will be ub - lb.
Whether to sample with replacement.
the random seed
the SparkPlan
A subquery that will return only one row and one column.
A subquery that will return only one row and one column.
This is the physical copy of ScalarSubquery to be used inside SparkPlan.
Takes the input object from child and turns in into unsafe row using the given serializer expression.
Takes the input object from child and turns in into unsafe row using the given serializer expression. The output of its child must be a single-field row containing the input object.
This is a specialized version of org.apache.spark.rdd.ShuffledRDD that is optimized for shuffling rows instead of Java key-value pairs.
This is a specialized version of org.apache.spark.rdd.ShuffledRDD that is optimized for shuffling rows instead of Java key-value pairs. Note that something like this should eventually be implemented in Spark core, but that is blocked by some more general refactorings to shuffle interfaces / internals.
This RDD takes a ShuffleDependency (dependency
),
and an optional array of partition start indices as input arguments
(specifiedPartitionStartIndices
).
The dependency
has the parent RDD of this RDD, which represents the dataset before shuffle
(i.e. map output). Elements of this RDD are (partitionId, Row) pairs.
Partition ids should be in the range [0, numPartitions - 1].
dependency.partitioner
is the original partitioner used to partition
map output, and dependency.partitioner.numPartitions
is the number of pre-shuffle partitions
(i.e. the number of partitions of the map output).
When specifiedPartitionStartIndices
is defined, specifiedPartitionStartIndices.length
will be the number of post-shuffle partitions. For this case, the i
th post-shuffle
partition includes specifiedPartitionStartIndices[i]
to
specifiedPartitionStartIndices[i+1] - 1
(inclusive).
When specifiedPartitionStartIndices
is not defined, there will be
dependency.partitioner.numPartitions
post-shuffle partitions. For this case,
a post-shuffle partition is created for every pre-shuffle partition.
Performs (external) sorting.
Performs (external) sorting.
when true performs a global sort of all partitions by shuffling the data first if necessary.
Method for configuring periodic spilling in unit tests. If set, will
spill every frequency
records.
The base class for physical operators.
The base class for physical operators.
The naming convention is that physical operators end with "Exec" suffix, e.g. ProjectExec.
:: DeveloperApi :: Stores information about a SQL SparkPlan.
:: DeveloperApi :: Stores information about a SQL SparkPlan.
Builder that converts an ANTLR ParseTree into a LogicalPlan/Expression/TableIdentifier.
Concrete parser for Spark SQL statements.
Converts a logical plan into zero or more SparkPlans.
Converts a logical plan into zero or more SparkPlans. This API is exposed for experimenting with the query planner and is not designed to be stable across spark releases. Developers writing libraries should instead consider using the stable APIs provided in org.apache.spark.sql.sources
Physical plan for a subquery.
Take the first limit elements as defined by the sortOrder, and do projection if needed.
Take the first limit elements as defined by the sortOrder, and do projection if needed. This is logically equivalent to having a Limit operator after a SortExec operator, or having a ProjectExec operator between them. This could have been named TopK, but Spark's top operator does the opposite in ordering so we name it TakeOrdered to avoid confusion.
Physical plan for unioning two plans, without a distinct.
Physical plan for unioning two plans, without a distinct. This is UNION ALL in SQL.
Serializer for serializing UnsafeRows during shuffle.
Serializer for serializing UnsafeRows during shuffle. Since UnsafeRows are already stored as bytes, this serializer simply copies those bytes to the underlying output stream. When deserializing a stream of rows, instances of this serializer mutate and return a single UnsafeRow instance that is backed by an on-heap byte array.
Note that this serializer implements only the Serializer methods that are used during shuffle, so certain SerializerInstance methods will throw UnsupportedOperationException.
WholeStageCodegen compile a subtree of plans that support codegen together into single Java function.
WholeStageCodegen compile a subtree of plans that support codegen together into single Java function.
Here is the call graph of to generate Java source (plan A support codegen, but plan B does not):
WholeStageCodegen Plan A FakeInput Plan B
-> execute() | doExecute() ---------> inputRDDs() -------> inputRDDs() ------> execute() | +-----------------> produce() | doProduce() -------> produce() | doProduce() | doConsume() <--------- consume() | doConsume() <-------- consume()
SparkPlan A should override doProduce() and doConsume().
doCodeGen() will create a CodeGenContext, which will hold a list of variables for input, used to generated code for BoundReference.
Helper functions for physical operators that work with user defined objects.
Contains methods for debugging query execution.
Contains methods for debugging query execution.
Usage:
import org.apache.spark.sql.execution.debug._ sql("SELECT 1").debug() sql("SELECT 1").debugCodegen()
Physical execution operators for join operations.
The physical execution component of Spark SQL. Note that this is a private package. All classes in catalyst are considered an internal API to Spark SQL and are subject to change between minor releases.