python

package python

Ordering

Alphabetic

Visibility

Public
All

Type Members

case class AggregateInPandasExec(groupingExpressions: Seq[NamedExpression], udfExpressions: Seq[PythonUDF], resultExpressions: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Physical node for aggregation with group aggregate Pandas UDF.
Physical node for aggregation with group aggregate Pandas UDF.
This plan works by sending the necessary (projected) input grouped data as Arrow record batches to the python worker, the python worker invokes the UDF and sends the results to the executor, finally the executor evaluates any post-aggregation expressions and join the result with the grouped key.
case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan, evalType: Int) extends SparkPlan with EvalPythonExec with Product with Serializable
A physical plan that evaluates a PythonUDF.
class ArrowPythonRunner extends BasePythonRunner[Iterator[InternalRow], ColumnarBatch] with PythonArrowOutput
Similar to PythonUDFRunner, but exchange data with Python worker via Arrow stream.
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan) extends SparkPlan with EvalPythonExec with Product with Serializable
A physical plan that evaluates a PythonUDF
class CoGroupedArrowPythonRunner extends BasePythonRunner[(Iterator[InternalRow], Iterator[InternalRow]), ColumnarBatch] with PythonArrowOutput
Python UDF Runner for cogrouped udfs.
Python UDF Runner for cogrouped udfs. It sends Arrow bathes from two different DataFrames, groups them in Python, and receive it back in JVM as batches of single DataFrame.
trait EvalPythonExec extends SparkPlan with UnaryExecNode
A physical plan that evaluates a PythonUDF, one partition of tuples at a time.
A physical plan that evaluates a PythonUDF, one partition of tuples at a time.
Python evaluation works by sending the necessary (projected) input data via a socket to an external Python process, and combine the result from the Python process with the original row.
For each row we send to Python, we also put it in a queue first. For each output row from Python, we drain the queue to find the original input row. Note that if the Python process is way too slow, this could lead to the queue growing unbounded and spill into disk when run out of memory.
Here is a diagram to show how this works:
Downstream (for parent) / \ / socket (output of UDF) / \ RowQueue Python \ / \ socket (input of UDF) \ / upstream (from child)
The rows sent to and received from Python are packed into batches (100 rows) and serialized, there should be always some rows buffered in the socket or Python process, so the pulling from RowQueue ALWAYS happened after pushing into it.
case class FlatMapCoGroupsInPandasExec(leftGroup: Seq[Attribute], rightGroup: Seq[Attribute], func: Expression, output: Seq[Attribute], left: SparkPlan, right: SparkPlan) extends SparkPlan with BinaryExecNode with Product with Serializable
Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInPandas
Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInPandas
The input dataframes are first Cogrouped. Rows from each side of the cogroup are passed to the Python worker via Arrow. As each side of the cogroup may have a different schema we send every group in its own Arrow stream. The Python worker turns the resulting record batches to pandas.DataFrames, invokes the user-defined function, and passes the resulting pandas.DataFrame as an Arrow record batch. Finally, each record batch is turned to Iterator[InternalRow] using ColumnarBatch.
Note on memory usage: Both the Python worker and the Java executor need to have enough memory to hold the largest cogroup. The memory on the Java side is used to construct the record batches (off heap memory). The memory on the Python side is used for holding the pandas.DataFrame. It's possible to further split one group into multiple record batches to reduce the memory footprint on the Java side, this is left as future work.
case class FlatMapGroupsInPandasExec(groupingAttributes: Seq[Attribute], func: Expression, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas
Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas
Rows in each group are passed to the Python worker as an Arrow record batch. The Python worker turns the record batch to a pandas.DataFrame, invoke the user-defined function, and passes the resulting pandas.DataFrame as an Arrow record batch. Finally, each record batch is turned to Iterator[InternalRow] using ColumnarBatch.
Note on memory usage: Both the Python worker and the Java executor need to have enough memory to hold the largest group. The memory on the Java side is used to construct the record batch (off heap memory). The memory on the Python side is used for holding the pandas.DataFrame. It's possible to further split one group into multiple record batches to reduce the memory footprint on the Java side, this is left as future work.
case class MapInPandasExec(func: Expression, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable
A relation produced by applying a function that takes an iterator of pandas DataFrames and outputs an iterator of pandas DataFrames.
A relation produced by applying a function that takes an iterator of pandas DataFrames and outputs an iterator of pandas DataFrames.
This is somewhat similar with FlatMapGroupsInPandasExec and org.apache.spark.sql.catalyst.plans.logical.MapPartitionsInRWithArrow
class PythonForeachWriter extends ForeachWriter[UnsafeRow]
class PythonUDFRunner extends BasePythonRunner[Array[Byte], Array[Byte]]
A helper class to run Python UDFs in Spark.
case class UserDefinedPythonFunction(name: String, func: PythonFunction, dataType: DataType, pythonEvalType: Int, udfDeterministic: Boolean) extends Product with Serializable
A user-defined Python function.
A user-defined Python function. This is used by the Python API.
case class WindowInPandasExec(windowExpression: Seq[NamedExpression], partitionSpec: Seq[Expression], orderSpec: Seq[SortOrder], child: SparkPlan) extends SparkPlan with WindowExecBase with Product with Serializable
This class calculates and outputs windowed aggregates over the rows in a single partition.
This class calculates and outputs windowed aggregates over the rows in a single partition.
This is similar to WindowExec. The main difference is that this node does not compute any window aggregation values. Instead, it computes the lower and upper bound for each window (i.e. window bounds) and pass the data and indices to Python worker to do the actual window aggregation.
It currently materializes all data associated with the same partition key and passes them to Python worker. This is not strictly necessary for sliding windows and can be improved (by possibly slicing data into overlapping chunks and stitching them together).
This class groups window expressions by their window boundaries so that window expressions with the same window boundaries can share the same window bounds. The window bounds are prepended to the data passed to the python worker.
For example, if we have: avg(v) over specifiedwindowframe(RowFrame, -5, 5), avg(v) over specifiedwindowframe(RowFrame, UnboundedPreceding, UnboundedFollowing), avg(v) over specifiedwindowframe(RowFrame, -3, 3), max(v) over specifiedwindowframe(RowFrame, -3, 3)
The python input will look like: (lower_bound_w1, upper_bound_w1, lower_bound_w3, upper_bound_w3, v)
where w1 is specifiedwindowframe(RowFrame, -5, 5) w2 is specifiedwindowframe(RowFrame, UnboundedPreceding, UnboundedFollowing) w3 is specifiedwindowframe(RowFrame, -3, 3)
Note that w2 doesn't have bound indices in the python input because it's unbounded window so it's bound indices will always be the same.
Bounded window and Unbounded window are evaluated differently in Python worker: (1) Bounded window takes the window bound indices in addition to the input columns. Unbounded window takes only input columns. (2) Bounded window evaluates the udf once per input row. Unbounded window evaluates the udf once per window partition. This is controlled by Python runner conf "pandas_window_bound_types"
The logic to compute window bounds is delegated to WindowFunctionFrame and shared with WindowExec
Note this doesn't support partial aggregation and all aggregation is computed from the entire window.

Value Members

object EvaluatePython
object ExtractGroupingPythonUDFFromAggregate extends Rule[LogicalPlan]
Extracts PythonUDFs in logical aggregate, which are used in grouping keys, evaluate them before aggregate.
Extracts PythonUDFs in logical aggregate, which are used in grouping keys, evaluate them before aggregate. This must be executed after ExtractPythonUDFFromAggregate rule and before ExtractPythonUDFs.
object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan]
Extracts all the Python UDFs in logical aggregate, which depends on aggregate expression or grouping key, or doesn't depend on any above expressions, evaluate them after aggregate.
object ExtractPythonUDFs extends Rule[LogicalPlan] with PredicateHelper
Extracts PythonUDFs from operators, rewriting the query plan so that the UDF can be evaluated alone in a batch.
Extracts PythonUDFs from operators, rewriting the query plan so that the UDF can be evaluated alone in a batch.
Only extracts the PythonUDFs that could be evaluated in Python (the single child is PythonUDFs or all the children could be evaluated in JVM).
This has the limitation that the input to the Python UDF is not allowed include attributes from multiple child operators.
object PythonForeachWriter extends Serializable
object PythonUDFRunner

Packages

python

package python

Type Members

Value Members

Ungrouped

Packages

python 

package python

Type Members

Value Members

Ungrouped

python