python

Type Members

case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan) extends SparkPlan with Product with Serializable

A physical plan that evaluates a PythonUDF, one partition of tuples at a time.
A physical plan that evaluates a PythonUDF, one partition of tuples at a time.
Python evaluation works by sending the necessary (projected) input data via a socket to an external Python process, and combine the result from the Python process with the original row.
For each row we send to Python, we also put it in a queue first. For each output row from Python, we drain the queue to find the original input row. Note that if the Python process is way too slow, this could lead to the queue growing unbounded and spill into disk when run out of memory.
Here is a diagram to show how this works:
Downstream (for parent) / \ / socket (output of UDF) / \ RowQueue Python \ / \ socket (input of UDF) \ / upstream (from child)
The rows sent to and received from Python are packed into batches (100 rows) and serialized, there should be always some rows buffered in the socket or Python process, so the pulling from RowQueue ALWAYS happened after pushing into it.
case class PythonUDF(name: String, func: PythonFunction, dataType: DataType, children: Seq[Expression]) extends Expression with Unevaluable with NonSQLExpression with Product with Serializable

A serialized version of a Python lambda function.
case class UserDefinedPythonFunction(name: String, func: PythonFunction, dataType: DataType) extends Product with Serializable

A user-defined Python function.
A user-defined Python function. This is used by the Python API.

Value Members

object EvaluatePython
object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan]

Extracts all the Python UDFs in logical aggregate, which depends on aggregate expression or grouping key, evaluate them after aggregate.
object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper

Extracts PythonUDFs from operators, rewriting the query plan so that the UDF can be evaluated alone in a batch.
Extracts PythonUDFs from operators, rewriting the query plan so that the UDF can be evaluated alone in a batch.
Only extracts the PythonUDFs that could be evaluated in Python (the single child is PythonUDFs or all the children could be evaluated in JVM).
This has the limitation that the input to the Python UDF is not allowed include attributes from multiple child operators.

package python

Type Members

case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan) extends SparkPlan with Product with Serializable

case class PythonUDF(name: String, func: PythonFunction, dataType: DataType, children: Seq[Expression]) extends Expression with Unevaluable with NonSQLExpression with Product with Serializable

case class UserDefinedPythonFunction(name: String, func: PythonFunction, dataType: DataType) extends Product with Serializable

Value Members

object EvaluatePython

object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan]

object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper

Ungrouped