Packages

package python

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. case class AggregateInPandasExec(groupingExpressions: Seq[NamedExpression], udfExpressions: Seq[PythonUDF], resultExpressions: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with PythonSQLMetrics with Product with Serializable

    Physical node for aggregation with group aggregate Pandas UDF.

    Physical node for aggregation with group aggregate Pandas UDF.

    This plan works by sending the necessary (projected) input grouped data as Arrow record batches to the python worker, the python worker invokes the UDF and sends the results to the executor, finally the executor evaluates any post-aggregation expressions and join the result with the grouped key.

  2. class ApplyInPandasWithStatePythonRunner extends BasePythonRunner[InType, OutType] with PythonArrowInput[InType] with PythonArrowOutput[OutType]

    A variant implementation of ArrowPythonRunner to serve the operation applyInPandasWithState.

    A variant implementation of ArrowPythonRunner to serve the operation applyInPandasWithState.

    Unlike normal ArrowPythonRunner which both input and output (executor <-> python worker) are InternalRow, applyInPandasWithState has side data (state information) in both input and output along with data, which requires different struct on Arrow RecordBatch.

  3. class ApplyInPandasWithStateWriter extends AnyRef

    This class abstracts the complexity on constructing Arrow RecordBatches for data and state with bin-packing and chunking.

    This class abstracts the complexity on constructing Arrow RecordBatches for data and state with bin-packing and chunking. The caller only need to call the proper public methods of this class startNewGroup, writeRow, finalizeGroup, finalizeData and this class will write the data and state into Arrow RecordBatches with performing bin-pack and chunk internally.

    This class requires that the parameter root has been initialized with the Arrow schema like below: - data fields - state field

    • nested schema (Refer ApplyInPandasWithStateWriter.STATE_METADATA_SCHEMA)

    Please refer the code comment in the implementation to see how the writes of data and state against Arrow RecordBatch work with consideration of bin-packing and chunking.

  4. case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan, evalType: Int) extends SparkPlan with EvalPythonExec with PythonSQLMetrics with Product with Serializable

    A physical plan that evaluates a PythonUDF.

  5. class ArrowPythonRunner extends BasePythonRunner[Iterator[InternalRow], ColumnarBatch] with BasicPythonArrowInput with BasicPythonArrowOutput

    Similar to PythonUDFRunner, but exchange data with Python worker via Arrow stream.

  6. case class AttachDistributedSequenceExec(sequenceAttr: Attribute, child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable

    A physical plan that adds a new long column with sequenceAttr that increases one by one.

    A physical plan that adds a new long column with sequenceAttr that increases one by one. This is for 'distributed-sequence' default index in pandas API on Spark.

  7. case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan) extends SparkPlan with EvalPythonExec with PythonSQLMetrics with Product with Serializable

    A physical plan that evaluates a PythonUDF

  8. class CoGroupedArrowPythonRunner extends BasePythonRunner[(Iterator[InternalRow], Iterator[InternalRow]), ColumnarBatch] with BasicPythonArrowOutput

    Python UDF Runner for cogrouped udfs.

    Python UDF Runner for cogrouped udfs. It sends Arrow bathes from two different DataFrames, groups them in Python, and receive it back in JVM as batches of single DataFrame.

  9. trait EvalPythonExec extends SparkPlan with UnaryExecNode

    A physical plan that evaluates a PythonUDF, one partition of tuples at a time.

    A physical plan that evaluates a PythonUDF, one partition of tuples at a time.

    Python evaluation works by sending the necessary (projected) input data via a socket to an external Python process, and combine the result from the Python process with the original row.

    For each row we send to Python, we also put it in a queue first. For each output row from Python, we drain the queue to find the original input row. Note that if the Python process is way too slow, this could lead to the queue growing unbounded and spill into disk when run out of memory.

    Here is a diagram to show how this works:

    Downstream (for parent) / \ / socket (output of UDF) / \ RowQueue Python \ / \ socket (input of UDF) \ / upstream (from child)

    The rows sent to and received from Python are packed into batches (100 rows) and serialized, there should be always some rows buffered in the socket or Python process, so the pulling from RowQueue ALWAYS happened after pushing into it.

  10. case class FlatMapCoGroupsInPandasExec(leftGroup: Seq[Attribute], rightGroup: Seq[Attribute], func: Expression, output: Seq[Attribute], left: SparkPlan, right: SparkPlan) extends SparkPlan with BinaryExecNode with PythonSQLMetrics with Product with Serializable

    Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInPandas

    Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInPandas

    The input dataframes are first Cogrouped. Rows from each side of the cogroup are passed to the Python worker via Arrow. As each side of the cogroup may have a different schema we send every group in its own Arrow stream. The Python worker turns the resulting record batches to pandas.DataFrames, invokes the user-defined function, and passes the resulting pandas.DataFrame as an Arrow record batch. Finally, each record batch is turned to Iterator[InternalRow] using ColumnarBatch.

    Note on memory usage: Both the Python worker and the Java executor need to have enough memory to hold the largest cogroup. The memory on the Java side is used to construct the record batches (off heap memory). The memory on the Python side is used for holding the pandas.DataFrame. It's possible to further split one group into multiple record batches to reduce the memory footprint on the Java side, this is left as future work.

  11. case class FlatMapGroupsInPandasExec(groupingAttributes: Seq[Attribute], func: Expression, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with PythonSQLMetrics with Product with Serializable

    Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas

    Physical node for org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas

    Rows in each group are passed to the Python worker as an Arrow record batch. The Python worker turns the record batch to a pandas.DataFrame, invoke the user-defined function, and passes the resulting pandas.DataFrame as an Arrow record batch. Finally, each record batch is turned to Iterator[InternalRow] using ColumnarBatch.

    Note on memory usage: Both the Python worker and the Java executor need to have enough memory to hold the largest group. The memory on the Java side is used to construct the record batch (off heap memory). The memory on the Python side is used for holding the pandas.DataFrame. It's possible to further split one group into multiple record batches to reduce the memory footprint on the Java side, this is left as future work.

  12. case class FlatMapGroupsInPandasWithStateExec(functionExpr: Expression, groupingAttributes: Seq[Attribute], outAttributes: Seq[Attribute], stateType: StructType, stateInfo: Option[StatefulOperatorStateInfo], stateFormatVersion: Int, outputMode: OutputMode, timeoutConf: GroupStateTimeout, batchTimestampMs: Option[Long], eventTimeWatermarkForLateEvents: Option[Long], eventTimeWatermarkForEviction: Option[Long], child: SparkPlan) extends SparkPlan with UnaryExecNode with FlatMapGroupsWithStateExecBase with Product with Serializable

    Physical operator for executing org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandasWithState

    Physical operator for executing org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandasWithState

    functionExpr

    function called on each group

    groupingAttributes

    used to group the data

    outAttributes

    used to define the output rows

    stateType

    used to serialize/deserialize state before calling functionExpr

    stateInfo

    StatefulOperatorStateInfo to identify the state store for a given operator.

    stateFormatVersion

    the version of state format.

    outputMode

    the output mode of functionExpr

    timeoutConf

    used to timeout groups that have not received data in a while

    batchTimestampMs

    processing timestamp of the current batch.

    eventTimeWatermarkForLateEvents

    event time watermark for filtering late events

    eventTimeWatermarkForEviction

    event time watermark for state eviction

    child

    logical plan of the underlying data

  13. trait MapInBatchExec extends SparkPlan with UnaryExecNode with PythonSQLMetrics

    A relation produced by applying a function that takes an iterator of batches such as pandas DataFrame or PyArrow's record batches, and outputs an iterator of them.

    A relation produced by applying a function that takes an iterator of batches such as pandas DataFrame or PyArrow's record batches, and outputs an iterator of them.

    This is somewhat similar with FlatMapGroupsInPandasExec and org.apache.spark.sql.catalyst.plans.logical.MapPartitionsInRWithArrow

  14. case class MapInPandasExec(func: Expression, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with MapInBatchExec with Product with Serializable

    A relation produced by applying a function that takes an iterator of pandas DataFrames and outputs an iterator of pandas DataFrames.

  15. class PythonForeachWriter extends ForeachWriter[UnsafeRow]
  16. case class PythonMapInArrowExec(func: Expression, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with MapInBatchExec with Product with Serializable

    A relation produced by applying a function that takes an iterator of PyArrow's record batches and outputs an iterator of PyArrow's record batches.

  17. class PythonUDFRunner extends BasePythonRunner[Array[Byte], Array[Byte]]

    A helper class to run Python UDFs in Spark.

  18. case class UserDefinedPythonFunction(name: String, func: PythonFunction, dataType: DataType, pythonEvalType: Int, udfDeterministic: Boolean) extends Product with Serializable

    A user-defined Python function.

    A user-defined Python function. This is used by the Python API.

  19. case class WindowInPandasExec(windowExpression: Seq[NamedExpression], partitionSpec: Seq[Expression], orderSpec: Seq[SortOrder], child: SparkPlan) extends SparkPlan with WindowExecBase with PythonSQLMetrics with Product with Serializable

    This class calculates and outputs windowed aggregates over the rows in a single partition.

    This class calculates and outputs windowed aggregates over the rows in a single partition.

    This is similar to WindowExec. The main difference is that this node does not compute any window aggregation values. Instead, it computes the lower and upper bound for each window (i.e. window bounds) and pass the data and indices to Python worker to do the actual window aggregation.

    It currently materializes all data associated with the same partition key and passes them to Python worker. This is not strictly necessary for sliding windows and can be improved (by possibly slicing data into overlapping chunks and stitching them together).

    This class groups window expressions by their window boundaries so that window expressions with the same window boundaries can share the same window bounds. The window bounds are prepended to the data passed to the python worker.

    For example, if we have: avg(v) over specifiedwindowframe(RowFrame, -5, 5), avg(v) over specifiedwindowframe(RowFrame, UnboundedPreceding, UnboundedFollowing), avg(v) over specifiedwindowframe(RowFrame, -3, 3), max(v) over specifiedwindowframe(RowFrame, -3, 3)

    The python input will look like: (lower_bound_w1, upper_bound_w1, lower_bound_w3, upper_bound_w3, v)

    where w1 is specifiedwindowframe(RowFrame, -5, 5) w2 is specifiedwindowframe(RowFrame, UnboundedPreceding, UnboundedFollowing) w3 is specifiedwindowframe(RowFrame, -3, 3)

    Note that w2 doesn't have bound indices in the python input because it's unbounded window so it's bound indices will always be the same.

    Bounded window and Unbounded window are evaluated differently in Python worker: (1) Bounded window takes the window bound indices in addition to the input columns. Unbounded window takes only input columns. (2) Bounded window evaluates the udf once per input row. Unbounded window evaluates the udf once per window partition. This is controlled by Python runner conf "pandas_window_bound_types"

    The logic to compute window bounds is delegated to WindowFunctionFrame and shared with WindowExec

    Note this doesn't support partial aggregation and all aggregation is computed from the entire window.

Value Members

  1. object ApplyInPandasWithStatePythonRunner
  2. object ApplyInPandasWithStateWriter
  3. object EvaluatePython
  4. object ExtractGroupingPythonUDFFromAggregate extends Rule[LogicalPlan]

    Extracts PythonUDFs in logical aggregate, which are used in grouping keys, evaluate them before aggregate.

    Extracts PythonUDFs in logical aggregate, which are used in grouping keys, evaluate them before aggregate. This must be executed after ExtractPythonUDFFromAggregate rule and before ExtractPythonUDFs.

  5. object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan]

    Extracts all the Python UDFs in logical aggregate, which depends on aggregate expression or grouping key, or doesn't depend on any above expressions, evaluate them after aggregate.

  6. object ExtractPythonUDFs extends Rule[LogicalPlan]

    Extracts PythonUDFs from operators, rewriting the query plan so that the UDF can be evaluated alone in a batch.

    Extracts PythonUDFs from operators, rewriting the query plan so that the UDF can be evaluated alone in a batch.

    Only extracts the PythonUDFs that could be evaluated in Python (the single child is PythonUDFs or all the children could be evaluated in JVM).

    This has the limitation that the input to the Python UDF is not allowed include attributes from multiple child operators.

  7. object PythonForeachWriter extends Serializable
  8. object PythonUDFRunner

Ungrouped