io.smartdatalake.workflow.action
inputs DataObject
output DataObject
optional custom transformation to apply
optional list of transformations to apply before historization. See sparktransformer for a list of included Transformers. The transformations are applied according to the lists ordering.
Remove all columns on blacklist from dataframe
Keep only columns on whitelist in dataframe
optional tuples of [column name, spark sql expression] to be added as additional columns to the dataframe. The spark sql expressions are evaluated against an instance of DefaultExpressionData.
Filter of data to be processed by historization. It can be used to exclude historical data not needed to create new history, for performance reasons. Note that filterClause is only applied if mergeModeEnable=false. Use mergeModeAdditionalJoinPredicate if mergeModeEnable=true to achieve a similar performance tuning.
optional list of columns to ignore when comparing two records in historization. Can not be used together with historizeWhitelist.
optional final list of columns to use when comparing two records in historization. Can not be used together with historizeBlacklist.
if true, remove no longer existing columns in Schema Evolution
if true, remove no longer existing columns from nested data types in Schema Evolution. Keeping deleted columns in complex data types has performance impact as all new data in the future has to be converted by a complex function.
Set to true to use saveMode.Merge for much better performance. Output DataObject must implement CanMergeDataFrame if enabled (default = false).
To optimize performance it might be interesting to limit the records read from the existing table data, e.g. it might be sufficient to use only the last 7 days. Specify a condition to select existing data to be used in transformation as Spark SQL expression. Use table alias 'existing' to reference columns of the existing table data.
optional execution mode for this Action
optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.
optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.
Adds a runtime event for this Action
Adds a runtime event for this Action
Adds a runtime metric for this Action
Adds a runtime metric for this Action
Applies the executionMode and stores result in executionModeResult variable
Applies the executionMode and stores result in executionModeResult variable
apply transformer to SubFeed
apply transformer to SubFeed
apply transformer to partition values
apply transformer to partition values
Stop propagating input DataFrame through action and instead get a new DataFrame from DataObject.
Stop propagating input DataFrame through action and instead get a new DataFrame from DataObject. This can help to save memory and performance if the input DataFrame includes many transformations from previous Actions. The new DataFrame will be initialized according to the SubFeed's partitionValues.
Enriches SparkSubFeed with DataFrame if not existing
Enriches SparkSubFeed with DataFrame if not existing
input data object.
input SubFeed.
current execution phase
true if this input is a recursive input
Action.exec implementation
Action.exec implementation
SparkSubFeed's to be processed
processed SparkSubFeed's
optional spark sql expression evaluated against SubFeedsExpressionData.
optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.
optional execution mode for this Action
optional execution mode for this Action
Returns the factory that can parse this type (that is, type CO
).
Returns the factory that can parse this type (that is, type CO
).
Typically, implementations of this method should return the companion object of the implementing class. The companion object in turn should implement FromConfigFactory.
the factory (object) for this class.
Filter DataFrame with given partition values
Filter DataFrame with given partition values
DataFrame to filter
partition values to use as filter condition
filter expression to apply
filtered DataFrame
Get potential state of input DataObjects when executionMode is DataObjectStateIncrementalMode.
Get potential state of input DataObjects when executionMode is DataObjectStateIncrementalMode.
Get latest runtime state
Get latest runtime state
Get summarized runtime information for a given ExecutionId.
Get summarized runtime information for a given ExecutionId.
ExecutionId to get runtime information for. If empty runtime information for last ExecutionId are returned.
Get the latest metrics for all DataObjects and a given SDLExecutionId.
Get the latest metrics for all DataObjects and a given SDLExecutionId.
ExecutionId to get metrics for. If empty metrics for last ExecutionId are returned.
applies all the transformations above
applies all the transformations above
A unique identifier for this instance.
A unique identifier for this instance.
if true, remove no longer existing columns in Schema Evolution
if true, remove no longer existing columns from nested data types in Schema Evolution.
if true, remove no longer existing columns from nested data types in Schema Evolution. Keeping deleted columns in complex data types has performance impact as all new data in the future has to be converted by a complex function.
Action.init implementation
Action.init implementation
SparkSubFeed's to be processed
processed SparkSubFeed's
Input DataObject which can CanCreateDataFrame
Input DataObject which can CanCreateDataFrame
inputs DataObject
Input DataObjects To be implemented by subclasses
Input DataObjects To be implemented by subclasses
If this Action should be run as asynchronous streaming process
If this Action should be run as asynchronous streaming process
To optimize performance it might be interesting to limit the records read from the existing table data, e.g.
To optimize performance it might be interesting to limit the records read from the existing table data, e.g. it might be sufficient to use only the last 7 days. Specify a condition to select existing data to be used in transformation as Spark SQL expression. Use table alias 'existing' to reference columns of the existing table data.
Set to true to use saveMode.Merge for much better performance.
Set to true to use saveMode.Merge for much better performance. Output DataObject must implement CanMergeDataFrame if enabled (default = false).
Additional metadata for the Action
Additional metadata for the Action
optional spark sql expression evaluated as where-clause against dataframe of metrics.
optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.
provide an implementation of the DAG node id
provide an implementation of the DAG node id
Output DataObject which can CanWriteDataFrame
Output DataObject which can CanWriteDataFrame
output DataObject
Output DataObjects To be implemented by subclasses
Output DataObjects To be implemented by subclasses
Force persisting input DataFrame's on Disk.
Force persisting input DataFrame's on Disk. This improves performance if dataFrame is used multiple times in the transformation and can serve as a recovery point in case a task get's lost. Note that DataFrames are persisted automatically by the previous Action if later Actions need the same data. To avoid this behaviour set breakDataFrameLineage=false.
Executes operations needed after executing an action.
Executes operations needed after executing an action. In this step any task on Input- or Output-DataObjects needed after the main task is executed, e.g. JdbcTableDataObjects postWriteSql or CopyActions deleteInputData.
Executes operations needed to cleanup after executing an action failed.
Executes operations needed to cleanup after executing an action failed.
Executes operations needed before executing an action.
Executes operations needed before executing an action. In this step any phase on Input- or Output-DataObjects needed before the main task is executed, e.g. JdbcTableDataObjects preWriteSql
Checks before initalization of Action In this step execution condition is evaluated and Action init is skipped if result is false.
Checks before initalization of Action In this step execution condition is evaluated and Action init is skipped if result is false.
Prepare DataObjects prerequisites.
Prepare DataObjects prerequisites. In this step preconditions are prepared & tested: - connections can be created - needed structures exist, e.g Kafka topic or Jdbc table
This runs during the "prepare" phase of the DAG.
Applies changes to a SubFeed from a previous action in order to be used as input for this actions transformation.
Applies changes to a SubFeed from a previous action in order to be used as input for this actions transformation.
Recursive Inputs cannot be set by configuration for SparkSubFeedActions, but they are implicitly used in DeduplicateAction and HistorizeAction for existing data.
Recursive Inputs cannot be set by configuration for SparkSubFeedActions, but they are implicitly used in DeduplicateAction and HistorizeAction for existing data. Default is empty.
Override and parametrize saveMode in output DataObject configurations when writing to DataObjects.
Override and parametrize saveMode in output DataObject configurations when writing to DataObjects.
Sets the util job description for better traceability in the Spark UI
Sets the util job description for better traceability in the Spark UI
Note: This sets Spark local properties, which are propagated to the respective executor tasks. We rely on this to match metrics back to Actions and DataObjects. As writing to a DataObject on the Driver happens uninterrupted in the same exclusive thread, this is suitable.
phase description (be short...)
This is displayed in ascii graph visualization
This is displayed in ascii graph visualization
Transform a SparkSubFeed.
Transform a SparkSubFeed. To be implemented by subclasses.
SparkSubFeed to be transformed
SparkSubFeed to be enriched with transformed result
transformed output SparkSubFeed
Transform partition values
Transform partition values
Map of input to output partition values. This allows to map partition values forward and backward, which is needed in execution modes.
optional list of transformations to apply before historization.
optional list of transformations to apply before historization. See sparktransformer for a list of included Transformers. The transformations are applied according to the lists ordering.
The transformed DataFrame is validated to have the output's partition columns included, partition columns are moved to the end and SubFeeds partition values updated.
The transformed DataFrame is validated to have the output's partition columns included, partition columns are moved to the end and SubFeeds partition values updated.
output DataObject
SubFeed with transformed DataFrame
validated and updated SubFeed
Validate that DataFrame contains a given list of columns, throwing an exception otherwise.
Validate that DataFrame contains a given list of columns, throwing an exception otherwise.
DataFrame to validate
Columns that must exist in DataFrame
name to mention in exception
writes subfeed to output respecting given execution mode
writes subfeed to output respecting given execution mode
true if no data was transfered, otherwise false
optional tuples of [column name, spark sql expression] to be added as additional columns to the dataframe.
optional tuples of [column name, spark sql expression] to be added as additional columns to the dataframe. The spark sql expressions are evaluated against an instance of DefaultExpressionData.
(Since version 2.0.5) Use transformers instead.
Remove all columns on blacklist from dataframe
Remove all columns on blacklist from dataframe
(Since version 2.0.5) Use transformers instead.
Keep only columns on whitelist in dataframe
Keep only columns on whitelist in dataframe
(Since version 2.0.5) Use transformers instead.
Filter of data to be processed by historization.
Filter of data to be processed by historization. It can be used to exclude historical data not needed to create new history, for performance reasons. Note that filterClause is only applied if mergeModeEnable=false. Use mergeModeAdditionalJoinPredicate if mergeModeEnable=true to achieve a similar performance tuning.
(Since version 2.0.5) Use transformers instead.
optional list of columns to ignore when comparing two records in historization.
optional list of columns to ignore when comparing two records in historization. Can not be used together with historizeWhitelist.
(Since version 2.0.5) Use transformers instead.
optional final list of columns to use when comparing two records in historization.
optional final list of columns to use when comparing two records in historization. Can not be used together with historizeBlacklist.
(Since version 2.0.5) Use transformers instead.
(Since version 2.0.5) Use transformers instead.
optional custom transformation to apply
optional custom transformation to apply
(Since version 2.0.5) Use transformers instead.
Action to historize a subfeed. Historization creates a technical history of data by creating valid-from/to columns. It needs a transactional table as output with defined primary keys.
inputs DataObject
output DataObject
optional custom transformation to apply
optional list of transformations to apply before historization. See sparktransformer for a list of included Transformers. The transformations are applied according to the lists ordering.
Remove all columns on blacklist from dataframe
Keep only columns on whitelist in dataframe
optional tuples of [column name, spark sql expression] to be added as additional columns to the dataframe. The spark sql expressions are evaluated against an instance of DefaultExpressionData.
Filter of data to be processed by historization. It can be used to exclude historical data not needed to create new history, for performance reasons. Note that filterClause is only applied if mergeModeEnable=false. Use mergeModeAdditionalJoinPredicate if mergeModeEnable=true to achieve a similar performance tuning.
optional list of columns to ignore when comparing two records in historization. Can not be used together with historizeWhitelist.
optional final list of columns to use when comparing two records in historization. Can not be used together with historizeBlacklist.
if true, remove no longer existing columns in Schema Evolution
if true, remove no longer existing columns from nested data types in Schema Evolution. Keeping deleted columns in complex data types has performance impact as all new data in the future has to be converted by a complex function.
Set to true to use saveMode.Merge for much better performance. Output DataObject must implement CanMergeDataFrame if enabled (default = false).
To optimize performance it might be interesting to limit the records read from the existing table data, e.g. it might be sufficient to use only the last 7 days. Specify a condition to select existing data to be used in transformation as Spark SQL expression. Use table alias 'existing' to reference columns of the existing table data.
optional execution mode for this Action
optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.
optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.