Package

io.smartdatalake.workflow

action

Permalink

package action

Visibility
  1. Public
  2. All

Type Members

  1. case class ActionMetadata(name: Option[String] = None, description: Option[String] = None, feed: Option[String] = None, tags: Seq[String] = Seq()) extends Product with Serializable

    Permalink

    Additional metadata for an Action

    Additional metadata for an Action

    name

    Readable name of the Action

    description

    Description of the content of the Action

    feed

    Name of the feed this Action belongs to

    tags

    Optional custom tags for this object

  2. case class CopyAction(id: ActionId, inputId: DataObjectId, outputId: DataObjectId, deleteDataAfterRead: Boolean = false, transformer: Option[CustomDfTransformerConfig] = None, transformers: Seq[ParsableDfTransformer] = Seq(), columnBlacklist: Option[Seq[String]] = None, columnWhitelist: Option[Seq[String]] = None, additionalColumns: Option[Map[String, String]] = None, filterClause: Option[String] = None, standardizeDatatypes: Boolean = false, breakDataFrameLineage: Boolean = false, persist: Boolean = false, executionMode: Option[ExecutionMode] = None, executionCondition: Option[Condition] = None, metricsFailCondition: Option[String] = None, saveModeOptions: Option[SaveModeOptions] = None, metadata: Option[ActionMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkSubFeedAction with Product with Serializable

    Permalink

    Action to copy files (i.e.

    Action to copy files (i.e. from stage to integration)

    inputId

    inputs DataObject

    outputId

    output DataObject

    deleteDataAfterRead

    a flag to enable deletion of input partitions after copying.

    transformer

    optional custom transformation to apply.

    transformers

    optional list of transformations to apply. See sparktransformer for a list of included Transformers. The transformations are applied according to the lists ordering.

    columnBlacklist

    Remove all columns on blacklist from dataframe

    columnWhitelist

    Keep only columns on whitelist in dataframe

    additionalColumns

    optional tuples of [column name, spark sql expression] to be added as additional columns to the dataframe. The spark sql expressions are evaluated against an instance of DefaultExpressionData.

    executionMode

    optional execution mode for this Action

    executionCondition

    optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.

    metricsFailCondition

    optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.

    saveModeOptions

    override and parametrize saveMode set in output DataObject configurations when writing to DataObjects.

  3. case class CustomFileAction(id: ActionId, inputId: DataObjectId, outputId: DataObjectId, transformer: CustomFileTransformerConfig, deleteDataAfterRead: Boolean = false, filesPerPartition: Int = 10, breakFileRefLineage: Boolean = false, executionMode: Option[ExecutionMode] = None, executionCondition: Option[Condition] = None, metricsFailCondition: Option[String] = None, metadata: Option[ActionMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends FileSubFeedAction with SmartDataLakeLogger with Product with Serializable

    Permalink

    Action to transform files between two Hadoop Data Objects.

    Action to transform files between two Hadoop Data Objects. The transformation is executed in distributed mode on the Spark executors. A custom file transformer must be given, which reads a file from Hadoop and writes it back to Hadoop.

    inputId

    inputs DataObject

    outputId

    output DataObject

    transformer

    a custom file transformer, which reads a file from HadoopFileDataObject and writes it back to another HadoopFileDataObject

    deleteDataAfterRead

    if the input files should be deleted after processing successfully

    filesPerPartition

    number of files per Spark partition

    executionMode

    optional execution mode for this Action

    executionCondition

    optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.

    metricsFailCondition

    optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.

  4. case class CustomSparkAction(id: ActionId, inputIds: Seq[DataObjectId], outputIds: Seq[DataObjectId], transformer: Option[CustomDfsTransformerConfig] = None, transformers: Seq[ParsableDfsTransformer] = Seq(), breakDataFrameLineage: Boolean = false, persist: Boolean = false, mainInputId: Option[DataObjectId] = None, mainOutputId: Option[DataObjectId] = None, executionMode: Option[ExecutionMode] = None, executionCondition: Option[Condition] = None, metricsFailCondition: Option[String] = None, metadata: Option[ActionMetadata] = None, recursiveInputIds: Seq[DataObjectId] = Seq(), inputIdsToIgnoreFilter: Seq[DataObjectId] = Seq())(implicit instanceRegistry: InstanceRegistry) extends SparkSubFeedsAction with Product with Serializable

    Permalink

    Action to transform data according to a custom transformer.

    Action to transform data according to a custom transformer. Allows to transform multiple input and output dataframes.

    inputIds

    input DataObject's

    outputIds

    output DataObject's

    transformer

    custom transformation for multiple dataframes to apply

    mainInputId

    optional selection of main inputId used for execution mode and partition values propagation. Only needed if there are multiple input DataObject's.

    mainOutputId

    optional selection of main outputId used for execution mode and partition values propagation. Only needed if there are multiple output DataObject's.

    executionMode

    optional execution mode for this Action

    executionCondition

    optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.

    metricsFailCondition

    optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.

    recursiveInputIds

    output of action that are used as input in the same action

    inputIdsToIgnoreFilter

    optional list of input ids to ignore filter (partition values & filter clause)

  5. case class DeduplicateAction(id: ActionId, inputId: DataObjectId, outputId: DataObjectId, transformer: Option[CustomDfTransformerConfig] = None, transformers: Seq[ParsableDfTransformer] = Seq(), columnBlacklist: Option[Seq[String]] = None, columnWhitelist: Option[Seq[String]] = None, additionalColumns: Option[Map[String, String]] = None, filterClause: Option[String] = None, standardizeDatatypes: Boolean = false, ignoreOldDeletedColumns: Boolean = false, ignoreOldDeletedNestedColumns: Boolean = true, updateCapturedColumnOnlyWhenChanged: Boolean = false, mergeModeEnable: Boolean = false, mergeModeAdditionalJoinPredicate: Option[String] = None, breakDataFrameLineage: Boolean = false, persist: Boolean = false, executionMode: Option[ExecutionMode] = None, executionCondition: Option[Condition] = None, metricsFailCondition: Option[String] = None, metadata: Option[ActionMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkSubFeedAction with Product with Serializable

    Permalink

    Action to deduplicate a subfeed.

    Action to deduplicate a subfeed. Deduplication keeps the last record for every key, also after it has been deleted in the source. DeduplicateAction adds an additional Column TechnicalTableColumn.captured. It contains the timestamp of the last occurrence of the record in the source. This creates lots of updates. Especially when using saveMode.Merge it is better to set TechnicalTableColumn.captured to the last change of the record in the source. Use updateCapturedColumnOnlyWhenChanged = true to enable this optimization.

    DeduplicateAction needs a transactional table (e.g. TransactionalSparkTableDataObject) as output with defined primary keys. If output implements CanMergeDataFrame, saveMode.Merge can be enabled by setting mergeModeEnable = true. This allows for much better performance.

    inputId

    inputs DataObject

    outputId

    output DataObject

    transformer

    optional custom transformation to apply

    transformers

    optional list of transformations to apply before deduplication. See sparktransformer for a list of included Transformers. The transformations are applied according to the lists ordering.

    columnBlacklist

    Remove all columns on blacklist from dataframe

    columnWhitelist

    Keep only columns on whitelist in dataframe

    additionalColumns

    optional tuples of [column name, spark sql expression] to be added as additional columns to the dataframe. The spark sql expressions are evaluated against an instance of io.smartdatalake.util.misc.DefaultExpressionData.

    ignoreOldDeletedColumns

    if true, remove no longer existing columns in Schema Evolution

    ignoreOldDeletedNestedColumns

    if true, remove no longer existing columns from nested data types in Schema Evolution. Keeping deleted columns in complex data types has performance impact as all new data in the future has to be converted by a complex function.

    updateCapturedColumnOnlyWhenChanged

    Set to true to enable update Column TechnicalTableColumn.captured only if Record has changed in the source, instead of updating it with every execution (default=false). This results in much less records updated with saveMode.Merge.

    mergeModeEnable

    Set to true to use saveMode.Merge for much better performance. Output DataObject must implement CanMergeDataFrame if enabled (default = false).

    mergeModeAdditionalJoinPredicate

    To optimize performance it might be interesting to limit the records read from the existing table data, e.g. it might be sufficient to use only the last 7 days. Specify a condition to select existing data to be used in transformation as Spark SQL expression. Use table alias 'existing' to reference columns of the existing table data.

    executionMode

    optional execution mode for this Action

    executionCondition

    optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.

    metricsFailCondition

    optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.

  6. abstract class FileSubFeedAction extends Action

    Permalink
  7. case class FileTransferAction(id: ActionId, inputId: DataObjectId, outputId: DataObjectId, deleteDataAfterRead: Boolean = false, overwrite: Boolean = true, breakFileRefLineage: Boolean = false, executionMode: Option[ExecutionMode] = None, executionCondition: Option[Condition] = None, metricsFailCondition: Option[String] = None, metadata: Option[ActionMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends FileSubFeedAction with Product with Serializable

    Permalink

    Action to transfer files between SFtp, Hadoop and local Fs.

    Action to transfer files between SFtp, Hadoop and local Fs.

    inputId

    inputs DataObject

    outputId

    output DataObject

    deleteDataAfterRead

    if the input files should be deleted after processing successfully

    executionMode

    optional execution mode for this Action

    executionCondition

    optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.

    metricsFailCondition

    optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.

  8. case class HistorizeAction(id: ActionId, inputId: DataObjectId, outputId: DataObjectId, transformer: Option[CustomDfTransformerConfig] = None, transformers: Seq[ParsableDfTransformer] = Seq(), columnBlacklist: Option[Seq[String]] = None, columnWhitelist: Option[Seq[String]] = None, additionalColumns: Option[Map[String, String]] = None, standardizeDatatypes: Boolean = false, filterClause: Option[String] = None, historizeBlacklist: Option[Seq[String]] = None, historizeWhitelist: Option[Seq[String]] = None, ignoreOldDeletedColumns: Boolean = false, ignoreOldDeletedNestedColumns: Boolean = true, mergeModeEnable: Boolean = false, mergeModeAdditionalJoinPredicate: Option[String] = None, breakDataFrameLineage: Boolean = false, persist: Boolean = false, executionMode: Option[ExecutionMode] = None, executionCondition: Option[Condition] = None, metricsFailCondition: Option[String] = None, metadata: Option[ActionMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends SparkSubFeedAction with Product with Serializable

    Permalink

    Action to historize a subfeed.

    Action to historize a subfeed. Historization creates a technical history of data by creating valid-from/to columns. It needs a transactional table as output with defined primary keys.

    inputId

    inputs DataObject

    outputId

    output DataObject

    transformer

    optional custom transformation to apply

    transformers

    optional list of transformations to apply before historization. See sparktransformer for a list of included Transformers. The transformations are applied according to the lists ordering.

    columnBlacklist

    Remove all columns on blacklist from dataframe

    columnWhitelist

    Keep only columns on whitelist in dataframe

    additionalColumns

    optional tuples of [column name, spark sql expression] to be added as additional columns to the dataframe. The spark sql expressions are evaluated against an instance of DefaultExpressionData.

    filterClause

    Filter of data to be processed by historization. It can be used to exclude historical data not needed to create new history, for performance reasons. Note that filterClause is only applied if mergeModeEnable=false. Use mergeModeAdditionalJoinPredicate if mergeModeEnable=true to achieve a similar performance tuning.

    historizeBlacklist

    optional list of columns to ignore when comparing two records in historization. Can not be used together with historizeWhitelist.

    historizeWhitelist

    optional final list of columns to use when comparing two records in historization. Can not be used together with historizeBlacklist.

    ignoreOldDeletedColumns

    if true, remove no longer existing columns in Schema Evolution

    ignoreOldDeletedNestedColumns

    if true, remove no longer existing columns from nested data types in Schema Evolution. Keeping deleted columns in complex data types has performance impact as all new data in the future has to be converted by a complex function.

    mergeModeEnable

    Set to true to use saveMode.Merge for much better performance. Output DataObject must implement CanMergeDataFrame if enabled (default = false).

    mergeModeAdditionalJoinPredicate

    To optimize performance it might be interesting to limit the records read from the existing table data, e.g. it might be sufficient to use only the last 7 days. Specify a condition to select existing data to be used in transformation as Spark SQL expression. Use table alias 'existing' to reference columns of the existing table data.

    executionMode

    optional execution mode for this Action

    executionCondition

    optional spark sql expression evaluated against SubFeedsExpressionData. If true Action is executed, otherwise skipped. Details see Condition.

    metricsFailCondition

    optional spark sql expression evaluated as where-clause against dataframe of metrics. Available columns are dataObjectId, key, value. If there are any rows passing the where clause, a MetricCheckFailed exception is thrown.

  9. case class Metric(dataObjectId: String, key: Option[String], value: Option[String]) extends Product with Serializable

    Permalink
  10. case class NoDataToProcessWarning(actionId: NodeId, msg: String, results: Option[Seq[SubFeed]] = None) extends TaskSkippedDontStopWarning[SubFeed] with Product with Serializable

    Permalink

    Execution modes can throw this exception to indicate that there is no data to process.

    Execution modes can throw this exception to indicate that there is no data to process.

    results

    SDL might add fake results to this exception to allow further execution of DAG. When creating the exception result should be set to None.

    Annotations
    @DeveloperApi()
  11. case class RuntimeInfo(executionId: ExecutionId, state: RuntimeEventState, startTstmp: Option[LocalDateTime] = None, duration: Option[Duration] = None, msg: Option[String] = None, results: Seq[ResultRuntimeInfo] = Seq(), dataObjectsState: Seq[DataObjectState] = Seq()) extends Product with Serializable

    Permalink

    Summarized runtime information

  12. case class SDLExecutionId(runId: Int, attemptId: Int = 1) extends ExecutionId with Product with Serializable

    Permalink

    Standard execution id for actions that are executed synchronous by SDL.

  13. case class SparkStreamingExecutionId(batchId: Long) extends ExecutionId with Product with Serializable

    Permalink

    Execution id for spark streaming jobs.

    Execution id for spark streaming jobs. They need a different execution id as they are executed asynchronous.

  14. abstract class SparkSubFeedAction extends SparkAction

    Permalink
  15. abstract class SparkSubFeedsAction extends SparkAction

    Permalink
  16. case class SubFeedExpressionData(partitionValues: Seq[Map[String, String]], isDAGStart: Boolean, isSkipped: Boolean) extends Product with Serializable

    Permalink
  17. case class SubFeedsExpressionData(inputSubFeeds: Map[String, SubFeedExpressionData]) extends Product with Serializable

    Permalink

Value Members

  1. object CopyAction extends FromConfigFactory[Action] with Serializable

    Permalink
  2. object CustomFileAction extends FromConfigFactory[Action] with Serializable

    Permalink
  3. object CustomSparkAction extends FromConfigFactory[Action] with Serializable

    Permalink
  4. object DeduplicateAction extends FromConfigFactory[Action] with Serializable

    Permalink
  5. object FileTransferAction extends FromConfigFactory[Action] with Serializable

    Permalink
  6. object HistorizeAction extends FromConfigFactory[Action] with Serializable

    Permalink
  7. object SDLExecutionId extends Serializable

    Permalink
  8. object SubFeedsExpressionData extends Serializable

    Permalink
  9. package customlogic

    Permalink
  10. package sparktransformer

    Permalink

Ungrouped