io.smartdatalake.workflow.dataobject
unique name of this data object
DDL-statement to be executed in prepare phase, using output jdbc connection. Note that it is also possible to let Spark create the table in Init-phase. See jdbcOptions to customize column data types for auto-created DDL-statement.
SQL-statement to be executed in exec phase before reading input table, using input jdbc connection. Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
SQL-statement to be executed in exec phase after reading input table and before action is finished, using input jdbc connection Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
SQL-statement to be executed in exec phase before writing output table, using output jdbc connection Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
SQL-statement to be executed in exec phase after writing output table, using output jdbc connection Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
The jdbc table to be read
Number of rows to be fetched together by the Jdbc driver
SDLSaveMode to use when writing table, default is "Overwrite". Only "Append" and "Overwrite" supported.
If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
Id of JdbcConnection configuration
Any jdbc options according to https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html. Note that some options above set and override some of this options explicitly. Use "createTableOptions" and "createTableColumnTypes" to control automatic creating of database tables.
Virtual partition columns. Note that this doesn't need to be the same as the database partition columns for this table. But it is important that there is an index on these columns to efficiently list existing "partitions".
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
Connection defines driver, url and db in central location
Id of JdbcConnection configuration
Creates the read schema based on a given write schema.
Creates the read schema based on a given write schema. Normally this is the same, but some DataObjects can remove & add columns on read (e.g. KafkaTopicDataObject, SparkFileDataObject) In this cases we have to break the DataFrame lineage und create a dummy DataFrame in init phase.
DDL-statement to be executed in prepare phase, using output jdbc connection.
DDL-statement to be executed in prepare phase, using output jdbc connection. Note that it is also possible to let Spark create the table in Init-phase. See jdbcOptions to customize column data types for auto-created DDL-statement.
Delete virtual partitions by "delete from" statement
Delete virtual partitions by "delete from" statement
Optional definition of partitions expected to exist.
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
Returns the factory that can parse this type (that is, type CO
).
Returns the factory that can parse this type (that is, type CO
).
Typically, implementations of this method should return the companion object of the implementing class. The companion object in turn should implement FromConfigFactory.
the factory (object) for this class.
Handle class cast exception when getting objects from instance registry
Handle class cast exception when getting objects from instance registry
Configure a housekeeping mode to e.g cleanup, archive and compact partitions.
Configure a housekeeping mode to e.g cleanup, archive and compact partitions. Default is None.
unique name of this data object
unique name of this data object
Called during init phase for checks and initialization.
Called during init phase for checks and initialization. If possible dont change the system until execution phase.
Number of rows to be fetched together by the Jdbc driver
Any jdbc options according to https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html.
Any jdbc options according to https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html. Note that some options above set and override some of this options explicitly. Use "createTableOptions" and "createTableColumnTypes" to control automatic creating of database tables.
Listing virtual partitions by a "select distinct partition-columns" query
Listing virtual partitions by a "select distinct partition-columns" query
Merges DataFrame with existing table data by writing DataFrame to a temp-table and using SQL Merge-statement.
Merges DataFrame with existing table data by writing DataFrame to a temp-table and using SQL Merge-statement. Table.primaryKey is used as condition to check if a record is matched or not. If it is matched it gets updated (or deleted), otherwise it is inserted. This all is done in one transaction.
Additional metadata for the DataObject
Additional metadata for the DataObject
Definition of partition columns
Definition of partition columns
Runs operations after reading from DataObject
Runs operations after reading from DataObject
SQL-statement to be executed in exec phase after reading input table and before action is finished, using input jdbc connection Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
Runs operations after writing to DataObject
Runs operations after writing to DataObject
SQL-statement to be executed in exec phase after writing output table, using output jdbc connection Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
Runs operations before reading from DataObject
Runs operations before reading from DataObject
SQL-statement to be executed in exec phase before reading input table, using input jdbc connection.
SQL-statement to be executed in exec phase before reading input table, using input jdbc connection. Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
Runs operations before writing to DataObject Note: As the transformed SubFeed doesnt yet exist in Action.preWrite, no partition values can be passed as parameters as in preRead
Runs operations before writing to DataObject Note: As the transformed SubFeed doesnt yet exist in Action.preWrite, no partition values can be passed as parameters as in preRead
SQL-statement to be executed in exec phase before writing output table, using output jdbc connection Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
Prepare & test DataObject's prerequisits
Prepare & test DataObject's prerequisits
This runs during the "prepare" operation of the DAG.
SDLSaveMode to use when writing table, default is "Overwrite".
SDLSaveMode to use when writing table, default is "Overwrite". Only "Append" and "Overwrite" supported.
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
The jdbc table to be read
The jdbc table to be read
Validate the schema of a given Spark Data Frame df
against a given expected schema.
Validate the schema of a given Spark Data Frame df
against a given expected schema.
The data frame to validate.
The expected schema to validate against.
role used in exception message. Set to read or write.
SchemaViolationException
is the schemaMin
does not validate.
Validate the schema of a given Spark Data Frame df
that it contains the specified partition columns
Validate the schema of a given Spark Data Frame df
that it contains the specified partition columns
The data frame to validate.
role used in exception message. Set to read or write.
SchemaViolationException
if the partitions columns are not included.
Validate the schema of a given Spark Data Frame df
that it contains the specified primary key columns
Validate the schema of a given Spark Data Frame df
that it contains the specified primary key columns
The data frame to validate.
role used in exception message. Set to read or write.
SchemaViolationException
if the partitions columns are not included.
Validate the schema of a given Spark Data Frame df
against schemaMin
.
Validate the schema of a given Spark Data Frame df
against schemaMin
.
The data frame to validate.
role used in exception message. Set to read or write.
SchemaViolationException
is the schemaMin
does not validate.
Virtual partition columns.
Virtual partition columns. Note that this doesn't need to be the same as the database partition columns for this table. But it is important that there is an index on these columns to efficiently list existing "partitions".
Write DataFrame to DataObject
Write DataFrame to DataObject
the DataFrame to write
partition values included in DataFrames data
if DataFrame needs this DataObject as input - special treatment might be needed in this case.
Write Spark structured streaming DataFrame The default implementation uses foreachBatch and this traits writeDataFrame method to write the DataFrame.
Write Spark structured streaming DataFrame The default implementation uses foreachBatch and this traits writeDataFrame method to write the DataFrame. Some DataObjects will override this with specific implementations (Kafka).
The Streaming DataFrame to write
Trigger frequency for stream
location for checkpoints of streaming query
DataObject of type JDBC. Provides details for an action to access tables in a database through JDBC.
unique name of this data object
DDL-statement to be executed in prepare phase, using output jdbc connection. Note that it is also possible to let Spark create the table in Init-phase. See jdbcOptions to customize column data types for auto-created DDL-statement.
SQL-statement to be executed in exec phase before reading input table, using input jdbc connection. Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
SQL-statement to be executed in exec phase after reading input table and before action is finished, using input jdbc connection Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
SQL-statement to be executed in exec phase before writing output table, using output jdbc connection Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
SQL-statement to be executed in exec phase after writing output table, using output jdbc connection Use tokens with syntax %{<spark sql expression>} to substitute with values from DefaultExpressionData.
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
The jdbc table to be read
Number of rows to be fetched together by the Jdbc driver
SDLSaveMode to use when writing table, default is "Overwrite". Only "Append" and "Overwrite" supported.
If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
Id of JdbcConnection configuration
Any jdbc options according to https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html. Note that some options above set and override some of this options explicitly. Use "createTableOptions" and "createTableColumnTypes" to control automatic creating of database tables.
Virtual partition columns. Note that this doesn't need to be the same as the database partition columns for this table. But it is important that there is an index on these columns to efficiently list existing "partitions".
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.