io.smartdatalake.workflow.dataobject
unique name of this data object
hadoop directory for this table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied.
partition columns for this data object
Options for Delta Lake tables see: https://docs.delta.io/latest/delta-batch.html and org.apache.spark.sql.delta.DeltaOptions
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
DeltaLake table to be written by this output
SDLSaveMode to use when writing files, default is "overwrite". Overwrite, Append and Merge are supported for now.
If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
Optional delta lake retention threshold in hours. Files required by the table for reading versions younger than retentionPeriod will be preserved and the rest of them will be deleted.
override connection permissions for files created tables hadoop directory with this connection
optional id of io.smartdatalake.workflow.connection.HiveTableConnection
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
meta data
override connection permissions for files created tables hadoop directory with this connection
If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
Check if the input files exist.
Check if the input files exist.
IllegalArgumentException
if failIfFilesMissing
= true and no files found at path
.
optional id of io.smartdatalake.workflow.connection.HiveTableConnection
Note that we will not delete the whole partition but just the data of the partition because delta lake keeps history
Note that we will not delete the whole partition but just the data of the partition because delta lake keeps history
Optional definition of partitions expected to exist.
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
Configure whether io.smartdatalake.workflow.action.Actions should fail if the input file(s) are missing on the file system.
Configure whether io.smartdatalake.workflow.action.Actions should fail if the input file(s) are missing on the file system.
Default is false.
Optional definition of a housekeeping mode applied after every write.
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
unique name of this data object
unique name of this data object
List partitions.
List partitions. Note that we need a Spark SQL statement as there might be partition directories with no current data inside
Merges DataFrame with existing table data by using DeltaLake Upsert-statement.
Merges DataFrame with existing table data by using DeltaLake Upsert-statement.
Table.primaryKey is used as condition to check if a record is matched or not. If it is matched it gets updated (or deleted), otherwise it is inserted.
This all is done in one transaction.
meta data
meta data
Options for Delta Lake tables see: https://docs.delta.io/latest/delta-batch.html and org.apache.spark.sql.delta.DeltaOptions
partition columns for this data object
partition columns for this data object
hadoop directory for this table.
hadoop directory for this table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied.
Optional delta lake retention threshold in hours.
Optional delta lake retention threshold in hours. Files required by the table for reading versions younger than retentionPeriod will be preserved and the rest of them will be deleted.
SDLSaveMode to use when writing files, default is "overwrite".
SDLSaveMode to use when writing files, default is "overwrite". Overwrite, Append and Merge are supported for now.
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
DeltaLake table to be written by this output
DeltaLake table to be written by this output
Writes DataFrame to HDFS/Parquet and creates DeltaLake table.
Writes DataFrame to HDFS/Parquet and creates DeltaLake table. DataFrames are repartitioned in order not to write too many small files or only a few HDFS files that are too large.
DataObject of type DeltaLakeTableDataObject. Provides details to access Tables in delta format to an Action. Note that in Spark 2.x Catalog for DeltaTable is not supported. This means that table db/name are not used. It's the path that
Delta format maintains a transaction log in a separate _delta_log subfolder. The schema is registered in Metastore by DeltaLakeTableDataObject.
The following anomalies might occur: - table is registered in metastore but path does not exist -> table is dropped from metastore - table is registered in metastore but path is empty -> error is thrown. Delete the path to clean up - table is registered and path contains parquet files, but _delta_log subfolder is missing -> path is converted to delta format - table is not registered but path contains parquet files and _delta_log subfolder -> Table is registered - table is not registered but path contains parquet files without _delta_log subfolder -> path is converted to delta format and table is registered - table is not registered and path does not exists -> table is created on write
unique name of this data object
hadoop directory for this table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied.
partition columns for this data object
Options for Delta Lake tables see: https://docs.delta.io/latest/delta-batch.html and org.apache.spark.sql.delta.DeltaOptions
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
DeltaLake table to be written by this output
SDLSaveMode to use when writing files, default is "overwrite". Overwrite, Append and Merge are supported for now.
If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
Optional delta lake retention threshold in hours. Files required by the table for reading versions younger than retentionPeriod will be preserved and the rest of them will be deleted.
override connection permissions for files created tables hadoop directory with this connection
optional id of io.smartdatalake.workflow.connection.HiveTableConnection
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
meta data