Package

io.smartdatalake.workflow

dataobject

Permalink

package dataobject

Visibility
  1. Public
  2. All

Type Members

  1. class DeltaLakeModulePlugin extends ModulePlugin

    Permalink
  2. case class DeltaLakeTableDataObject(id: DataObjectId, path: Option[String], partitions: Seq[String] = Seq(), options: Option[Map[String, String]] = None, schemaMin: Option[StructType] = None, table: Table, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, allowSchemaEvolution: Boolean = false, retentionPeriod: Option[Int] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends TransactionalSparkTableDataObject with CanMergeDataFrame with CanEvolveSchema with CanHandlePartitions with HasHadoopStandardFilestore with Product with Serializable

    Permalink

    DataObject of type DeltaLakeTableDataObject.

    DataObject of type DeltaLakeTableDataObject. Provides details to access Tables in delta format to an Action. Note that in Spark 2.x Catalog for DeltaTable is not supported. This means that table db/name are not used. It's the path that

    Delta format maintains a transaction log in a separate _delta_log subfolder. The schema is registered in Metastore by DeltaLakeTableDataObject.

    The following anomalies might occur: - table is registered in metastore but path does not exist -> table is dropped from metastore - table is registered in metastore but path is empty -> error is thrown. Delete the path to clean up - table is registered and path contains parquet files, but _delta_log subfolder is missing -> path is converted to delta format - table is not registered but path contains parquet files and _delta_log subfolder -> Table is registered - table is not registered but path contains parquet files without _delta_log subfolder -> path is converted to delta format and table is registered - table is not registered and path does not exists -> table is created on write

    id

    unique name of this data object

    path

    hadoop directory for this table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied.

    partitions

    partition columns for this data object

    options

    Options for Delta Lake tables see: https://docs.delta.io/latest/delta-batch.html and org.apache.spark.sql.delta.DeltaOptions

    schemaMin

    An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.

    table

    DeltaLake table to be written by this output

    saveMode

    SDLSaveMode to use when writing files, default is "overwrite". Overwrite, Append and Merge are supported for now.

    allowSchemaEvolution

    If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.

    retentionPeriod

    Optional delta lake retention threshold in hours. Files required by the table for reading versions younger than retentionPeriod will be preserved and the rest of them will be deleted.

    acl

    override connection permissions for files created tables hadoop directory with this connection

    connectionId

    optional id of io.smartdatalake.workflow.connection.HiveTableConnection

    expectedPartitionsCondition

    Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.

    housekeepingMode

    Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.

    metadata

    meta data

Value Members

  1. object DeltaLakeTableDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink

Ungrouped