DeltaLakeTableDataObject

DataObject of type DeltaLakeTableDataObject. Provides details to access Tables in delta format to an Action. Note that in Spark 2.x Catalog for DeltaTable is not supported. This means that table db/name are not used. It's the path that

Delta format maintains a transaction log in a separate _delta_log subfolder. The schema is registered in Metastore by DeltaLakeTableDataObject.

The following anomalies might occur: - table is registered in metastore but path does not exist -> table is dropped from metastore - table is registered in metastore but path is empty -> error is thrown. Delete the path to clean up - table is registered and path contains parquet files, but _delta_log subfolder is missing -> path is converted to delta format - table is not registered but path contains parquet files and _delta_log subfolder -> Table is registered - table is not registered but path contains parquet files without _delta_log subfolder -> path is converted to delta format and table is registered - table is not registered and path does not exists -> table is created on write

id: unique name of this data object
path: hadoop directory for this table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied.
partitions: partition columns for this data object
options: Options for Delta Lake tables see: https://docs.delta.io/latest/delta-batch.html and org.apache.spark.sql.delta.DeltaOptions
schemaMin: An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
table: DeltaLake table to be written by this output
saveMode: SDLSaveMode to use when writing files, default is "overwrite". Overwrite, Append and Merge are supported for now.
allowSchemaEvolution: If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
retentionPeriod: Optional delta lake retention threshold in hours. Files required by the table for reading versions younger than retentionPeriod will be preserved and the rest of them will be deleted.
acl: override connection permissions for files created tables hadoop directory with this connection
connectionId: optional id of io.smartdatalake.workflow.connection.HiveTableConnection
expectedPartitionsCondition: Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
housekeepingMode: Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
metadata: meta data

Linear Supertypes

Serializable, Serializable, Product, Equals, HasHadoopStandardFilestore, CanHandlePartitions, CanEvolveSchema, CanMergeDataFrame, TransactionalSparkTableDataObject, CanWriteDataFrame, TableDataObject, SchemaValidation, CanCreateDataFrame, DataObject, AtlasExportable, SmartDataLakeLogger, ParsableFromConfig[DataObject], SdlConfigObject, AnyRef, Any

Instance Constructors

new DeltaLakeTableDataObject(id: DataObjectId, path: Option[String], partitions: Seq[String] = Seq(), options: Option[Map[String, String]] = None, schemaMin: Option[StructType] = None, table: Table, saveMode: SDLSaveMode = SDLSaveMode.Overwrite, allowSchemaEvolution: Boolean = false, retentionPeriod: Option[Int] = None, acl: Option[AclDef] = None, connectionId: Option[ConnectionId] = None, expectedPartitionsCondition: Option[String] = None, housekeepingMode: Option[HousekeepingMode] = None, metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry)

id
unique name of this data object
path
hadoop directory for this table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied.
partitions
partition columns for this data object
options
Options for Delta Lake tables see: https://docs.delta.io/latest/delta-batch.html and org.apache.spark.sql.delta.DeltaOptions
schemaMin
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
table
DeltaLake table to be written by this output
saveMode
SDLSaveMode to use when writing files, default is "overwrite". Overwrite, Append and Merge are supported for now.
allowSchemaEvolution
If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
retentionPeriod
Optional delta lake retention threshold in hours. Files required by the table for reading versions younger than retentionPeriod will be preserved and the rest of them will be deleted.
acl
override connection permissions for files created tables hadoop directory with this connection
connectionId
optional id of io.smartdatalake.workflow.connection.HiveTableConnection
expectedPartitionsCondition
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
housekeepingMode
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.
metadata
meta data

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
val acl: Option[AclDef]

override connection permissions for files created tables hadoop directory with this connection
def addFieldIfNotExisting(writeSchema: StructType, colName: String, dataType: DataType): StructType

Attributes
protected
Definition Classes
CanCreateDataFrame
val allowSchemaEvolution: Boolean

If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.
If set to true schema evolution will automatically occur when writing to this DataObject with different schema, otherwise SDL will stop with error.

Definition Classes
DeltaLakeTableDataObject → CanEvolveSchema
final def asInstanceOf[T0]: T0

Definition Classes
Any
def atlasName: String

Definition Classes
TableDataObject → DataObject → AtlasExportable
def atlasQualifiedName(prefix: String): String

Definition Classes
TableDataObject → AtlasExportable
def checkFilesExisting(implicit session: SparkSession): Boolean

Check if the input files exist.
Check if the input files exist.

Attributes
protected
Exceptions thrown
IllegalArgumentException if failIfFilesMissing = true and no files found at path.
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
val connectionId: Option[ConnectionId]

optional id of io.smartdatalake.workflow.connection.HiveTableConnection
def createReadSchema(writeSchema: StructType)(implicit session: SparkSession): StructType

Definition Classes
CanCreateDataFrame
def deletePartitions(partitionValues: Seq[PartitionValues])(implicit session: SparkSession): Unit

Note that we will not delete the whole partition but just the data of the partition because delta lake keeps history
Note that we will not delete the whole partition but just the data of the partition because delta lake keeps history

Definition Classes
DeltaLakeTableDataObject → CanHandlePartitions
def dropTable(implicit session: SparkSession): Unit

Definition Classes
DeltaLakeTableDataObject → TableDataObject
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
val expectedPartitionsCondition: Option[String]

Optional definition of partitions expected to exist.
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.

Definition Classes
DeltaLakeTableDataObject → CanHandlePartitions
def factory: FromConfigFactory[DataObject]

Definition Classes
DeltaLakeTableDataObject → ParsableFromConfig
def failIfFilesMissing: Boolean

Configure whether io.smartdatalake.workflow.action.Actions should fail if the input file(s) are missing on the file system.
Configure whether io.smartdatalake.workflow.action.Actions should fail if the input file(s) are missing on the file system.
Default is false.
val filetype: String
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def getConnection[T <: Connection](connectionId: ConnectionId)(implicit registry: InstanceRegistry, ct: ClassTag[T], tt: scala.reflect.api.JavaUniverse.TypeTag[T]): T

Attributes
protected
Definition Classes
DataObject
def getConnectionReg[T <: Connection](connectionId: ConnectionId, registry: InstanceRegistry)(implicit ct: ClassTag[T], tt: scala.reflect.api.JavaUniverse.TypeTag[T]): T

Attributes
protected
Definition Classes
DataObject
def getDataFrame(partitionValues: Seq[PartitionValues] = Seq())(implicit session: SparkSession, context: ActionPipelineContext): DataFrame

Definition Classes
DeltaLakeTableDataObject → CanCreateDataFrame
def getPKduplicates(implicit session: SparkSession, context: ActionPipelineContext): DataFrame

Definition Classes
TableDataObject
def getPKnulls(implicit session: SparkSession, context: ActionPipelineContext): DataFrame

Definition Classes
TableDataObject
def getPKviolators(implicit session: SparkSession, context: ActionPipelineContext): DataFrame

Definition Classes
TableDataObject
def hadoopPath(implicit session: SparkSession): Path

Definition Classes
DeltaLakeTableDataObject → HasHadoopStandardFilestore
val housekeepingMode: Option[HousekeepingMode]

Optional definition of a housekeeping mode applied after every write.
Optional definition of a housekeeping mode applied after every write. E.g. it can be used to cleanup, archive and compact partitions. See HousekeepingMode for available implementations. Default is None.

Definition Classes
DeltaLakeTableDataObject → DataObject
val id: DataObjectId

unique name of this data object
unique name of this data object

Definition Classes
DeltaLakeTableDataObject → DataObject → SdlConfigObject
def init(df: DataFrame, partitionValues: Seq[PartitionValues], saveModeOptions: Option[SaveModeOptions] = None)(implicit session: SparkSession, context: ActionPipelineContext): Unit

Definition Classes
DeltaLakeTableDataObject → CanWriteDataFrame
implicit val instanceRegistry: InstanceRegistry
def isDbExisting(implicit session: SparkSession): Boolean

Definition Classes
DeltaLakeTableDataObject → TableDataObject
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def isPKcandidateKey(implicit session: SparkSession, context: ActionPipelineContext): Boolean

Definition Classes
TableDataObject
def isTableExisting(implicit session: SparkSession): Boolean

Definition Classes
DeltaLakeTableDataObject → TableDataObject
def listPartitions(implicit session: SparkSession, context: ActionPipelineContext): Seq[PartitionValues]

List partitions.
List partitions. Note that we need a Spark SQL statement as there might be partition directories with no current data inside

Definition Classes
DeltaLakeTableDataObject → CanHandlePartitions
lazy val logger: Logger

Attributes
protected
Definition Classes
SmartDataLakeLogger
def mergeDataFrameByPrimaryKey(df: DataFrame, saveModeOptions: SaveModeMergeOptions)(implicit session: SparkSession, context: ActionPipelineContext): Unit

Merges DataFrame with existing table data by using DeltaLake Upsert-statement.
Merges DataFrame with existing table data by using DeltaLake Upsert-statement.
Table.primaryKey is used as condition to check if a record is matched or not. If it is matched it gets updated (or deleted), otherwise it is inserted.
This all is done in one transaction.
val metadata: Option[DataObjectMetadata]

meta data
meta data

Definition Classes
DeltaLakeTableDataObject → DataObject
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
val options: Option[Map[String, String]]

Options for Delta Lake tables see: https://docs.delta.io/latest/delta-batch.html and org.apache.spark.sql.delta.DeltaOptions
def partitionLayout(): Option[String]

Definition Classes
HasHadoopStandardFilestore
val partitions: Seq[String]

partition columns for this data object
partition columns for this data object

Definition Classes
DeltaLakeTableDataObject → CanHandlePartitions
val path: Option[String]

hadoop directory for this table.
hadoop directory for this table. If it doesn't contain scheme and authority, the connections pathPrefix is applied. If pathPrefix is not defined or doesn't define scheme and authority, default schema and authority is applied.
def preWrite(implicit session: SparkSession, context: ActionPipelineContext): Unit

Definition Classes
DeltaLakeTableDataObject → DataObject
def prepare(implicit session: SparkSession, context: ActionPipelineContext): Unit

Definition Classes
DeltaLakeTableDataObject → DataObject
val retentionPeriod: Option[Int]

Optional delta lake retention threshold in hours.
Optional delta lake retention threshold in hours. Files required by the table for reading versions younger than retentionPeriod will be preserved and the rest of them will be deleted.
val saveMode: SDLSaveMode

SDLSaveMode to use when writing files, default is "overwrite".
SDLSaveMode to use when writing files, default is "overwrite". Overwrite, Append and Merge are supported for now.
val schemaMin: Option[StructType]

An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.

Definition Classes
DeltaLakeTableDataObject → SchemaValidation
val separator: Char

Attributes
protected
def streamingOptions: Map[String, String]

Definition Classes
CanWriteDataFrame
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
var table: Table

DeltaLake table to be written by this output
DeltaLake table to be written by this output

Definition Classes
DeltaLakeTableDataObject → TableDataObject
var tableSchema: StructType

Definition Classes
TableDataObject
def toStringShort: String

Definition Classes
DataObject
def vacuum(implicit session: SparkSession): Unit
def validateSchema(df: DataFrame, schemaExpected: StructType, role: String): Unit

Definition Classes
SchemaValidation
def validateSchemaHasPartitionCols(df: DataFrame, role: String): Unit

Definition Classes
CanHandlePartitions
def validateSchemaHasPrimaryKeyCols(df: DataFrame, primaryKeyCols: Seq[String], role: String): Unit

Definition Classes
CanHandlePartitions
def validateSchemaMin(df: DataFrame, role: String): Unit

Definition Classes
SchemaValidation
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def writeDataFrame(df: DataFrame, createTableOnly: Boolean, partitionValues: Seq[PartitionValues], saveModeOptions: Option[SaveModeOptions])(implicit session: SparkSession, context: ActionPipelineContext): Unit

Writes DataFrame to HDFS/Parquet and creates DeltaLake table.
Writes DataFrame to HDFS/Parquet and creates DeltaLake table. DataFrames are repartitioned in order not to write too many small files or only a few HDFS files that are too large.
def writeDataFrame(df: DataFrame, partitionValues: Seq[PartitionValues] = Seq(), isRecursiveInput: Boolean = false, saveModeOptions: Option[SaveModeOptions] = None)(implicit session: SparkSession, context: ActionPipelineContext): Unit

Definition Classes
DeltaLakeTableDataObject → CanWriteDataFrame
def writeStreamingDataFrame(df: DataFrame, trigger: Trigger, options: Map[String, String], checkpointLocation: String, queryName: String, outputMode: OutputMode, saveModeOptions: Option[SaveModeOptions])(implicit session: SparkSession, context: ActionPipelineContext): StreamingQuery

Definition Classes
CanWriteDataFrame

Related Docs: object DeltaLakeTableDataObject | package dataobject

Instance Constructors

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

val acl: Option[AclDef]

def addFieldIfNotExisting(writeSchema: StructType, colName: String, dataType: DataType): StructType

val allowSchemaEvolution: Boolean

final def asInstanceOf[T0]: T0

def atlasName: String

def atlasQualifiedName(prefix: String): String

def checkFilesExisting(implicit session: SparkSession): Boolean

def clone(): AnyRef

val connectionId: Option[ConnectionId]

def createReadSchema(writeSchema: StructType)(implicit session: SparkSession): StructType

def deletePartitions(partitionValues: Seq[PartitionValues])(implicit session: SparkSession): Unit

def dropTable(implicit session: SparkSession): Unit

final def eq(arg0: AnyRef): Boolean

val expectedPartitionsCondition: Option[String]

def factory: FromConfigFactory[DataObject]

def failIfFilesMissing: Boolean

val filetype: String

def finalize(): Unit

final def getClass(): Class[_]

def getConnection[T <: Connection](connectionId: ConnectionId)(implicit registry: InstanceRegistry, ct: ClassTag[T], tt: scala.reflect.api.JavaUniverse.TypeTag[T]): T

def getConnectionReg[T <: Connection](connectionId: ConnectionId, registry: InstanceRegistry)(implicit ct: ClassTag[T], tt: scala.reflect.api.JavaUniverse.TypeTag[T]): T

def getDataFrame(partitionValues: Seq[PartitionValues] = Seq())(implicit session: SparkSession, context: ActionPipelineContext): DataFrame

def getPKduplicates(implicit session: SparkSession, context: ActionPipelineContext): DataFrame

def getPKnulls(implicit session: SparkSession, context: ActionPipelineContext): DataFrame

def getPKviolators(implicit session: SparkSession, context: ActionPipelineContext): DataFrame

def hadoopPath(implicit session: SparkSession): Path

val housekeepingMode: Option[HousekeepingMode]

val id: DataObjectId

def init(df: DataFrame, partitionValues: Seq[PartitionValues], saveModeOptions: Option[SaveModeOptions] = None)(implicit session: SparkSession, context: ActionPipelineContext): Unit

implicit val instanceRegistry: InstanceRegistry

def isDbExisting(implicit session: SparkSession): Boolean

final def isInstanceOf[T0]: Boolean

def isPKcandidateKey(implicit session: SparkSession, context: ActionPipelineContext): Boolean

def isTableExisting(implicit session: SparkSession): Boolean

def listPartitions(implicit session: SparkSession, context: ActionPipelineContext): Seq[PartitionValues]

lazy val logger: Logger

def mergeDataFrameByPrimaryKey(df: DataFrame, saveModeOptions: SaveModeMergeOptions)(implicit session: SparkSession, context: ActionPipelineContext): Unit

val metadata: Option[DataObjectMetadata]

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

val options: Option[Map[String, String]]

def partitionLayout(): Option[String]

val partitions: Seq[String]

val path: Option[String]

def preWrite(implicit session: SparkSession, context: ActionPipelineContext): Unit

def prepare(implicit session: SparkSession, context: ActionPipelineContext): Unit

val retentionPeriod: Option[Int]

val saveMode: SDLSaveMode

val schemaMin: Option[StructType]

val separator: Char

def streamingOptions: Map[String, String]

final def synchronized[T0](arg0: ⇒ T0): T0

var table: Table

var tableSchema: StructType

def toStringShort: String

def vacuum(implicit session: SparkSession): Unit

def validateSchema(df: DataFrame, schemaExpected: StructType, role: String): Unit

def validateSchemaHasPartitionCols(df: DataFrame, role: String): Unit

def validateSchemaHasPrimaryKeyCols(df: DataFrame, primaryKeyCols: Seq[String], role: String): Unit

def validateSchemaMin(df: DataFrame, role: String): Unit

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

def writeDataFrame(df: DataFrame, createTableOnly: Boolean, partitionValues: Seq[PartitionValues], saveModeOptions: Option[SaveModeOptions])(implicit session: SparkSession, context: ActionPipelineContext): Unit

def writeDataFrame(df: DataFrame, partitionValues: Seq[PartitionValues] = Seq(), isRecursiveInput: Boolean = false, saveModeOptions: Option[SaveModeOptions] = None)(implicit session: SparkSession, context: ActionPipelineContext): Unit

def writeStreamingDataFrame(df: DataFrame, trigger: Trigger, options: Map[String, String], checkpointLocation: String, queryName: String, outputMode: OutputMode, saveModeOptions: Option[SaveModeOptions])(implicit session: SparkSession, context: ActionPipelineContext): StreamingQuery

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from HasHadoopStandardFilestore

Inherited from CanHandlePartitions