Package

io.smartdatalake.workflow

dataobject

Permalink

package dataobject

Visibility
  1. Public
  2. All

Type Members

  1. case class DatePartitionColumnDef(colName: String, timeFormat: String = "yyyyMMdd", timeUnit: String = "days", timeZone: Option[String] = None, includeCurrentPartition: Boolean = false) extends Product with Serializable

    Permalink

    Definition of date partition column to extract formatted time into column.

    Definition of date partition column to extract formatted time into column.

    colName

    date partition column name to extract time into column on batch read

    timeFormat

    time format for timestamp in date partition column, definition according to java DateTimeFormatter. Default is "yyyyMMdd".

    timeUnit

    time unit for timestamp in date partition column, definition according to java ChronoUnit. Default is "days".

    timeZone

    time zone used for date logic. If not specified, java system default is used.

    includeCurrentPartition

    If the current partition should be included. Default is to list only completed partitions. Attention: including the current partition might result in data loss if there is more data arriving. But it might be useful to export all data before a scheduled maintenance.

  2. case class KafkaTopicDataObject(id: DataObjectId, topicName: String, connectionId: ConnectionId, keyType: KafkaColumnType = KafkaColumnType.String, valueType: KafkaColumnType = KafkaColumnType.String, schemaMin: Option[StructType] = None, selectCols: Seq[String] = Seq("key", "value"), datePartitionCol: Option[DatePartitionColumnDef] = None, batchReadConsecutivePartitionsAsRanges: Boolean = false, batchReadMaxOffsetsPerTask: Option[Int] = None, options: Map[String, String] = Map(), metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with CanCreateStreamingDataFrame with CanWriteDataFrame with CanHandlePartitions with SchemaValidation with Product with Serializable

    Permalink

    DataObject of type KafkaTopic.

    DataObject of type KafkaTopic. Provides details to an action to read from Kafka Topics using either org.apache.spark.sql.DataFrameReader or org.apache.spark.sql.streaming.DataStreamReader

    topicName

    The name of the topic to read

    keyType

    Optional type the key column should be converted to. If none is given it will remain a bytearray / binary.

    valueType

    Optional type the value column should be converted to. If none is given it will remain a bytearray / binary.

    schemaMin

    An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.

    selectCols

    Columns to be selected when reading the DataFrame. Available columns are key, value, topic, partition, offset, timestamp, timestampType. If key/valueType is AvroSchemaRegistry the key/value column are convert to a complex type according to the avro schema. To expand it select "value.*". Default is to select key and value.

    datePartitionCol

    definition of date partition column to extract formatted timestamp into column. This is used to list existing partition and is added as additional column on batch read.

    batchReadConsecutivePartitionsAsRanges

    Set to true if consecutive partitions should be combined as one range of offsets when batch reading from topic. This results in less tasks but can be a performance problem when reading many partitions. (default=false)

    batchReadMaxOffsetsPerTask

    Set number of offsets per Spark task when batch reading from topic.

    options

    Options for the Kafka stream reader (see https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html). These options override connection.options.

Value Members

  1. object KafkaColumnType extends Enumeration

    Permalink
  2. object KafkaTopicDataObject extends FromConfigFactory[DataObject] with Serializable

    Permalink

Ungrouped