dataobject

Type Members

case class DatePartitionColumnDef(colName: String, timeFormat: String = "yyyyMMdd", timeUnit: String = "days", timeZone: Option[String] = None, includeCurrentPartition: Boolean = false) extends Product with Serializable

Definition of date partition column to extract formatted time into column.
Definition of date partition column to extract formatted time into column.
colName
date partition column name to extract time into column on batch read
timeFormat
time format for timestamp in date partition column, definition according to java DateTimeFormatter. Default is "yyyyMMdd".
timeUnit
time unit for timestamp in date partition column, definition according to java ChronoUnit. Default is "days".
timeZone
time zone used for date logic. If not specified, java system default is used.
includeCurrentPartition
If the current partition should be included. Default is to list only completed partitions. Attention: including the current partition might result in data loss if there is more data arriving. But it might be useful to export all data before a scheduled maintenance.
case class KafkaTopicDataObject(id: DataObjectId, topicName: String, connectionId: ConnectionId, keyType: KafkaColumnType = KafkaColumnType.String, valueType: KafkaColumnType = KafkaColumnType.String, schemaMin: Option[StructType] = None, selectCols: Seq[String] = Seq("key", "value"), datePartitionCol: Option[DatePartitionColumnDef] = None, batchReadConsecutivePartitionsAsRanges: Boolean = false, batchReadMaxOffsetsPerTask: Option[Int] = None, options: Map[String, String] = Map(), metadata: Option[DataObjectMetadata] = None)(implicit instanceRegistry: InstanceRegistry) extends DataObject with CanCreateDataFrame with CanCreateStreamingDataFrame with CanWriteDataFrame with CanHandlePartitions with SchemaValidation with Product with Serializable

DataObject of type KafkaTopic.
DataObject of type KafkaTopic. Provides details to an action to read from Kafka Topics using either org.apache.spark.sql.DataFrameReader or org.apache.spark.sql.streaming.DataStreamReader
topicName
The name of the topic to read
keyType
Optional type the key column should be converted to. If none is given it will remain a bytearray / binary.
valueType
Optional type the value column should be converted to. If none is given it will remain a bytearray / binary.
schemaMin
An optional, minimal schema that this DataObject must have to pass schema validation on reading and writing.
selectCols
Columns to be selected when reading the DataFrame. Available columns are key, value, topic, partition, offset, timestamp, timestampType. If key/valueType is AvroSchemaRegistry the key/value column are convert to a complex type according to the avro schema. To expand it select "value.*". Default is to select key and value.
datePartitionCol
definition of date partition column to extract formatted timestamp into column. This is used to list existing partition and is added as additional column on batch read.
batchReadConsecutivePartitionsAsRanges
Set to true if consecutive partitions should be combined as one range of offsets when batch reading from topic. This results in less tasks but can be a performance problem when reading many partitions. (default=false)
batchReadMaxOffsetsPerTask
Set number of offsets per Spark task when batch reading from topic.
options
Options for the Kafka stream reader (see https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html). These options override connection.options.

Value Members

object KafkaColumnType extends Enumeration
object KafkaTopicDataObject extends FromConfigFactory[DataObject] with Serializable

package dataobject

Type Members

case class DatePartitionColumnDef(colName: String, timeFormat: String = "yyyyMMdd", timeUnit: String = "days", timeZone: Option[String] = None, includeCurrentPartition: Boolean = false) extends Product with Serializable

Value Members

object KafkaColumnType extends Enumeration

object KafkaTopicDataObject extends FromConfigFactory[DataObject] with Serializable

Ungrouped