io.smartdatalake.workflow.dataobject
partition layout defines how partition values can be extracted from the path. Use "%<colname>%" as token to extract the value for a partition column. With "%<colname:regex>%" a regex can be given to limit search. This is especially useful if there is no char to delimit the last token from the rest of the path or also between two tokens.
Overwrite or Append new data.
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
Create an OutputStream for a given path, that the Action can use to write data into.
Create an OutputStream for a given path, that the Action can use to write data into.
Delete all data.
Delete all data. This is used to implement SaveMode.Overwrite.
Delete given files.
Delete given files. This is used to cleanup files after they are processed.
This is called after all output streams have been written.
This is called after all output streams have been written. It is used for e.g. making sure empty partitions are created as well.
Optional definition of partitions expected to exist.
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.
Extract partition values from a given file path
Extract partition values from a given file path
Returns the factory that can parse this type (that is, type CO
).
Returns the factory that can parse this type (that is, type CO
).
Typically, implementations of this method should return the companion object of the implementing class. The companion object in turn should implement FromConfigFactory.
the factory (object) for this class.
Definition of fileName.
Definition of fileName. Default is an asterix to match everything. This is concatenated with the partition layout to search for files.
Handle class cast exception when getting objects from instance registry
Handle class cast exception when getting objects from instance registry
List files for given partition values
List files for given partition values
List of partition values to be filtered. If empty all files in root path of DataObject will be listed.
List of FileRefs
get partition values formatted by partition layout
get partition values formatted by partition layout
Method for subclasses to override the base path for this DataObject.
Method for subclasses to override the base path for this DataObject. This is for instance needed if pathPrefix is defined in a connection.
prepare paths to be searched
prepare paths to be searched
Configure a housekeeping mode to e.g cleanup, archive and compact partitions.
Configure a housekeeping mode to e.g cleanup, archive and compact partitions. Default is None.
A unique identifier for this instance.
A unique identifier for this instance.
List partitions on data object's root path
List partitions on data object's root path
Additional metadata for the DataObject
Additional metadata for the DataObject
partition layout defines how partition values can be extracted from the path.
partition layout defines how partition values can be extracted from the path. Use "%<colname>%" as token to extract the value for a partition column. With "%<colname:regex>%" a regex can be given to limit search. This is especially useful if there is no char to delimit the last token from the rest of the path or also between two tokens.
Definition of partition columns
Definition of partition columns
The root path of the files that are handled by this DataObject.
The root path of the files that are handled by this DataObject.
Prepare & test DataObject's prerequisits
Prepare & test DataObject's prerequisits
This runs during the "prepare" operation of the DAG.
Make a given path relative to this DataObjects base path
Make a given path relative to this DataObjects base path
Overwrite or Append new data.
Overwrite or Append new data.
default separator for paths
default separator for paths
This is called before any output stream is created to initialize writing.
This is called before any output stream is created to initialize writing. It is used to apply SaveMode, e.g. deleting existing partitions.
Given some FileRef for another DataObject, translate the paths to the root path of this DataObject
Given some FileRef for another DataObject, translate the paths to the root path of this DataObject
Validate the schema of a given Spark Data Frame df
that it contains the specified partition columns
Validate the schema of a given Spark Data Frame df
that it contains the specified partition columns
The data frame to validate.
role used in exception message. Set to read or write.
SchemaViolationException
if the partitions columns are not included.
Validate the schema of a given Spark Data Frame df
that it contains the specified primary key columns
Validate the schema of a given Spark Data Frame df
that it contains the specified primary key columns
The data frame to validate.
role used in exception message. Set to read or write.
SchemaViolationException
if the partitions columns are not included.
Connects to SFtp files Needs java library "com.hieronymus % sshj % 0.21.1" The following authentication mechanisms are supported -> public/private-key: private key must be saved in ~/.ssh, public key must be registered on server. -> user/pwd authentication: user and password is taken from two variables set as parameters. These variables could come from clear text (CLEAR), a file (FILE) or an environment variable (ENV)
partition layout defines how partition values can be extracted from the path. Use "%<colname>%" as token to extract the value for a partition column. With "%<colname:regex>%" a regex can be given to limit search. This is especially useful if there is no char to delimit the last token from the rest of the path or also between two tokens.
Overwrite or Append new data.
Optional definition of partitions expected to exist. Define a Spark SQL expression that is evaluated against a PartitionValues instance and returns true or false Default is to expect all partitions to exist.