PostgresExtractor

Instance Constructors

new PostgresExtractor(sparkSession: SparkSession, connectionDetails: PostgresConnectionDetails, extraConnectionProperties: Properties = new Properties(), transformTableNameForRead: (String) ⇒ String = identity)

transformTableNameForRead
How to transform the target table name into the table name in the database if the two are different. Useful if you have multiple tables representing the same thing but with different names, and you wish them to be written to the same target table

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
lazy val allTablePKs: Map[String, String]
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
val connectionDetails: PostgresConnectionDetails

Definition Classes
PostgresExtractor → RDBMExtractor
def connectionProperties: Properties

Attributes
protected
Definition Classes
RDBMExtractor
def driverClass: String

The JDBC driver to use for this RDBM
The JDBC driver to use for this RDBM

Definition Classes
PostgresExtractor → RDBMExtractor
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def escapeKeyword(identifier: String): String

Escape a keyword (for use in a query) e.g.
Escape a keyword (for use in a query) e.g. SQlServer uses [], Postgres uses ""
returns
the escaped keyword

Definition Classes
PostgresExtractor → RDBMExtractor
val extraConnectionProperties: Properties

JDBC connection properties
JDBC connection properties

Definition Classes
PostgresExtractor → RDBMExtractor
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def fromQueryPart(tableMetadata: ExtractionMetadata, lastUpdated: Option[Timestamp]): String

Attributes
protected
Definition Classes
RDBMExtractor
def generateSplitPredicates(tableMetadata: ExtractionMetadata, lastUpdated: Option[Timestamp], maxRowsPerPartition: Int): Option[Array[String]]

Generates predicates which are used to form the partitions of the read Dataset Queries the table to work out the primary key boundary points to use (so that each partition will contain a maximum of maxRowsPerPartition rows)
Generates predicates which are used to form the partitions of the read Dataset Queries the table to work out the primary key boundary points to use (so that each partition will contain a maximum of maxRowsPerPartition rows)
tableMetadata
the table metadata
lastUpdated
the last updated timestamp from which we wish to read data
maxRowsPerPartition
the maximum number of rows we want in each partition
returns
If the Dataset will have fewer rows than maxRowsPerParition then None, otherwise predicates to use in order to create the partitions e.g. "id >= 5 and id < 7"

Definition Classes
RDBMExtractor
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
final def getTableDataset(meta: Map[String, String], lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int] = None, forceFullLoad: Boolean = false): Dataset[_]

Creates a Dataset for the given table containing data which was updated after or on the provided timestamp
Creates a Dataset for the given table containing data which was updated after or on the provided timestamp
meta
the table metadata
lastUpdated
the last updated for the table (if None, then we read everything)
maxRowsPerPartition
Optionally, the maximum number of rows to be read per Dataset partition for this table This number will be used to generate predicates to be passed to org.apache.spark.sql.SparkSession.read.jdbc If this is not set, the DataFrame will only have one partition. This could result in memory issues when extracting large tables. Be careful not to create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. You can also control the maximum number of jdbc connections to open by limiting the number of executors for your application.
forceFullLoad
If set to true, ignore the last updated and read everything
returns
a Dataset for the given table

Definition Classes
RDBMExtractor
def getTableMetadata(dbSchemaName: String, tableName: String, primaryKeys: Option[Seq[String]], lastUpdatedColumn: Option[String], retainStorageHistory: (Option[String]) ⇒ Boolean): Try[AuditTableInfo]

Subclasses of RDBMExtractor must implement this method which: - tries to get whatever metadata information it can from the database - uses the optional provided values for pks and lastUpdated if it cannot get them from the database
Subclasses of RDBMExtractor must implement this method which: - tries to get whatever metadata information it can from the database - uses the optional provided values for pks and lastUpdated if it cannot get them from the database
This differs from the method defined above in the retainStorageHistory parameter. It takes a function which, given an optional lastUpdated column, returns whether or not to retain storage history for this table. Implementations should call this function to get the value needed by the AuditTableInfo

Definition Classes
PostgresExtractor → RDBMExtractor
def getTableMetadata(dbSchemaName: String, tableName: String, primaryKeys: Option[Seq[String]], lastUpdatedColumn: Option[String], forceRetainStorageHistory: Option[Boolean]): Try[AuditTableInfo]

Tries to get whatever metadata information it can from the database Uses the optional provided values for pks and lastUpdated if it cannot get them from the database.
Tries to get whatever metadata information it can from the database Uses the optional provided values for pks and lastUpdated if it cannot get them from the database.
dbSchemaName
the database schema name
tableName
the table name
primaryKeys
Optionally, the primary keys for this table (not needed if this extractor can discover primary keys itself)
lastUpdatedColumn
Optionally, the last updated column for this table (not needed if this extractor can discover last updated columns itself). If this is undefined and this extractor does not discover a last updated column for the table, then this table will be extracted in full every time
forceRetainStorageHistory
Optionally specify whether to retain history for this table in the storage layer. Setting this to anything other than None will override the default behaviour which is:
- if there is a lastUpdated column found or specified, retain all history for this table
- if there is no lastUpdated column, don't retain history for this table (history is removed when the table is compacted). The choice of this default behaviour is because, without a lastUpdatedColumn, the table will be extracted in full every time extraction is performed, causing the size of the data in storage to grow uncontrollably
returns
Success[AuditTableInfo] if all required metadata was either found or provided by the user Failure if required metadata was neither found nor provided by the user Failure if metadata provided differed from the metadata found in the database
Definition Classes
RDBMExtractor
def getTablePKs(dbSchemaName: String, tableName: String): Option[Seq[String]]
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def isTraceEnabled(): Boolean

Attributes
protected
Definition Classes
Logging
def loadDataset(meta: Map[String, String], lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int]): (Dataset[_], Column)

Creates a Dataset for the given table containing data which was updated after or on the provided timestamp Override this if required, or if you wish to use a different metadata class than TableExtractionMetadata
Creates a Dataset for the given table containing data which was updated after or on the provided timestamp Override this if required, or if you wish to use a different metadata class than TableExtractionMetadata
meta
the table metadata
lastUpdated
the last updated timestamp from which we wish to read data (if None, then we read everything)
maxRowsPerPartition
Optionally, the maximum number of rows to be read per Dataset partition for this table This number will be used to generate predicates to be passed to org.apache.spark.sql.SparkSession.read.jdbc If this is not set, the DataFrame will only have one partition. This could result in memory issues when extracting large tables. Be careful not to create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. You can also control the maximum number of jdbc connections to open by limiting the number of executors for your application.
returns
(Dataset for the given table, Column to use as the last updated)

Definition Classes
PostgresExtractor → RDBMExtractor
def logAndReturn[A](a: A, msg: String, level: Level): A

Definition Classes
Logging
def logAndReturn[A](a: A, message: (A) ⇒ String, level: Level): A

Definition Classes
Logging
def logDebug(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logName: String

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
val pkQuery: String
def rdbmRecordLastUpdatedColumn: String

This is what the column to use as the last updated in the output DataFrames will be called (In some cases, this will come from the provided last updated column, in others it will be the system timestamp)
This is what the column to use as the last updated in the output DataFrames will be called (In some cases, this will come from the provided last updated column, in others it will be the system timestamp)

Definition Classes
RDBMExtractor
def resolveLastUpdatedColumn(tableMetadata: ExtractionMetadata, sparkSession: SparkSession): Column

Definition Classes
RDBMExtractor
def selectQuery(tableMetadata: ExtractionMetadata, lastUpdated: Option[Timestamp], explicitColumnSelects: Seq[String]): String

Generate a query to select from the given table
Generate a query to select from the given table
tableMetadata
the metadata for the table
lastUpdated
the last updated timestamp from which we wish to read data
explicitColumnSelects
any additional columns which need to be specified on read (which won't be picked up by select *) e.g. HIDDEN fields
returns
a query which selects from the given table

Definition Classes
RDBMExtractor
def sourceDBSystemTimestampFunction: String

The function to use to get the system timestamp in the database
The function to use to get the system timestamp in the database

Definition Classes
PostgresExtractor → RDBMExtractor
def sparkLoad(tableMetadata: ExtractionMetadata, lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int], explicitColumnSelects: Seq[String] = Seq.empty): Dataset[_]

Creates a Spark Dataset for the table
Creates a Spark Dataset for the table
returns
a Spark Dataset for the table

Definition Classes
RDBMExtractor
val sparkSession: SparkSession

Definition Classes
PostgresExtractor → RDBMExtractor
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def systemTimestampColumnName: String

This is what the column containing the system timestamp will be called in the output DataFrames
This is what the column containing the system timestamp will be called in the output DataFrames

Definition Classes
RDBMExtractor
def toString(): String

Definition Classes
AnyRef → Any
val transformTableNameForRead: (String) ⇒ String

How to transform the target table name into the table name in the database if the two are different.
How to transform the target table name into the table name in the database if the two are different. Useful if you have multiple tables representing the same thing but with different names, and you wish them to be written to the same target table

Definition Classes
PostgresExtractor → RDBMExtractor
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package ingestion

class PostgresExtractor extends RDBMExtractor

Instance Constructors

new PostgresExtractor(sparkSession: SparkSession, connectionDetails: PostgresConnectionDetails, extraConnectionProperties: Properties = new Properties(), transformTableNameForRead: (String) ⇒ String = identity)

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

lazy val allTablePKs: Map[String, String]

final def asInstanceOf[T0]: T0

def clone(): AnyRef

val connectionDetails: PostgresConnectionDetails

def connectionProperties: Properties

def driverClass: String

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def escapeKeyword(identifier: String): String

val extraConnectionProperties: Properties

def finalize(): Unit

def fromQueryPart(tableMetadata: ExtractionMetadata, lastUpdated: Option[Timestamp]): String

def generateSplitPredicates(tableMetadata: ExtractionMetadata, lastUpdated: Option[Timestamp], maxRowsPerPartition: Int): Option[Array[String]]

final def getClass(): Class[_]

final def getTableDataset(meta: Map[String, String], lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int] = None, forceFullLoad: Boolean = false): Dataset[_]

def getTableMetadata(dbSchemaName: String, tableName: String, primaryKeys: Option[Seq[String]], lastUpdatedColumn: Option[String], retainStorageHistory: (Option[String]) ⇒ Boolean): Try[AuditTableInfo]

def getTableMetadata(dbSchemaName: String, tableName: String, primaryKeys: Option[Seq[String]], lastUpdatedColumn: Option[String], forceRetainStorageHistory: Option[Boolean]): Try[AuditTableInfo]

def getTablePKs(dbSchemaName: String, tableName: String): Option[Seq[String]]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def isTraceEnabled(): Boolean

def loadDataset(meta: Map[String, String], lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int]): (Dataset[_], Column)

def logAndReturn[A](a: A, msg: String, level: Level): A

def logAndReturn[A](a: A, message: (A) ⇒ String, level: Level): A

def logDebug(msg: ⇒ String, throwable: Throwable): Unit

def logDebug(msg: ⇒ String): Unit

def logError(msg: ⇒ String, throwable: Throwable): Unit

def logError(msg: ⇒ String): Unit

def logInfo(msg: ⇒ String, throwable: Throwable): Unit

def logInfo(msg: ⇒ String): Unit

def logName: String

def logTrace(msg: ⇒ String, throwable: Throwable): Unit

def logTrace(msg: ⇒ String): Unit

def logWarning(msg: ⇒ String, throwable: Throwable): Unit

def logWarning(msg: ⇒ String): Unit

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

val pkQuery: String

def rdbmRecordLastUpdatedColumn: String

def resolveLastUpdatedColumn(tableMetadata: ExtractionMetadata, sparkSession: SparkSession): Column

def selectQuery(tableMetadata: ExtractionMetadata, lastUpdated: Option[Timestamp], explicitColumnSelects: Seq[String]): String

def sourceDBSystemTimestampFunction: String

def sparkLoad(tableMetadata: ExtractionMetadata, lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int], explicitColumnSelects: Seq[String] = Seq.empty): Dataset[_]

val sparkSession: SparkSession

final def synchronized[T0](arg0: ⇒ T0): T0

def systemTimestampColumnName: String

def toString(): String

val transformTableNameForRead: (String) ⇒ String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from RDBMExtractor

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped