Class

com.coxautodata.waimak.rdbm.ingestion

PostgresExtractor

Related Doc: package ingestion

Permalink

class PostgresExtractor extends RDBMExtractor

Created by Vicky Avison on 27/04/18.

Linear Supertypes
RDBMExtractor, Logging, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. PostgresExtractor
  2. RDBMExtractor
  3. Logging
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new PostgresExtractor(sparkSession: SparkSession, connectionDetails: PostgresConnectionDetails, extraConnectionProperties: Properties = new Properties(), transformTableNameForRead: (String) ⇒ String = identity)

    Permalink

    transformTableNameForRead

    How to transform the target table name into the table name in the database if the two are different. Useful if you have multiple tables representing the same thing but with different names, and you wish them to be written to the same target table

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. lazy val allTablePKs: Map[String, String]

    Permalink
  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  7. val connectionDetails: PostgresConnectionDetails

    Permalink
    Definition Classes
    PostgresExtractorRDBMExtractor
  8. def connectionProperties: Properties

    Permalink
    Attributes
    protected
    Definition Classes
    RDBMExtractor
  9. def driverClass: String

    Permalink

    The JDBC driver to use for this RDBM

    The JDBC driver to use for this RDBM

    Definition Classes
    PostgresExtractorRDBMExtractor
  10. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  11. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  12. def escapeKeyword(identifier: String): String

    Permalink

    Escape a keyword (for use in a query) e.g.

    Escape a keyword (for use in a query) e.g. SQlServer uses [], Postgres uses ""

    returns

    the escaped keyword

    Definition Classes
    PostgresExtractorRDBMExtractor
  13. val extraConnectionProperties: Properties

    Permalink

    JDBC connection properties

    JDBC connection properties

    Definition Classes
    PostgresExtractorRDBMExtractor
  14. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  15. def fromQueryPart(tableMetadata: ExtractionMetadata, lastUpdated: Option[Timestamp]): String

    Permalink
    Attributes
    protected
    Definition Classes
    RDBMExtractor
  16. def generateSplitPredicates(tableMetadata: ExtractionMetadata, lastUpdated: Option[Timestamp], maxRowsPerPartition: Int): Option[Array[String]]

    Permalink

    Generates predicates which are used to form the partitions of the read Dataset Queries the table to work out the primary key boundary points to use (so that each partition will contain a maximum of maxRowsPerPartition rows)

    Generates predicates which are used to form the partitions of the read Dataset Queries the table to work out the primary key boundary points to use (so that each partition will contain a maximum of maxRowsPerPartition rows)

    tableMetadata

    the table metadata

    lastUpdated

    the last updated timestamp from which we wish to read data

    maxRowsPerPartition

    the maximum number of rows we want in each partition

    returns

    If the Dataset will have fewer rows than maxRowsPerParition then None, otherwise predicates to use in order to create the partitions e.g. "id >= 5 and id < 7"

    Definition Classes
    RDBMExtractor
  17. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  18. final def getTableDataset(meta: Map[String, String], lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int] = None, forceFullLoad: Boolean = false): Dataset[_]

    Permalink

    Creates a Dataset for the given table containing data which was updated after or on the provided timestamp

    Creates a Dataset for the given table containing data which was updated after or on the provided timestamp

    meta

    the table metadata

    lastUpdated

    the last updated for the table (if None, then we read everything)

    maxRowsPerPartition

    Optionally, the maximum number of rows to be read per Dataset partition for this table This number will be used to generate predicates to be passed to org.apache.spark.sql.SparkSession.read.jdbc If this is not set, the DataFrame will only have one partition. This could result in memory issues when extracting large tables. Be careful not to create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. You can also control the maximum number of jdbc connections to open by limiting the number of executors for your application.

    forceFullLoad

    If set to true, ignore the last updated and read everything

    returns

    a Dataset for the given table

    Definition Classes
    RDBMExtractor
  19. def getTableMetadata(dbSchemaName: String, tableName: String, primaryKeys: Option[Seq[String]], lastUpdatedColumn: Option[String], retainStorageHistory: (Option[String]) ⇒ Boolean): Try[AuditTableInfo]

    Permalink

    Subclasses of RDBMExtractor must implement this method which: - tries to get whatever metadata information it can from the database - uses the optional provided values for pks and lastUpdated if it cannot get them from the database

    Subclasses of RDBMExtractor must implement this method which: - tries to get whatever metadata information it can from the database - uses the optional provided values for pks and lastUpdated if it cannot get them from the database

    This differs from the method defined above in the retainStorageHistory parameter. It takes a function which, given an optional lastUpdated column, returns whether or not to retain storage history for this table. Implementations should call this function to get the value needed by the AuditTableInfo

    Definition Classes
    PostgresExtractorRDBMExtractor
  20. def getTableMetadata(dbSchemaName: String, tableName: String, primaryKeys: Option[Seq[String]], lastUpdatedColumn: Option[String], forceRetainStorageHistory: Option[Boolean]): Try[AuditTableInfo]

    Permalink

    Tries to get whatever metadata information it can from the database Uses the optional provided values for pks and lastUpdated if it cannot get them from the database.

    Tries to get whatever metadata information it can from the database Uses the optional provided values for pks and lastUpdated if it cannot get them from the database.

    dbSchemaName

    the database schema name

    tableName

    the table name

    primaryKeys

    Optionally, the primary keys for this table (not needed if this extractor can discover primary keys itself)

    lastUpdatedColumn

    Optionally, the last updated column for this table (not needed if this extractor can discover last updated columns itself). If this is undefined and this extractor does not discover a last updated column for the table, then this table will be extracted in full every time

    forceRetainStorageHistory

    Optionally specify whether to retain history for this table in the storage layer. Setting this to anything other than None will override the default behaviour which is:

    • if there is a lastUpdated column found or specified, retain all history for this table
    • if there is no lastUpdated column, don't retain history for this table (history is removed when the table is compacted). The choice of this default behaviour is because, without a lastUpdatedColumn, the table will be extracted in full every time extraction is performed, causing the size of the data in storage to grow uncontrollably
    returns

    Success[AuditTableInfo] if all required metadata was either found or provided by the user Failure if required metadata was neither found nor provided by the user Failure if metadata provided differed from the metadata found in the database

    Definition Classes
    RDBMExtractor
  21. def getTablePKs(dbSchemaName: String, tableName: String): Option[Seq[String]]

    Permalink
  22. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  23. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  24. def isTraceEnabled(): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  25. def loadDataset(meta: Map[String, String], lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int]): (Dataset[_], Column)

    Permalink

    Creates a Dataset for the given table containing data which was updated after or on the provided timestamp Override this if required, or if you wish to use a different metadata class than TableExtractionMetadata

    Creates a Dataset for the given table containing data which was updated after or on the provided timestamp Override this if required, or if you wish to use a different metadata class than TableExtractionMetadata

    meta

    the table metadata

    lastUpdated

    the last updated timestamp from which we wish to read data (if None, then we read everything)

    maxRowsPerPartition

    Optionally, the maximum number of rows to be read per Dataset partition for this table This number will be used to generate predicates to be passed to org.apache.spark.sql.SparkSession.read.jdbc If this is not set, the DataFrame will only have one partition. This could result in memory issues when extracting large tables. Be careful not to create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. You can also control the maximum number of jdbc connections to open by limiting the number of executors for your application.

    returns

    (Dataset for the given table, Column to use as the last updated)

    Definition Classes
    PostgresExtractorRDBMExtractor
  26. def logAndReturn[A](a: A, msg: String, level: Level): A

    Permalink
    Definition Classes
    Logging
  27. def logAndReturn[A](a: A, message: (A) ⇒ String, level: Level): A

    Permalink
    Definition Classes
    Logging
  28. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  29. def logDebug(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  30. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  31. def logError(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  32. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  33. def logInfo(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  34. def logName: String

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  35. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  36. def logTrace(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  37. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  38. def logWarning(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  39. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  40. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  41. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  42. val pkQuery: String

    Permalink
  43. def rdbmRecordLastUpdatedColumn: String

    Permalink

    This is what the column to use as the last updated in the output DataFrames will be called (In some cases, this will come from the provided last updated column, in others it will be the system timestamp)

    This is what the column to use as the last updated in the output DataFrames will be called (In some cases, this will come from the provided last updated column, in others it will be the system timestamp)

    Definition Classes
    RDBMExtractor
  44. def resolveLastUpdatedColumn(tableMetadata: ExtractionMetadata, sparkSession: SparkSession): Column

    Permalink
    Definition Classes
    RDBMExtractor
  45. def selectQuery(tableMetadata: ExtractionMetadata, lastUpdated: Option[Timestamp], explicitColumnSelects: Seq[String]): String

    Permalink

    Generate a query to select from the given table

    Generate a query to select from the given table

    tableMetadata

    the metadata for the table

    lastUpdated

    the last updated timestamp from which we wish to read data

    explicitColumnSelects

    any additional columns which need to be specified on read (which won't be picked up by select *) e.g. HIDDEN fields

    returns

    a query which selects from the given table

    Definition Classes
    RDBMExtractor
  46. def sourceDBSystemTimestampFunction: String

    Permalink

    The function to use to get the system timestamp in the database

    The function to use to get the system timestamp in the database

    Definition Classes
    PostgresExtractorRDBMExtractor
  47. def sparkLoad(tableMetadata: ExtractionMetadata, lastUpdated: Option[Timestamp], maxRowsPerPartition: Option[Int], explicitColumnSelects: Seq[String] = Seq.empty): Dataset[_]

    Permalink

    Creates a Spark Dataset for the table

    Creates a Spark Dataset for the table

    returns

    a Spark Dataset for the table

    Definition Classes
    RDBMExtractor
  48. val sparkSession: SparkSession

    Permalink
    Definition Classes
    PostgresExtractorRDBMExtractor
  49. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  50. def systemTimestampColumnName: String

    Permalink

    This is what the column containing the system timestamp will be called in the output DataFrames

    This is what the column containing the system timestamp will be called in the output DataFrames

    Definition Classes
    RDBMExtractor
  51. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  52. val transformTableNameForRead: (String) ⇒ String

    Permalink

    How to transform the target table name into the table name in the database if the two are different.

    How to transform the target table name into the table name in the database if the two are different. Useful if you have multiple tables representing the same thing but with different names, and you wish them to be written to the same target table

    Definition Classes
    PostgresExtractorRDBMExtractor
  53. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  54. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  55. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from RDBMExtractor

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped