Packages

trait DataHelpers extends LazyLogging

Helper Utilities for reading/writing data from/to different data sources.

Linear Supertypes
LazyLogging, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. DataHelpers
  2. LazyLogging
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def appendTrailer(pathInputData: String, pathInputTrailer: String, pathOutputConcatenated: String, configuration: Configuration): Unit

    Appends a trailer data to every single file in the data directory.

    Appends a trailer data to every single file in the data directory. A single trailer file in the pathOutputTrailer directory should correspond to a single data file in the pathOutputData directory.

    If a trailer for a given file does not exist, the file is moved as is to the output directory.

    pathInputData

    Input data files directory

    pathInputTrailer

    Input trailer files directory

    pathOutputConcatenated

    Output concatenated files directory

    configuration

    Hadoop configuration (preferably sparkSession.sparkContext.hadoopConfiguration)

  5. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  6. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native() @HotSpotIntrinsicCandidate()
  7. def concatenate(sources: Seq[String], destination: String, compressToGZip: Boolean = false): Unit

    Method to get data from multiple source paths and combine it into single destination path.

    Method to get data from multiple source paths and combine it into single destination path.

    sources

    multiple source paths from which to merge the data.

    destination

    destination path to combine all data to.

    compressToGZip

    flag to compress final output file into gzip format

  8. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  10. def executeNonSelectSQLQueries(sqlList: Seq[String], dbConnection: Connection): Unit
  11. def ftpTo(remoteHost: String, userName: String, password: String, sourceFile: String, destFile: String, retryFailures: Boolean, retryCount: Int, retryPauseSecs: Int, mode: String, psCmd: String): (Boolean, Boolean, String, String)
  12. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  13. def getEmptyLogDataFrame(sparkSession: SparkSession): DataFrame

    Method to get empty dataframe with below abinitio log schema.

    Method to get empty dataframe with below abinitio log schema.

    record string("|") node, timestamp, component, subcomponent, event_type; string("|\n") event_text; end

  14. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  15. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  16. def loadBinaryFileAsBinaryDataFrame(filePath: String, lineDelimiter: String = "\n", minPartition: Int = 1, rowName: String = "line", spark: SparkSession): DataFrame
  17. def loadBinaryFileAsStringDataFrame(filePath: String, lineDelimiter: String = "\n", charSetEncoding: String = "Cp1047", minPartition: Int = 1, rowName: String = "line", spark: SparkSession): DataFrame
  18. def loadFixedWindowBinaryFileAsDataFrame(filePath: String, lineLength: Int, minPartition: Int = 1, rowName: String = "line", spark: SparkSession): DataFrame
  19. lazy val logger: Logger
    Attributes
    protected
    Definition Classes
    LazyLogging
    Annotations
    @transient()
  20. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  21. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  22. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  23. def readHiveTable(spark: SparkSession, database: String, table: String, partition: String = ""): DataFrame

    Method to read data from hive table.

    Method to read data from hive table.

    spark

    spark session

    database

    hive database

    table

    hive table.

    partition

    hive table partition to read data specifically from if provided.

    returns

    dataframe with data read from Hive Table.

  24. def readHiveTableInChunks(spark: SparkSession, database: String, table: String, partitionKey: String, partitionValue: String): DataFrame

    Reads a full hive table partition, by reading every subpartition separately and performing a union on all the final DataFrames

    Reads a full hive table partition, by reading every subpartition separately and performing a union on all the final DataFrames

    This function is meant to temporarily solve the problem with Hive metastore crashing when querying too many partitions at the same time.

    spark

    spark session

    database

    hive database name

    table

    hive table name

    partitionKey

    top-level partition's key

    partitionValue

    top-level partition's value

    returns

    A complete DataFrame with the selected hive table partition

  25. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  26. def toString(): String
    Definition Classes
    AnyRef → Any
  27. def unionAll(df: DataFrame*): DataFrame

    Method to take union of all passed dataframes.

    Method to take union of all passed dataframes.

    df

    list of dataframes for which to take union of.

    returns

    union of all passed input dataframes.

  28. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  29. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  30. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  31. def writeDataFrame(df: DataFrame, path: String, spark: SparkSession, props: Map[String, String], format: String, partitionColumns: List[String] = Nil, bucketColumns: List[String] = Nil, numBuckets: Option[Int] = None, sortColumns: List[String] = Nil, tableName: Option[String] = None, databaseName: Option[String] = None): Unit

    Method to write data passed in dataframe in specific file format.

    Method to write data passed in dataframe in specific file format.

    df

    dataframe containing data.

    path

    path to write data to.

    spark

    spark session.

    props

    underlying data source specific properties.

    format

    file format in which to persist data. Supported file formats are csv, text, json, parquet, orc

    partitionColumns

    columns to be used for partitioning.

    bucketColumns

    used to bucket the output by the given columns. If specified, the output is laid out on the file-system similar to Hive's bucketing scheme.

    numBuckets

    number of buckets to be used.

    sortColumns

    columns on which to order data while persisting.

    tableName

    table name for persisting data.

    databaseName

    database name for persisting data.

  32. lazy val write_to_log: UserDefinedFunction

    UDF to write logging parameters to log port.

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] ) @Deprecated
    Deprecated

Inherited from LazyLogging

Inherited from AnyRef

Inherited from Any

Ungrouped