final def !=(arg0: Any): Boolean

Definition Classes: AnyRef → Any

final def ##(): Int

Definition Classes: AnyRef → Any

final def ==(arg0: Any): Boolean

Definition Classes: AnyRef → Any

final def asInstanceOf[T0]: T0

Definition Classes: Any

def breakAndWriteDataFrameForOutputFile(outputColumns: Seq[String], fileColumnName: String, format: String, delimiter: Option[String] = None): Unit

Method to break input dataframe via unique values of fileColumnName colume into multiple dataframes and persist each dataframe into its corresponding output file.

Definition Classes: ExtendedDataFrame

def clone(): AnyRef

Attributes: protected[lang]
Definition Classes: AnyRef
Annotations: @throws( ... ) @native() @HotSpotIntrinsicCandidate()

def collectDataFrameColumnsToApplyFilter(columnList: List[String], filterSourceDataFrame: DataFrame): DataFrame

Method to collect values for columnList columns from filterSourceDataFrame and pass it to caller DataFrame to filter out values in caller DataFrame.

Definition Classes: ExtendedDataFrame

def compareRecords(otherDataFrame: DataFrame, componentName: String, limit: Int, spark: SparkSession): DataFrame

Method which implements logic of Compare Records abinitio component.

Method which implements logic of Compare Records abinitio component. Its functioning is as explained below

1. It takes join of both input dataframes via adding incremental sequence number and takes join on this sequence number. 2. It compares all records of both input dataframes and finds count of mismatching records. 3. If mismatch record count is more than limit than it throws error to terminate workflow execution. Otherwise it returns dataframe with mismatch count report.

Definition Classes: ExtendedDataFrame

val dataFrame: DataFrame

Definition Classes: ExtendedDataFrame

def deduplicate(typeToKeep: String, groupByColumns: List[Column] = List(lit(1)), orderByColumns: List[Column] = List(lit(1))): DataFrame

Method for Deduplicate operation when rows to be kept in each group of rows to be either first, Last or unique-only.

Method for Deduplicate operation when rows to be kept in each group of rows to be either first, Last or unique-only. It does first groupBy on all passed groupByColumns and then depending on typeToKeep value it does further operations.

For both first and last option, it adds new temporary row_number column which returns the row number within a group of rows grouped by groupByColumns. Then to find first records it simply filters out all rows with row_number as 1. To find last records within each group it also computes the count value for each group and filters out all the records where row_number is same as group count

For unique-only case it adds new temporary count column which returns the count of rows within a window partition. Then it filters the resultant dataframe with count value 1.

typeToKeep: option to find kind of rows. Possible values are first, last and unique-only
groupByColumns: columns to be used to group input records.
returns: DataFrame with first or last or unique-only records in each grouping of input records.

Definition Classes: ExtendedDataFrame

def deduplicateFromColumnNames(typeToKeep: String, groupByColumns: ArrayList[String]): DataFrame

Definition Classes: ExtendedDataFrame

def denormalizeSorted(groupByColumns: List[Column] = List(lit(1)), orderByColumns: List[Column] = List(lit(1)), denormalizeRecordExpression: Column, finalizeExpressionMap: Map[String, Column], inputFilter: Option[Column] = None, outputFilter: Option[Column] = None, denormColumnName: String, countColumnName: String = "count"): DataFrame

Definition Classes: ExtendedDataFrame

final def eq(arg0: AnyRef): Boolean

Definition Classes: AnyRef

def equals(arg0: Any): Boolean

Definition Classes: AnyRef → Any

def generateLogOutput(componentName: String, subComponentName: String = "", perRowEventTypes: Option[Column] = None, perRowEventTexts: Option[Column] = None, inputRowCount: Long = 0, outputRowCount: Option[Long] = Some(0), finalLogEventType: Option[Column] = None, finalLogEventText: Option[Column] = None, finalEventExtraColumnMap: Map[String, Column] = Map(), sparkSession: SparkSession): DataFrame

Method to generate abinitio log output for any component.

Method to generate abinitio log output for any component. This method takes as input array of non-standard events which are emitted by workflow component and serializes these events into separate row. This method will also add start and finish events with adding count information with finish event.

Definition Classes: ExtendedDataFrame

def generateSurrogateKeys(keyDF: DataFrame, naturalKeys: List[String], surrogateKey: String, overrideSurrogateKeys: Option[String], computeOldPortOutput: Boolean = false, spark: SparkSession): (DataFrame, DataFrame, DataFrame)

Definition Classes: ExtendedDataFrame

final def getClass(): Class[_]

Definition Classes: AnyRef → Any
Annotations: @native() @HotSpotIntrinsicCandidate()

def grouped(windowSize: Int): DataFrame

Definition Classes: ExtendedDataFrame

def hashCode(): Int

Definition Classes: AnyRef → Any
Annotations: @native() @HotSpotIntrinsicCandidate()

def interim(subgraph: String, component: String, port: String)(implicit interimOutput: InterimOutput): DataFrame

Definition Classes: ExtendedDataFrame
Annotations: @Py4JWhitelist()

def interim(subgraph: String, component: String, port: String, subPath: String, numRows: Int, detailedStats: Boolean = false)(implicit interimOutput: InterimOutput): DataFrame

Definition Classes: ExtendedDataFrame
Annotations: @Py4JWhitelist()

def interim(subgraph: String, component: String, port: String, subPath: String, numRows: Int, interimOutput: InterimOutput, detailedStats: Boolean): DataFrame

Definition Classes: ExtendedDataFrame
Annotations: @Py4JWhitelist()

final def isInstanceOf[T0]: Boolean

Definition Classes: Any

def mergeMultipleFileContentInDataFrame(fileNameDF: DataFrame, spark: SparkSession, abinitioSchema: String, delimiter: String, readFormat: String, joinWithInputDataframe: Boolean): DataFrame

Method to read passed dataframe fileNameDF and read the content of filenames passed in this dataframe.

Method to read passed dataframe fileNameDF and read the content of filenames passed in this dataframe. It will also merge the fileName column and unique sequence id in the final generated dataframe with file content for all passed fileNames.

Finally it joins the dataframe with content of file and dataframe corresponding to input dataframe and returns the joined dataframe.

Definition Classes: ExtendedDataFrame

def mergeMultipleFileContentInDataFrame(fileNameDF: DataFrame, spark: SparkSession, outputSchema: StructType, delimiter: String, readFormat: String, joinWithInputDataframe: Boolean, ffSchema: Option[FFSchemaRecord]): DataFrame

Method to read passed dataframe fileNameDF and read the content of filenames passed in this dataframe.

Method to read passed dataframe fileNameDF and read the content of filenames passed in this dataframe. It will also merge the fileName column and unique sequence id in the final generated dataframe with file content for all passed fileNames.

Finally it joins the dataframe with content of file and dataframe corresponding to input dataframe and returns the joined dataframe.

Definition Classes: ExtendedDataFrame

def metaPivot(pivotColumns: Seq[String], nameField: String, valueField: String, sparkSession: SparkSession): DataFrame

Method to take pivot on passed pivot columns.

Method to take pivot on passed pivot columns. This method splits records by pivot columns, converting each input record into a series of separate output records. There is one separate output record for each field of data in the original input record which is not in pivot list. Each output record contains the name and value of a single data field from the original input record along with pivot columns.

Definition Classes: ExtendedDataFrame

final def ne(arg0: AnyRef): Boolean

Definition Classes: AnyRef

def normalize(lengthExpression: Option[Column], finishedExpression: Option[Column], finishedCondition: Option[Column], alias: String, colsToSelect: List[Column], tempWindowExpr: Map[String, Column], lengthRelatedGlobalExpressions: Map[String, Column] = Map()): DataFrame

Method to take care of abinitio normalize functionality.

Method to take care of abinitio normalize functionality. It first replicates input dataframe rows, muliple times depending on passed lengthExpression or finishedExpression. LengthExpression evaluates to a number and will replicate each row in input data by this number.

FinishedExpression and finishedCondition are used to apply filter condition on input data and use this condition result to duplicate each input row multiple times.

tempWindowExpr is used to evaluate temp variables for Normalize with Temp case, using window functions. These expressions are then used in computation of final value for normalize output.

lengthExpression: expression which evaluates to a integer value, used to duplicate input records.
finishedExpression: expression to be used in filterCondition during its evaluation for duplication of records. return finishedCondition condition to be used to duplicate input records till condition result is false.
alias: to be used to rename finishedExpressions
colsToSelect: columns to be selected after normalize operations.
tempWindowExpr: window expressions to compute value of temp variables.
returns: final normalize output for both with Temp and without Temp case.

Definition Classes: ExtendedDataFrame

final def notify(): Unit

Definition Classes: AnyRef
Annotations: @native() @HotSpotIntrinsicCandidate()

final def notifyAll(): Unit

Definition Classes: AnyRef
Annotations: @native() @HotSpotIntrinsicCandidate()

def readSeparatedValues(inputColumn: Column, outputSchemaColumns: List[String], recordSeparator: String, fieldSeparator: String): DataFrame

Method to read textual data from inputColumn and split it into multiple records via recordSeparator and then further split each record into multiple columns via fieldSeparator.

Method to read textual data from inputColumn and split it into multiple records via recordSeparator and then further split each record into multiple columns via fieldSeparator. Then finally map the resultant data to output columns passed.

Definition Classes: ExtendedDataFrame

def syncDataFrameColumnsWithSchema(columnNames: Seq[String]): DataFrame

Method to sync column names in dataframe with column names passed as input.

Definition Classes: ExtendedDataFrame

final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes: AnyRef

def toString(): String

Definition Classes: AnyRef → Any

def unionWithSchema(otherDataFrame: DataFrame): DataFrame

Method to take union of current dataframe with passed otherDataFrame.

Method to take union of current dataframe with passed otherDataFrame. This method also rearranges the columns ot otherDataFrame in the same order as of current dataFrame columns

Definition Classes: ExtendedDataFrame

lazy val vectorUDF: UserDefinedFunction

Definition Classes: ExtendedDataFrame

final def wait(arg0: Long, arg1: Int): Unit

Definition Classes: AnyRef
Annotations: @throws( ... )

final def wait(arg0: Long): Unit

Definition Classes: AnyRef
Annotations: @throws( ... ) @native()

final def wait(): Unit

Definition Classes: AnyRef
Annotations: @throws( ... )

def withColumnOptional(name: String, value: Column): DataFrame

Adds a column with defined value, if it doesn't exist.

name: Column's name
value: New column's value
returns: DataFrame with a new column if it doesn't exist already

Definition Classes: ExtendedDataFrame

def zipWithIndex(startValue: Long = 0L, incrementBy: Long = 1L, indexColName: String, sparkSession: SparkSession): DataFrame

Method to add new unique sequence column in dataframe where value in each row is incremented by incrementBy value and sequence starts with startValue.

Definition Classes: ExtendedDataFrame

Packages

ExtendedDataFrameGlobal

implicit class ExtendedDataFrameGlobal extends ExtendedDataFrame

Instance Constructors

Value Members

Deprecated Value Members

Inherited from libs.ExtendedDataFrame

Inherited from AnyRef

Inherited from Any

Ungrouped