Extension methods for arrays
Extension methods for arrays
Extension methods for arrays
Extension methods for arrays
Extension methods for any kind of column
Extension methods for any kind of column
Casting methods
Casting methods
INTEGRAL OPERATIONS
INTEGRAL OPERATIONS
INTEGRAL OPERATIONS
INTEGRAL OPERATIONS
LONG OPERATIONS
LONG OPERATIONS
Extension methods for Map Columns
Extension methods for Map Columns
NUM WITH DECIMALS OPERATIONS
NUM WITH DECIMALS OPERATIONS
GENERIC NUMERIC OPERATIONS
GENERIC NUMERIC OPERATIONS
Unique column operations
Unique column operations
Aggregate function: returns the AND value for a boolean column
Aggregate function: returns the AND value for a boolean column
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
Creates a new array column.
Creates a new array column. The input columns must all have the same data type.
scaladoc link (issue #135)
org.apache.spark.sql.functions.array
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group.
Returns the first column that is not null, or null if all inputs are null.
Returns the first column that is not null, or null if all inputs are null.
For example, coalesce(a, b, c)
will return a if a is not null, or b if a
is null and b is not null, or c if both a and b are null but c is not
null.
the DoricColumns to coalesce
the first column that is not null, or null if all inputs are null.
Retrieves a column with the provided name and the provided type.
Retrieves a column with the provided name and the provided type.
the expected type of the column
the name of the column to find.
error location.
the column reference
Retrieves a column with the provided name expecting it to be of array of T type.
Retrieves a column with the provided name expecting it to be of array of T type.
the type of the elements of the array.
the name of the column to find.
error location.
the array of T column reference.
Retrieves a column with the provided name expecting it to be of array of integers type.
Retrieves a column with the provided name expecting it to be of array of integers type.
the name of the column to find.
error location.
the array of integers column reference.
Retrieves a column with the provided name expecting it to be of array of string type.
Retrieves a column with the provided name expecting it to be of array of string type.
the name of the column to find.
error location.
the array of string column reference.
Retrieves a column with the provided name expecting it to be of array of bytes type.
Retrieves a column with the provided name expecting it to be of array of bytes type.
the name of the column to find.
error location.
the binary column reference.
Retrieves a column with the provided name expecting it to be of double type.
Retrieves a column with the provided name expecting it to be of double type.
the name of the column to find.
error location.
the long column reference
Retrieves a column with the provided name expecting it to be of Date type.
Retrieves a column with the provided name expecting it to be of Date type.
the name of the column to find.
error location.
the Date column reference
Retrieves a column with the provided name expecting it to be of double type.
Retrieves a column with the provided name expecting it to be of double type.
the name of the column to find.
error location.
the double column reference
Retrieves a column with the provided name expecting it to be of float type.
Retrieves a column with the provided name expecting it to be of float type.
the name of the column to find.
error location.
the float column reference
Retrieves a column of the provided dataframe.
Retrieves a column of the provided dataframe. Useful to prevent column ambiguity errors.
the type of the doric column.
the name of the column to find.
the dataframe to force the column.
error location.
the column of type T column reference.
Retrieves a column with the provided name expecting it to be of instant type.
Retrieves a column with the provided name expecting it to be of instant type.
the name of the column to find.
error location.
the instant column reference
Retrieves a column with the provided name expecting it to be of integer type.
Retrieves a column with the provided name expecting it to be of integer type.
the name of the column to find.
error location.
the integer column reference
Retrieves a column with the provided name expecting it to be of LocalDate type.
Retrieves a column with the provided name expecting it to be of LocalDate type.
the name of the column to find.
error location.
the LocalDate column reference
Retrieves a column with the provided name expecting it to be of long type.
Retrieves a column with the provided name expecting it to be of long type.
the name of the column to find.
error location.
the long column reference
Retrieves a column with the provided name expecting it to be of map type.
Retrieves a column with the provided name expecting it to be of map type.
the type of the keys of the map.
the type of the values of the map.
the name of the column to find.
error location.
the map column reference.
Retrieves a column with the provided name expecting it to be of map type.
Retrieves a column with the provided name expecting it to be of map type.
the type of the values of the map.
the name of the column to find.
error location.
the map column reference.
Retrieves a column with the provided name expecting it to be of null type.
Retrieves a column with the provided name expecting it to be of null type.
the name of the column to find.
error location.
the null column reference
Retrieves a column with the provided name expecting it to be of string type.
Retrieves a column with the provided name expecting it to be of string type.
the name of the column to find.
error location.
the string column reference
Retrieves a column with the provided name expecting it to be of struct type.
Retrieves a column with the provided name expecting it to be of struct type.
the name of the column to find.
error location.
the struct column reference.
Retrieves a column with the provided name expecting it to be of Timestamp type.
Retrieves a column with the provided name expecting it to be of Timestamp type.
the name of the column to find.
error location.
the Timestamp column reference
Aggregate function: returns a list of objects with duplicates.
Aggregate function: returns a list of objects with duplicates.
The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
Aggregate function: returns a set of objects with duplicate elements eliminated.
Aggregate function: returns a set of objects with duplicate elements eliminated.
The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
Concatenate string columns to form a single one
Concatenate string columns to form a single one
the String DoricColumns to concatenate
a reference of a single DoricColumn with all strings concatenated. If at least one is null will return null.
Concatenates multiple array columns together into a single column.
Concatenates multiple array columns together into a single column.
The type of the elements of the arrays.
the array columns, must be Arrays of the same type.
Doric Column with the concatenation.
Concatenates multiple binary columns together into a single column.
Concatenates multiple binary columns together into a single column.
the first binary column
the binary columns
Doric Column with the concatenation.
Returns the union of all the given maps.
Returns the union of all the given maps.
Concatenates multiple input string columns together into a single string column, using the given separator.
Concatenates multiple input string columns together into a single string column, using the given separator.
df.withColumn("res", concatWs("-".lit, col("col1"), col("col2"))) .show(false) +----+----+----+ |col1|col2| res| +----+----+----+ | 1| 1| 1-1| |null| 2| 2| | 3|null| 3| |null|null| | +----+----+----+
even if cols
contain null columns, it prints remaining string columns (or empty string).
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
Aggregate function: returns the number of items in a group.
Aggregate function: returns the number of items in a group.
Aggregate function: returns the number of items in a group.
Aggregate function: returns the number of items in a group.
Aggregate function: returns the number of distinct items in a group.
Aggregate function: returns the number of distinct items in a group.
Aggregate function: returns the number of distinct items in a group.
Aggregate function: returns the number of distinct items in a group.
Aggregate function: returns the population covariance for two columns.
Aggregate function: returns the population covariance for two columns.
Aggregate function: returns the sample covariance for two columns.
Aggregate function: returns the sample covariance for two columns.
Returns the current date at the start of query evaluation as a date column.
Returns the current date at the start of query evaluation as a date column. All calls of current_date within the same query return the same value.
Returns the current date at the start of query evaluation as a date column typed with the provided T.
Returns the current date at the start of query evaluation as a date column typed with the provided T. All calls of current_date within the same query return the same value.
Returns the current timestamp at the start of query evaluation as a timestamp column.
Returns the current timestamp at the start of query evaluation as a timestamp column. All calls of current_timestamp within the same query return the same value.
Returns the current timestamp at the start of query evaluation as a timestamp column.
Returns the current timestamp at the start of query evaluation as a timestamp column. All calls of current_timestamp within the same query return the same value.
Aggregate function: returns the first value in a group.
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
Aggregate function: returns the first value in a group.
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
Formats the arguments in printf-style and returns the result as a string column.
Formats the arguments in printf-style and returns the result as a string column.
Printf format
the String DoricColumns to format
Formats the arguments in printf-style and returns the result as a string column.
Returns the greatest value of the list of values, skipping null values.
Returns the greatest value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
skips null values
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
Aggregate function: returns the level of grouping, equals to
Aggregate function: returns the level of grouping, equals to
(grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)
The list of columns should match with grouping columns exactly, or empty (means all the grouping columns).
Aggregate function: returns the level of grouping, equals to
Aggregate function: returns the level of grouping, equals to
(grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn)
The list of columns should match with grouping columns exactly, or empty (means all the grouping columns).
Calculates the hash code of given columns, and returns the result as an integer column.
Calculates the hash code of given columns, and returns the result as an integer column.
Creates a string column for the file name of the current Spark task.
Creates a string column for the file name of the current Spark task.
Aggregate function: returns the kurtosis of the values in a group.
Aggregate function: returns the kurtosis of the values in a group.
Aggregate function: returns the last value in a group.
Aggregate function: returns the last value in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
Aggregate function: returns the last value in a group.
Aggregate function: returns the last value in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
Returns the least value of the list of values, skipping null values.
Returns the least value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
skips null values
Creates a new list column.
Creates a new list column. The input columns must all have the same data type.
scaladoc link (issue #135)
org.apache.spark.sql.functions.array
Creates a literal with the provided value.
Creates a literal with the provided value.
The type of the literal.
the element to create as a literal.
A doric column that represent the literal value and the same type as the value.
Creates a new map column.
Creates a new map column. The input is formed by tuples of key and the corresponding value.
the type of the keys of the Map
the type of the values of the Map
a pair of key value DoricColumns
the rest of pairs of key and corresponding Values.
the DoricColumn of the corresponding Map type
Creates a new map column.
Creates a new map column. The array in the first column is used for keys. The array in the second column is used for values. All elements in the array for key should not be null.
the type of the Array elements of the keys.
the type of the Array elements of the value.
the array to create the keys.
the array to create the values.
an DoricColumn of type Map of the keys and values.
Aggregate function: returns the maximum value of the expression in a group.
Aggregate function: returns the maximum value of the expression in a group.
Aggregate function: returns the maximum value of the expression in a group.
Aggregate function: returns the maximum value of the expression in a group.
Aggregate function: returns the maximum value of the expression in a group.
Aggregate function: returns the maximum value of the expression in a group.
A column expression that generates monotonically increasing 64-bit integers.
A column expression that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
consider a DataFrame
with two partitions, each with 3 records.
This expression would return the following IDs:
0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
Inversion of boolean expression, i.e.
Inversion of boolean expression, i.e. NOT.
Aggregate function: returns the OR value for a boolean column
Aggregate function: returns the OR value for a boolean column
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
The function is non-deterministic in general case.
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
The function is non-deterministic in general case.
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
The function is non-deterministic in general case.
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
The function is non-deterministic in general case.
The object row
stands for the top-level row of the DataFrame.
The object row
stands for the top-level row of the DataFrame.
Aggregate function: returns the skewness of the values in a group.
Aggregate function: returns the skewness of the values in a group.
Partition ID.
Partition ID.
This is non-deterministic because it depends on data partitioning and task scheduling.
Creates a string column for the file name of the current Spark task.
Creates a string column for the file name of the current Spark task.
Aggregate function: alias for stddev_samp
.
Aggregate function: alias for stddev_samp
.
Aggregate function: returns the population standard deviation of the expression in a group.
Aggregate function: returns the population standard deviation of the expression in a group.
Aggregate function: returns the sample standard deviation of the expression in a group.
Aggregate function: returns the sample standard deviation of the expression in a group.
Creates a struct with the columns
Creates a struct with the columns
the columns that will form the struct
A DStruct DoricColumn.
Aggregate function: returns the sum of all values in the expression.
Aggregate function: returns the sum of all values in the expression.
Aggregate function: returns the sum of distinct values in the expression.
Aggregate function: returns the sum of distinct values in the expression.
Returns the current Unix timestamp (in seconds) as a long.
Returns the current Unix timestamp (in seconds) as a long.
All calls of unix_timestamp
within the same query return the same value
(i.e. the current timestamp is calculated at the start of query evaluation).
Aggregate function: returns the population variance of the values in a group.
Aggregate function: returns the population variance of the values in a group.
Aggregate function: returns the unbiased variance of the values in a group.
Aggregate function: returns the unbiased variance of the values in a group.
Aggregate function: alias for var_samp
.
Aggregate function: alias for var_samp
.
Initialize a when builder
Initialize a when builder
the type of the returnign DoricColumn
WhenBuilder instance to add the required logic.