Returns TypedColumn
of type A
given it's name.
Returns TypedColumn
of type A
given it's name.
tf('id)
It is statically checked that column with such name exists and has type A
.
Returns a new TypedDataset where each record has been mapped on to the specified type.
Persist this TypedDataset with the default storage level (MEMORY_AND_DISK
).
Persist this TypedDataset with the default storage level (MEMORY_AND_DISK
).
apache/spark
Returns a new TypedDataset that has exactly numPartitions
partitions.
Returns a new TypedDataset that has exactly numPartitions
partitions.
Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g.
if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
the 100 new partitions will claim 10 of the current partitions.
apache/spark
Returns TypedColumn
of type A
given it's name.
Returns TypedColumn
of type A
given it's name.
tf.col('id)
It is statically checked that column with such name exists and has type A
.
Returns a Seq
that contains all the elements in this TypedDataset.
Returns a Seq
that contains all the elements in this TypedDataset.
Running this Job requires moving all the data into the application's driver process, and doing so on a very large TypedDataset can crash the driver process with OutOfMemoryError.
Differs from Dataset#collect
by wrapping it's result into a Job.
Returns the number of elements in the TypedDataset.
Returns the number of elements in the TypedDataset.
Differs from Dataset#count
by wrapping it's result into a Job.
Returns a new TypedDataset that contains only the unique elements of this TypedDataset.
Returns a new TypedDataset that contains only the unique elements of this TypedDataset.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
apache/spark
Returns a new Dataset containing rows in this Dataset but not in another Dataset.
Returns a new Dataset containing rows in this Dataset but not in another Dataset.
This is equivalent to EXCEPT
in SQL.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
apache/spark
Prints the plans (logical and physical) to the console for debugging purposes.
Prints the plans (logical and physical) to the console for debugging purposes.
apache/spark
Returns a new frameless.TypedDataset that only contains elements where column
is true
.
Returns a new frameless.TypedDataset that only contains elements where column
is true
.
Differs from TypedDatasetForward#filter
by taking a TypedColumn[T, Boolean]
instead of a
T => Boolean
. Using a column expression instead of a regular function save one Spark → Scala
deserialization which leads to better performance.
Returns a new TypedDataset that only contains elements where func
returns true
.
Returns a new TypedDataset that only contains elements where func
returns true
.
apache/spark
Optionally returns the first element in this TypedDataset.
Optionally returns the first element in this TypedDataset.
Differs from Dataset#first
by wrapping it's result into an Option
and a Job.
Returns a new TypedDataset by first applying a function to all elements of this TypedDataset, and then flattening the results.
Returns a new TypedDataset by first applying a function to all elements of this TypedDataset, and then flattening the results.
apache/spark
Runs func
on each element of this TypedDataset.
Runs func
on each element of this TypedDataset.
Differs from Dataset#foreach
by wrapping it's result into a Job.
Runs func
on each partition of this TypedDataset.
Runs func
on each partition of this TypedDataset.
Differs from Dataset#foreachPartition
by wrapping it's result into a Job.
Returns a new TypedDataset that contains only the elements of this TypedDataset that are also
present in other
.
Returns a new TypedDataset that contains only the elements of this TypedDataset that are also
present in other
.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
apache/spark
Returns a new Dataset by taking the first n
rows.
Returns a new Dataset by taking the first n
rows. The difference between this function
and head
is that head
is an action and returns an array (by triggering query execution)
while limit
returns a new Dataset.
apache/spark
Takes a function from (A1, A2, A3, A4, A5) => R and converts it to a UDF for (TypedColumn[T, A1], TypedColumn[T, A2], TypedColumn[T, A3], TypedColumn[T, A4], TypedColumn[T, A5]) => TypedColumn[T, R].
Takes a function from (A1, A2, A3, A4) => R and converts it to a UDF for (TypedColumn[T, A1], TypedColumn[T, A2], TypedColumn[T, A3], TypedColumn[T, A4]) => TypedColumn[T, R].
Takes a function from (A1, A2, A3) => R and converts it to a UDF for (TypedColumn[T, A1], TypedColumn[T, A2], TypedColumn[T, A3]) => TypedColumn[T, R].
Takes a function from (A1, A2) => R and converts it to a UDF for (TypedColumn[T, A1], TypedColumn[T, A2]) => TypedColumn[T, R].
Takes a function from A => R and converts it to a UDF for TypedColumn[T, A] => TypedColumn[T, R].
Returns a new TypedDataset that contains the result of applying func
to each element.
Returns a new TypedDataset that contains the result of applying func
to each element.
apache/spark
Returns a new TypedDataset that contains the result of applying func
to each partition.
Returns a new TypedDataset that contains the result of applying func
to each partition.
apache/spark
Persist this TypedDataset with the given storage level.
Persist this TypedDataset with the given storage level.
One of: MEMORY_ONLY
, MEMORY_AND_DISK
, MEMORY_ONLY_SER
,
MEMORY_AND_DISK_SER
, DISK_ONLY
, MEMORY_ONLY_2
, MEMORY_AND_DISK_2
, etc.
apache/spark
Prints the schema of the underlying Dataset
to the console in a nice tree format.
Prints the schema of the underlying Dataset
to the console in a nice tree format.
apache/spark
Returns a new TypedDataset where each record has been mapped on to the specified type.
Returns a new TypedDataset where each record has been mapped on to the specified type.
Unlike as
the projection U may include a subset of the columns of T and the column names and types must agree.
case class Foo(i: Int, j: String) case class Bar(j: String) val t: TypedDataset[Foo] = ... val b: TypedDataset[Bar] = t.project[Bar] case class BarErr(e: String) // The following does not compile because `Foo` doesn't have a field with name `e` val e: TypedDataset[BarErr] = t.project[BarErr]
Converts this TypedDataset to an RDD.
Converts this TypedDataset to an RDD.
apache/spark
Optionally reduces the elements of this TypedDataset using the specified binary function.
Optionally reduces the elements of this TypedDataset using the specified binary function. The given
func
must be commutative and associative or the result may be non-deterministic.
Differs from Dataset#reduce
by wrapping it's result into an Option
and a Job.
Returns a new TypedDataset that has exactly numPartitions
partitions.
Returns a new TypedDataset that has exactly numPartitions
partitions.
apache/spark
Returns a new TypedDataset by sampling a fraction of records.
Returns a new TypedDataset by sampling a fraction of records.
apache/spark
Returns the schema of this Dataset.
Returns the schema of this Dataset.
apache/spark
Type-safe projection from type T to Tuple10[A,B,...]
Type-safe projection from type T to Tuple10[A,B,...]
d.select( d('a), d('a)+d('b), ... )
Type-safe projection from type T to Tuple9[A,B,...]
Type-safe projection from type T to Tuple9[A,B,...]
d.select( d('a), d('a)+d('b), ... )
Type-safe projection from type T to Tuple8[A,B,...]
Type-safe projection from type T to Tuple8[A,B,...]
d.select( d('a), d('a)+d('b), ... )
Type-safe projection from type T to Tuple7[A,B,...]
Type-safe projection from type T to Tuple7[A,B,...]
d.select( d('a), d('a)+d('b), ... )
Type-safe projection from type T to Tuple6[A,B,...]
Type-safe projection from type T to Tuple6[A,B,...]
d.select( d('a), d('a)+d('b), ... )
Type-safe projection from type T to Tuple5[A,B,...]
Type-safe projection from type T to Tuple5[A,B,...]
d.select( d('a), d('a)+d('b), ... )
Type-safe projection from type T to Tuple4[A,B,...]
Type-safe projection from type T to Tuple4[A,B,...]
d.select( d('a), d('a)+d('b), ... )
Type-safe projection from type T to Tuple3[A,B,...]
Type-safe projection from type T to Tuple3[A,B,...]
d.select( d('a), d('a)+d('b), ... )
Type-safe projection from type T to Tuple2[A,B]
Type-safe projection from type T to Tuple2[A,B]
d.select( d('a), d('a)+d('b), ... )
Type-safe projection from type T to Tuple1[A]
Type-safe projection from type T to Tuple1[A]
d.select( d('a), d('a)+d('b), ... )
Displays the content of this TypedDataset in a tabular form.
Displays the content of this TypedDataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:
year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
Number of rows to show
Whether truncate long strings. If true, strings more than 20 characters will
be truncated and all cells will be aligned right
Differs from Dataset#show
by wrapping it's result into a Job.
apache/spark
Returns the first num
elements of this TypedDataset as a Seq
.
Returns the first num
elements of this TypedDataset as a Seq
.
Running take requires moving data into the application's driver process, and doing so with
a very large num
can crash the driver process with OutOfMemoryError.
Differs from Dataset#take
by wrapping it's result into a Job.
apache/spark
Converts this strongly typed collection of data to generic Dataframe.
Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.
apache/spark
Concise syntax for chaining custom transformations.
Concise syntax for chaining custom transformations.
apache/spark
Returns a new TypedDataset that contains the elements of both this and the other
TypedDataset
combined.
Returns a new TypedDataset that contains the elements of both this and the other
TypedDataset
combined.
Note that, this function is not a typical set union operation, in that it does not eliminate
duplicate items. As such, it is analogous to UNION ALL
in SQL.
apache/spark
Mark the TypedDataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the TypedDataset as non-persistent, and remove all blocks for it from memory and disk.
Whether to block until all blocks are deleted. apache/spark
TypedDataset is a safer interface for working with
Dataset
.NOTE: Prefer
TypedDataset.create
overnew TypedDataset
unless you know what you are doing.Documentation marked "apache/spark" is thanks to apache/spark Contributors at https://github.com/apache/spark, licensed under Apache v2.0 available at http://www.apache.org/licenses/LICENSE-2.0