Use a zero element and scala functions (reduce, merge, and finish) to aggregate the key-value Dataset's values for each key.
Use a zero element and scala functions (reduce, merge, and finish) to aggregate the key-value Dataset's values for each key.
scala> val zero = 1.0 scala> val reduce = (x: Double, y: Int) => x / y scala> val merge = (x: Double, y: Double) => x + y scala> val finish = (x: Double) => x * 10 scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3)).toDS.aggByKey(zero, reduce, merge, finish).show +-----+------------------+ |value| anon$1(int)| +-----+------------------+ | 1|18.333333333333332| +-----+------------------+
Use scala functions to aggregate the key-value Dataset's values for each key: zero, reduce, merge, and finish.
Use scala functions to aggregate the key-value Dataset's values for each key: zero, reduce, merge, and finish.
scala> val zero = () => 1.0 scala> val reduce = (x: Double, y: Int) => x / y scala> val merge = (x: Double, y: Double) => x + y scala> val finish = (x: Double) => x * 10 scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3)).toDS.aggByKey(zero, reduce, merge, finish).show +-----+------------------+ |value| anon$1(int)| +-----+------------------+ | 1|18.333333333333332| +-----+------------------+
Use four TypedColumns to aggregate the key-value Dataset's values for each key.
Use four TypedColumns to aggregate the key-value Dataset's values for each key.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> val agg = typed.sum((x: Int) => x) scala> Seq((1, 2), (1, 3), (2, 4)).toDS.aggByKey(agg, agg, agg, agg).show +-----+-------------------+-------------------+-------------------+-------------------+ |value|TypedSumDouble(int)|TypedSumDouble(int)|TypedSumDouble(int)|TypedSumDouble(int)| +-----+-------------------+-------------------+-------------------+-------------------+ | 2| 4.0| 4.0| 4.0| 4.0| | 1| 5.0| 5.0| 5.0| 5.0| +-----+-------------------+-------------------+-------------------+-------------------+
Use three TypedColumns to aggregate the key-value Dataset's values for each key.
Use three TypedColumns to aggregate the key-value Dataset's values for each key.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3), (2, 4)).toDS.aggByKey(typed.avg(x => x + 2), typed.sum(x => x), typed.avg(x => x - 2)).show +-----+-----------------+-------------------+-----------------+ |value|TypedAverage(int)|TypedSumDouble(int)|TypedAverage(int)| +-----+-----------------+-------------------+-----------------+ | 1| 4.5| 5.0| 0.5| | 2| 6.0| 4.0| 2.0| +-----+-----------------+-------------------+-----------------+
Use two TypedColumns to aggregate the key-value Dataset's values for each key.
Use two TypedColumns to aggregate the key-value Dataset's values for each key.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3), (2, 4)).toDS.aggByKey(typed.avg(x => x + 2), typed.sum(x => x)).show +-----+-----------------+-------------------+ |value|TypedAverage(int)|TypedSumDouble(int)| +-----+-----------------+-------------------+ | 1| 4.5| 5.0| | 2| 6.0| 4.0| +-----+-----------------+-------------------+
Use a TypedColumn to aggregate the key-value Dataset's values for each key.
Use a TypedColumn to aggregate the key-value Dataset's values for each key.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3), (2, 4)).toDS.aggByKey(typed.avg(x => x + 2)).show +-----+-----------------+ |value|TypedAverage(int)| +-----+-----------------+ | 1| 4.5| | 2| 6.0| +-----+-----------------+
Count the number rows in the key-value Dataset with each key.
Count the number rows in the key-value Dataset with each key.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3), (2, 4)).toDS.countByKey.show +-----+--------+ |value|count(1)| +-----+--------+ | 1| 2| | 2| 1| +-----+--------+
Flat-map the key-value Dataset's values for each key, with the provided function.
Flat-map the key-value Dataset's values for each key, with the provided function.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2)).toDS.flatMapValues{ case x => List(x, x + 1) }.show +---+---+ | _1| _2| +---+---+ | 1| 2| | 1| 3| +---+---+
Full outer join with another key-value Dataset on their keys.
Full outer join with another key-value Dataset on their keys.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3), (2, 4)).toDS.fullOuterJoinByKey(Seq((1, 4), (1, 5)).toDS).show +---+--------+ | _1| _2| +---+--------+ | 1| [2,4]| | 1| [2,5]| | 1| [3,4]| | 1| [3,5]| | 2|[4,null]| +---+--------+
Inner join with another key-value Dataset on their keys.
Inner join with another key-value Dataset on their keys.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3), (2, 4)).toDS.joinByKey(Seq((1,4)).toDS).show +---+-----+ | _1| _2| +---+-----+ | 1|[2,4]| | 1|[3,4]| +---+-----+
Discard the key-value Dataset's values, leaving only the keys.
Discard the key-value Dataset's values, leaving only the keys.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3), (2, 4)).toDS.keys.show +-----+ |value| +-----+ | 1| | 1| | 2| +-----+
Left outer join with another key-value Dataset on their keys.
Left outer join with another key-value Dataset on their keys.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3), (2, 4)).toDS.leftOuterJoinByKey(Seq((1,4)).toDS).show +---+--------+ | _1| _2| +---+--------+ | 1| [2,4]| | 1| [3,4]| | 2|[4,null]| +---+--------+
Apply a provided function to the values of the key-value Dataset.
Apply a provided function to the values of the key-value Dataset.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3)).toDS.mapValues(_ + 2).show +---+---+ | _1| _2| +---+---+ | 1| 4| | 1| 5| +---+---+
Partition the key-value Dataset by key.
Partition the key-value Dataset by key.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> val ds = Seq((1, 2), (1, 3), (2, 4)).toDS scala> ds.rdd.partitions res1: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@20fe, org.apache.spark.rdd.ParallelCollectionPartition@20ff, org.apache.spark.rdd.ParallelCollectionPartition@2100) scala> ds.partitionByKey.rdd.partitions res2: Array[org.apache.spark.Partition] = Array(org.apache.spark.sql.execution.ShuffledRowRDDPartition@0, org.apache.spark.sql.execution.ShuffledRowRDDPartition@1, org.apache.spark.sql.execution.ShuffledRowRDDPartition@2, org.apache.spark.sql.execution.ShuffledRowRDDPartition@3, org.apache.spark.sql.execution.ShuffledRowRDDPartition@4, org.apache.spark.sql.execution.ShuffledRowRDDPartition@5, org.apache.spark.sql.execution.ShuffledRowRDDPartition@6, org.apache.spark.sql.execution.ShuffledRowRDDPartition@7)
Partition the key-value Dataset by key, up to the maximum given number of partitions.
Partition the key-value Dataset by key, up to the maximum given number of partitions.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> val ds = Seq((1, 2), (1, 3), (2, 4)).toDS scala> ds.rdd.partitions res1: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@20fe, org.apache.spark.rdd.ParallelCollectionPartition@20ff, org.apache.spark.rdd.ParallelCollectionPartition@2100) scala> ds.partitionByKey(1).rdd.partitions res1: Array[org.apache.spark.Partition] = Array(org.apache.spark.sql.execution.ShuffledRowRDDPartition@0)
Reduce the key-value Dataset's values for each key, with the provided function.
Reduce the key-value Dataset's values for each key, with the provided function.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3)).toDS.reduceByKey(_ + _).show +-----+---------------------+ |value|ReduceAggregator(int)| +-----+---------------------+ | 1| 5| +-----+---------------------+
Right outer join with another key-value Dataset on their keys.
Right outer join with another key-value Dataset on their keys.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3), (2, 4)).toDS.rightOuterJoinByKey(Seq((1,4)).toDS).show +---+-----+ | _1| _2| +---+-----+ | 1|[3,4]| | 1|[2,4]| +---+-----+
Sort the key-value Dataset within partitions by key.
Sort the key-value Dataset within partitions by key.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> val ds = Seq((1, 2), (3, 1), (2, 2), (2, 6), (1, 1)).toDS scala> ds.partitionByKey(2).rdd.glom.map(_.map(_._1).toSeq).collect res56: Array[Seq[Int]] = Array(WrappedArray(2, 2), WrappedArray(1, 3, 1)) scala> ds.partitionByKey(2).sortWithinPartitionsByKey.rdd.glom.map(_.map(_._1).toSeq).collect res57: Array[Seq[Int]] = Array(WrappedArray(2, 2), WrappedArray(1, 1, 3))
Discard the key-value Dataset's keys, leaving only the values.
Discard the key-value Dataset's keys, leaving only the values.
scala> import com.tresata.spark.datasetops.RichPairDataset scala> Seq((1, 2), (1, 3), (2, 4)).toDS.values.show +-----+ |value| +-----+ | 2| | 3| | 4| +-----+