Class

com.tresata.spark.datasetops

RichPairDataset

Related Doc: package datasetops

Permalink

implicit final class RichPairDataset[K, V] extends AnyVal

Linear Supertypes
AnyVal, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. RichPairDataset
  2. AnyVal
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new RichPairDataset(ds: Dataset[(K, V)])

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    Any
  4. def aggByKey[B, U](zero: B, reduce: (B, V) ⇒ B, merge: (B, B) ⇒ B, finish: (B) ⇒ U)(implicit envK: Encoder[K], envV: Encoder[V], encB: Encoder[B], encU: Encoder[U]): Dataset[(K, U)]

    Permalink

    Use a zero element and scala functions (reduce, merge, and finish) to aggregate the key-value Dataset's values for each key.

    Use a zero element and scala functions (reduce, merge, and finish) to aggregate the key-value Dataset's values for each key.

    scala> val zero = 1.0
    scala> val reduce = (x: Double, y: Int) => x / y
    scala> val merge = (x: Double, y: Double) => x + y
    scala> val finish = (x: Double) => x * 10
    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3)).toDS.aggByKey(zero, reduce, merge, finish).show
    +-----+------------------+
    |value|       anon$1(int)|
    +-----+------------------+
    |    1|18.333333333333332|
    +-----+------------------+
  5. def aggByKey[B, U](zero: () ⇒ B, reduce: (B, V) ⇒ B, merge: (B, B) ⇒ B, finish: (B) ⇒ U)(implicit envK: Encoder[K], envV: Encoder[V], encB: Encoder[B], encU: Encoder[U]): Dataset[(K, U)]

    Permalink

    Use scala functions to aggregate the key-value Dataset's values for each key: zero, reduce, merge, and finish.

    Use scala functions to aggregate the key-value Dataset's values for each key: zero, reduce, merge, and finish.

    scala> val zero = () => 1.0
    scala> val reduce = (x: Double, y: Int) => x / y
    scala> val merge = (x: Double, y: Double) => x + y
    scala> val finish = (x: Double) => x * 10
    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3)).toDS.aggByKey(zero, reduce, merge, finish).show
    +-----+------------------+
    |value|       anon$1(int)|
    +-----+------------------+
    |    1|18.333333333333332|
    +-----+------------------+
  6. def aggByKey[U1, U2, U3, U4](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3], col4: TypedColumn[V, U4])(implicit encK: Encoder[K], encV: Encoder[V]): Dataset[(K, U1, U2, U3, U4)]

    Permalink

    Use four TypedColumns to aggregate the key-value Dataset's values for each key.

    Use four TypedColumns to aggregate the key-value Dataset's values for each key.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> val agg = typed.sum((x: Int) => x)
    scala> Seq((1, 2), (1, 3), (2, 4)).toDS.aggByKey(agg, agg, agg, agg).show
    +-----+-------------------+-------------------+-------------------+-------------------+
    |value|TypedSumDouble(int)|TypedSumDouble(int)|TypedSumDouble(int)|TypedSumDouble(int)|
    +-----+-------------------+-------------------+-------------------+-------------------+
    |    2|                4.0|                4.0|                4.0|                4.0|
    |    1|                5.0|                5.0|                5.0|                5.0|
    +-----+-------------------+-------------------+-------------------+-------------------+
  7. def aggByKey[U1, U2, U3](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3])(implicit encK: Encoder[K], encV: Encoder[V]): Dataset[(K, U1, U2, U3)]

    Permalink

    Use three TypedColumns to aggregate the key-value Dataset's values for each key.

    Use three TypedColumns to aggregate the key-value Dataset's values for each key.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3), (2, 4)).toDS.aggByKey(typed.avg(x => x + 2), typed.sum(x => x), typed.avg(x => x - 2)).show
    +-----+-----------------+-------------------+-----------------+
    |value|TypedAverage(int)|TypedSumDouble(int)|TypedAverage(int)|
    +-----+-----------------+-------------------+-----------------+
    |    1|              4.5|                5.0|              0.5|
    |    2|              6.0|                4.0|              2.0|
    +-----+-----------------+-------------------+-----------------+
  8. def aggByKey[U1, U2](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2])(implicit encK: Encoder[K], encV: Encoder[V]): Dataset[(K, U1, U2)]

    Permalink

    Use two TypedColumns to aggregate the key-value Dataset's values for each key.

    Use two TypedColumns to aggregate the key-value Dataset's values for each key.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3), (2, 4)).toDS.aggByKey(typed.avg(x => x + 2), typed.sum(x => x)).show
    +-----+-----------------+-------------------+
    |value|TypedAverage(int)|TypedSumDouble(int)|
    +-----+-----------------+-------------------+
    |    1|              4.5|                5.0|
    |    2|              6.0|                4.0|
    +-----+-----------------+-------------------+
  9. def aggByKey[U1](col1: TypedColumn[V, U1])(implicit encK: Encoder[K], encV: Encoder[V]): Dataset[(K, U1)]

    Permalink

    Use a TypedColumn to aggregate the key-value Dataset's values for each key.

    Use a TypedColumn to aggregate the key-value Dataset's values for each key.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3), (2, 4)).toDS.aggByKey(typed.avg(x => x + 2)).show
    +-----+-----------------+
    |value|TypedAverage(int)|
    +-----+-----------------+
    |    1|              4.5|
    |    2|              6.0|
    +-----+-----------------+
  10. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  11. def countByKey()(implicit encK: Encoder[K]): Dataset[(K, Long)]

    Permalink

    Count the number rows in the key-value Dataset with each key.

    Count the number rows in the key-value Dataset with each key.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3), (2, 4)).toDS.countByKey.show
    +-----+--------+
    |value|count(1)|
    +-----+--------+
    |    1|       2|
    |    2|       1|
    +-----+--------+
  12. val ds: Dataset[(K, V)]

    Permalink
  13. def flatMapValues[U](f: (V) ⇒ TraversableOnce[U])(implicit encKU: Encoder[(K, U)]): Dataset[(K, U)]

    Permalink

    Flat-map the key-value Dataset's values for each key, with the provided function.

    Flat-map the key-value Dataset's values for each key, with the provided function.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2)).toDS.flatMapValues{ case x => List(x, x + 1) }.show
    +---+---+
    | _1| _2|
    +---+---+
    |  1|  2|
    |  1|  3|
    +---+---+
  14. def fullOuterJoinOnKey[V1](other: Dataset[(K, V1)])(implicit encKV: Encoder[(K, V)], encKV1: Encoder[(K, V1)], encKVOptV1: Encoder[(K, (Option[V], Option[V1]))]): Dataset[(K, (Option[V], Option[V1]))]

    Permalink

    Full outer join with another key-value Dataset on their keys.

    Full outer join with another key-value Dataset on their keys.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3), (2, 4)).toDS.fullOuterJoinByKey(Seq((1, 4), (1, 5)).toDS).show
    +---+--------+
    | _1|      _2|
    +---+--------+
    |  1|   [2,4]|
    |  1|   [2,5]|
    |  1|   [3,4]|
    |  1|   [3,5]|
    |  2|[4,null]|
    +---+--------+
  15. def getClass(): Class[_ <: AnyVal]

    Permalink
    Definition Classes
    AnyVal → Any
  16. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  17. def joinOnKey[V1](other: Dataset[(K, V1)])(implicit encKV: Encoder[(K, V)], encKV1: Encoder[(K, V1)], encKVV1: Encoder[(K, (V, V1))]): Dataset[(K, (V, V1))]

    Permalink

    Inner join with another key-value Dataset on their keys.

    Inner join with another key-value Dataset on their keys.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3), (2, 4)).toDS.joinByKey(Seq((1,4)).toDS).show
    +---+-----+
    | _1|   _2|
    +---+-----+
    |  1|[2,4]|
    |  1|[3,4]|
    +---+-----+
  18. def keys(implicit encK: Encoder[K]): Dataset[K]

    Permalink

    Discard the key-value Dataset's values, leaving only the keys.

    Discard the key-value Dataset's values, leaving only the keys.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3), (2, 4)).toDS.keys.show
    +-----+
    |value|
    +-----+
    |    1|
    |    1|
    |    2|
    +-----+
  19. def leftOuterJoinOnKey[V1](other: Dataset[(K, V1)])(implicit encKV: Encoder[(K, V)], encKV1: Encoder[(K, V1)], encKVOptV1: Encoder[(K, (V, Option[V1]))]): Dataset[(K, (V, Option[V1]))]

    Permalink

    Left outer join with another key-value Dataset on their keys.

    Left outer join with another key-value Dataset on their keys.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3), (2, 4)).toDS.leftOuterJoinByKey(Seq((1,4)).toDS).show
    +---+--------+
    | _1|      _2|
    +---+--------+
    |  1|   [2,4]|
    |  1|   [3,4]|
    |  2|[4,null]|
    +---+--------+
  20. def mapValues[U](f: (V) ⇒ U)(implicit encKU: Encoder[(K, U)]): Dataset[(K, U)]

    Permalink

    Apply a provided function to the values of the key-value Dataset.

    Apply a provided function to the values of the key-value Dataset.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3)).toDS.mapValues(_ + 2).show
    +---+---+
    | _1| _2|
    +---+---+
    |  1|  4|
    |  1|  5|
    +---+---+
  21. def partitionByKey(implicit encKV: Encoder[(K, V)]): Dataset[(K, V)]

    Permalink

    Partition the key-value Dataset by key.

    Partition the key-value Dataset by key.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> val ds = Seq((1, 2), (1, 3), (2, 4)).toDS
    scala> ds.rdd.partitions
    res1: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@20fe, org.apache.spark.rdd.ParallelCollectionPartition@20ff, org.apache.spark.rdd.ParallelCollectionPartition@2100)
    
    scala> ds.partitionByKey.rdd.partitions
    res2: Array[org.apache.spark.Partition] = Array(org.apache.spark.sql.execution.ShuffledRowRDDPartition@0, org.apache.spark.sql.execution.ShuffledRowRDDPartition@1, org.apache.spark.sql.execution.ShuffledRowRDDPartition@2, org.apache.spark.sql.execution.ShuffledRowRDDPartition@3, org.apache.spark.sql.execution.ShuffledRowRDDPartition@4, org.apache.spark.sql.execution.ShuffledRowRDDPartition@5, org.apache.spark.sql.execution.ShuffledRowRDDPartition@6, org.apache.spark.sql.execution.ShuffledRowRDDPartition@7)
  22. def partitionByKey(numPartitions: Int)(implicit encKV: Encoder[(K, V)]): Dataset[(K, V)]

    Permalink

    Partition the key-value Dataset by key, up to the maximum given number of partitions.

    Partition the key-value Dataset by key, up to the maximum given number of partitions.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> val ds = Seq((1, 2), (1, 3), (2, 4)).toDS
    scala> ds.rdd.partitions
    res1: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@20fe, org.apache.spark.rdd.ParallelCollectionPartition@20ff, org.apache.spark.rdd.ParallelCollectionPartition@2100)
    
    scala> ds.partitionByKey(1).rdd.partitions
    res1: Array[org.apache.spark.Partition] = Array(org.apache.spark.sql.execution.ShuffledRowRDDPartition@0)
  23. def reduceByKey(f: (V, V) ⇒ V)(implicit encK: Encoder[K], encV: Encoder[V]): Dataset[(K, V)]

    Permalink

    Reduce the key-value Dataset's values for each key, with the provided function.

    Reduce the key-value Dataset's values for each key, with the provided function.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3)).toDS.reduceByKey(_ + _).show
    +-----+---------------------+
    |value|ReduceAggregator(int)|
    +-----+---------------------+
    |    1|                    5|
    +-----+---------------------+
  24. def rightOuterJoinOnKey[V1](other: Dataset[(K, V1)])(implicit encKV: Encoder[(K, V)], encKV1: Encoder[(K, V1)], encKVOptV1: Encoder[(K, (Option[V], V1))]): Dataset[(K, (Option[V], V1))]

    Permalink

    Right outer join with another key-value Dataset on their keys.

    Right outer join with another key-value Dataset on their keys.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3), (2, 4)).toDS.rightOuterJoinByKey(Seq((1,4)).toDS).show
    +---+-----+
    | _1|   _2|
    +---+-----+
    |  1|[3,4]|
    |  1|[2,4]|
    +---+-----+
  25. def sortWithinPartitionsByKey(implicit encKV: Encoder[(K, V)]): Dataset[(K, V)]

    Permalink

    Sort the key-value Dataset within partitions by key.

    Sort the key-value Dataset within partitions by key.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> val ds = Seq((1, 2), (3, 1), (2, 2), (2, 6), (1, 1)).toDS
    scala> ds.partitionByKey(2).rdd.glom.map(_.map(_._1).toSeq).collect
    res56: Array[Seq[Int]] = Array(WrappedArray(2, 2), WrappedArray(1, 3, 1))
    
    scala> ds.partitionByKey(2).sortWithinPartitionsByKey.rdd.glom.map(_.map(_._1).toSeq).collect
    res57: Array[Seq[Int]] = Array(WrappedArray(2, 2), WrappedArray(1, 1, 3))
  26. def toString(): String

    Permalink
    Definition Classes
    Any
  27. def values(implicit encV: Encoder[V]): Dataset[V]

    Permalink

    Discard the key-value Dataset's keys, leaving only the values.

    Discard the key-value Dataset's keys, leaving only the values.

    scala> import com.tresata.spark.datasetops.RichPairDataset
    scala> Seq((1, 2), (1, 3), (2, 4)).toDS.values.show
    +-----+
    |value|
    +-----+
    |    2|
    |    3|
    |    4|
    +-----+

Inherited from AnyVal

Inherited from Any

Ungrouped