GroupByKey (Google Cloud Dataflow SDK API)

java.lang.Object
- com.google.cloud.dataflow.sdk.transforms.PTransform<PCollection<KV<K,V>>,PCollection<KV<K,java.lang.Iterable<V>>>>
- - com.google.cloud.dataflow.sdk.transforms.GroupByKey<K,V>

Type Parameters:
K - the type of the keys of the input and output PCollections
V - the type of the values of the input PCollection and the elements of the Iterables in the output PCollection

All Implemented Interfaces:

java.io.Serializable
```
public class GroupByKey<K,V>
extends PTransform<PCollection<KV<K,V>>,PCollection<KV<K,java.lang.Iterable<V>>>>
```
GroupByKey<K, V> takes a PCollection<KV<K, V>>, groups the values by key and windows, and returns a PCollection<KV<K, Iterable<V>>> representing a map from each distinct key and window of the input PCollection to an Iterable over all the values associated with that key in the input. Each key in the output PCollection is unique within each window.
GroupByKey is analogous to converting a multi-map into a uni-map, and related to GROUP BY in SQL. It corresponds to the "shuffle" step between the Mapper and the Reducer in the MapReduce framework.
Two keys of type K are compared for equality not by regular Java Object.equals(java.lang.Object), but instead by first encoding each of the keys using the Coder of the keys of the input PCollection, and then comparing the encoded bytes. This admits efficient parallel evaluation. Note that this requires that the Coder of the keys be deterministic (see Coder.verifyDeterministic()). If the key Coder is not deterministic, an exception is thrown at runtime.
By default, the Coder of the keys of the output PCollection is the same as that of the keys of the input, and the Coder of the elements of the Iterable values of the output PCollection is the same as the Coder of the values of the input.
Example of use:
```
 PCollection<KV<String, Doc>> urlDocPairs = ...;
 PCollection<KV<String, Iterable<Doc>>> urlToDocs =
     urlDocPairs.apply(GroupByKey.<String, Doc>create());
 PCollection<R> results =
     urlToDocs.apply(ParDo.of(new DoFn<KV<String, Iterable<Doc>>, R>() {
       public void processElement(ProcessContext c) {
         String url = c.element().getKey();
         Iterable<Doc> docsWithThatUrl = c.element().getValue();
         ... process all docs having that url ...
       }}));
  
```
GroupByKey is a key primitive in data-parallel processing, since it is the main way to efficiently bring associated data together into one location. It is also a key determiner of the performance of a data-parallel pipeline.
See CoGroupByKey for a way to group multiple input PCollections by a common key at once.
See Combine.PerKey for a common pattern of GroupByKey followed by Combine.GroupedValues.
When grouping, windows that can be merged according to the WindowFn of the input PCollection will be merged together, and a group corresponding to the new, merged window will be emitted. The timestamp for each group is the upper bound of its window, e.g., the most recent timestamp that can be assigned into the window, and the group will be in the window that it corresponds to. The output PCollection will have the same WindowFn as the input.
If the input PCollection contains late data (see PubsubIO.Read.Bound.timestampLabel for an example of how this can occur), then there may be multiple elements output by a GroupByKey that correspond to the same key and window.
If the WindowFn of the input requires merging, it is not valid to apply another GroupByKey without first applying a new WindowFn.
See Also:
Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`GroupByKey.GroupAlsoByWindow<K,V>` Helper transform that takes a collection of timestamp-ordered values associated with each key, groups the values by window, combines windows as needed, and for each window in each key, outputs a collection of key/value-list pairs implicitly assigned to the window and with the timestamp derived from that window.
`static class`	`GroupByKey.GroupByKeyOnly<K,V>` Primitive helper transform that groups by key only, ignoring any window assignments.
`static class`	`GroupByKey.ReifyTimestampsAndWindows<K,V>` Helper transform that makes timestamps and window assignments explicit in the value part of each key/value pair.
`static class`	`GroupByKey.SortValuesByTimestamp<K,V>` Helper transform that sorts the values associated with each key by timestamp.

Field Summary
- Fields inherited from class com.google.cloud.dataflow.sdk.transforms.PTransform
  name

Constructor Summary

Constructors
Constructor and Description

GroupByKey()

Constructors
Constructor and Description
`GroupByKey()`

Method Summary

Methods
Modifier and Type	Method and Description
`PCollection<KV<K,java.lang.Iterable<V>>>`	`apply(PCollection<KV<K,V>> input)` Applies this `PTransform` on the given `Input`, and returns its `Output`.
`PCollection<KV<K,java.lang.Iterable<V>>>`	`applyHelper(PCollection<KV<K,V>> input, boolean isStreaming, boolean runnerSortsByTimestamp)`
`static <K,V> GroupByKey<K,V>`	`create()` Returns a `GroupByKey<K, V>` `PTransform`.

Methods inherited from class com.google.cloud.dataflow.sdk.transforms.PTransform
finishSpecifying, getCoderRegistry, getDefaultName, getDefaultOutputCoder, getDefaultOutputCoder, getInput, getKindString, getName, getOutput, getPipeline, setName, setPipeline, toString, withName

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - GroupByKey
```
public GroupByKey()
```
- Method Detail
  - create
```
public static <K,V> GroupByKey<K,V> create()
```
    Returns a GroupByKey<K, V> PTransform.
    
    Type Parameters:
    K - the type of the keys of the input and output PCollections
    V - the type of the values of the input PCollection and the elements of the Iterables in the output PCollection
  - apply
```
public PCollection<KV<K,java.lang.Iterable<V>>> apply(PCollection<KV<K,V>> input)
```
    Description copied from class: PTransform
    
    Applies this PTransform on the given Input, and returns its Output.
    Composite transforms, which are defined in terms of other transforms, should return the output of one of the composed transforms. Non-composite transforms, which do not apply any transforms internally, should return a new unbound output and register evaluators (via backend-specific registration methods).
    The default implementation throws an exception. A derived class must either implement apply, or else each runner must supply a custom implementation via PipelineRunner.apply(com.google.cloud.dataflow.sdk.transforms.PTransform<Input, Output>, Input).
    
    Overrides:
    
    apply in class PTransform<PCollection<KV<K,V>>,PCollection<KV<K,java.lang.Iterable<V>>>>
  - applyHelper
```
public PCollection<KV<K,java.lang.Iterable<V>>> applyHelper(PCollection<KV<K,V>> input,
                                                   boolean isStreaming,
                                                   boolean runnerSortsByTimestamp)
```

Class GroupByKey<K,V>

Nested Class Summary

Field Summary

Fields inherited from class com.google.cloud.dataflow.sdk.transforms.PTransform

Constructor Summary

Method Summary

Methods inherited from class com.google.cloud.dataflow.sdk.transforms.PTransform

Methods inherited from class java.lang.Object

Constructor Detail

GroupByKey

Method Detail

create

apply

applyHelper