Class Group
- java.lang.Object
-
- org.apache.beam.sdk.schemas.transforms.Group
-
@Experimental(SCHEMAS) public class Group extends java.lang.Object
A generic grouping transform for schemaPCollection
s.When used without a combiner, this transforms simply acts as a
GroupByKey
but without the need for the user to explicitly extract the keys. For example, consider the following input type:@DefaultSchema(JavaFieldSchema.class) public class UserPurchase { public String userId; public String country; public long cost; public double transactionDuration; } PCollection<UserPurchase> purchases = readUserPurchases();
You can group all purchases by user and country as follows:
@DefaultSchema(JavaFieldSchema.class) PCollection<Row> byUser = purchases.apply(Group.byFieldNames("userId', "country"));
However often an aggregation of some form is desired. The builder methods inside the Group class allows building up separate aggregations for every field (or set of fields) on the input schema, and generating an output schema based on these aggregations. For example:
PCollection<Row> aggregated = purchases .apply(Group.byFieldNames("userId', "country") .aggregateField("cost", Sum.ofLongs(), "total_cost") .aggregateField("cost", Top.<Long>largestLongsFn(10), "top_purchases") .aggregateField("cost", ApproximateQuantilesCombineFn.create(21), Field.of("transactionDurations", FieldType.array(FieldType.INT64)));
The result will be a new row schema containing the fields total_cost, top_purchases, and transactionDurations, containing the sum of all purchases costs (for that user and country), the top ten purchases, and a histogram of transaction durations. The schema will also contain a key field, which will be a row containing userId and country.
Note that usually the field type can be automatically inferred from the
Combine.CombineFn
passed in. However sometimes it cannot be inferred, due to Java type erasure, in which case aSchema.Field
object containing the field type must be passed in. This is currently the case for ApproximateQuantilesCombineFn in the above example.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
Group.AggregateCombiner<InputT>
aPTransform
that does a combine using an aggregation built up by calls to aggregateField and aggregateFields.static class
Group.ByFields<InputT>
aPTransform
that groups schema elements based on the given fields.static class
Group.CombineFieldsByFields<InputT>
aPTransform
that does a per-key combine using an aggregation built up by calls to aggregateField and aggregateFields.static class
Group.CombineFieldsGlobally<InputT>
aPTransform
that does a global combine using an aggregation built up by calls to aggregateField and aggregateFields.static class
Group.CombineGlobally<InputT,OutputT>
aPTransform
that does a global combine using a providerCombine.CombineFn
.static class
Group.Global<InputT>
APTransform
for doing global aggregations on schema PCollections.
-
Constructor Summary
Constructors Constructor Description Group()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static <T> Group.ByFields<T>
byFieldAccessDescriptor(FieldAccessDescriptor fieldAccess)
Returns a transform that groups all elements in the inputPCollection
keyed by the fields specified.static <T> Group.ByFields<T>
byFieldIds(java.lang.Integer... fieldIds)
Returns a transform that groups all elements in the inputPCollection
keyed by the list of fields specified.static <T> Group.ByFields<T>
byFieldIds(java.lang.Iterable<java.lang.Integer> fieldIds)
Same asbyFieldIds(Integer...)
.static <T> Group.ByFields<T>
byFieldNames(java.lang.Iterable<java.lang.String> fieldNames)
Same asbyFieldNames(String...)
.static <T> Group.ByFields<T>
byFieldNames(java.lang.String... fieldNames)
Returns a transform that groups all elements in the inputPCollection
keyed by the list of fields specified.static <T> Group.Global<T>
globally()
Returns a transform that groups all elements in the inputPCollection
.
-
-
-
Method Detail
-
globally
public static <T> Group.Global<T> globally()
Returns a transform that groups all elements in the inputPCollection
. The returned transform contains further builder methods to control how the grouping is done.
-
byFieldNames
public static <T> Group.ByFields<T> byFieldNames(java.lang.String... fieldNames)
Returns a transform that groups all elements in the inputPCollection
keyed by the list of fields specified. The output of this transform will be aKV
keyed by aRow
containing the specified extracted fields. The returned transform contains further builder methods to control how the grouping is done.
-
byFieldNames
public static <T> Group.ByFields<T> byFieldNames(java.lang.Iterable<java.lang.String> fieldNames)
Same asbyFieldNames(String...)
.
-
byFieldIds
public static <T> Group.ByFields<T> byFieldIds(java.lang.Integer... fieldIds)
Returns a transform that groups all elements in the inputPCollection
keyed by the list of fields specified. The output of this transform will have a key field of typeRow
containing the specified extracted fields. It will also have a value field of typeRow
containing the specified extracted fields. The returned transform contains further builder methods to control how the grouping is done.
-
byFieldIds
public static <T> Group.ByFields<T> byFieldIds(java.lang.Iterable<java.lang.Integer> fieldIds)
Same asbyFieldIds(Integer...)
.
-
byFieldAccessDescriptor
public static <T> Group.ByFields<T> byFieldAccessDescriptor(FieldAccessDescriptor fieldAccess)
Returns a transform that groups all elements in the inputPCollection
keyed by the fields specified. The output of this transform will have a key field of typeRow
containing the specified extracted fields. It will also have a value field of typeRow
containing the specified extracted fields. The returned transform contains further builder methods to control how the grouping is done.
-
-