This rule rewrites an aggregate query with distinct aggregations into an expanded double
aggregation in which the regular aggregation expressions and every distinct clause is aggregated
in a separate group. The results are then combined in a second aggregate.
For example (in scala):
val data = Seq(
("a", "ca1", "cb1", 10),
("a", "ca1", "cb2", 5),
("b", "ca1", "cb1", 13))
.toDF("key", "cat1", "cat2", "value")
data.createOrReplaceTempView("data")
val agg = data.groupBy($"key")
.agg(
countDistinct($"cat1").as("cat1_cnt"),
countDistinct($"cat2").as("cat2_cnt"),
sum($"value").as("total"))
This translates to the following (pseudo) logical plan:
The rule does the following things here:
1. Expand the data. There are three aggregation groups in this query:
the non-distinct group;
ii. the distinct 'cat1 group;
iii. the distinct 'cat2 group.
An expand operator is inserted to expand the child data for each group. The expand will null
out all unused columns for the given group; this must be done in order to ensure correctness
later on. Groups can by identified by a group id (gid) column added by the expand operator.
2. De-duplicate the distinct paths and aggregate the non-aggregate path. The group by clause of
this aggregate consists of the original group by clause, all the requested distinct columns
and the group id. Both de-duplication of distinct column and the aggregation of the
non-distinct group take advantage of the fact that we group by the group id (gid) and that we
have nulled out all non-relevant columns the given group.
3. Aggregating the distinct groups and combining this with the results of the non-distinct
aggregation. In this step we use the group id to filter the inputs for the aggregate
functions. The result of the non-distinct group are 'aggregated' by using the first operator,
it might be more elegant to use the native UDAF merge mechanism for this in the future.
This rule duplicates the input data by two or more times (# distinct groups + an optional
non-distinct group). This will put quite a bit of memory pressure of the used aggregate and
exchange operators. Keeping the number of distinct groups as low a possible should be priority,
we could improve this in the current rule by applying more advanced expression canonicalization
techniques.
This rule rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group. The results are then combined in a second aggregate.
For example (in scala):
This translates to the following (pseudo) logical plan:
Aggregate( key = ['key] functions = [COUNT(DISTINCT 'cat1), COUNT(DISTINCT 'cat2), sum('value)] output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) LocalTableScan [...]
This rule rewrites this logical plan to the following (pseudo) logical plan:
The rule does the following things here: 1. Expand the data. There are three aggregation groups in this query:
This rule duplicates the input data by two or more times (# distinct groups + an optional non-distinct group). This will put quite a bit of memory pressure of the used aggregate and exchange operators. Keeping the number of distinct groups as low a possible should be priority, we could improve this in the current rule by applying more advanced expression canonicalization techniques.