Rewrite AGG(IF()) to AGG() FILTER() by yuanzhanhku · Pull Request #16534 · prestodb/presto

yuanzhanhku · 2021-07-29T01:40:28Z

Add an rule to rewrite

AGG(IF(condition, expr))
to
AGG(expr) FILTER (WHERE condition).

The latter plan is more efficient because:

the filter can be pushed down to the scan node
the rows not matching the condition are not aggregated
the IF() expression wrapper is removed.

Test plan
Added unit tests for the rule in TestRewriteAggIfToAggFilter.java.
Added query plan tests in TestFilteredAggregations.java and covered all existing test cases for AGG() FILTER.

== RELEASE NOTES ==

General Changes
* Introduce a new config ``optimizer.aggregation-if-to-filter-rewrite-enabled`` and its corresponding session property ``aggregation_if_to_filter_rewrite_enabled`` to enable or disable an optimizer rule to improve the query performance of ``IF`` expressions inside aggregation functions.

The filter for the aggregation with mask was incorrect.

highker

still reviewing; but quick nit: let's change all "agg" into "aggregation" in this PR...

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

presto-main/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java

highker

more comments on logic

...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteAggIfToAggFilter.java

yuanzhanhku · 2021-07-30T22:43:06Z

The order of the aggregations in the Aggregation node seems not deterministic, so I made a few changes to generate the new expressions in the order of the VariableReferenceExpression names.

highker · 2021-08-02T06:07:38Z

Quick nit: could you add the "release note" section to this github page as well? Check existing merged PRs with release note as examples. This should be "general changes" with something like "introduce a new config and session property to blah blah blah". Note that a release note should be extremely user-facing, mean that we could use sentences like "optimize if expressions in aggregation functions to improve performance".

presto-expressions/src/main/java/com/facebook/presto/expressions/LogicalRowExpressions.java

presto-main/src/test/java/com/facebook/presto/sql/query/TestFilteredAggregations.java

...c/main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteAggregationIfToFilter.java

Add an rule to rewrite - AGG(IF(condition, expr)) to - AGG(expr) FILTER (WHERE condition). The latter plan is more efficient because - the filter can be pushed down to the scan node - the rows not matching the condition are not aggregated - the IF() expression wrapper is removed.

The rule rewriting AGG(IF()) to AGG() FILTER is enabled by default. To disable the rule, SET SESSION agg_if_to_filter_rewrite_enabled=false; or set the config optimizer.aggregation-if-to-filter-rewrite-enabled to false.

kaikalur

So SET_AGG works differently for these two cases:


presto:di> select set_agg(if(x=1,y)) from (select 1 x, 2 y union all select null x, 20 y union all select 1 x, null y) group by y;
 _col0
--------
 [2]
 [null]
 [null]
(3 rows)

Query 20210805_003013_03168_fbgd3, FINISHED, 195 nodes
Splits: 3,139 total, 3,139 done (100.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]

presto:di> select set_agg( y) filter (where x=1) from (select 1 x, 2 y union all select null x, 20 y union all select 1 x, null y) group by y;
 _col0
--------
 NULL
 [null]
 [2]

So you may want to see if there are aggregation properties needed for this to work properly.

kaikalur · 2021-08-05T00:31:19Z

...c/main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteAggregationIfToFilter.java

+        }
+        SpecialFormExpression expression = (SpecialFormExpression) sourceExpression;
+        // Only rewrite the aggregation if the else branch is not present.
+        return expression.getForm() == IF && Expressions.isNull(expression.getArguments().get(2));


You could also add when the else part is the null literal like IF(x, y, null)

yuanzhanhku · 2021-08-05T00:57:38Z

So SET_AGG works differently for these two cases:


presto:di> select set_agg(if(x=1,y)) from (select 1 x, 2 y union all select null x, 20 y union all select 1 x, null y) group by y;
 _col0
--------
 [2]
 [null]
 [null]
(3 rows)

Query 20210805_003013_03168_fbgd3, FINISHED, 195 nodes
Splits: 3,139 total, 3,139 done (100.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]

presto:di> select set_agg( y) filter (where x=1) from (select 1 x, 2 y union all select null x, 20 y union all select 1 x, null y) group by y;
 _col0
--------
 NULL
 [null]
 [2]

So you may want to see if there are aggregation properties needed for this to work properly.

Thanks for catching this. I will make a fix and add the tests.

kaikalur · 2021-08-05T01:28:14Z

In fact, I think we should do this only for numerical aggs (or an allowlist to start with SUM/COUNT/MIN/MAX)

yuanzhanhku · 2021-08-05T03:23:12Z

In fact, I think we should do this only for numerical aggs (or an allowlist to start with SUM/COUNT/MIN/MAX)

Right. I am planning to only do this rewrite for numerical aggs. In addition to SUM/COUNT/MIN/MAX, I see there are many use cases for approx_distinct/ variance/approx_percentile as well. It would be good to enable the rewrite for these too.

kaikalur · 2021-08-05T14:11:17Z

Actually, there is more:

Make sure the NULL behavior of the agg is "not called on null"
Make sure the condition, then part of the if expression are both deterministic

Also, when you are at it, keep applying the simplification logic iteratively so things like IF(p1, IF(p2, x)) are also handled (I have seen those in tool generated code).

Use the RowExpressionInterpreter to simplify the IF expression before checking if the else part is null. It does some more const expr eval so good to be comprehensive.

kaikalur · 2021-08-05T14:13:21Z

Just to be clear, if you do those things, there should be no need to special csase.

kaikalur · 2021-08-05T15:38:17Z

...c/main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteAggregationIfToFilter.java

+        RowExpression predicate = TRUE_CONSTANT;
+        if (!aggregationNode.hasNonEmptyGroupingSet() && aggregationsToRewrite.size() == aggregationNode.getAggregations().size()) {
+            // All aggregations are rewritten by this rule. We can add a filter with all the masks to make the query more efficient.
+            predicate = or(masks.build());


This or might actually cause more slowdown than help because all masks have to be evaluated for every row anyway. I don't think this helps.

I think it helps because

The filter could be pushed down to the scan node which might be able to evaluated very efficiently, e.g., if it is on some partition columns, the partition columns are using dictionary encoding. This only need to evaluated once per dictionary item.

The filter can be used to prune the partitions/splits based on column stats.

Besides, the AGG() FILTER implementation also adds this predicate. It is better to keep the same behavior:

presto/presto-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/ImplementFilteredAggregations.java

Line 121 in 04138bc

predicate = combineDisjunctsWithDefault(maskSymbols.build(), TRUE_LITERAL);

yuanzhanhku · 2021-08-05T16:45:44Z

Actually, there is more:

Make sure the NULL behavior of the agg is "not called on null"

Great point on checking isCalledOnNullInput(). It is indeed the root cause of the behavior change as this rewrite filters out the NULL values.

Make sure the condition, then part of the if expression are both deterministic

Could you please elaborate why non-deterministic functions matter in this case? Do you have some examples when this could cause issues?

Also, when you are at it, keep applying the simplification logic iteratively so things like IF(p1, IF(p2, x)) are also handled (I have seen those in tool generated code).

Yes, this is a nice additional optimization we can do. Basically, we can rewrite AGG(IF(p1, IF(p2, x))) to AGG(x) FILTER(WHERE p1 AND p2).
Actually, it might be better to have a separate rewrite to inline the IF expressions in this case. i.e., rewrite IF(p1, IF(p2, x)) to IF(p1 AND p2, x). This rewrite can be applied for non-agg functions as well. It might improve performance because it simplifies the function?

Use the RowExpressionInterpreter to simplify the IF expression before checking if the else part is null. It does some more const expr eval so good to be comprehensive.

The SimplifyRowExpressions rules simplifies all expressions. Might be better to avoid doing the duplicate optimization here?

kaikalur · 2021-08-05T17:03:25Z

Actually, there is more:

Make sure the NULL behavior of the agg is "not called on null"

Great point on checking isCalledOnNullInput(). It is indeed the root cause of the behavior change as this rewrite filters out the NULL values.

Make sure the condition, then part of the if expression are both deterministic

Could you please elaborate why non-deterministic functions matter in this case? Do you have some examples when this could cause issues?

It's one of those things for example random() is called once per row I think but if you put in a then part, that might work differently. Not sure. So let's be conservative (for corner cases).

Also, when you are at it, keep applying the simplification logic iteratively so things like IF(p1, IF(p2, x)) are also handled (I have seen those in tool generated code).

Yes, this is a nice additional optimization we can do. Basically, we can rewrite AGG(IF(p1, IF(p2, x))) to AGG(x) FILTER(WHERE p1 AND p2).
Actually, it might be better to have a separate rewrite to inline the IF expressions in this case. i.e., rewrite IF(p1, IF(p2, x)) to IF(p1 AND p2, x). This rewrite can be applied for non-agg functions as well. It might improve performance because it simplifies the function?

I thought about it. Sure you can do that in a separate PR

Use the RowExpressionInterpreter to simplify the IF expression before checking if the else part is null. It does some more const expr eval so good to be comprehensive.

The SimplifyRowExpressions rules simplifies all expressions. Might be better to avoid doing the duplicate optimization here?

That's fine. It's good to make sure it's simpified to get max benefit.

yuanzhanhku · 2021-08-05T18:06:59Z

Thanks Sreeni for all the suggestions. Made the changes in #16566.

yuanzhanhku requested a review from highker July 29, 2021 01:40

yuanzhanhku force-pushed the master branch 2 times, most recently from 03ac1d7 to b2a2c2a Compare July 29, 2021 01:50

yuanzhanhku linked an issue Jul 29, 2021 that may be closed by this pull request

Add a rule to rewrite AGG(IF()) to AGG() WITH FILTER #16535

Closed

yuanzhanhku force-pushed the master branch from b2a2c2a to d7c1a54 Compare July 29, 2021 16:51

highker requested review from kaikalur and shixuan-fan July 29, 2021 17:09

Fix aggregation mask matcher

8270c04

The filter for the aggregation with mask was incorrect.

yuanzhanhku force-pushed the master branch from d7c1a54 to d054ee0 Compare July 29, 2021 20:26

highker reviewed Jul 30, 2021

View reviewed changes

yuanzhanhku force-pushed the master branch from 8594514 to 8c40e88 Compare July 30, 2021 20:25

yuanzhanhku requested a review from highker July 30, 2021 20:25

yuanzhanhku force-pushed the master branch from 8c40e88 to 401a3dc Compare July 30, 2021 22:39

yuanzhanhku force-pushed the master branch from 401a3dc to 3fee58a Compare July 30, 2021 22:57

highker approved these changes Aug 2, 2021

View reviewed changes

highker self-assigned this Aug 2, 2021

yuanzhanhku added 3 commits August 2, 2021 09:31

Allow and/or to take subclasses of RowExpression

2dca81f

Add session property and config to control AGG IF rewrite

5b679e6

The rule rewriting AGG(IF()) to AGG() FILTER is enabled by default. To disable the rule, SET SESSION agg_if_to_filter_rewrite_enabled=false; or set the config optimizer.aggregation-if-to-filter-rewrite-enabled to false.

yuanzhanhku force-pushed the master branch from 3fee58a to 5b679e6 Compare August 2, 2021 16:34

yuanzhanhku requested a review from highker August 2, 2021 16:35

highker merged commit 39cc942 into prestodb:master Aug 3, 2021

kaikalur reviewed Aug 5, 2021

View reviewed changes

varungajjala mentioned this pull request Aug 16, 2021

Add release notes for 0.260 #16619

Merged

3 tasks

kaikalur mentioned this pull request Aug 28, 2021

Rewrite COUNT_IF to COUNT(1) FILTER WHERE #16662

Open

Conversation

yuanzhanhku commented Jul 29, 2021 • edited by highker Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

highker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

highker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanzhanhku commented Jul 30, 2021

Uh oh!

highker commented Aug 2, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaikalur left a comment

Choose a reason for hiding this comment

Uh oh!

kaikalur Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

yuanzhanhku commented Aug 5, 2021

Uh oh!

kaikalur commented Aug 5, 2021

Uh oh!

yuanzhanhku commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaikalur commented Aug 5, 2021

Uh oh!

kaikalur commented Aug 5, 2021

Uh oh!

kaikalur Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

yuanzhanhku Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

yuanzhanhku commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaikalur commented Aug 5, 2021

Uh oh!

yuanzhanhku commented Aug 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanzhanhku commented Jul 29, 2021 •

edited by highker

Loading

yuanzhanhku commented Aug 5, 2021 •

edited

Loading

yuanzhanhku commented Aug 5, 2021 •

edited

Loading