Optimize filter condition with switch case by maswin · Pull Request #17065 · prestodb/presto

maswin · 2021-12-03T12:09:57Z

Test plan - Unit tests + verifier run on internal queries

== RELEASE NOTES ==

General Changes
* Added a new Optimizer SimplifySwitchExpression which simplifies switch cases in filter condition

The DB visualization tools such as looker & DbVisualizer generates sub-optimal queries when a selected column has switch case and a filter is applied on top of it. For instance it generates queries such as:

SELECT case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral' from students where (case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral') = 'good';

But this query can be simplified into:

SELECT case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral' from students where behavior_code = 1;

Without this simplification, the query misses out a lot of other optimizations (i.e, ORC indexes) and becomes extremely slow.

Since these are generated queries based on the UI, the user doesn't have much control on them.

linux-foundation-easycla · 2021-12-03T12:09:59Z

The committers listed above are authorized under a signed CLA.

✅ login: maswin / name: Alagappan Maruthappan (f085fc9eef45da04d6952ae35467a17f5859edfa)

linux-foundation-easycla · 2021-12-03T12:12:24Z

The committers are authorized under a signed CLA.

✅ Alagappan Maruthappan (47ea97a23a8eed19039358a6ca8fa784584308d9)

kaikalur · 2021-12-14T22:26:11Z

An interesting way to do this would be to flip WHEN/THEN (by appropriately adding equals to the THEN) and calling RowExpresionInterpreter. Like take the example and create a new expression:

case when 'good' = 'good' then behavior_code=1 when 'good'= 'bad' then behavior_code=2  when 'good' =  'neutral' then not(behavior_code=1  OR behavior_code=2 ) end

and call the RowExpressionInterpter on it. That should be more robust.

kaikalur · 2021-12-14T22:29:00Z

@rongrong Check it out

kaikalur · 2021-12-14T23:40:15Z

In fact, this can be generalized to any relational expression where one side is case - just push relation to the THEN parts, flip them and evaluate it. ELSE part is simply NOT of OR of all the new THEN parts.

maswin · 2021-12-17T02:13:35Z

What about the case when 2 WHEN clauses match the THEN part

(CASE 
  WHEN column_val = 1 THEN 'a'   // Matching
  WHEN column_val = 2 THEN 'a'   // Matching
  WHEN column_val = 3 THEN 'b'
ELSE 'c') = 'a'

flipping this will make it:

CASE 
  WHEN 'a'='a' THEN column_val = 1
  WHEN 'a'='a' THEN column_val = 2
  WHEN 'a'='b' THEN column_val = 3
  WHEN 'a'='c' THEN !(column_val = 1 OR column_val = 2 OR column_val = 3)
ELSE false

Since in switch case when first condition is matched we use it and break of, RowExpressionInterpter will simplify this to:
column_val = 1
But the right simplification is
column_val = 1 OR column_val = 2

maswin · 2021-12-17T03:49:42Z

It needs to be converted to a series of OR clauses:

(CASE 
  WHEN column_val = 1 THEN 'a'   // Matching
  WHEN column_val = 2 THEN 'a'   // Matching
  WHEN column_val = 3 THEN 'b'
ELSE 'c') = 'a'

above expression gets transformed to:

('a'='a' AND column_val = 1) OR 
('a'='a' AND column_val = 2) OR 
('b'='a' AND column_val = 3) OR 
('c'='a' AND !(column_val = 1 OR column_val = 2 OR column_val = 3))

RowExpressionInterpter can transform this to a simpler expression

column_val = 1 OR column_val = 2

This handles lot of other extreme cases too,
i.e one of the THEN clause and ELSE clause matches

(CASE 
  WHEN column_val = 1 THEN 'a'   // Matching
  WHEN column_val = 2 THEN 'b' 
ELSE 'a'      // Matching
) = 'a'

get simplified to

column_val != 2

kaikalur · 2021-12-17T05:44:55Z

It needs to be converted to a series of OR clauses:
(CASE 
  WHEN column_val = 1 THEN 'a'   // Matching
  WHEN column_val = 2 THEN 'a'   // Matching
  WHEN column_val = 3 THEN 'b'
ELSE 'c') = 'a'
above expression gets transformed to:
('a'='a' AND column_val = 1) OR 
('a'='a' AND column_val = 2) OR 
('b'='a' AND column_val = 3) OR 
('c'='a' AND !(column_val = 1 OR column_val = 2 OR column_val = 3))
RowExpressionInterpter can transform this to a simpler expression

column_val = 1 OR column_val = 2

This handles lot of other extreme cases too, i.e one of the THEN clause and ELSE clause matches
(CASE 
  WHEN column_val = 1 THEN 'a'   // Matching
  WHEN column_val = 2 THEN 'b' 
ELSE 'a'      // Matching
) = 'a' 
get simplified to
column_val != 2

Hmm, yeah I'm usually not a fan of OR expressions because they are harder to pushdown. If we are going to stick to doing it only for cases when all THEN parts are constant/literal, we can just build a map THEN -> WHEN[] and the generate an OR for each.

maswin · 2021-12-17T06:31:14Z

Hmm, yeah I'm usually not a fan of OR expressions because they are harder to pushdown. If we are going to stick to doing it only for cases when all THEN parts are constant/literal, we can just build a map THEN -> WHEN[] and the generate an OR for each.

Yeah, it is difficult to pushdown, but atleast OR seems better to deal with than CASE expressions for further Optimizations. Was assuming if not a non deterministic function is involved in the expression to do this optimization. Since conversion from CASE to OR doesn't change the final evaluation result of the predicate and scenarios like concat('ab'. 'cd') = 'abcd' will also be handled

kaikalur · 2021-12-17T06:45:01Z

Hmm, yeah I'm usually not a fan of OR expressions because they are harder to pushdown. If we are going to stick to doing it only for cases when all THEN parts are constant/literal, we can just build a map THEN -> WHEN[] and the generate an OR for each.

Yeah, it is difficult to pushdown, but atleast OR seems better to deal with than CASE expressions for further Optimizations. Was assuming if not a non deterministic function is involved in the expression to do this optimization. Since conversion from CASE to OR doesn't change the final evaluation result of the predicate and scenarios like concat('ab'. 'cd') = 'abcd' will also be handled

Actually CASE is sequential and OR is not so but I guess it won't matter here if we are checking for constants only. Issue with OR is the whole NULL handling. So if we can do CASE, we should.

Also like I said see if you can generalize for all relational operators.

kaikalur

Also please add some explicit end-to-end tests in hive connector to see the impact of this optimization.

...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java

kaikalur · 2021-12-21T16:19:01Z

...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java

I think it's best to check that otherExpression is actually a constant so it is clearly beneficial without too many issues with non-determinism etc

We have scenarios such as

(CASE WHEN (replace(column_name, 'prefix_string')) = 'dept1' THEN 'V1' WHEN (replace(column_name, 'prefix_string')) = 'dept2' THEN 'V2 ELSE (replace(column_name, 'prefix_string')) END ) = UPPER('v1');

In this case checking if it is just Constant Expression would be problematic.

One problem we see with CASE expression is that it doesn't extract out the Column Domain while picking table layout. Switching it to AND/OR cases helps achieve it. Even if there is a column reference present in the equals side

(CASE WHEN (replace(column_name, 'prefix_string')) = 'dept1' THEN 'V1' WHEN (replace(column_name, 'prefix_string')) = 'dept2' THEN 'V2 ELSE 'V3' END ) = column_name_2;

Above expression gets converted into a cumbersome AND/OR expression since column value's equality with THEN value cannot be evaluated and the expression won't be further simplified, but still it helps RowExpressionInterpretter/Optimizers in extracting column domain for the table from that expression. i.e, column_name_2 domain is set as [ V1, V2, V3].

These things can lead to subtle bugs. And also if the other side is a function call pushing it into the CASE can make it evaluate many times. We should not need to do random stuff like upper(..). But maybe extend it to just constant or a variable/field reference/input expression.

kaikalur · 2021-12-21T16:22:12Z

...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java

Do you really need this class?

Simplified without the class.

kaikalur · 2021-12-21T16:24:55Z

...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java

Why do you need to check for this?

Both SearchedCase and SimpleCase expression both have the same format in RowExpresion, if its a SimpleCase Expression, in RowExpresion in place of the case operand we would have ConstantExpression(true). So I am performing the check to avoid doing true = when condition.
Similar to this:

presto/presto-main/src/main/java/com/facebook/presto/sql/gen/SwitchCodeGenerator.java

Line 83 in 123c2ab

boolean searchedCase = (value instanceof ConstantExpression && ((ConstantExpression) value).getType() == BOOLEAN &&

kaikalur · 2021-12-21T17:33:07Z

In fact, we want to check either all of the THEN and ELSE are constant or the otherExpression is constant

kaikalur · 2021-12-21T17:43:31Z

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

And we normally start by disabling the feature by default

maswin · 2021-12-23T12:05:15Z

Also please add some explicit end-to-end tests in hive connector to see the impact of this optimization.

Can you please point out where these tests exists? Internally we were able to see queries that ran for over 10 minutes to run in less than 10 seconds.

kaikalur

We also need more end to end tests, not just unit test. Look at AbstractTestQueries - adding there will test with mutliple connectors which will be good.

kaikalur · 2021-12-24T15:55:13Z

presto-expressions/src/main/java/com/facebook/presto/expressions/LogicalRowExpressions.java

equals is a confusing name! Make it createEquals

presto-expressions/src/main/java/com/facebook/presto/expressions/LogicalRowExpressions.java

kaikalur · 2021-12-24T16:00:34Z

...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java

Add a helper method for this conditional so the logic is clear.

Refactored the part.

kaikalur · 2021-12-24T16:16:30Z

Also fix the test failures

maswin · 2022-01-04T08:48:55Z

We also need more end to end tests, not just unit test. Look at AbstractTestQueries - adding there will test with mutliple connectors which will be good.

Let me know if more cases needs to be handled in AbstractTestQueries

kaikalur · 2022-04-21T17:09:40Z

@maswin where are we at with this PR? This could be useful for one of our usecases so just wanted to check if there is anything else missing. Thanks!

maswin · 2022-04-22T07:15:32Z

@maswin where are we at with this PR? This could be useful for one of our usecases so just wanted to check if there is anything else missing. Thanks!

We have handled the case where there is a Cast done on top of the expression since most of the time to match the lhs and rhs expression type expressions are cast with a super data type. All other mentioned review comments are fixed and updated.

kaikalur · 2022-04-22T13:34:54Z

Thank you! Added @rschlussel for further review and hopefully merging.

rschlussel

I added a bunch of comments to RewriteCaseExpressionPredicate, but I actually wonder if it would be better to do this with the other expression rewrites at the beginning of optimization using an ExpressionRewriteRuleSet

rschlussel · 2022-04-27T15:44:11Z

...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java

would be good to add a note about why this limitation exists (am I correct that it's because case expressions are ordered, so you need the expressions to be disjunct (can only have one be true) to replace it with an or?

I have rewritten the description with the reasoning for it.

rschlussel · 2022-04-27T15:55:45Z

...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java

why are there restrictions for this and the rhs expressions beyond just requiring a deterministic function for the value? Is it just less useful in those cases? also why doesn't the rhs restriction also allow column references?

Will we be able to evaluate if the conditions are disjunct at query planning time if there are column references at RHS?

And most of the time it is the data visualization tool that generates such query, which usually is a column at LHS and constant at RHS. We have handled the most common case.

rschlussel · 2022-04-27T15:57:42Z

...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java

should there also be some restriction on the number of cases?

Internally we saw many case statements with ~30-40 WHEN conditions and most of the time some WHEN return value gets matched to the final equals value and most of the OR clause gets removed in final expression, especially the last else statement. So I am not sure if this is a needed restriction.

rschlussel · 2022-04-28T15:07:20Z

...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java

If the LHS are the same, why do the RHS need to be unique? Wouldn't it just be a duplicate?

If we do the rewrite without this check (the RHS is not unique)

(case when col1=1 return 'a' when col1=1 return 'b' else 'c') = 'b'

will get simplified to

(col1=1 AND 'a'='b') OR (col1=1 AND 'b'='b') OR ('c' = 'b' AND col1<>1 AND col2<>2)

'a'='b' condition would evaluate to false, and will get simplified to
col1=1
But this is a wrong simplification. If col1=1, it would have gone to first WHEN clause in CASE statement and returned 'a' which wouldn't be equal to 'b'. Our simplified condition allows rows with col1=1 to pass but in actual case it should not pass.

rschlussel · 2022-04-28T15:29:20Z

...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java

use an immutableListBuilder

rschlussel · 2022-04-28T16:15:18Z

...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java

always use immutableCollections

rschlussel · 2022-04-28T16:33:31Z

presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java

you don't need to compare against caseExpressionRewriteDisabled. assertQuery compares the results against another DB. it ignores the session completely for creating the expected results

Didn't realize that. Fixed.

rschlussel · 2022-04-28T16:33:50Z

presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueryFramework.java

this shouldn't be necessary

rschlussel · 2022-04-28T16:49:47Z

.../java/com/facebook/presto/sql/planner/iterative/rule/TestRewriteCaseExpressionPredicate.java

what is going on with the second half of this test here?

The second half of the test checks the Optimized version of the expression. Most of the times the result=value condition gets evaluated to false and a simplified final expression would be produced. That simplification is not done by the rewriter and it is against unit testing principle to test it here, but it helped in understanding and developing the Rewriter better so retained them in the final commit.

I think it would be better to remove this second half of the test. I think it makes the test more complex/confusing and doesn't add much value to the code coverage.

rschlussel · 2022-04-28T16:51:11Z

.../java/com/facebook/presto/sql/planner/iterative/rule/TestRewriteCaseExpressionPredicate.java

love the thorough test cases!

maswin · 2022-05-02T09:00:45Z

I added a bunch of comments to RewriteCaseExpressionPredicate, but I actually wonder if it would be better to do this with the other expression rewrites at the beginning of optimization using an ExpressionRewriteRuleSet

We wanted the setting to be disabled initially, couldn't find disabling in ExpressionRewriteRuleSet/RowExpressionRewriteRuleSet. But yeah, that might help it extending to all types of Nodes.

maswin · 2022-05-02T20:34:44Z

I added a bunch of comments to RewriteCaseExpressionPredicate, but I actually wonder if it would be better to do this with the other expression rewrites at the beginning of optimization using an ExpressionRewriteRuleSet

We wanted the setting to be disabled initially, couldn't find disabling in ExpressionRewriteRuleSet/RowExpressionRewriteRuleSet. But yeah, that might help it extending to all types of Nodes.

I have now made an additional commit that adds the ability for RowExpressionRewriterRuleSet to be disabled if required and modified the RewriteCaseExpressionPredicate to extend RowExpressionRewriteRuleSet. This enables it to optimize filter expression in JOIN condition too.

rschlussel · 2022-05-06T17:50:51Z

...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java

add a checkArgument that expression is a cast expression and its argument is a case expression

rschlussel · 2022-05-06T17:59:14Z

.../java/com/facebook/presto/sql/planner/iterative/rule/TestRewriteCaseExpressionPredicate.java

I think it would be better to remove this second half of the test. I think it makes the test more complex/confusing and doesn't add much value to the code coverage.

rschlussel

Thank you!

rschlussel · 2022-05-09T14:15:10Z

@kaikalur can you review/approve? Merging is blocked since you are still marked as requesting changes.

kaikalur · 2022-05-12T14:20:32Z

Are we ready to merge this?

rschlussel · 2022-05-12T15:23:13Z

yup. I'll merge now!

maswin force-pushed the switch_case_optimizer branch from c3e19d0 to 47ea97a Compare December 3, 2021 12:12

maswin force-pushed the switch_case_optimizer branch 3 times, most recently from 4ee1ee7 to 6d88ef5 Compare December 8, 2021 21:26

maswin force-pushed the switch_case_optimizer branch 3 times, most recently from aabdf49 to bfaf8c1 Compare December 14, 2021 00:21

maswin force-pushed the switch_case_optimizer branch 2 times, most recently from 477c4d8 to afc1fcb Compare December 21, 2021 04:40

kaikalur requested changes Dec 21, 2021

View reviewed changes

kaikalur reviewed Dec 21, 2021

View reviewed changes

maswin force-pushed the switch_case_optimizer branch from afc1fcb to 1cf2ef0 Compare December 23, 2021 11:46

maswin force-pushed the switch_case_optimizer branch 2 times, most recently from ca85bbb to af5f5f4 Compare December 23, 2021 13:39

kaikalur requested changes Dec 24, 2021

View reviewed changes

maswin force-pushed the switch_case_optimizer branch from af5f5f4 to ce2de0e Compare January 4, 2022 08:45

maswin force-pushed the switch_case_optimizer branch 3 times, most recently from 0dcf5ab to 549ab27 Compare January 13, 2022 05:46

maswin force-pushed the switch_case_optimizer branch from 549ab27 to 5a24efe Compare January 25, 2022 01:02

maswin force-pushed the switch_case_optimizer branch from 5a24efe to f085fc9 Compare April 22, 2022 07:10

kaikalur requested a review from rschlussel April 22, 2022 13:34

rschlussel requested changes Apr 28, 2022

View reviewed changes

maswin force-pushed the switch_case_optimizer branch from f085fc9 to d26b4c1 Compare May 2, 2022 07:19

maswin requested a review from a team as a code owner May 2, 2022 07:19

maswin force-pushed the switch_case_optimizer branch 2 times, most recently from 41eb3bc to ee1d93c Compare May 2, 2022 20:32

rschlussel reviewed May 6, 2022

View reviewed changes

maswin added 2 commits May 6, 2022 13:54

Allow RowExpressionRewriteRuleSet to be enabled or disabled

083baaf

Optimize filter condition with CASE predicate

8c11aa4

maswin force-pushed the switch_case_optimizer branch from ee1d93c to 8c11aa4 Compare May 6, 2022 21:31

rschlussel approved these changes May 9, 2022

View reviewed changes

kaikalur approved these changes May 12, 2022

View reviewed changes

rschlussel merged commit 67c8e6d into prestodb:master May 12, 2022

maswin mentioned this pull request May 16, 2022

Optimize filter condition with case expression predicate trinodb/trino#10580

Closed

highker mentioned this pull request Jul 6, 2022

Add release notes for 0.274 #17987

Closed

7 tasks

This was referenced Aug 25, 2022

Optimize filter condition with CASE predicate - handle null cases #18228

Closed

Optimize filter condition with CASE predicate - handle null cases #18231

Open

Conversation

maswin commented Dec 3, 2021

Uh oh!

linux-foundation-easycla bot commented Dec 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Dec 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaikalur commented Dec 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaikalur commented Dec 14, 2021

Uh oh!

kaikalur commented Dec 14, 2021

Uh oh!

maswin commented Dec 17, 2021

Uh oh!

maswin commented Dec 17, 2021

Uh oh!

kaikalur commented Dec 17, 2021

Uh oh!

maswin commented Dec 17, 2021

Uh oh!

kaikalur commented Dec 17, 2021

Uh oh!

kaikalur left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaikalur Dec 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaikalur commented Dec 21, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maswin commented Dec 23, 2021

Uh oh!

kaikalur left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaikalur commented Dec 24, 2021

Uh oh!

maswin commented Jan 4, 2022

Uh oh!

kaikalur commented Apr 21, 2022

Uh oh!

maswin commented Apr 22, 2022

Uh oh!

kaikalur commented Apr 22, 2022

Uh oh!

linux-foundation-easycla bot commented Dec 3, 2021 •

edited

Loading

linux-foundation-easycla bot commented Dec 3, 2021 •

edited

Loading

kaikalur commented Dec 14, 2021 •

edited

Loading

kaikalur Dec 21, 2021 •

edited

Loading