Skip to content

Optimize filter condition with switch case#17065

Merged
rschlussel merged 2 commits intoprestodb:masterfrom
maswin:switch_case_optimizer
May 12, 2022
Merged

Optimize filter condition with switch case#17065
rschlussel merged 2 commits intoprestodb:masterfrom
maswin:switch_case_optimizer

Conversation

@maswin
Copy link
Copy Markdown
Contributor

@maswin maswin commented Dec 3, 2021

Test plan - Unit tests + verifier run on internal queries

== RELEASE NOTES ==

General Changes
* Added a new Optimizer SimplifySwitchExpression which simplifies switch cases in filter condition

The DB visualization tools such as looker & DbVisualizer generates sub-optimal queries when a selected column has switch case and a filter is applied on top of it. For instance it generates queries such as:

SELECT case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral' from students where (case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral') = 'good';

But this query can be simplified into:

SELECT case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral' from students where behavior_code = 1;

Without this simplification, the query misses out a lot of other optimizations (i.e, ORC indexes) and becomes extremely slow.

Since these are generated queries based on the UI, the user doesn't have much control on them.

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Dec 3, 2021

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: maswin / name: Alagappan Maruthappan (f085fc9eef45da04d6952ae35467a17f5859edfa)

@maswin maswin force-pushed the switch_case_optimizer branch from c3e19d0 to 47ea97a Compare December 3, 2021 12:12
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Dec 3, 2021

CLA Signed

The committers are authorized under a signed CLA.

  • ✅ Alagappan Maruthappan (47ea97a23a8eed19039358a6ca8fa784584308d9)

@maswin maswin force-pushed the switch_case_optimizer branch 3 times, most recently from 4ee1ee7 to 6d88ef5 Compare December 8, 2021 21:26
@maswin maswin force-pushed the switch_case_optimizer branch 3 times, most recently from aabdf49 to bfaf8c1 Compare December 14, 2021 00:21
@kaikalur
Copy link
Copy Markdown
Contributor

kaikalur commented Dec 14, 2021

An interesting way to do this would be to flip WHEN/THEN (by appropriately adding equals to the THEN) and calling RowExpresionInterpreter. Like take the example and create a new expression:

case when 'good' = 'good' then behavior_code=1 when 'good'= 'bad' then behavior_code=2  when 'good' =  'neutral' then not(behavior_code=1  OR behavior_code=2 ) end

and call the RowExpressionInterpter on it. That should be more robust.

@kaikalur
Copy link
Copy Markdown
Contributor

@rongrong Check it out

@kaikalur
Copy link
Copy Markdown
Contributor

In fact, this can be generalized to any relational expression where one side is case - just push relation to the THEN parts, flip them and evaluate it. ELSE part is simply NOT of OR of all the new THEN parts.

@maswin
Copy link
Copy Markdown
Contributor Author

maswin commented Dec 17, 2021

What about the case when 2 WHEN clauses match the THEN part

(CASE 
  WHEN column_val = 1 THEN 'a'   // Matching
  WHEN column_val = 2 THEN 'a'   // Matching
  WHEN column_val = 3 THEN 'b'
ELSE 'c') = 'a'

flipping this will make it:

CASE 
  WHEN 'a'='a' THEN column_val = 1
  WHEN 'a'='a' THEN column_val = 2
  WHEN 'a'='b' THEN column_val = 3
  WHEN 'a'='c' THEN !(column_val = 1 OR column_val = 2 OR column_val = 3)
ELSE false

Since in switch case when first condition is matched we use it and break of, RowExpressionInterpter will simplify this to:
column_val = 1
But the right simplification is
column_val = 1 OR column_val = 2

@maswin
Copy link
Copy Markdown
Contributor Author

maswin commented Dec 17, 2021

It needs to be converted to a series of OR clauses:

(CASE 
  WHEN column_val = 1 THEN 'a'   // Matching
  WHEN column_val = 2 THEN 'a'   // Matching
  WHEN column_val = 3 THEN 'b'
ELSE 'c') = 'a'

above expression gets transformed to:

('a'='a' AND column_val = 1) OR 
('a'='a' AND column_val = 2) OR 
('b'='a' AND column_val = 3) OR 
('c'='a' AND !(column_val = 1 OR column_val = 2 OR column_val = 3))

RowExpressionInterpter can transform this to a simpler expression

column_val = 1 OR column_val = 2

This handles lot of other extreme cases too,
i.e one of the THEN clause and ELSE clause matches

(CASE 
  WHEN column_val = 1 THEN 'a'   // Matching
  WHEN column_val = 2 THEN 'b' 
ELSE 'a'      // Matching
) = 'a' 

get simplified to

column_val != 2

@kaikalur
Copy link
Copy Markdown
Contributor

It needs to be converted to a series of OR clauses:

(CASE 
  WHEN column_val = 1 THEN 'a'   // Matching
  WHEN column_val = 2 THEN 'a'   // Matching
  WHEN column_val = 3 THEN 'b'
ELSE 'c') = 'a'

above expression gets transformed to:

('a'='a' AND column_val = 1) OR 
('a'='a' AND column_val = 2) OR 
('b'='a' AND column_val = 3) OR 
('c'='a' AND !(column_val = 1 OR column_val = 2 OR column_val = 3))

RowExpressionInterpter can transform this to a simpler expression

column_val = 1 OR column_val = 2

This handles lot of other extreme cases too, i.e one of the THEN clause and ELSE clause matches

(CASE 
  WHEN column_val = 1 THEN 'a'   // Matching
  WHEN column_val = 2 THEN 'b' 
ELSE 'a'      // Matching
) = 'a' 

get simplified to

column_val != 2

Hmm, yeah I'm usually not a fan of OR expressions because they are harder to pushdown. If we are going to stick to doing it only for cases when all THEN parts are constant/literal, we can just build a map THEN -> WHEN[] and the generate an OR for each.

@maswin
Copy link
Copy Markdown
Contributor Author

maswin commented Dec 17, 2021

Hmm, yeah I'm usually not a fan of OR expressions because they are harder to pushdown. If we are going to stick to doing it only for cases when all THEN parts are constant/literal, we can just build a map THEN -> WHEN[] and the generate an OR for each.

Yeah, it is difficult to pushdown, but atleast OR seems better to deal with than CASE expressions for further Optimizations. Was assuming if not a non deterministic function is involved in the expression to do this optimization. Since conversion from CASE to OR doesn't change the final evaluation result of the predicate and scenarios like concat('ab'. 'cd') = 'abcd' will also be handled

@kaikalur
Copy link
Copy Markdown
Contributor

Hmm, yeah I'm usually not a fan of OR expressions because they are harder to pushdown. If we are going to stick to doing it only for cases when all THEN parts are constant/literal, we can just build a map THEN -> WHEN[] and the generate an OR for each.

Yeah, it is difficult to pushdown, but atleast OR seems better to deal with than CASE expressions for further Optimizations. Was assuming if not a non deterministic function is involved in the expression to do this optimization. Since conversion from CASE to OR doesn't change the final evaluation result of the predicate and scenarios like concat('ab'. 'cd') = 'abcd' will also be handled

Actually CASE is sequential and OR is not so but I guess it won't matter here if we are checking for constants only. Issue with OR is the whole NULL handling. So if we can do CASE, we should.

Also like I said see if you can generalize for all relational operators.

@maswin maswin force-pushed the switch_case_optimizer branch 2 times, most recently from 477c4d8 to afc1fcb Compare December 21, 2021 04:40
Copy link
Copy Markdown
Contributor

@kaikalur kaikalur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add some explicit end-to-end tests in hive connector to see the impact of this optimization.

Copy link
Copy Markdown
Contributor

@kaikalur kaikalur Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to check that otherExpression is actually a constant so it is clearly beneficial without too many issues with non-determinism etc

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have scenarios such as

(CASE
    WHEN (replace(column_name, 'prefix_string')) = 'dept1' THEN 'V1'
    WHEN (replace(column_name, 'prefix_string')) = 'dept2'  THEN 'V2
    ELSE (replace(column_name, 'prefix_string'))
    END
) = UPPER('v1');

In this case checking if it is just Constant Expression would be problematic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem we see with CASE expression is that it doesn't extract out the Column Domain while picking table layout. Switching it to AND/OR cases helps achieve it. Even if there is a column reference present in the equals side

(CASE
    WHEN (replace(column_name, 'prefix_string')) = 'dept1' THEN 'V1'
    WHEN (replace(column_name, 'prefix_string')) = 'dept2'  THEN 'V2
    ELSE 'V3'
    END
) = column_name_2;

Above expression gets converted into a cumbersome AND/OR expression since column value's equality with THEN value cannot be evaluated and the expression won't be further simplified, but still it helps RowExpressionInterpretter/Optimizers in extracting column domain for the table from that expression. i.e, column_name_2 domain is set as [ V1, V2, V3].

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These things can lead to subtle bugs. And also if the other side is a function call pushing it into the CASE can make it evaluate many times. We should not need to do random stuff like upper(..). But maybe extend it to just constant or a variable/field reference/input expression.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need this class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified without the class.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to check for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both SearchedCase and SimpleCase expression both have the same format in RowExpresion, if its a SimpleCase Expression, in RowExpresion in place of the case operand we would have ConstantExpression(true). So I am performing the check to avoid doing true = when condition.
Similar to this:

boolean searchedCase = (value instanceof ConstantExpression && ((ConstantExpression) value).getType() == BOOLEAN &&

@kaikalur
Copy link
Copy Markdown
Contributor

In fact, we want to check either all of the THEN and ELSE are constant or the otherExpression is constant

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we normally start by disabling the feature by default

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@maswin maswin force-pushed the switch_case_optimizer branch from afc1fcb to 1cf2ef0 Compare December 23, 2021 11:46
@maswin
Copy link
Copy Markdown
Contributor Author

maswin commented Dec 23, 2021

Also please add some explicit end-to-end tests in hive connector to see the impact of this optimization.

Can you please point out where these tests exists? Internally we were able to see queries that ran for over 10 minutes to run in less than 10 seconds.

@maswin maswin force-pushed the switch_case_optimizer branch 2 times, most recently from ca85bbb to af5f5f4 Compare December 23, 2021 13:39
Copy link
Copy Markdown
Contributor

@kaikalur kaikalur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need more end to end tests, not just unit test. Look at AbstractTestQueries - adding there will test with mutliple connectors which will be good.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equals is a confusing name! Make it createEquals

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a helper method for this conditional so the logic is clear.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the part.

@kaikalur
Copy link
Copy Markdown
Contributor

Also fix the test failures

@maswin maswin force-pushed the switch_case_optimizer branch from af5f5f4 to ce2de0e Compare January 4, 2022 08:45
@maswin
Copy link
Copy Markdown
Contributor Author

maswin commented Jan 4, 2022

We also need more end to end tests, not just unit test. Look at AbstractTestQueries - adding there will test with mutliple connectors which will be good.

Let me know if more cases needs to be handled in AbstractTestQueries

@maswin maswin force-pushed the switch_case_optimizer branch 3 times, most recently from 0dcf5ab to 549ab27 Compare January 13, 2022 05:46
@maswin maswin force-pushed the switch_case_optimizer branch from 549ab27 to 5a24efe Compare January 25, 2022 01:02
@kaikalur
Copy link
Copy Markdown
Contributor

@maswin where are we at with this PR? This could be useful for one of our usecases so just wanted to check if there is anything else missing. Thanks!

@maswin maswin force-pushed the switch_case_optimizer branch from 5a24efe to f085fc9 Compare April 22, 2022 07:10
@maswin
Copy link
Copy Markdown
Contributor Author

maswin commented Apr 22, 2022

@maswin where are we at with this PR? This could be useful for one of our usecases so just wanted to check if there is anything else missing. Thanks!

We have handled the case where there is a Cast done on top of the expression since most of the time to match the lhs and rhs expression type expressions are cast with a super data type. All other mentioned review comments are fixed and updated.

@kaikalur kaikalur requested a review from rschlussel April 22, 2022 13:34
@kaikalur
Copy link
Copy Markdown
Contributor

Thank you! Added @rschlussel for further review and hopefully merging.

Copy link
Copy Markdown
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a bunch of comments to RewriteCaseExpressionPredicate, but I actually wonder if it would be better to do this with the other expression rewrites at the beginning of optimization using an ExpressionRewriteRuleSet

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to add a note about why this limitation exists (am I correct that it's because case expressions are ordered, so you need the expressions to be disjunct (can only have one be true) to replace it with an or?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rewritten the description with the reasoning for it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are there restrictions for this and the rhs expressions beyond just requiring a deterministic function for the value? Is it just less useful in those cases? also why doesn't the rhs restriction also allow column references?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we be able to evaluate if the conditions are disjunct at query planning time if there are column references at RHS?

And most of the time it is the data visualization tool that generates such query, which usually is a column at LHS and constant at RHS. We have handled the most common case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there also be some restriction on the number of cases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally we saw many case statements with ~30-40 WHEN conditions and most of the time some WHEN return value gets matched to the final equals value and most of the OR clause gets removed in final expression, especially the last else statement. So I am not sure if this is a needed restriction.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the LHS are the same, why do the RHS need to be unique? Wouldn't it just be a duplicate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do the rewrite without this check (the RHS is not unique)

(case 
when col1=1 return 'a'
when col1=1 return 'b'
else 'c') = 'b'

will get simplified to

(col1=1 AND 'a'='b') OR
(col1=1 AND 'b'='b') OR
('c' = 'b' AND col1<>1 AND col2<>2)

'a'='b' condition would evaluate to false, and will get simplified to
col1=1
But this is a wrong simplification. If col1=1, it would have gone to first WHEN clause in CASE statement and returned 'a' which wouldn't be equal to 'b'. Our simplified condition allows rows with col1=1 to pass but in actual case it should not pass.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use an immutableListBuilder

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always use immutableCollections

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 5517 to 5513
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to compare against caseExpressionRewriteDisabled. assertQuery compares the results against another DB. it ignores the session completely for creating the expected results

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize that. Fixed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be necessary

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is going on with the second half of this test here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second half of the test checks the Optimized version of the expression. Most of the times the result=value condition gets evaluated to false and a simplified final expression would be produced. That simplification is not done by the rewriter and it is against unit testing principle to test it here, but it helped in understanding and developing the Rewriter better so retained them in the final commit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to remove this second half of the test. I think it makes the test more complex/confusing and doesn't add much value to the code coverage.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love the thorough test cases!

@maswin maswin force-pushed the switch_case_optimizer branch from f085fc9 to d26b4c1 Compare May 2, 2022 07:19
@maswin maswin requested a review from a team as a code owner May 2, 2022 07:19
@maswin
Copy link
Copy Markdown
Contributor Author

maswin commented May 2, 2022

I added a bunch of comments to RewriteCaseExpressionPredicate, but I actually wonder if it would be better to do this with the other expression rewrites at the beginning of optimization using an ExpressionRewriteRuleSet

We wanted the setting to be disabled initially, couldn't find disabling in ExpressionRewriteRuleSet/RowExpressionRewriteRuleSet. But yeah, that might help it extending to all types of Nodes.

@maswin maswin force-pushed the switch_case_optimizer branch 2 times, most recently from 41eb3bc to ee1d93c Compare May 2, 2022 20:32
@maswin
Copy link
Copy Markdown
Contributor Author

maswin commented May 2, 2022

I added a bunch of comments to RewriteCaseExpressionPredicate, but I actually wonder if it would be better to do this with the other expression rewrites at the beginning of optimization using an ExpressionRewriteRuleSet

We wanted the setting to be disabled initially, couldn't find disabling in ExpressionRewriteRuleSet/RowExpressionRewriteRuleSet. But yeah, that might help it extending to all types of Nodes.

I have now made an additional commit that adds the ability for RowExpressionRewriterRuleSet to be disabled if required and modified the RewriteCaseExpressionPredicate to extend RowExpressionRewriteRuleSet. This enables it to optimize filter expression in JOIN condition too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a checkArgument that expression is a cast expression and its argument is a case expression

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to remove this second half of the test. I think it makes the test more complex/confusing and doesn't add much value to the code coverage.

@maswin maswin force-pushed the switch_case_optimizer branch from ee1d93c to 8c11aa4 Compare May 6, 2022 21:31
Copy link
Copy Markdown
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@rschlussel
Copy link
Copy Markdown
Contributor

@kaikalur can you review/approve? Merging is blocked since you are still marked as requesting changes.

@kaikalur
Copy link
Copy Markdown
Contributor

Are we ready to merge this?

@rschlussel
Copy link
Copy Markdown
Contributor

yup. I'll merge now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants