Optimize filter condition with switch case#17065
Conversation
c3e19d0 to
47ea97a
Compare
4ee1ee7 to
6d88ef5
Compare
aabdf49 to
bfaf8c1
Compare
|
An interesting way to do this would be to flip WHEN/THEN (by appropriately adding equals to the THEN) and calling RowExpresionInterpreter. Like take the example and create a new expression: and call the RowExpressionInterpter on it. That should be more robust. |
|
@rongrong Check it out |
|
In fact, this can be generalized to any relational expression where one side is case - just push relation to the THEN parts, flip them and evaluate it. ELSE part is simply NOT of OR of all the new THEN parts. |
|
What about the case when 2 WHEN clauses match the THEN part flipping this will make it: Since in switch case when first condition is matched we use it and break of, RowExpressionInterpter will simplify this to: |
|
It needs to be converted to a series of OR clauses: above expression gets transformed to: RowExpressionInterpter can transform this to a simpler expression
This handles lot of other extreme cases too, get simplified to |
Hmm, yeah I'm usually not a fan of OR expressions because they are harder to pushdown. If we are going to stick to doing it only for cases when all THEN parts are constant/literal, we can just build a map THEN -> WHEN[] and the generate an OR for each. |
Yeah, it is difficult to pushdown, but atleast OR seems better to deal with than CASE expressions for further Optimizations. Was assuming if not a non deterministic function is involved in the expression to do this optimization. Since conversion from CASE to OR doesn't change the final evaluation result of the predicate and scenarios like concat('ab'. 'cd') = 'abcd' will also be handled |
Actually CASE is sequential and OR is not so but I guess it won't matter here if we are checking for constants only. Issue with OR is the whole NULL handling. So if we can do CASE, we should. Also like I said see if you can generalize for all relational operators. |
477c4d8 to
afc1fcb
Compare
kaikalur
left a comment
There was a problem hiding this comment.
Also please add some explicit end-to-end tests in hive connector to see the impact of this optimization.
...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I think it's best to check that otherExpression is actually a constant so it is clearly beneficial without too many issues with non-determinism etc
There was a problem hiding this comment.
We have scenarios such as
(CASE
WHEN (replace(column_name, 'prefix_string')) = 'dept1' THEN 'V1'
WHEN (replace(column_name, 'prefix_string')) = 'dept2' THEN 'V2
ELSE (replace(column_name, 'prefix_string'))
END
) = UPPER('v1');
In this case checking if it is just Constant Expression would be problematic.
There was a problem hiding this comment.
One problem we see with CASE expression is that it doesn't extract out the Column Domain while picking table layout. Switching it to AND/OR cases helps achieve it. Even if there is a column reference present in the equals side
(CASE
WHEN (replace(column_name, 'prefix_string')) = 'dept1' THEN 'V1'
WHEN (replace(column_name, 'prefix_string')) = 'dept2' THEN 'V2
ELSE 'V3'
END
) = column_name_2;
Above expression gets converted into a cumbersome AND/OR expression since column value's equality with THEN value cannot be evaluated and the expression won't be further simplified, but still it helps RowExpressionInterpretter/Optimizers in extracting column domain for the table from that expression. i.e, column_name_2 domain is set as [ V1, V2, V3].
There was a problem hiding this comment.
These things can lead to subtle bugs. And also if the other side is a function call pushing it into the CASE can make it evaluate many times. We should not need to do random stuff like upper(..). But maybe extend it to just constant or a variable/field reference/input expression.
There was a problem hiding this comment.
Do you really need this class?
There was a problem hiding this comment.
Simplified without the class.
There was a problem hiding this comment.
Why do you need to check for this?
There was a problem hiding this comment.
Both SearchedCase and SimpleCase expression both have the same format in RowExpresion, if its a SimpleCase Expression, in RowExpresion in place of the case operand we would have ConstantExpression(true). So I am performing the check to avoid doing true = when condition.
Similar to this:
|
In fact, we want to check either all of the THEN and ELSE are constant or the otherExpression is constant |
There was a problem hiding this comment.
And we normally start by disabling the feature by default
afc1fcb to
1cf2ef0
Compare
Can you please point out where these tests exists? Internally we were able to see queries that ran for over 10 minutes to run in less than 10 seconds. |
ca85bbb to
af5f5f4
Compare
kaikalur
left a comment
There was a problem hiding this comment.
We also need more end to end tests, not just unit test. Look at AbstractTestQueries - adding there will test with mutliple connectors which will be good.
There was a problem hiding this comment.
equals is a confusing name! Make it createEquals
presto-expressions/src/main/java/com/facebook/presto/expressions/LogicalRowExpressions.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Add a helper method for this conditional so the logic is clear.
|
Also fix the test failures |
af5f5f4 to
ce2de0e
Compare
Let me know if more cases needs to be handled in AbstractTestQueries |
0dcf5ab to
549ab27
Compare
549ab27 to
5a24efe
Compare
|
@maswin where are we at with this PR? This could be useful for one of our usecases so just wanted to check if there is anything else missing. Thanks! |
5a24efe to
f085fc9
Compare
We have handled the case where there is a Cast done on top of the expression since most of the time to match the lhs and rhs expression type expressions are cast with a super data type. All other mentioned review comments are fixed and updated. |
|
Thank you! Added @rschlussel for further review and hopefully merging. |
rschlussel
left a comment
There was a problem hiding this comment.
I added a bunch of comments to RewriteCaseExpressionPredicate, but I actually wonder if it would be better to do this with the other expression rewrites at the beginning of optimization using an ExpressionRewriteRuleSet
There was a problem hiding this comment.
would be good to add a note about why this limitation exists (am I correct that it's because case expressions are ordered, so you need the expressions to be disjunct (can only have one be true) to replace it with an or?
There was a problem hiding this comment.
I have rewritten the description with the reasoning for it.
There was a problem hiding this comment.
why are there restrictions for this and the rhs expressions beyond just requiring a deterministic function for the value? Is it just less useful in those cases? also why doesn't the rhs restriction also allow column references?
There was a problem hiding this comment.
Will we be able to evaluate if the conditions are disjunct at query planning time if there are column references at RHS?
And most of the time it is the data visualization tool that generates such query, which usually is a column at LHS and constant at RHS. We have handled the most common case.
There was a problem hiding this comment.
should there also be some restriction on the number of cases?
There was a problem hiding this comment.
Internally we saw many case statements with ~30-40 WHEN conditions and most of the time some WHEN return value gets matched to the final equals value and most of the OR clause gets removed in final expression, especially the last else statement. So I am not sure if this is a needed restriction.
There was a problem hiding this comment.
If the LHS are the same, why do the RHS need to be unique? Wouldn't it just be a duplicate?
There was a problem hiding this comment.
If we do the rewrite without this check (the RHS is not unique)
(case
when col1=1 return 'a'
when col1=1 return 'b'
else 'c') = 'b'
will get simplified to
(col1=1 AND 'a'='b') OR
(col1=1 AND 'b'='b') OR
('c' = 'b' AND col1<>1 AND col2<>2)
'a'='b' condition would evaluate to false, and will get simplified to
col1=1
But this is a wrong simplification. If col1=1, it would have gone to first WHEN clause in CASE statement and returned 'a' which wouldn't be equal to 'b'. Our simplified condition allows rows with col1=1 to pass but in actual case it should not pass.
There was a problem hiding this comment.
use an immutableListBuilder
There was a problem hiding this comment.
always use immutableCollections
There was a problem hiding this comment.
you don't need to compare against caseExpressionRewriteDisabled. assertQuery compares the results against another DB. it ignores the session completely for creating the expected results
There was a problem hiding this comment.
Didn't realize that. Fixed.
There was a problem hiding this comment.
this shouldn't be necessary
There was a problem hiding this comment.
what is going on with the second half of this test here?
There was a problem hiding this comment.
The second half of the test checks the Optimized version of the expression. Most of the times the result=value condition gets evaluated to false and a simplified final expression would be produced. That simplification is not done by the rewriter and it is against unit testing principle to test it here, but it helped in understanding and developing the Rewriter better so retained them in the final commit.
There was a problem hiding this comment.
I think it would be better to remove this second half of the test. I think it makes the test more complex/confusing and doesn't add much value to the code coverage.
There was a problem hiding this comment.
love the thorough test cases!
f085fc9 to
d26b4c1
Compare
We wanted the setting to be disabled initially, couldn't find disabling in ExpressionRewriteRuleSet/RowExpressionRewriteRuleSet. But yeah, that might help it extending to all types of Nodes. |
41eb3bc to
ee1d93c
Compare
I have now made an additional commit that adds the ability for RowExpressionRewriterRuleSet to be disabled if required and modified the RewriteCaseExpressionPredicate to extend RowExpressionRewriteRuleSet. This enables it to optimize filter expression in JOIN condition too. |
There was a problem hiding this comment.
add a checkArgument that expression is a cast expression and its argument is a case expression
There was a problem hiding this comment.
I think it would be better to remove this second half of the test. I think it makes the test more complex/confusing and doesn't add much value to the code coverage.
ee1d93c to
8c11aa4
Compare
|
@kaikalur can you review/approve? Merging is blocked since you are still marked as requesting changes. |
|
Are we ready to merge this? |
|
yup. I'll merge now! |
Test plan - Unit tests + verifier run on internal queries
The DB visualization tools such as looker & DbVisualizer generates sub-optimal queries when a selected column has switch case and a filter is applied on top of it. For instance it generates queries such as:
SELECT case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral' from students where (case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral') = 'good';But this query can be simplified into:
SELECT case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral' from students where behavior_code = 1;Without this simplification, the query misses out a lot of other optimizations (i.e, ORC indexes) and becomes extremely slow.
Since these are generated queries based on the UI, the user doesn't have much control on them.