Optimize filter condition with case expression predicate#10580
Optimize filter condition with case expression predicate#10580maswin wants to merge 2 commits intotrinodb:masterfrom
Conversation
|
@kasiafi, can you take a look at this PR? |
c80f64a to
36d1dcc
Compare
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
|
@cla-bot check |
|
The cla-bot has been summoned, and re-checked this pull request! |
kasiafi
left a comment
There was a problem hiding this comment.
Thanks for working on this optimization. Here are some initial comments on the logic and code style. I haven't gone through the tests yet.
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
This is not correct. We cannot replace any operator with "is null", when one operand is null.
It should rather be:
Expression default = defaultValue
.map(value -> getCastExpression(castExpression, value))
.orElse(<cast null to the expected type>);
Expression elseResult = new ComparisonExpression(operator, default, otherExpression);
There was a problem hiding this comment.
Made it to return NullLiteral in case of missing else clause since figuring out the expected type from Optimizer is difficult and the Simplify rules applied later adds a CAST if necessary. Now the results are similar to how trino handles other NULL expressions involved comparison function (i.e, NULL > 'string')
I have disabled the unit test with respect to this change - testRewriterOnCaseExpressionWithoutElseClause.
The reason is, the test was failing because of the difference in Cast object created.
When the query is parsed by PlanBuilder.expression, typeOnly is set to false.
public Cast(NodeLocation location, Expression expression, DataType type, boolean safe)
{
this(Optional.of(location), expression, type, safe, false);
}
But when the expression is optimized using ExpressionInterpreter, typeOnly is set to true.
// Code block from LiteralEncoder.toExpression
if (object == null) {
if (type.equals(UNKNOWN)) {
return new NullLiteral();
}
return new Cast(new NullLiteral(), toSqlType(type), false, true);
}
Not sure if this is an existing bug.
There was a problem hiding this comment.
Not sure if this is an existing bug.
Setting typeOnly to false, and not revisiting it later for the expected expression, should be considered a simplification of the testing process rather than a bug. I agree, however, that it might get in the way. This issue was already encountered and addressed here:
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
36d1dcc to
7eae8ae
Compare
It is not correct to insert an expression with mismatching type (here: |
kasiafi
left a comment
There was a problem hiding this comment.
Please make sure that there are tests with coercions. E.g. different case results have different types and the compared value type is different. We should check both cases: when the whole Case expression is wrapped in cast, and when the compared value is wrapped in cast.
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
7eae8ae to
7a9b7dc
Compare
I have a test unit case "testRewriterOnCaseExpressionWithCastFunction" which tests the whole case expression being wrapped and compared value wrapped in cast cases. |
7a9b7dc to
a091f95
Compare
kasiafi
left a comment
There was a problem hiding this comment.
Thanks for updating and sorry for the delay.
Here's another batch from me. This time I went through the tests too.
First of all, we need to address the correctness issue with literal uniqueness.
There was a problem hiding this comment.
This doc would be clearer if it mentioned explicitly that the WHEN-clauses are mutually exclusive. Please either follow my suggestion here #10580 (comment) or write something along those lines.
There was a problem hiding this comment.
It does not guarantee that the literals are semantically unique. See #10580 (comment). You might need to evaluate the literals to make sure.
There was a problem hiding this comment.
Done. I have also added an unit test for this case.
There was a problem hiding this comment.
Now after using isEffectivelyLiteral, you only accept constants as "value". IIRC, before it was constants AND symbols. So the javadoc says.
There was a problem hiding this comment.
We need to get it tested somehow. Have you tried to follow my suggestion #10580 (comment)?
There was a problem hiding this comment.
Please move all the helper methods below the tests, and reformat to either have all arguments on the same line or all arguments on separate lines. Also, please do not abbreviate. I think that inputExpression, expectedRewritten, expectedOptimized are sufficiently clear names.
There was a problem hiding this comment.
Removed and simplified the test.
There was a problem hiding this comment.
In fact, a unit test for the rule should be limited to what the rule does. So, the "getOptimized" part does not belong here. Moreover, this optimization does not directly reflect what happens in Trino's optimization process.
If you want to test the final form of the expression, you can do this in TestLogicalPlanner.
a091f95 to
5936a79
Compare
|
Sorry for the delay in addressing the comments. Let me know if there are any more changes required and if more test cases are needed. |
5936a79 to
884d490
Compare
kasiafi
left a comment
There was a problem hiding this comment.
@maswin after another glance I noticed that the proposed transformation is not correct. I'm sorry for not catching it earlier. This is because of the semantics of null values.
Consider this simple example:
trino:tiny> SELECT
-> (
-> CASE
-> WHEN col = 1 THEN 'a'
-> WHEN col = 2 THEN 'b'
-> ELSE 'c'
-> END
-> ) = 'c'
-> FROM (VALUES null) t(col);
_col0
-------
true
(1 row)
After the transformation, the result is null. This is because of the "inverted conditions" added to the else branch which evaluate to null.
Please ignore the following comments until we find an alternative approach without this correctness issue.
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
...o-main/src/main/java/io/trino/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
This transformation is not correct in case when any of expression=constant2, expression=constant3, expression=constant1 returns null.
There was a problem hiding this comment.
Added a check such that constant cannot be NULL or NullLiteral (Added unit test for that case).
expression being null is handled in the new form of rewrite.
core/trino-main/src/main/java/io/trino/sql/planner/PlanOptimizers.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I'm not sure we want it behind a toggle. This change seems to generally beneficial.
There was a problem hiding this comment.
Transformed expression at times maybe big or could not be simplified, maybe better to have a toggle.
core/trino-main/src/main/java/io/trino/sql/planner/CaseExpressionPredicateRewriter.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/planner/CaseExpressionPredicateRewriter.java
Outdated
Show resolved
Hide resolved
884d490 to
7e786d2
Compare
Nice catch. Missed handling NULL cases. We do this optimization only for when clauses of type expression=constant, so whenever there is null, it will end up in else block. So I think its enough we add a check that the expression is not NULL. This is how the new conversion would look (mostly once all result1=value evaluation happens it would be simplified to a simple expression) To avoid returning NULL in all when cases - check if expression IS NOT NULL, so that expression=constant will be valid and won't return NULL. I have added additional tests in AbstractTestQueries for this case. |
e2c3376 to
32ccf8a
Compare
There was a problem hiding this comment.
(result1 = value AND expression IS NOT NULL AND expression=constant1) OR
(result2 = value AND expression IS NOT NULL AND expression=constant2) OR
(result3 = value AND expression IS NOT NULL AND expression=constant3) OR
(elseResult = value AND ((expression IS NULL) OR (!(expression=constant1) AND !(expression=constant2) AND !(expression=constant3)))
Adding expression IS NOT NULL in the branches 1, 2 and 3 is redundant. They will fail anyway for expression being null because expression=constant returns null. The only thing we really need is the additional condition in the last branch.
There was a problem hiding this comment.
evaluateConstantExpression never returns an Expression, so you can remove || constantExpression instanceof NullLiteral.
There was a problem hiding this comment.
I'd remove expression instanceof NullLiteral. Note that it doesn't catch expressions like cast(null as int), and anyway it should be caught below (if (constantExpression == null ...).
There was a problem hiding this comment.
put each arg on a separate line (or all on the same line).
There was a problem hiding this comment.
The type should be passed to this method as an argument. The caller is responsible for correctness.
There was a problem hiding this comment.
In this test, col1 and col2 should be the same.
Anyway, I didn't notice we had the condition you're testing here.
There was a problem hiding this comment.
These are not "invalid" case expressions. Please rename the method.
There was a problem hiding this comment.
Test all LHS expressions the same but not deterministic.
There was a problem hiding this comment.
BTW, the results (case1, case2, default) should be deterministic too. I think that you can just add a check in the rewriter that the whole rewritten expression is deterministic, so you don't have to check the component expressions.
There was a problem hiding this comment.
These examples don't test coercions in the rewrite. The expressions are compared regardless of the types you put here. To test where the casts really are, you'd have to pass explicit Expressions.
There was a problem hiding this comment.
(result1 = value AND expression IS NOT NULL AND expression=constant1) OR (result2 = value AND expression IS NOT NULL AND expression=constant2) OR (result3 = value AND expression IS NOT NULL AND expression=constant3) OR (elseResult = value AND ((expression IS NULL) OR (!(expression=constant1) AND !(expression=constant2) AND !(expression=constant3)))Adding
expression IS NOT NULLin the branches 1, 2 and 3 is redundant. They will fail anyway forexpressionbeingnullbecauseexpression=constantreturns null. The only thing we really need is the additional condition in the last branch.
@kasiafi : The idea of this optimization is most of the 'result=value' expression will evaluate to false during the expression simplification and most of the time only one branch will be present in the end.
The problem with adding IS NULL check only in last branch is, the last branch might be discarded by ExpressionInterpreter during simplification and NULL check will also get discarded. So we need the NULL check in all branches.
For example, if we add NULL check only to last branch
(CASE
WHEN col = 1 THEN 'a'
WHEN col = 2 THEN 'b'
ELSE 'c'
END) = 'a'
will be converted to
('a'='a' AND col = 1) OR
('b'='a' AND col = 2) OR
('c'='a' AND ((col IS NULL) OR ((col != 1) AND (col != 2)))
after simplification will become
col = 1
which will result in NULL=1 and return NULL, but we need 'FALSE' to be returned. By adding IS NOT NULL check, the simplification would be
col IS NOT NULL and col=1
which will make sure FALSE is returned instead of NULL.
There was a problem hiding this comment.
Please add expected results to the tests. You might want to use the region or nation tables, as they're small.
There was a problem hiding this comment.
Modified to use region and nation table.
I want the results to be same with and without the optimizer enabled. assertQuery seems to compare the results against another DB and ignores the session completely for creating the expected results.
32ccf8a to
d72fbba
Compare
d72fbba to
b1ff3b1
Compare
|
@kasiafi : Hi, please let me know if more changes are required. |
|
Closing this one out due to inactivity, but please reopen if you would like to pick this back up. |
|
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
|
Closing this PR due to inactivity. Anyone interested please feel free to pick it back up and continue work on it here or in a new PR. |
The DB visualization tools such as looker & DbVisualizer generates sub-optimal queries when a selected column has switch case and a filter is applied on top of it. For instance it generates queries such as:
SELECT case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral' from students where (case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral') = 'good';But this query can be simplified into:
SELECT case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral' from students where behavior_code = 1;Without this simplification, the query misses out a lot of other optimizations since the Predicate is not converted to ColumnDomain and not pushed into readers (i.e, ORC indexes) and becomes extremely slow.
Since these are generated queries, the user doesn't have much control on them to convert it manually.
Internally inside our company we saw lot of such queries running for a long time. So we have come up with this Optimizer to rewrite such case predicates on filter condition. We saw a lot of performance gain from them.
The case expression of format
can be converted to a series of AND/OR cases as below:
but this involves a lot of re-evaluation. If we can ensure all the conditions are mutually exclusive then
condition2 AND NOT(condition1)can be converted to justcondition2since former being true will mean the later NOT condition is also true since their domains are mutually exclusive. With this the above expression can be further simplified toSo under the following conditions, we convert case expression predicates to above form:
SimplifyExpressions can discard all conditions in which
result != valueand DomainTranslator can construct a Domain for the column.