Skip to content

Optimize filter condition with case expression predicate#10580

Closed
maswin wants to merge 2 commits intotrinodb:masterfrom
maswin:switch_case_optimizer
Closed

Optimize filter condition with case expression predicate#10580
maswin wants to merge 2 commits intotrinodb:masterfrom
maswin:switch_case_optimizer

Conversation

@maswin
Copy link
Copy Markdown
Member

@maswin maswin commented Jan 13, 2022

The DB visualization tools such as looker & DbVisualizer generates sub-optimal queries when a selected column has switch case and a filter is applied on top of it. For instance it generates queries such as:

SELECT case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral' from students where (case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral') = 'good';

But this query can be simplified into:

SELECT case behavior_code when 1 then 'good' when 2 then 'bad' else 'neutral' from students where behavior_code = 1;

Without this simplification, the query misses out a lot of other optimizations since the Predicate is not converted to ColumnDomain and not pushed into readers (i.e, ORC indexes) and becomes extremely slow.

Since these are generated queries, the user doesn't have much control on them to convert it manually.

Internally inside our company we saw lot of such queries running for a long time. So we have come up with this Optimizer to rewrite such case predicates on filter condition. We saw a lot of performance gain from them.

The case expression of format

(case 
    when condition1 then result1
    when condition2 then result2
    when condition3 then result3
else elseResult) = value

can be converted to a series of AND/OR cases as below:

(result1 = value AND condition1)
OR
(result2 = value AND condition2 AND NOT(condition1))
OR
(result3 = value AND condition3 AND NOT(condition1) AND NOT(condition2))
OR
(elseResult = value AND NOT(condition1) AND NOT(condition2) AND NOT(condition3))

but this involves a lot of re-evaluation. If we can ensure all the conditions are mutually exclusive then condition2 AND NOT(condition1) can be converted to just condition2 since former being true will mean the later NOT condition is also true since their domains are mutually exclusive. With this the above expression can be further simplified to

(result1 = value AND condition1)
OR
(result2 = value AND condition2)
OR
(result3 = value AND condition3)
OR
(elseResult = value AND NOT(condition1) AND NOT(condition2) AND NOT(condition3))

So under the following conditions, we convert case expression predicates to above form:

  1. Value is a simple expression(constant/column reference/input reference) and not any function call since it could be re-evaluated.
  2. The LHS expression in all WHEN clauses are the same. For example, if one WHEN clause has a condition on col1 and another on col2, it will not work
  3. The relational operator in the WHEN clause condition is equals. With other operators it is harder to check for exclusivity.
  4. All the RHS expressions are a constant and unique

SimplifyExpressions can discard all conditions in which result != value and DomainTranslator can construct a Domain for the column.

@martint martint requested review from kasiafi and martint January 13, 2022 01:59
@martint
Copy link
Copy Markdown
Member

martint commented Jan 13, 2022

@kasiafi, can you take a look at this PR?

@maswin maswin force-pushed the switch_case_optimizer branch from c80f64a to 36d1dcc Compare January 13, 2022 03:48
@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Jan 13, 2022

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla.

@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Jan 14, 2022

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Jan 14, 2022
@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Jan 14, 2022

The cla-bot has been summoned, and re-checked this pull request!

Copy link
Copy Markdown
Member

@kasiafi kasiafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this optimization. Here are some initial comments on the logic and code style. I haven't gone through the tests yet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. We cannot replace any operator with "is null", when one operand is null.
It should rather be:

Expression default = defaultValue
        .map(value -> getCastExpression(castExpression, value))
        .orElse(<cast null to the expected type>);

Expression elseResult = new ComparisonExpression(operator, default, otherExpression);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it to return NullLiteral in case of missing else clause since figuring out the expected type from Optimizer is difficult and the Simplify rules applied later adds a CAST if necessary. Now the results are similar to how trino handles other NULL expressions involved comparison function (i.e, NULL > 'string')

I have disabled the unit test with respect to this change - testRewriterOnCaseExpressionWithoutElseClause.
The reason is, the test was failing because of the difference in Cast object created.

When the query is parsed by PlanBuilder.expression, typeOnly is set to false.

public Cast(NodeLocation location, Expression expression, DataType type, boolean safe)
{
    this(Optional.of(location), expression, type, safe, false);
}

But when the expression is optimized using ExpressionInterpreter, typeOnly is set to true.

// Code block from LiteralEncoder.toExpression
if (object == null) {
    if (type.equals(UNKNOWN)) {
        return new NullLiteral();
    }
    return new Cast(new NullLiteral(), toSqlType(type), false, true);
}

Not sure if this is an existing bug.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is an existing bug.

Setting typeOnly to false, and not revisiting it later for the expected expression, should be considered a simplification of the testing process rather than a bug. I agree, however, that it might get in the way. This issue was already encountered and addressed here:

@Override
public Expression rewriteCast(Cast node, Void context, ExpressionTreeRewriter<Void> treeRewriter)
{
// the `expected` Cast expression comes out of the AstBuilder with the `typeOnly` flag set to false.
// always set the `typeOnly` flag to false so that it does not break the comparison.
return new Cast(node.getExpression(), node.getType(), node.isSafe(), false);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@maswin maswin force-pushed the switch_case_optimizer branch from 36d1dcc to 7eae8ae Compare January 25, 2022 00:22
@kasiafi
Copy link
Copy Markdown
Member

kasiafi commented Jan 25, 2022

Made it to return NullLiteral in case of missing else clause since figuring out the expected type from Optimizer is difficult and the Simplify rules applied later adds a CAST if necessary.

It is not correct to insert an expression with mismatching type (here: UNKNOWN). We expect that both sides of comparison are of the same type. To determine the type in case when there is no Cast, you can use the TypeAnalyzer.getType() method. The TypeAnalyzer is available for the Optimizer rules in PlanOptimizers.

Copy link
Copy Markdown
Member

@kasiafi kasiafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that there are tests with coercions. E.g. different case results have different types and the compared value type is different. We should check both cases: when the whole Case expression is wrapped in cast, and when the compared value is wrapped in cast.

@maswin maswin force-pushed the switch_case_optimizer branch from 7eae8ae to 7a9b7dc Compare January 28, 2022 08:03
@maswin
Copy link
Copy Markdown
Member Author

maswin commented Jan 28, 2022

Please make sure that there are tests with coercions. E.g. different case results have different types and the compared value type is different. We should check both cases: when the whole Case expression is wrapped in cast, and when the compared value is wrapped in cast.

I have a test unit case "testRewriterOnCaseExpressionWithCastFunction" which tests the whole case expression being wrapped and compared value wrapped in cast cases.
Additionally I see that before the optimizer is reached the common super type of the expression is identified and all other values are wrapped with CAST to bring it all to uniform type.
Let me know if more test cases are required.

@maswin maswin force-pushed the switch_case_optimizer branch from 7a9b7dc to a091f95 Compare January 28, 2022 08:35
Copy link
Copy Markdown
Member

@kasiafi kasiafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating and sorry for the delay.
Here's another batch from me. This time I went through the tests too.
First of all, we need to address the correctness issue with literal uniqueness.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc would be clearer if it mentioned explicitly that the WHEN-clauses are mutually exclusive. Please either follow my suggestion here #10580 (comment) or write something along those lines.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 88 to 89
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requireNonNull

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 242 to 244
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not guarantee that the literals are semantically unique. See #10580 (comment). You might need to evaluate the literals to make sure.

Copy link
Copy Markdown
Member Author

@maswin maswin May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have also added an unit test for this case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now after using isEffectivelyLiteral, you only accept constants as "value". IIRC, before it was constants AND symbols. So the javadoc says.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import and, or.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to get it tested somehow. Have you tried to follow my suggestion #10580 (comment)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move all the helper methods below the tests, and reformat to either have all arguments on the same line or all arguments on separate lines. Also, please do not abbreviate. I think that inputExpression, expectedRewritten, expectedOptimized are sufficiently clear names.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is expectedQuerySession?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and simplified the test.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, a unit test for the rule should be limited to what the rule does. So, the "getOptimized" part does not belong here. Moreover, this optimization does not directly reflect what happens in Trino's optimization process.

If you want to test the final form of the expression, you can do this in TestLogicalPlanner.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@maswin maswin force-pushed the switch_case_optimizer branch from a091f95 to 5936a79 Compare May 16, 2022 06:24
@maswin
Copy link
Copy Markdown
Member Author

maswin commented May 16, 2022

Sorry for the delay in addressing the comments. Let me know if there are any more changes required and if more test cases are needed.
I raised a PR for this optimizer in Presto and got it merged recently: prestodb/presto#17065

@maswin maswin force-pushed the switch_case_optimizer branch from 5936a79 to 884d490 Compare May 16, 2022 15:34
Copy link
Copy Markdown
Member

@kasiafi kasiafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maswin after another glance I noticed that the proposed transformation is not correct. I'm sorry for not catching it earlier. This is because of the semantics of null values.
Consider this simple example:

trino:tiny> SELECT
         -> (
         ->  CASE
         ->     WHEN col = 1 THEN 'a'
         ->     WHEN col = 2 THEN 'b'
         ->  ELSE 'c'
         ->  END
         -> ) = 'c'
         -> FROM (VALUES null) t(col);
 _col0
-------
 true
(1 row)

After the transformation, the result is null. This is because of the "inverted conditions" added to the else branch which evaluate to null.
Please ignore the following comments until we find an alternative approach without this correctness issue.

Comment on lines 42 to 45
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transformation is not correct in case when any of expression=constant2, expression=constant3, expression=constant1 returns null.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check such that constant cannot be NULL or NullLiteral (Added unit test for that case).
expression being null is handled in the new form of rewrite.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want it behind a toggle. This change seems to generally beneficial.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transformed expression at times maybe big or could not be simplified, maybe better to have a toggle.

@maswin maswin force-pushed the switch_case_optimizer branch from 884d490 to 7e786d2 Compare May 25, 2022 06:20
@maswin
Copy link
Copy Markdown
Member Author

maswin commented May 25, 2022

@maswin after another glance I noticed that the proposed transformation is not correct. I'm sorry for not catching it earlier. This is because of the semantics of null values. Consider this simple example:

trino:tiny> SELECT
         -> (
         ->  CASE
         ->     WHEN col = 1 THEN 'a'
         ->     WHEN col = 2 THEN 'b'
         ->  ELSE 'c'
         ->  END
         -> ) = 'c'
         -> FROM (VALUES null) t(col);
 _col0
-------
 true
(1 row)

After the transformation, the result is null. This is because of the "inverted conditions" added to the else branch which evaluate to null. Please ignore the following comments until we find an alternative approach without this correctness issue.

Nice catch. Missed handling NULL cases. We do this optimization only for when clauses of type expression=constant, so whenever there is null, it will end up in else block. So I think its enough we add a check that the expression is not NULL. This is how the new conversion would look (mostly once all result1=value evaluation happens it would be simplified to a simple expression)

(result1 = value AND expression IS NOT NULL AND expression=constant1) OR
(result2 = value AND expression IS NOT NULL AND expression=constant2) OR
(result3 = value AND expression IS NOT NULL AND expression=constant3) OR
(elseResult = value AND ((expression IS NULL) OR (!(expression=constant1) AND !(expression=constant2) AND !(expression=constant3)))

To avoid returning NULL in all when cases - check if expression IS NOT NULL, so that expression=constant will be valid and won't return NULL.
To avoid returning NULL in else case - check if expression IS NULL, so that we skip evaluating all NOT(expression=constant)

I have added additional tests in AbstractTestQueries for this case.

@maswin maswin force-pushed the switch_case_optimizer branch 2 times, most recently from e2c3376 to 32ccf8a Compare May 25, 2022 17:54
Copy link
Copy Markdown
Member

@kasiafi kasiafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maswin

(result1 = value AND expression IS NOT NULL AND expression=constant1) OR
(result2 = value AND expression IS NOT NULL AND expression=constant2) OR
(result3 = value AND expression IS NOT NULL AND expression=constant3) OR
(elseResult = value AND ((expression IS NULL) OR (!(expression=constant1) AND !(expression=constant2) AND !(expression=constant3)))

Adding expression IS NOT NULL in the branches 1, 2 and 3 is redundant. They will fail anyway for expression being null because expression=constant returns null. The only thing we really need is the additional condition in the last branch.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evaluateConstantExpression never returns an Expression, so you can remove || constantExpression instanceof NullLiteral.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove expression instanceof NullLiteral. Note that it doesn't catch expressions like cast(null as int), and anyway it should be caught below (if (constantExpression == null ...).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put each arg on a separate line (or all on the same line).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type should be passed to this method as an argument. The caller is responsible for correctness.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get() -> orElseThrow()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 43 to 58
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, col1 and col2 should be the same.
Anyway, I didn't notice we had the condition you're testing here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not "invalid" case expressions. Please rename the method.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test all LHS expressions the same but not deterministic.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the results (case1, case2, default) should be deterministic too. I think that you can just add a check in the rewriter that the whole rewritten expression is deterministic, so you don't have to check the component expressions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples don't test coercions in the rewrite. The expressions are compared regardless of the types you put here. To test where the casts really are, you'd have to pass explicit Expressions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maswin

(result1 = value AND expression IS NOT NULL AND expression=constant1) OR
(result2 = value AND expression IS NOT NULL AND expression=constant2) OR
(result3 = value AND expression IS NOT NULL AND expression=constant3) OR
(elseResult = value AND ((expression IS NULL) OR (!(expression=constant1) AND !(expression=constant2) AND !(expression=constant3)))

Adding expression IS NOT NULL in the branches 1, 2 and 3 is redundant. They will fail anyway for expression being null because expression=constant returns null. The only thing we really need is the additional condition in the last branch.

@kasiafi : The idea of this optimization is most of the 'result=value' expression will evaluate to false during the expression simplification and most of the time only one branch will be present in the end.
The problem with adding IS NULL check only in last branch is, the last branch might be discarded by ExpressionInterpreter during simplification and NULL check will also get discarded. So we need the NULL check in all branches.

For example, if we add NULL check only to last branch

(CASE
  WHEN col = 1 THEN 'a'
  WHEN col = 2 THEN 'b'
  ELSE 'c'
END) = 'a'

will be converted to

('a'='a' AND col = 1) OR
('b'='a' AND col = 2) OR
('c'='a' AND ((col IS NULL) OR ((col != 1) AND (col != 2)))

after simplification will become
col = 1

which will result in NULL=1 and return NULL, but we need 'FALSE' to be returned. By adding IS NOT NULL check, the simplification would be
col IS NOT NULL and col=1
which will make sure FALSE is returned instead of NULL.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add expected results to the tests. You might want to use the region or nation tables, as they're small.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to use region and nation table.
I want the results to be same with and without the optimizer enabled. assertQuery seems to compare the results against another DB and ignores the session completely for creating the expected results.

@maswin maswin force-pushed the switch_case_optimizer branch from 32ccf8a to d72fbba Compare August 24, 2022 18:43
@mosabua
Copy link
Copy Markdown
Member

mosabua commented Nov 3, 2022

👋 @maswin and @kasiafi .. I assume this PR is still active and you are working towards merge. If that is not the case, please close the PR.

@maswin maswin force-pushed the switch_case_optimizer branch from d72fbba to b1ff3b1 Compare January 10, 2023 00:02
@github-actions github-actions bot added the docs label Jan 10, 2023
@maswin
Copy link
Copy Markdown
Member Author

maswin commented Jan 10, 2023

@kasiafi : Hi, please let me know if more changes are required.

@mosabua
Copy link
Copy Markdown
Member

mosabua commented Apr 25, 2023

@kasiafi and @maswin please let me know if it is a good time to review the docs yet or if this PR still ongoing for a while

@mosabua
Copy link
Copy Markdown
Member

mosabua commented Jan 15, 2024

Closing this one out due to inactivity, but please reopen if you would like to pick this back up.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 5, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Sep 5, 2024
@mosabua
Copy link
Copy Markdown
Member

mosabua commented Sep 6, 2024

Closing this PR due to inactivity. Anyone interested please feel free to pick it back up and continue work on it here or in a new PR.

@mosabua mosabua closed this Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants