Optimize filter condition with CASE predicate - handle null cases by maswin · Pull Request #18231 · prestodb/presto

maswin · 2022-08-25T19:19:59Z

This is an extension to the following PR: #17065

The following conditions were not handled previously:

NULL case:

(CASE
WHEN col = 1 THEN 'a'
WHEN col = 2 THEN 'b'
ELSE 'c' END) = 'c'

when col is NULL, the expression should evaluate to TRUE since NULL will match none of the WHEN clause and reach the else block. Else case will return 'c' which is equal to 'c'. But the previous simplification will result in:

('c'='a' AND col = 1) OR
('c'='b' AND col = 2) OR
('c'='c' AND (col != 1 AND col !=2))

which will be simplified to

('c'='c' AND (col != 1 AND col !=2)) -> col != 1 AND col !=2

but this wouldn't evaluate to TRUE due to the way NULL are handled in sql. This will result in NULL.
So we have added NULL checks in the rewrite, and the new rewrite would be of the form:

('c'='a' AND col IS NOT NULL AND col = 1) OR
('c'='b' AND col IS NOT NULL AND col = 2) OR
('c'='c' AND ((col IS NULL) OR (col != 1 AND col !=2)))

which will be simplified to
('c'='c' AND ((col IS NULL) OR (col != 1 AND col !=2))) -> (col IS NULL) OR (col != 1 AND col !=2))
this will rightly get evaluated to TRUE.

Result matching with any other when clause result case is also handled. Tests are written for those cases in AbstractTestQueries. e.g:

(CASE
WHEN col = 1 THEN 'a'
WHEN col = 2 THEN 'b'
ELSE 'c' END) = 'b'

NULL value in col will result in returning else result 'c' which is != 'b' there by evaluating to FALSE. Above expression will be rewritten as

('b'='a' AND col IS NOT NULL AND col = 1) OR
('b'='b' AND col IS NOT NULL AND col = 2) OR
('b'='c' AND ((col IS NULL) OR (col != 1 AND col !=2)))

simplified as

'b'='b' AND col IS NOT NULL AND col = 2 -> col IS NOT NULL AND col = 2

which will evaluate to FALSE.

constant expression evaluation:
If the expression is of form
(case when col1=CAST(1 as SMALLINT) then 'case1' when col1=CAST(1 as TINYINT) then 'case2' else 'default' end) = 'case1'

the rewrite should not happen since RHS values in operand are not unique, but currently it happens since we were checking if the ConstantExpression are same. Modified it to evaluate the expression and then perform the check.

Also added documentation for the setting.

kaikalur · 2022-08-25T19:21:07Z

Like I said - just coalesce instead of IS NULL OR etc. COALESCE is better.

r0hini · 2022-08-25T20:16:27Z

User facing documentation is too technical with internal details. Need to rework that.

maswin · 2022-08-25T20:36:06Z

Like I said - just coalesce instead of IS NULL OR etc. COALESCE is better.

Do you mean to rewrite in this form:

('c'='a' AND COALESCE(col = 1, FALSE)) OR
('c'='b' AND COALESCE(col = 2, FALSE)) OR
('c'='c' AND COALESCE((col != 1 AND col !=2), TRUE))

the main reason we want this rewrite to happen is for Presto to construct column domains so that the execution is faster. Currently only for simple AND/OR predicates column domains are constructed. RowExpressionDomainTranslator wont be able to construct column domain if any function (COALESCE) is used.

When I use COALESCE as mentioned above, the ScanFilter looks like this:

- ScanFilter[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=sample_orc, analyzePartitionValues=Optional.empty}', layout='Optional[default.sample_orc{}]'}, filterPredicate = COALESCE((c) = (INTEGER'2'), BOOLEAN'false')] => [a:integer, b:varchar, c:integer, d:varchar]
                Estimates: {rows: 16 (1.11kB), cpu: 1136.00, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: 2272.00, memory: 0.00, network: 0.00}
                LAYOUT: default.sample_orc{}
                c := c:int:2:REGULAR (1:23)
                d := d:string:-13:PARTITION_KEY (1:23)
                    :: [[""a""], [""b""], [""c""]]
                a := a:int:0:REGULAR (1:23)

No column domains in the table scan layout.

by rewriting using IS NULL we get:

- ScanFilter[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=sample_orc, analyzePartitionValues=Optional.empty}', layout='Optional[default.sample_orc{domains={c=[ [[""2""]] ]}}]'}, filterPredicate = (c) = (INTEGER'2')] => [a:integer, b:varchar, c:integer, d:varchar]
                Estimates: {rows: 16 (1.11kB), cpu: 1136.00, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: 2272.00, memory: 0.00, network: 0.00}
                LAYOUT: default.sample_orc{domains={c=[ [[""2""]] ]}}
                c := c:int:2:REGULAR (1:23)
                d := d:string:-13:PARTITION_KEY (1:23)
                    :: [[""a""], [""b""], [""c""]]
                a := a:int:0:REGULAR (1:23)

by using IS NULL we get column domain as {domains={c=[ [[""2""]] ]}}

Above example is when result matches one of the when clause, even if it matches the else clause we would get a domain like:

- ScanFilter[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=sample_orc, analyzePartitionValues=Optional.empty}', layout='Optional[default.sample_orc{domains={c=[ NULL, [(<min>, ""1""), (""1"", ""2""), (""2"", <max>)] ]}}]'}, filterPredicate = (not(IN(c, INTEGER'1', INTEGER'2'))) OR (IS_NULL(c))] => [a:integer, b:varchar, c:integer, d:varchar]
                Estimates: {rows: 16 (1.11kB), cpu: 1136.00, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: 2272.00, memory: 0.00, network: 0.00}
                LAYOUT: default.sample_orc{domains={c=[ NULL, [(<min>, ""1""), (""1"", ""2""), (""2"", <max>)] ]}}
                c := c:int:2:REGULAR (1:23)
                d := d:string:-13:PARTITION_KEY (1:23)
                    :: [[""a""], [""b""], [""c""]]
                a := a:int:0:REGULAR (1:23)

column domain: {domains={c=[ NULL, [(<min>, ""1""), (""1"", ""2""), (""2"", <max>)] ]}}

kaikalur · 2022-08-25T20:47:48Z

Like I said - just coalesce instead of IS NULL OR etc. COALESCE is better.

Do you mean to rewrite in this form:

('c'='a' AND COALESCE(col = 1, FALSE)) OR
('c'='b' AND COALESCE(col = 2, FALSE)) OR
('c'='c' AND COALESCE((col != 1 AND col !=2), TRUE))

the main reason we want this rewrite to happen is for Presto to construct column domains so that the execution is faster. Currently only for simple AND/OR predicates column domains are constructed. RowExpressionDomainTranslator wont be able to construct column domain if any function (COALESCE) is used.

When I use COALESCE as mentioned above, the ScanFilter looks like this:

- ScanFilter[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=sample_orc, analyzePartitionValues=Optional.empty}', layout='Optional[default.sample_orc{}]'}, filterPredicate = COALESCE((c) = (INTEGER'2'), BOOLEAN'false')] => [a:integer, b:varchar, c:integer, d:varchar]
                Estimates: {rows: 16 (1.11kB), cpu: 1136.00, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: 2272.00, memory: 0.00, network: 0.00}
                LAYOUT: default.sample_orc{}
                c := c:int:2:REGULAR (1:23)
                d := d:string:-13:PARTITION_KEY (1:23)
                    :: [[""a""], [""b""], [""c""]]
                a := a:int:0:REGULAR (1:23)

No column domains in the table scan layout.

by rewriting using IS NULL we get:

- ScanFilter[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=sample_orc, analyzePartitionValues=Optional.empty}', layout='Optional[default.sample_orc{domains={c=[ [[""2""]] ]}}]'}, filterPredicate = (c) = (INTEGER'2')] => [a:integer, b:varchar, c:integer, d:varchar]
                Estimates: {rows: 16 (1.11kB), cpu: 1136.00, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: 2272.00, memory: 0.00, network: 0.00}
                LAYOUT: default.sample_orc{domains={c=[ [[""2""]] ]}}
                c := c:int:2:REGULAR (1:23)
                d := d:string:-13:PARTITION_KEY (1:23)
                    :: [[""a""], [""b""], [""c""]]
                a := a:int:0:REGULAR (1:23)

by using IS NULL we get column domain as {domains={c=[ [[""2""]] ]}}

Above example is when result matches one of the when clause, even if it matches the else clause we would get a domain like:

- ScanFilter[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=sample_orc, analyzePartitionValues=Optional.empty}', layout='Optional[default.sample_orc{domains={c=[ NULL, [(<min>, ""1""), (""1"", ""2""), (""2"", <max>)] ]}}]'}, filterPredicate = (not(IN(c, INTEGER'1', INTEGER'2'))) OR (IS_NULL(c))] => [a:integer, b:varchar, c:integer, d:varchar]
                Estimates: {rows: 16 (1.11kB), cpu: 1136.00, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: 2272.00, memory: 0.00, network: 0.00}
                LAYOUT: default.sample_orc{domains={c=[ NULL, [(<min>, ""1""), (""1"", ""2""), (""2"", <max>)] ]}}
                c := c:int:2:REGULAR (1:23)
                d := d:string:-13:PARTITION_KEY (1:23)
                    :: [[""a""], [""b""], [""c""]]
                a := a:int:0:REGULAR (1:23)

column domain: {domains={c=[ NULL, [(<min>, ""1""), (""1"", ""2""), (""2"", <max>)] ]}}

Wow! that's not good. OK sounds good.

maswin · 2022-08-26T21:05:06Z

User facing documentation is too technical with internal details. Need to rework that.

Updated.

r0hini · 2022-08-26T21:56:27Z

presto-docs/src/main/sphinx/admin/properties.rst

Can you link https://cloud.google.com/looker/docs/reference/param-field-case here? Also can you make it lowercase (case instead of CASE). lookml is all lowercase and so they only refer it as case.

steveburnett

This PR is old.

Local doc build failed with

"python3 -m sphinx -b html -n -d target/doctrees -j auto src/main/sphinx target/html
Running Sphinx v4.4.0

Sphinx version error:
The sphinxcontrib.applehelp extension used by this project needs at least Sphinx v5.0; it therefore cannot be built with this version.
make: *** [html] Error 2"

git merge master - the usual solution to this build problem related to PR #21708 - failed:

"Auto-merging presto-docs/src/main/sphinx/admin/properties.rst
Auto-merging presto-main/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java
Auto-merging presto-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
CONFLICT (content): Merge conflict in presto-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Auto-merging presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java
Automatic merge failed; fix conflicts and then commit the result."

Please update the PR, then re-request my review of the docs.

maswin requested a review from a team as a code owner August 25, 2022 19:20

maswin requested a review from presto-oss August 25, 2022 19:20

maswin mentioned this pull request Aug 25, 2022

Optimize filter condition with CASE predicate - handle null cases #18228

Closed

maswin force-pushed the switch_case_optimizer branch from 20aa571 to 419c620 Compare August 26, 2022 21:04

r0hini reviewed Aug 26, 2022

View reviewed changes

maswin force-pushed the switch_case_optimizer branch from 419c620 to 413f85a Compare August 27, 2022 07:05

Optimize filter condition with CASE predicate - handle null cases

1569ea5

maswin force-pushed the switch_case_optimizer branch from 413f85a to 1569ea5 Compare August 27, 2022 07:07

wanglinsong requested review from feilong-liu, jaystarshot and steveburnett as code owners July 6, 2024 04:32

steveburnett requested changes Jul 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize filter condition with CASE predicate - handle null cases#18231

Optimize filter condition with CASE predicate - handle null cases#18231
maswin wants to merge 1 commit intoprestodb:masterfrom
maswin:switch_case_optimizer

maswin commented Aug 25, 2022

Uh oh!

kaikalur commented Aug 25, 2022

Uh oh!

r0hini commented Aug 25, 2022

Uh oh!

maswin commented Aug 25, 2022

Uh oh!

kaikalur commented Aug 25, 2022

Uh oh!

maswin commented Aug 26, 2022

Uh oh!

r0hini Aug 26, 2022 •

edited

Loading

Uh oh!

maswin Aug 27, 2022

Uh oh!

steveburnett left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

maswin commented Aug 25, 2022

Uh oh!

kaikalur commented Aug 25, 2022

Uh oh!

r0hini commented Aug 25, 2022

Uh oh!

maswin commented Aug 25, 2022

Uh oh!

kaikalur commented Aug 25, 2022

Uh oh!

maswin commented Aug 26, 2022

Uh oh!

r0hini Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maswin Aug 27, 2022

Choose a reason for hiding this comment

Uh oh!

steveburnett left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

r0hini Aug 26, 2022 •

edited

Loading