Skip to content

Optimize filter condition with CASE predicate - handle null cases#18231

Open
maswin wants to merge 1 commit intoprestodb:masterfrom
maswin:switch_case_optimizer
Open

Optimize filter condition with CASE predicate - handle null cases#18231
maswin wants to merge 1 commit intoprestodb:masterfrom
maswin:switch_case_optimizer

Conversation

@maswin
Copy link
Copy Markdown
Contributor

@maswin maswin commented Aug 25, 2022

This is an extension to the following PR: #17065

The following conditions were not handled previously:

  1. NULL case:
(CASE
WHEN col = 1 THEN 'a'
WHEN col = 2 THEN 'b'
ELSE 'c' END) = 'c'

when col is NULL, the expression should evaluate to TRUE since NULL will match none of the WHEN clause and reach the else block. Else case will return 'c' which is equal to 'c'. But the previous simplification will result in:

('c'='a' AND col = 1) OR
('c'='b' AND col = 2) OR
('c'='c' AND (col != 1 AND col !=2))

which will be simplified to

('c'='c' AND (col != 1 AND col !=2)) -> col != 1 AND col !=2

but this wouldn't evaluate to TRUE due to the way NULL are handled in sql. This will result in NULL.
So we have added NULL checks in the rewrite, and the new rewrite would be of the form:

('c'='a' AND col IS NOT NULL AND col = 1) OR
('c'='b' AND col IS NOT NULL AND col = 2) OR
('c'='c' AND ((col IS NULL) OR (col != 1 AND col !=2)))

which will be simplified to
('c'='c' AND ((col IS NULL) OR (col != 1 AND col !=2))) -> (col IS NULL) OR (col != 1 AND col !=2))
this will rightly get evaluated to TRUE.

Result matching with any other when clause result case is also handled. Tests are written for those cases in AbstractTestQueries. e.g:

(CASE
WHEN col = 1 THEN 'a'
WHEN col = 2 THEN 'b'
ELSE 'c' END) = 'b'

NULL value in col will result in returning else result 'c' which is != 'b' there by evaluating to FALSE. Above expression will be rewritten as

('b'='a' AND col IS NOT NULL AND col = 1) OR
('b'='b' AND col IS NOT NULL AND col = 2) OR
('b'='c' AND ((col IS NULL) OR (col != 1 AND col !=2)))

simplified as

'b'='b' AND col IS NOT NULL AND col = 2 -> col IS NOT NULL AND col = 2

which will evaluate to FALSE.

  1. constant expression evaluation:
    If the expression is of form
    (case when col1=CAST(1 as SMALLINT) then 'case1' when col1=CAST(1 as TINYINT) then 'case2' else 'default' end) = 'case1'

the rewrite should not happen since RHS values in operand are not unique, but currently it happens since we were checking if the ConstantExpression are same. Modified it to evaluate the expression and then perform the check.

Also added documentation for the setting.

@kaikalur
Copy link
Copy Markdown
Contributor

Like I said - just coalesce instead of IS NULL OR etc. COALESCE is better.

@r0hini
Copy link
Copy Markdown

r0hini commented Aug 25, 2022

User facing documentation is too technical with internal details. Need to rework that.

@maswin
Copy link
Copy Markdown
Contributor Author

maswin commented Aug 25, 2022

Like I said - just coalesce instead of IS NULL OR etc. COALESCE is better.

Do you mean to rewrite in this form:

('c'='a' AND COALESCE(col = 1, FALSE)) OR
('c'='b' AND COALESCE(col = 2, FALSE)) OR
('c'='c' AND COALESCE((col != 1 AND col !=2), TRUE))

the main reason we want this rewrite to happen is for Presto to construct column domains so that the execution is faster. Currently only for simple AND/OR predicates column domains are constructed. RowExpressionDomainTranslator wont be able to construct column domain if any function (COALESCE) is used.

When I use COALESCE as mentioned above, the ScanFilter looks like this:

- ScanFilter[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=sample_orc, analyzePartitionValues=Optional.empty}', layout='Optional[default.sample_orc{}]'}, filterPredicate = COALESCE((c) = (INTEGER'2'), BOOLEAN'false')] => [a:integer, b:varchar, c:integer, d:varchar]
                Estimates: {rows: 16 (1.11kB), cpu: 1136.00, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: 2272.00, memory: 0.00, network: 0.00}
                LAYOUT: default.sample_orc{}
                c := c:int:2:REGULAR (1:23)
                d := d:string:-13:PARTITION_KEY (1:23)
                    :: [[""a""], [""b""], [""c""]]
                a := a:int:0:REGULAR (1:23)

No column domains in the table scan layout.

by rewriting using IS NULL we get:

- ScanFilter[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=sample_orc, analyzePartitionValues=Optional.empty}', layout='Optional[default.sample_orc{domains={c=[ [[""2""]] ]}}]'}, filterPredicate = (c) = (INTEGER'2')] => [a:integer, b:varchar, c:integer, d:varchar]
                Estimates: {rows: 16 (1.11kB), cpu: 1136.00, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: 2272.00, memory: 0.00, network: 0.00}
                LAYOUT: default.sample_orc{domains={c=[ [[""2""]] ]}}
                c := c:int:2:REGULAR (1:23)
                d := d:string:-13:PARTITION_KEY (1:23)
                    :: [[""a""], [""b""], [""c""]]
                a := a:int:0:REGULAR (1:23)

by using IS NULL we get column domain as {domains={c=[ [[""2""]] ]}}

Above example is when result matches one of the when clause, even if it matches the else clause we would get a domain like:

- ScanFilter[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=sample_orc, analyzePartitionValues=Optional.empty}', layout='Optional[default.sample_orc{domains={c=[ NULL, [(<min>, ""1""), (""1"", ""2""), (""2"", <max>)] ]}}]'}, filterPredicate = (not(IN(c, INTEGER'1', INTEGER'2'))) OR (IS_NULL(c))] => [a:integer, b:varchar, c:integer, d:varchar]
                Estimates: {rows: 16 (1.11kB), cpu: 1136.00, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: 2272.00, memory: 0.00, network: 0.00}
                LAYOUT: default.sample_orc{domains={c=[ NULL, [(<min>, ""1""), (""1"", ""2""), (""2"", <max>)] ]}}
                c := c:int:2:REGULAR (1:23)
                d := d:string:-13:PARTITION_KEY (1:23)
                    :: [[""a""], [""b""], [""c""]]
                a := a:int:0:REGULAR (1:23)

column domain: {domains={c=[ NULL, [(<min>, ""1""), (""1"", ""2""), (""2"", <max>)] ]}}

@kaikalur
Copy link
Copy Markdown
Contributor

Like I said - just coalesce instead of IS NULL OR etc. COALESCE is better.

Do you mean to rewrite in this form:

('c'='a' AND COALESCE(col = 1, FALSE)) OR
('c'='b' AND COALESCE(col = 2, FALSE)) OR
('c'='c' AND COALESCE((col != 1 AND col !=2), TRUE))

the main reason we want this rewrite to happen is for Presto to construct column domains so that the execution is faster. Currently only for simple AND/OR predicates column domains are constructed. RowExpressionDomainTranslator wont be able to construct column domain if any function (COALESCE) is used.

When I use COALESCE as mentioned above, the ScanFilter looks like this:

- ScanFilter[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=sample_orc, analyzePartitionValues=Optional.empty}', layout='Optional[default.sample_orc{}]'}, filterPredicate = COALESCE((c) = (INTEGER'2'), BOOLEAN'false')] => [a:integer, b:varchar, c:integer, d:varchar]
                Estimates: {rows: 16 (1.11kB), cpu: 1136.00, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: 2272.00, memory: 0.00, network: 0.00}
                LAYOUT: default.sample_orc{}
                c := c:int:2:REGULAR (1:23)
                d := d:string:-13:PARTITION_KEY (1:23)
                    :: [[""a""], [""b""], [""c""]]
                a := a:int:0:REGULAR (1:23)

No column domains in the table scan layout.

by rewriting using IS NULL we get:

- ScanFilter[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=sample_orc, analyzePartitionValues=Optional.empty}', layout='Optional[default.sample_orc{domains={c=[ [[""2""]] ]}}]'}, filterPredicate = (c) = (INTEGER'2')] => [a:integer, b:varchar, c:integer, d:varchar]
                Estimates: {rows: 16 (1.11kB), cpu: 1136.00, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: 2272.00, memory: 0.00, network: 0.00}
                LAYOUT: default.sample_orc{domains={c=[ [[""2""]] ]}}
                c := c:int:2:REGULAR (1:23)
                d := d:string:-13:PARTITION_KEY (1:23)
                    :: [[""a""], [""b""], [""c""]]
                a := a:int:0:REGULAR (1:23)

by using IS NULL we get column domain as {domains={c=[ [[""2""]] ]}}

Above example is when result matches one of the when clause, even if it matches the else clause we would get a domain like:

- ScanFilter[table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=default, tableName=sample_orc, analyzePartitionValues=Optional.empty}', layout='Optional[default.sample_orc{domains={c=[ NULL, [(<min>, ""1""), (""1"", ""2""), (""2"", <max>)] ]}}]'}, filterPredicate = (not(IN(c, INTEGER'1', INTEGER'2'))) OR (IS_NULL(c))] => [a:integer, b:varchar, c:integer, d:varchar]
                Estimates: {rows: 16 (1.11kB), cpu: 1136.00, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: 2272.00, memory: 0.00, network: 0.00}
                LAYOUT: default.sample_orc{domains={c=[ NULL, [(<min>, ""1""), (""1"", ""2""), (""2"", <max>)] ]}}
                c := c:int:2:REGULAR (1:23)
                d := d:string:-13:PARTITION_KEY (1:23)
                    :: [[""a""], [""b""], [""c""]]
                a := a:int:0:REGULAR (1:23)

column domain: {domains={c=[ NULL, [(<min>, ""1""), (""1"", ""2""), (""2"", <max>)] ]}}

Wow! that's not good. OK sounds good.

@maswin maswin force-pushed the switch_case_optimizer branch from 20aa571 to 419c620 Compare August 26, 2022 21:04
@maswin
Copy link
Copy Markdown
Contributor Author

maswin commented Aug 26, 2022

User facing documentation is too technical with internal details. Need to rework that.

Updated.

Copy link
Copy Markdown

@r0hini r0hini Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link https://cloud.google.com/looker/docs/reference/param-field-case here? Also can you make it lowercase (case instead of CASE). lookml is all lowercase and so they only refer it as case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@maswin maswin force-pushed the switch_case_optimizer branch from 419c620 to 413f85a Compare August 27, 2022 07:05
Copy link
Copy Markdown
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is old.

Local doc build failed with

"python3 -m sphinx -b html -n -d target/doctrees -j auto src/main/sphinx target/html
Running Sphinx v4.4.0

Sphinx version error:
The sphinxcontrib.applehelp extension used by this project needs at least Sphinx v5.0; it therefore cannot be built with this version.
make: *** [html] Error 2"

git merge master - the usual solution to this build problem related to PR #21708 - failed:

"Auto-merging presto-docs/src/main/sphinx/admin/properties.rst
Auto-merging presto-main/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java
Auto-merging presto-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
CONFLICT (content): Merge conflict in presto-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RewriteCaseExpressionPredicate.java
Auto-merging presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java
Automatic merge failed; fix conflicts and then commit the result."

Please update the PR, then re-request my review of the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants