optimize APPROX_DISTINCT operations on constant values#25262
optimize APPROX_DISTINCT operations on constant values#25262hdikeman wants to merge 1 commit intoprestodb:masterfrom
Conversation
|
This pull request was exported from Phabricator. Differential Revision: D76161617 |
Summary:
`APPROX_DISTINCT` operations on a constant value (e.g. `APPROX_DISTINCT('abcd')`) are more expensive than and functionally equivalent to `ARBITRARY(1)`
Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant values with equivalent calls to `ARBITRARY`
Right now I inline constants into the aggregation itself, though I could do this with another optimizer rule
Differential Revision: D76161617
7f3cedb to
22421bd
Compare
|
This pull request was exported from Phabricator. Differential Revision: D76161617 |
|
on thinking more - I think we can simply replace this to 1 - we don't even need arbitrary. I was just concerned about messing with aggs but I think we have good enough optimizations now to figure out the constant 1 and hoist it. |
presto-main-base/src/main/java/com/facebook/presto/SystemSessionProperties.java
Show resolved
Hide resolved
Actually approx_distinct(null) is 0 but any other constant should be 1 |
Summary:
`APPROX_DISTINCT` operations on a constant value (e.g. `APPROX_DISTINCT('abcd')`) are more expensive than and functionally equivalent to `ARBITRARY(1)`
Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant values with equivalent calls to `ARBITRARY`
Right now I inline constants into the aggregation itself, though I could do this with another optimizer rule
Differential Revision: D76161617
22421bd to
cccdda3
Compare
|
This pull request was exported from Phabricator. Differential Revision: D76161617 |
Summary:
`APPROX_DISTINCT` operations on a constant value (e.g. `APPROX_DISTINCT('abcd')`) are more expensive than and functionally equivalent to `ARBITRARY(1)`
Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant values with equivalent calls to `ARBITRARY`
Right now I inline constants into the aggregation itself, though I could do this with another optimizer rule
Differential Revision: D76161617
cccdda3 to
56f3bdb
Compare
|
This pull request was exported from Phabricator. Differential Revision: D76161617 |
|
Saved that user @hdikeman is from Meta |
|
Also look at: that's the most general way to do this. And actually easier! |
|
This pull request was exported from Phabricator. Differential Revision: D76161617 |
56f3bdb to
57bdee4
Compare
Summary: Pull Request resolved: prestodb#25262 `APPROX_DISTINCT` operations on a constant value (e.g. `APPROX_DISTINCT('abcd')`) are more expensive than and functionally equivalent to `ARBITRARY(1)` Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant values with equivalent calls to `ARBITRARY` Differential Revision: D76161617
Gotcha. Wonder if we could defer that or open a separate task. Might be finicky to try to eliminate the aggregation entirely
The way I got around this @kaikalur was by just ignoring NULLs and only replacing true constants (I check I could have handled it but have another task for optimization of functions with NULL inputs where could handle it more generally By the way, thanks for the review! Sorry I'm just seeing this, I dumped this PR into draft mode and forgot about it while I worked out a couple bugs |
This non-trivial. Too many subtle semantics. So your two choices:
|
|
@kaikalur and I talked offline For the case which motivated this change ( So I will remove the handling for the |
Yeah - let's make sure to run this rule after constant pull up happens |
|
This pull request was exported from Phabricator. Differential Revision: D76161617 |
|
@hdikeman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
If this PR doesn't need a release note, please add Thanks! |
…restodb#25262) Summary: Pull Request resolved: prestodb#25262 `APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))` Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY` This comes up in some automated queries Differential Revision: D76161617
6593c5c to
2c3bab7
Compare
|
@hdikeman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
...in/java/com/facebook/presto/sql/planner/iterative/rule/ReplaceConditionalApproxDistinct.java
Outdated
Show resolved
Hide resolved
presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java
Outdated
Show resolved
Hide resolved
...ase/src/test/java/com/facebook/presto/sql/planner/assertions/AggregationFunctionMatcher.java
Show resolved
Hide resolved
presto-main-base/src/main/java/com/facebook/presto/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
presto-spi/src/main/java/com/facebook/presto/spi/function/StandardFunctionResolution.java
Outdated
Show resolved
Hide resolved
...in/java/com/facebook/presto/sql/planner/iterative/rule/ReplaceConditionalApproxDistinct.java
Outdated
Show resolved
Hide resolved
...ava/com/facebook/presto/sql/planner/iterative/rule/TestReplaceConditionalApproxDistinct.java
Show resolved
Hide resolved
presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java
Outdated
Show resolved
Hide resolved
...in/java/com/facebook/presto/sql/planner/iterative/rule/ReplaceConditionalApproxDistinct.java
Outdated
Show resolved
Hide resolved
...in/java/com/facebook/presto/sql/planner/iterative/rule/ReplaceConditionalApproxDistinct.java
Show resolved
Hide resolved
…restodb#25262) Summary: Pull Request resolved: prestodb#25262 `APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))` Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY` This comes up in some automated queries Differential Revision: D76161617
|
We talked about this comment (#25262 (comment)) more offline:
Turns out this indeterminism is also an issue even for the "typical" case of i.e. the function However, this has the added complication that Something else to consider: we could switch from ARBITRARY to MAX and handle this as |
…restodb#25262) Summary: Pull Request resolved: prestodb#25262 `APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))` Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY` This comes up in some automated queries Differential Revision: D76161617
2c3bab7 to
4e8b933
Compare
|
@hdikeman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
feilong-liu
left a comment
There was a problem hiding this comment.
Thanks for the contribution.
Overall lgtm.
Can you follow the instructions here to run some tests?
presto-main-base/src/main/java/com/facebook/presto/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
...in/java/com/facebook/presto/sql/planner/iterative/rule/ReplaceConditionalApproxDistinct.java
Show resolved
Hide resolved
...in/java/com/facebook/presto/sql/planner/iterative/rule/ReplaceConditionalApproxDistinct.java
Show resolved
Hide resolved
…restodb#25262) Summary: Pull Request resolved: prestodb#25262 `APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))` Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY` This comes up in some automated queries Differential Revision: D76161617
4e8b933 to
30b2823
Compare
|
@hdikeman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…restodb#25262) Summary: Pull Request resolved: prestodb#25262 `APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))` Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY` This comes up in some automated queries Differential Revision: D76161617
|
I don't know how to resolve Sreeni's change request. He's out on vacation. I was told the easiest way around this is to close and reopen the PR. New one is here: #25428 |
…restodb#25262) Summary: Pull Request resolved: prestodb#25262 `APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))` Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY` This comes up in some automated queries Differential Revision: D76161617
…25262) Summary: Pull Request resolved: #25262 `APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))` Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY` This comes up in some automated queries Differential Revision: D76161617
Description
APPROX_DISTINCToperations on a constant value (e.g.APPROX_DISTINCT(IF(..., const))) are more expensive than and functionally equivalent toARBITRARY(IF(..., 1, 0))Adding an optimizer rule to replace any
APPROX_DISTINCToperations on constant conditionals with equivalent calls toARBITRARYMotivation and Context
Some autogenerated queries use this pattern, which is inefficient and causes OOM errors for complex queries
Impact
Queries which use this APPROX_DISTINCT pattern will consume less memory
Test Plan
Adding test coverage, E2E and unit tests
Also did some E2E testing manually to make sure the substitution was occurring:
All unit tests were run on latest revision. Also, a verifier run was performed:
failed queries were due to load
Contributor checklist
Release Notes