-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33971][SQL] Eliminate distinct from more aggregates #30999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…/tanelk/spark into SPARK-33971_eliminate_distinct
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Outdated
Show resolved
Hide resolved
|
Looks making sense otherwise. cc @gengliangwang and @cloud-fan FYI |
|
ok to test |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #133615 has finished for PR 30999 at commit
|
...catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateDistinctSuite.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #133870 has finished for PR 30999 at commit
|
|
Test build #133872 has finished for PR 30999 at commit
|
| case _: Min => true | ||
| case _: BitAndAgg => true | ||
| case _: BitOrAgg => true | ||
| case _: First => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First and Last are non-deterministic, does this matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In another PR I try to argue that they should be deterministic: #29810
TLDR: An analogous aggregator would be sum on float and double datatype - its result does depend on the order of its inputs, but is deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we finally make First and Last deterministic, then I guess they need to be removed from EliminateDistinct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's exclude First/Last here before #29810 is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
@tanelk ping. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #134514 has finished for PR 30999 at commit
|
|
retest this please |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #135424 has finished for PR 30999 at commit
|
|
Thanks! Merged to master. |
What changes were proposed in this pull request?
Add more aggregate expressions to
EliminateDistinctrule.Why are the changes needed?
Distinct aggregation can add a significant overhead. It's better to remove distinct whenever possible.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT