-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9241] [SQL] [WIP] Supporting multiple DISTINCT columns #9280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins, this is ok to test. |
|
Test build #44363 has finished for PR 9280 at commit
|
|
Test build #44376 has finished for PR 9280 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this in the next iteration...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed....
|
Test build #44430 has finished for PR 9280 at commit
|
|
@hvanhovell I think the approach we want to take is to either use the aggregate expansion (#9406), or use joins, and not this one. Do you mind closing this one? (We already have a record of it from JIRA in case we need to reference it). |
|
Closing PR. Some of the code in here will probably re-emerge if we ever want distincts in window functions. |
…g Rule The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path. This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](#9280) are: - This can use the faster TungstenAggregate code path. - It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself. The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed. cc yhuai - Could you also tell me where to add tests for this? Author: Herman van Hovell <[email protected]> Closes #9406 from hvanhovell/SPARK-9241-rewriter. (cherry picked from commit 6d0ead3) Signed-off-by: Michael Armbrust <[email protected]>
…g Rule The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path. This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](#9280) are: - This can use the faster TungstenAggregate code path. - It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself. The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed. cc yhuai - Could you also tell me where to add tests for this? Author: Herman van Hovell <[email protected]> Closes #9406 from hvanhovell/SPARK-9241-rewriter.
…g Rule The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path. This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](apache/spark#9280) are: - This can use the faster TungstenAggregate code path. - It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself. The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed. cc yhuai - Could you also tell me where to add tests for this? Author: Herman van Hovell <[email protected]> Closes #9406 from hvanhovell/SPARK-9241-rewriter.
This PR adds support for multiple distinct columns to the new aggregation code path.
The implementation uses the
OpenHashSetclass and set expressions. As a result we can only use the slower sort based aggregation code path. This also means the code will be probably slower than the old hash aggregation.The PR is currently in the proof of concept phase, and I have submitted it to get some feedback to see if I am headed in the right direction. I'll add more tests if this considered to be the way to go.
An example using the new code path:
cc @yhuai