-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11733: [Rust][DataFusion] Implement hash partitioning #9548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9548 +/- ##
=======================================
Coverage 82.28% 82.29%
=======================================
Files 244 244
Lines 55616 55659 +43
=======================================
+ Hits 45766 45804 +38
- Misses 9850 9855 +5
Continue to review full report at Codecov.
|
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @Dandandan. I agree that it makes sense to get the functionality working first and optimize later.
|
The integration failure looks like https://issues.apache.org/jira/browse/ARROW-11717 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- nice work @Dandandan
|
|
||
| let total_rows: usize = output_partitions.iter().map(|x| x.len()).sum(); | ||
|
|
||
| assert_eq!(8, output_partitions.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense here also to assert on the distribution of rows (e.g. ensure that each batch has ~ 50*3 rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, but not sure how to do that currently, as it depends on random state (it could happen that all of them end up on same hash / partition in a very rare case).
|
@alamb from my side the PR is good to go |
|
Thanks @Dandandan -- looks great. |
This PR implements a first version of hash repartition.
This can used to (further) parallelize hash joins / aggregates or to implement distributed algorithms like
ShuffleHashJoin(https://github.com/ballista-compute/ballista/issues/595 )I didn't yet optimize for speed, as I think it makes sense to implement it and look for improvements later.
FYI @andygrove