Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Feb 22, 2021

This PR implements a first version of hash repartition.
This can used to (further) parallelize hash joins / aggregates or to implement distributed algorithms like ShuffleHashJoin(https://github.com/ballista-compute/ballista/issues/595 )

I didn't yet optimize for speed, as I think it makes sense to implement it and look for improvements later.

FYI @andygrove

@github-actions
Copy link

@Dandandan Dandandan changed the title ARROW-11733: [Rust][DataFusion] Hash repartition ARROW-11733: [Rust][DataFusion] Hash repartition [WIP] Feb 22, 2021
@Dandandan Dandandan changed the title ARROW-11733: [Rust][DataFusion] Hash repartition [WIP] ARROW-11733: [Rust][DataFusion] Hash repartition Feb 23, 2021
@Dandandan Dandandan changed the title ARROW-11733: [Rust][DataFusion] Hash repartition ARROW-11733: [Rust][DataFusion] Implement hash repartition Feb 23, 2021
@Dandandan Dandandan changed the title ARROW-11733: [Rust][DataFusion] Implement hash repartition ARROW-11733: [Rust][DataFusion] Implement hash partitioning Feb 23, 2021
@codecov-io
Copy link

Codecov Report

Merging #9548 (bbba43f) into master (39b23b7) will increase coverage by 0.00%.
The diff coverage is 81.13%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #9548   +/-   ##
=======================================
  Coverage   82.28%   82.29%           
=======================================
  Files         244      244           
  Lines       55616    55659   +43     
=======================================
+ Hits        45766    45804   +38     
- Misses       9850     9855    +5     
Impacted Files Coverage Δ
rust/datafusion/src/physical_plan/repartition.rs 77.52% <80.76%> (+2.71%) ⬆️
rust/datafusion/src/physical_plan/hash_join.rs 83.52% <100.00%> (ø)
rust/datafusion/src/physical_plan/mod.rs 88.00% <0.00%> (+2.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39b23b7...bbba43f. Read the comment docs.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @Dandandan. I agree that it makes sense to get the functionality working first and optimize later.

@alamb
Copy link
Contributor

alamb commented Feb 23, 2021

The integration failure looks like https://issues.apache.org/jira/browse/ARROW-11717

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- nice work @Dandandan


let total_rows: usize = output_partitions.iter().map(|x| x.len()).sum();

assert_eq!(8, output_partitions.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense here also to assert on the distribution of rows (e.g. ensure that each batch has ~ 50*3 rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but not sure how to do that currently, as it depends on random state (it could happen that all of them end up on same hash / partition in a very rare case).

@Dandandan
Copy link
Contributor Author

@alamb from my side the PR is good to go

@alamb alamb closed this in d731c91 Feb 26, 2021
@alamb
Copy link
Contributor

alamb commented Feb 26, 2021

Thanks @Dandandan -- looks great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants