ARROW-11733: [Rust][DataFusion] Implement hash partitioning #9548

Dandandan · 2021-02-22T19:39:56Z

This PR implements a first version of hash repartition.
This can used to (further) parallelize hash joins / aggregates or to implement distributed algorithms like ShuffleHashJoin(https://github.com/ballista-compute/ballista/issues/595 )

I didn't yet optimize for speed, as I think it makes sense to implement it and look for improvements later.

FYI @andygrove

github-actions · 2021-02-22T20:14:11Z

https://issues.apache.org/jira/browse/ARROW-11733

codecov-io · 2021-02-23T08:04:14Z

Codecov Report

Merging #9548 (bbba43f) into master (39b23b7) will increase coverage by 0.00%.
The diff coverage is 81.13%.

@@           Coverage Diff           @@
##           master    #9548   +/-   ##
=======================================
  Coverage   82.28%   82.29%           
=======================================
  Files         244      244           
  Lines       55616    55659   +43     
=======================================
+ Hits        45766    45804   +38     
- Misses       9850     9855    +5

Impacted Files	Coverage Δ
rust/datafusion/src/physical_plan/repartition.rs	`77.52% <80.76%> (+2.71%)`	⬆️
rust/datafusion/src/physical_plan/hash_join.rs	`83.52% <100.00%> (ø)`
rust/datafusion/src/physical_plan/mod.rs	`88.00% <0.00%> (+2.00%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39b23b7...bbba43f. Read the comment docs.

andygrove

This looks great @Dandandan. I agree that it makes sense to get the functionality working first and optimize later.

alamb · 2021-02-23T22:54:58Z

The integration failure looks like https://issues.apache.org/jira/browse/ARROW-11717

alamb

Looks good -- nice work @Dandandan

rust/datafusion/src/physical_plan/repartition.rs

alamb · 2021-02-23T22:59:39Z

rust/datafusion/src/physical_plan/repartition.rs

+
+        let total_rows: usize = output_partitions.iter().map(|x| x.len()).sum();
+
+        assert_eq!(8, output_partitions.len());


would it make sense here also to assert on the distribution of rows (e.g. ensure that each batch has ~ 50*3 rows?

Makes sense, but not sure how to do that currently, as it depends on random state (it could happen that all of them end up on same hash / partition in a very rare case).

Dandandan · 2021-02-26T20:02:41Z

@alamb from my side the PR is good to go

alamb · 2021-02-26T22:03:33Z

Thanks @Dandandan -- looks great.

Dandandan added 2 commits February 22, 2021 20:34

WIP hash repartition

5e71a28

WIP

708cf97

github-actions bot added Component: Rust - DataFusion Component: Rust labels Feb 22, 2021

Dandandan changed the title ~~ARROW-11733: [Rust][DataFusion] Hash repartition~~ ARROW-11733: [Rust][DataFusion] Hash repartition [WIP] Feb 22, 2021

Dandandan added 3 commits February 23, 2021 08:19

Make it compile

f310373

Add test

a192ed0

Add assert to test

56e47fb

Dandandan changed the title ~~ARROW-11733: [Rust][DataFusion] Hash repartition [WIP]~~ ARROW-11733: [Rust][DataFusion] Hash repartition Feb 23, 2021

Dandandan changed the title ~~ARROW-11733: [Rust][DataFusion] Hash repartition~~ ARROW-11733: [Rust][DataFusion] Implement hash repartition Feb 23, 2021

Dandandan changed the title ~~ARROW-11733: [Rust][DataFusion] Implement hash repartition~~ ARROW-11733: [Rust][DataFusion] Implement hash partitioning Feb 23, 2021

Remove comment

bbba43f

andygrove approved these changes Feb 23, 2021

View reviewed changes

Dandandan added 4 commits February 23, 2021 17:47

Small cleanup

c120b50

Small cleanup

ccf2bbd

Small cleanup

eea2754

Small cleanup

9917b0a

alamb approved these changes Feb 23, 2021

View reviewed changes

Remove clone

82b75c2

gangliao mentioned this pull request Feb 24, 2021

Todo: Distributed Dataflow model flock-lab/flock#188

Closed

alamb closed this in d731c91 Feb 26, 2021

asfimport mentioned this pull request Feb 26, 2021

[Rust][DataFusion] Support hash repartitioning #27590

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-11733: [Rust][DataFusion] Implement hash partitioning #9548

ARROW-11733: [Rust][DataFusion] Implement hash partitioning #9548

Uh oh!

Dandandan commented Feb 22, 2021 •

edited

Loading

Uh oh!

github-actions bot commented Feb 22, 2021

Uh oh!

codecov-io commented Feb 23, 2021

Uh oh!

andygrove left a comment

Uh oh!

alamb commented Feb 23, 2021

Uh oh!

alamb left a comment

Uh oh!

Uh oh!

alamb Feb 23, 2021

Uh oh!

Dandandan Feb 24, 2021

Uh oh!

Dandandan commented Feb 26, 2021

Uh oh!

alamb commented Feb 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		let total_rows: usize = output_partitions.iter().map(\|x\| x.len()).sum();

		assert_eq!(8, output_partitions.len());

ARROW-11733: [Rust][DataFusion] Implement hash partitioning #9548

ARROW-11733: [Rust][DataFusion] Implement hash partitioning #9548

Uh oh!

Conversation

Dandandan commented Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 22, 2021

Uh oh!

codecov-io commented Feb 23, 2021

Codecov Report

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 23, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb Feb 23, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Feb 26, 2021

Uh oh!

alamb commented Feb 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Dandandan commented Feb 22, 2021 •

edited

Loading