Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up hash partitioning #6822

Open
Dandandan opened this issue Jul 2, 2023 · 1 comment
Open

Speed up hash partitioning #6822

Dandandan opened this issue Jul 2, 2023 · 1 comment
Labels
enhancement New feature or request performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Jul 2, 2023

Is your feature request related to a problem or challenge?

Also see request in arrow apache/arrow-rs#4476

In DataFusion, a common operation is to repartition a RecordBatch by hashing one or more columns and dividing them into partition record batches using the "formula" hash % num_partitions.

The current approach is to create the indices that match and use them to take the individual arrays (see BatchPartitioner in datafusion).

This is relatively expensive however, as we visit the arrays num_partitions times in different places of the array, leading to cache inefficient operators (especially when the number of partitions is high).

Describe the solution you'd like

Faster hash-partitioning implementation

Describe alternatives you've considered

No response

Additional context

No response

@Dandandan Dandandan added the enhancement New feature or request label Jul 2, 2023
@Dandandan Dandandan changed the title Speed up partitioning operator Speed up hash partitioning operator Jul 2, 2023
@Dandandan Dandandan changed the title Speed up hash partitioning operator Speed up hash partitioning Jul 2, 2023
@alamb
Copy link
Contributor

alamb commented Jul 3, 2023

I recommend we look into implementing Selection Vectors / bitmaks -- then repartitioning could become a calculation of such filters/ bitmasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

2 participants