ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices #9595

Dandandan · 2021-02-27T12:20:47Z

This PR implements two optimizations

Change the way we create an array of indices for an inner join to avoid generating a null bit map. It seems currently not really ergonomic to do this with Arrow without resorting to an iterator (which would be hard to do here). This is around 3% difference
Allow to reuse allocations in create_hashes when possible. This is around 2% faster.

In total this gives a small (5%) speedup to query 5:

This PR:

Query 5 iteration 0 took 169.3 ms
Query 5 iteration 1 took 156.0 ms
Query 5 iteration 2 took 157.5 ms
Query 5 iteration 3 took 158.0 ms
Query 5 iteration 4 took 157.3 ms
Query 5 iteration 5 took 163.4 ms
Query 5 iteration 6 took 167.6 ms
Query 5 iteration 7 took 171.5 ms
Query 5 iteration 8 took 167.4 ms
Query 5 iteration 9 took 164.5 ms
Query 5 avg time: 163.26 ms

Master:

Query 5 iteration 0 took 177.6 ms
Query 5 iteration 1 took 169.6 ms
Query 5 iteration 2 took 171.8 ms
Query 5 iteration 3 took 175.1 ms
Query 5 iteration 4 took 167.2 ms
Query 5 iteration 5 took 171.1 ms
Query 5 iteration 6 took 174.2 ms
Query 5 iteration 7 took 178.1 ms
Query 5 iteration 8 took 167.9 ms
Query 5 iteration 9 took 172.0 ms
Query 5 avg time: 172.46 ms

github-actions · 2021-02-27T12:21:07Z

https://issues.apache.org/jira/browse/ARROW-11806

andygrove

LGTM. Thanks @Dandandan

alamb · 2021-03-03T13:29:48Z

The integration test failure in https://github.com/apache/arrow/pull/9595/checks?check_run_id=1998235390 seems to be the same as was fixed in #9593

I also pulled this branch locally, and re-ran the tests and everythings looks good to me

alamb

The code makes sense to me -- thank you @Dandandan

I would like to suggest we rename the hashes_buffer, hash_buff and hashes parameters consistently as I think they mean the same thing. I don't have a particular preference as to which, but I do think it would help readability a lot to use the same name

alamb · 2021-03-03T13:33:54Z

rust/datafusion/src/physical_plan/hash_join.rs

    hash: &mut JoinHashMap,
    offset: usize,
    random_state: &RandomState,
+    hashes_buffer: &mut Vec<u64>,


This is effectively allowing hashes_buffer to be reused, right?

It may eventually make sense to make some struct that holds all the relevant state (on, random_state, hash_buf, etc).

Indeed, this change is for reusing the allocated Vec.

Yes, makes sense to group them in a struct. There are some opportunities in other functions build_join_indexes build_batch, etc. for this as well. Not sure if it makes sense they all receive the same struct, or maybe all of them a subset of the most commonly needed parts 🤔

🤔 definitely not for this PR

rust/datafusion/src/physical_plan/hash_join.rs

Dandandan · 2021-03-03T18:43:59Z

Thanks @alamb resolved the incosistent naming.

alamb

Looks nice to me. Thanks @Dandandan

codecov-io · 2021-03-03T20:07:52Z

Codecov Report

Merging #9595 (2869040) into master (0f64726) will increase coverage by 0.04%.
The diff coverage is 78.72%.

@@            Coverage Diff             @@
##           master    #9595      +/-   ##
==========================================
+ Coverage   82.33%   82.38%   +0.04%     
==========================================
  Files         245      245              
  Lines       56407    57134     +727     
==========================================
+ Hits        46443    47068     +625     
- Misses       9964    10066     +102

Impacted Files	Coverage Δ
rust/datafusion/src/physical_plan/hash_join.rs	`84.16% <77.77%> (+0.64%)`	⬆️
rust/datafusion/src/physical_plan/repartition.rs	`81.21% <100.00%> (-0.14%)`	⬇️
...datafusion/src/physical_plan/string_expressions.rs	`73.38% <0.00%> (-3.62%)`	⬇️
rust/arrow/src/array/equal/utils.rs	`75.49% <0.00%> (-0.99%)`	⬇️
rust/arrow/src/datatypes/field.rs	`55.47% <0.00%> (-0.66%)`	⬇️
rust/datafusion/src/physical_plan/parquet.rs	`87.83% <0.00%> (-0.22%)`	⬇️
rust/benchmarks/src/bin/tpch.rs	`38.33% <0.00%> (ø)`
...datafusion/src/physical_plan/crypto_expressions.rs	`52.45% <0.00%> (ø)`
...integration-testing/src/flight_server_scenarios.rs	`0.00% <0.00%> (ø)`
...-testing/src/flight_server_scenarios/middleware.rs	`0.00% <0.00%> (ø)`
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f64726...2869040. Read the comment docs.

alamb · 2021-03-03T23:17:10Z

@Dandandan on no! It now seems to have failed rust fmt linting

Dandandan · 2021-03-04T07:58:21Z

@alamb ha, thanks, fixed!

Dandandan added 2 commits February 27, 2021 12:48

Optimize builder usage

df826f1

Cleanup

a713f0e

github-actions bot added Component: Rust - DataFusion Component: Rust labels Feb 27, 2021

Dandandan changed the title ~~ARROW-11806: Optimize inner join creation of indices~~ ARROW-11806: Optimize join / inner join creation of indices Feb 28, 2021

Save allocations in calculating hashes

5e0f91c

andygrove approved these changes Feb 28, 2021

View reviewed changes

alamb changed the title ~~ARROW-11806: Optimize join / inner join creation of indices~~ ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices Mar 3, 2021

alamb reviewed Mar 3, 2021

View reviewed changes

Rename to hashes_buffer

2869040

alamb approved these changes Mar 3, 2021

View reviewed changes

Fmt

be37882

alamb closed this in ec5934a Mar 4, 2021

asfimport mentioned this pull request Mar 4, 2021

[Rust][DataFusion] Optimize inner join creation of indices #27657

Closed

ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices #9595

ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices #9595

Uh oh!

Conversation

Dandandan commented Feb 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 27, 2021

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 3, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Mar 3, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Mar 3, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Dandandan commented Mar 3, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Mar 3, 2021

Codecov Report

Uh oh!

alamb commented Mar 3, 2021

Uh oh!

Dandandan commented Mar 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Dandandan commented Feb 27, 2021 •

edited

Loading

Dandandan Mar 3, 2021 •

edited

Loading