Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Feb 27, 2021

This PR implements two optimizations

  • Change the way we create an array of indices for an inner join to avoid generating a null bit map. It seems currently not really ergonomic to do this with Arrow without resorting to an iterator (which would be hard to do here). This is around 3% difference
  • Allow to reuse allocations in create_hashes when possible. This is around 2% faster.

In total this gives a small (5%) speedup to query 5:

This PR:

Query 5 iteration 0 took 169.3 ms
Query 5 iteration 1 took 156.0 ms
Query 5 iteration 2 took 157.5 ms
Query 5 iteration 3 took 158.0 ms
Query 5 iteration 4 took 157.3 ms
Query 5 iteration 5 took 163.4 ms
Query 5 iteration 6 took 167.6 ms
Query 5 iteration 7 took 171.5 ms
Query 5 iteration 8 took 167.4 ms
Query 5 iteration 9 took 164.5 ms
Query 5 avg time: 163.26 ms

Master:

Query 5 iteration 0 took 177.6 ms
Query 5 iteration 1 took 169.6 ms
Query 5 iteration 2 took 171.8 ms
Query 5 iteration 3 took 175.1 ms
Query 5 iteration 4 took 167.2 ms
Query 5 iteration 5 took 171.1 ms
Query 5 iteration 6 took 174.2 ms
Query 5 iteration 7 took 178.1 ms
Query 5 iteration 8 took 167.9 ms
Query 5 iteration 9 took 172.0 ms
Query 5 avg time: 172.46 ms

@github-actions
Copy link

@Dandandan Dandandan changed the title ARROW-11806: Optimize inner join creation of indices ARROW-11806: Optimize join / inner join creation of indices Feb 28, 2021
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Dandandan

@alamb alamb changed the title ARROW-11806: Optimize join / inner join creation of indices ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices Mar 3, 2021
@alamb
Copy link
Contributor

alamb commented Mar 3, 2021

The integration test failure in https://github.com/apache/arrow/pull/9595/checks?check_run_id=1998235390 seems to be the same as was fixed in #9593

I also pulled this branch locally, and re-ran the tests and everythings looks good to me

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code makes sense to me -- thank you @Dandandan

I would like to suggest we rename the hashes_buffer, hash_buff and hashes parameters consistently as I think they mean the same thing. I don't have a particular preference as to which, but I do think it would help readability a lot to use the same name

hash: &mut JoinHashMap,
offset: usize,
random_state: &RandomState,
hashes_buffer: &mut Vec<u64>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively allowing hashes_buffer to be reused, right?

It may eventually make sense to make some struct that holds all the relevant state (on, random_state, hash_buf, etc).

Copy link
Contributor Author

@Dandandan Dandandan Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this change is for reusing the allocated Vec.

Yes, makes sense to group them in a struct. There are some opportunities in other functions build_join_indexes build_batch, etc. for this as well. Not sure if it makes sense they all receive the same struct, or maybe all of them a subset of the most commonly needed parts 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 definitely not for this PR

@Dandandan
Copy link
Contributor Author

Thanks @alamb resolved the incosistent naming.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice to me. Thanks @Dandandan

@codecov-io
Copy link

Codecov Report

Merging #9595 (2869040) into master (0f64726) will increase coverage by 0.04%.
The diff coverage is 78.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9595      +/-   ##
==========================================
+ Coverage   82.33%   82.38%   +0.04%     
==========================================
  Files         245      245              
  Lines       56407    57134     +727     
==========================================
+ Hits        46443    47068     +625     
- Misses       9964    10066     +102     
Impacted Files Coverage Δ
rust/datafusion/src/physical_plan/hash_join.rs 84.16% <77.77%> (+0.64%) ⬆️
rust/datafusion/src/physical_plan/repartition.rs 81.21% <100.00%> (-0.14%) ⬇️
...datafusion/src/physical_plan/string_expressions.rs 73.38% <0.00%> (-3.62%) ⬇️
rust/arrow/src/array/equal/utils.rs 75.49% <0.00%> (-0.99%) ⬇️
rust/arrow/src/datatypes/field.rs 55.47% <0.00%> (-0.66%) ⬇️
rust/datafusion/src/physical_plan/parquet.rs 87.83% <0.00%> (-0.22%) ⬇️
rust/benchmarks/src/bin/tpch.rs 38.33% <0.00%> (ø)
...datafusion/src/physical_plan/crypto_expressions.rs 52.45% <0.00%> (ø)
...integration-testing/src/flight_server_scenarios.rs 0.00% <0.00%> (ø)
...-testing/src/flight_server_scenarios/middleware.rs 0.00% <0.00%> (ø)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f64726...2869040. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented Mar 3, 2021

@Dandandan on no! It now seems to have failed rust fmt linting

@Dandandan
Copy link
Contributor Author

@alamb ha, thanks, fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants