ARROW-10807: [Rust][DataFusion] Avoid double hashing #8832

Dandandan · 2020-12-04T11:58:58Z

This PR shows one area for improvement in the hash join and aggregates. Currently the key is hashed twice by first looking up the key, and then inserting (which hashes the same key again) or mutating the value. This particularly is expensive with lots of distinct keys (ie most joins or aggregates on some high cardinality identifier).
Using the unstable hash_raw_entry (or api in hashbrown) api we can avoid this, and get some speedup somewhere between 50-100ms (mostly in the hash join) for the tpch query 12.

~~We could also use the hashbrown crate (which rust std lib also uses) instead to avoid needing a unstable feature.~~ did that

This brings the query tpc-h 12 times down from > 1500ms locally to:

Query 12 iteration 0 took 1399 ms
Query 12 iteration 1 took 1398 ms
Query 12 iteration 2 took 1402 ms
Query 12 iteration 3 took 1407 ms
Query 12 iteration 4 took 1409 ms
Query 12 iteration 5 took 1406 ms
Query 12 iteration 6 took 1414 ms
Query 12 iteration 7 took 1416 ms
Query 12 iteration 8 took 1423 ms
Query 12 iteration 9 took 1430 ms

FYI @jorgecarleitao @andygrove

github-actions · 2020-12-04T12:13:37Z

https://issues.apache.org/jira/browse/ARROW-10807

Dandandan · 2020-12-04T17:30:08Z

Is ready for review now @jorgecarleitao @andygrove

jorgecarleitao

LGTM. Another cool improvement! :)

jorgecarleitao · 2020-12-07T05:54:46Z

@alamb @andygrove , this introduces a new dependency to DataFusion. Is that ok for you?

Dandandan · 2020-12-07T18:09:24Z

Some additional context: in the future, when the feature is stabilized, the hashbrown dependency can be dropped again. I think the raw entry api will be useful for future optimizations / hash join algorithms as well, for example it also allows for putting your own keys instead of based on a value.

alamb

This looks reasonable to me -- thank you @Dandandan

Dandandan added 2 commits December 4, 2020 12:33

Avoid double hashing using raw_entry api

d855815

Avoid double hashing using raw_entry api

475946f

github-actions bot added Component: Rust - DataFusion Component: Rust labels Dec 4, 2020

fmt

379fc6c

Dandandan added 2 commits December 4, 2020 17:50

Use hashbrown to avoid unstable feature

32c84a2

Return error as before

a5bb8a3

Map to same error

94e1202

jorgecarleitao approved these changes Dec 4, 2020

View reviewed changes

Dandandan mentioned this pull request Dec 5, 2020

ARROW-10810: [Rust] Speed up comparison kernels #8837

Closed

andygrove approved these changes Dec 7, 2020

View reviewed changes

jorgecarleitao closed this in 3cbc482 Dec 7, 2020

alamb reviewed Dec 7, 2020

View reviewed changes

asfimport mentioned this pull request Dec 7, 2020

[Rust][DataFusion] Avoid double hashing #26745

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-10807: [Rust][DataFusion] Avoid double hashing #8832

ARROW-10807: [Rust][DataFusion] Avoid double hashing #8832

Uh oh!

Dandandan commented Dec 4, 2020 •

edited

Loading

Uh oh!

github-actions bot commented Dec 4, 2020

Uh oh!

Dandandan commented Dec 4, 2020

Uh oh!

jorgecarleitao left a comment

Uh oh!

jorgecarleitao commented Dec 7, 2020

Uh oh!

Dandandan commented Dec 7, 2020

Uh oh!

alamb left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ARROW-10807: [Rust][DataFusion] Avoid double hashing #8832

ARROW-10807: [Rust][DataFusion] Avoid double hashing #8832

Uh oh!

Conversation

Dandandan commented Dec 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 4, 2020

Uh oh!

Dandandan commented Dec 4, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao commented Dec 7, 2020

Uh oh!

Dandandan commented Dec 7, 2020

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Dandandan commented Dec 4, 2020 •

edited

Loading