-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10807: [Rust][DataFusion] Avoid double hashing #8832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Is ready for review now @jorgecarleitao @andygrove |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Another cool improvement! :)
|
@alamb @andygrove , this introduces a new dependency to DataFusion. Is that ok for you? |
|
Some additional context: in the future, when the feature is stabilized, the hashbrown dependency can be dropped again. I think the raw entry api will be useful for future optimizations / hash join algorithms as well, for example it also allows for putting your own keys instead of based on a value. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me -- thank you @Dandandan
This PR shows one area for improvement in the hash join and aggregates. Currently the key is hashed twice by first looking up the key, and then inserting (which hashes the same key again) or mutating the value. This particularly is expensive with lots of distinct keys (ie most joins or aggregates on some high cardinality identifier).
Using the unstable
hash_raw_entry(or api in hashbrown) api we can avoid this, and get some speedup somewhere between 50-100ms (mostly in the hash join) for the tpch query 12.We could also use the hashbrown crate (which rust std lib also uses) instead to avoid needing a unstable feature.did thatThis brings the query tpc-h 12 times down from > 1500ms locally to:
FYI @jorgecarleitao @andygrove