-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10837: [Rust][DataFusion] Use Vec<u8> for hash keys
#8863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Vec<u8> for hash join key valuesVec<u8> for hash join keys
|
As a next step after this, I think it would be interesting if we can have a look at calculating the hashes on the columns instead to benefit from the columnar data layout. Some material I found on this: https://www.cockroachlabs.com/blog/vectorized-hash-joiner/ (simple explanation) Please add if you know of more/newer material! |
Vec<u8> for hash join keysVec<u8> for hash keys
Codecov Report
@@ Coverage Diff @@
## master #8863 +/- ##
==========================================
- Coverage 77.03% 77.01% -0.03%
==========================================
Files 173 173
Lines 40090 40101 +11
==========================================
Hits 30885 30885
- Misses 9205 9216 +11
Continue to review full report at Codecov.
|
|
Just extended the change for hash aggregates as well. Turns out, a good speedup as well for hash aggregate queries! [This version] [Master] |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement!
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work and optimization!
This PR is a follow up of #8765 . Here, the hashmap values for the key are converted to
Vec<u8>to use as key instead.This is a bit faster as both hashing and cloning will be faster. It will also use less additional memory than the earlier usage of the dynamic
GroupByScalarvalues (for hash join).[This PR]
[Master]
[This PR]
[Master]
FWIW, micro benchmark results for aggregate queries:
FYI @jorgecarleitao