-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11806: [Rust][DataFusion] Optimize join / inner join creation of indices #9595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @Dandandan
|
The integration test failure in https://github.com/apache/arrow/pull/9595/checks?check_run_id=1998235390 seems to be the same as was fixed in #9593 I also pulled this branch locally, and re-ran the tests and everythings looks good to me |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code makes sense to me -- thank you @Dandandan
I would like to suggest we rename the hashes_buffer, hash_buff and hashes parameters consistently as I think they mean the same thing. I don't have a particular preference as to which, but I do think it would help readability a lot to use the same name
| hash: &mut JoinHashMap, | ||
| offset: usize, | ||
| random_state: &RandomState, | ||
| hashes_buffer: &mut Vec<u64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is effectively allowing hashes_buffer to be reused, right?
It may eventually make sense to make some struct that holds all the relevant state (on, random_state, hash_buf, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this change is for reusing the allocated Vec.
Yes, makes sense to group them in a struct. There are some opportunities in other functions build_join_indexes build_batch, etc. for this as well. Not sure if it makes sense they all receive the same struct, or maybe all of them a subset of the most commonly needed parts 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 definitely not for this PR
|
Thanks @alamb resolved the incosistent naming. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice to me. Thanks @Dandandan
Codecov Report
@@ Coverage Diff @@
## master #9595 +/- ##
==========================================
+ Coverage 82.33% 82.38% +0.04%
==========================================
Files 245 245
Lines 56407 57134 +727
==========================================
+ Hits 46443 47068 +625
- Misses 9964 10066 +102
Continue to review full report at Codecov.
|
|
@Dandandan on no! It now seems to have failed |
|
@alamb ha, thanks, fixed! |
This PR implements two optimizations
create_hasheswhen possible. This is around 2% faster.In total this gives a small (5%) speedup to query 5:
This PR:
Master: