-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-4468: [Rust] Implement BitAnd/BitOr for &Buffer (with SIMD) #3571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-4468: [Rust] Implement BitAnd/BitOr for &Buffer (with SIMD) #3571
Conversation
|
@paddyhoran it looks great! I'm glad that You've started to implement the compute module with SIMD! However providing operator kernels for buffers seems a bit odd to me. How about implementing |
Yea, I started to do this but decided not to. I plan for these functions to evolve over time as I add the rest of the kernels. One situation that came to mind was that when implementing kernels on arrays I will likely want to process the values buffer and bitmap buffer in the same loop which will be made difficult if using As SIMD is being introduced I wanted to submit this PR as it's smaller in scope, isolated to buffers, so that people could comment on the SIMD approach. Most of what will be in the *boolean_kernels" file will be kernels for arrays and these functions may just become "helper" functions. In short, I do plan to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to understand the approach here. Do we do the following:
SIMD_AND (left data, right data) -> result as bitmask
SIMD_AND(left validity, right validity) -> result as bitmask
The first SIMD would have performed the operation on garbage data as well (where validity bit is 0) but that is fine as we avoid a branch and finally use validity to get the end result
SIMD_AND (result1, result 2)
I am not sure how exactly do we use validity in the below code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct.
Our BooleanArray (bit packed) and Bitmask are both backed by a Buffer. For other kernels like ADD, etc. we will reuse the SIMD_AND on the Bitmask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
|
@paddyhoran , this is great. Are we adding SIMD kernels in a generic manner? In other words, can we leverage SIMD acceleration done here for similar operations on arrow vectors in other languages C++, Java etc? |
This is one of the great value propositions of Arrow, writing computation in one language and re-using it across many. Right now, I plan to update all our compute in Rust to use SIMD by the 0.13.0 release. Once up and running we could talk about how we could expose it to other implementations. I really want to be able to write high-performance code in Rust and expose it to |
|
LGTM - This can be merged if you are not planning to add more changes. Another question though -- from my prior experience with SIMD (in C/C++ land), I leveraged Intel compiler intrinsics (platform dependent) to work with underlying SIMD instructions. How are these instructions available in Rust? Secondly, I see that in the code we check if target architecture (x86/x86_64) has support for SIMD. I am not sure where we are checking if we have to use SSE or AVX256, AVX512? |
|
Actually I do plan to refactor this a little tonight, @kszucs got me thinking.. Intrinsics are in the We can look at adding runtime detection of instructions but I plan to add this later behind a feature flag as it's not always what you want to do for maximum performance. For instance, right now if I don't ask for any specific instructions it defaults to SSE. I have AVX2 available on my dev machine but there's no difference as the computation is completely memory bound. As others will likely be building higher level libraries on Arrow I thought it best give all the options to developers (i.e. static detection + optional runtime detection behind a feature flag). |
|
@kszucs @siddharthteotia @andygrove @sunchao could you please give this a review when you have a chance. @kszucs I re-worked the code to use traits instead while extracting what I could for re-use in other kernel implementations (where I will want to operate on the values buffer and bitmap buffer in the same loop). I removed the I left the benchmark in place but renamed it to illustrate the speed up but this will be replaced by benchmarks on the array kernels in time. |
rust/arrow/src/buffer.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have made this generic over the operators (& and |) to clean this up :(, I'll follow up when I get a chance... Please ignore this duplication of code if reviewing.
rust/arrow/benches/bitwise_ops.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder where are the similar functions for bitwise OR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually I plan to benchmark the array kernels not the buffer level operations. The current benchmark was just to demonstrate that the PR was in fact speeding things up so I just added the AND version almost as an example. I'll add the OR version also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
db732c1 to
a8534b1
Compare
|
Rebased. |
|
LGTM. @paddyhoran , are you planning to add more changes or this is good to go? |
|
This is good to go IMHO. I'd like to get it merged soon as I'm almost ready to submit another PR that builds on it. |
|
Merged. Thanks @paddyhoran |
This PR lays the ground work for future PR's adding explicit SIMD to what was called "array_ops". My plan is to migrate "ops" into the compute sub-module as I add explicit SIMD.
This PR includes the following:
packed_simdcratebitwise_andandbitwise_orfunctions forBuffer'sI'm adding these first as they are needed when updating the bitmap in all array kernels. The functions above include compile time SIMD, .i.e you have to use
RUSTFLAGS="-C target-feature=+avx2"or similar. However, even without this you should see a speed up as most modern processors will use certain instructions even if you do not explicitly ask for them (in my case the computation is completely memory bound and using larger registers would not speed this up further).I have included a benchmark to illustrate the speed up. However, as the compute sub-module evolves I expect that the benchmarks will be for array kernels and not buffer kernels (at which point
bitwise_bin_op_defaultandbitwise_bin_op_simdwill be made private again).If interested also please see this discussion with the maintainer of
packed_simdfor some background.