Skip to content

Conversation

@paddyhoran
Copy link
Contributor

This PR lays the ground work for future PR's adding explicit SIMD to what was called "array_ops". My plan is to migrate "ops" into the compute sub-module as I add explicit SIMD.

This PR includes the following:

  • introduces the packed_simd crate
  • introduces the compute sub-module
  • adds bitwise_and and bitwise_or functions for Buffer's

I'm adding these first as they are needed when updating the bitmap in all array kernels. The functions above include compile time SIMD, .i.e you have to use RUSTFLAGS="-C target-feature=+avx2" or similar. However, even without this you should see a speed up as most modern processors will use certain instructions even if you do not explicitly ask for them (in my case the computation is completely memory bound and using larger registers would not speed this up further).

I have included a benchmark to illustrate the speed up. However, as the compute sub-module evolves I expect that the benchmarks will be for array kernels and not buffer kernels (at which point bitwise_bin_op_default and bitwise_bin_op_simd will be made private again).

If interested also please see this discussion with the maintainer of packed_simd for some background.

@kszucs
Copy link
Member

kszucs commented Feb 6, 2019

@paddyhoran it looks great! I'm glad that You've started to implement the compute module with SIMD!

However providing operator kernels for buffers seems a bit odd to me. How about implementing
core::ops::BitAnd and core::ops::BitOr on arrow::bitmap::Bitmap instead?

@paddyhoran
Copy link
Contributor Author

How about implementing core::ops::BitAnd and core::ops::BitOr on arrow::bitmap::Bitmap instead?

Yea, I started to do this but decided not to. I plan for these functions to evolve over time as I add the rest of the kernels. One situation that came to mind was that when implementing kernels on arrays I will likely want to process the values buffer and bitmap buffer in the same loop which will be made difficult if using BitAnd and BitOr.

As SIMD is being introduced I wanted to submit this PR as it's smaller in scope, isolated to buffers, so that people could comment on the SIMD approach. Most of what will be in the *boolean_kernels" file will be kernels for arrays and these functions may just become "helper" functions.

In short, I do plan to add BitAnd/BitOr but didn't want to back myself into a corner. I plan to have all this cleaned up for the 0.13.0 release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to understand the approach here. Do we do the following:

SIMD_AND (left data, right data) -> result as bitmask
SIMD_AND(left validity, right validity) -> result as bitmask

The first SIMD would have performed the operation on garbage data as well (where validity bit is 0) but that is fine as we avoid a branch and finally use validity to get the end result

SIMD_AND (result1, result 2)

I am not sure how exactly do we use validity in the below code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct.

Our BooleanArray (bit packed) and Bitmask are both backed by a Buffer. For other kernels like ADD, etc. we will reuse the SIMD_AND on the Bitmask.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@siddharthteotia
Copy link
Contributor

@paddyhoran , this is great.

Are we adding SIMD kernels in a generic manner? In other words, can we leverage SIMD acceleration done here for similar operations on arrow vectors in other languages C++, Java etc?

@paddyhoran
Copy link
Contributor Author

Are we adding SIMD kernels in a generic manner? In other words, can we leverage SIMD acceleration done here for similar operations on arrow vectors in other languages C++, Java etc?

This is one of the great value propositions of Arrow, writing computation in one language and re-using it across many. Right now, I plan to update all our compute in Rust to use SIMD by the 0.13.0 release. Once up and running we could talk about how we could expose it to other implementations.

I really want to be able to write high-performance code in Rust and expose it to pyarrow but I have not had a chance to look at it yet...

@siddharthteotia
Copy link
Contributor

LGTM - This can be merged if you are not planning to add more changes.

Another question though -- from my prior experience with SIMD (in C/C++ land), I leveraged Intel compiler intrinsics (platform dependent) to work with underlying SIMD instructions. How are these instructions available in Rust?

Secondly, I see that in the code we check if target architecture (x86/x86_64) has support for SIMD. I am not sure where we are checking if we have to use SSE or AVX256, AVX512?

@paddyhoran
Copy link
Contributor Author

Actually I do plan to refactor this a little tonight, @kszucs got me thinking..

Intrinsics are in the stdsimd crate in the standard library of Rust. There is no runtime detection in this PR, you have to choose at compile time if you want to use AVX512 for example. However, packed_simd is quite nice in that using the u8x64 type is portable, on systems without 512 bit registers it will use 2 256 bit registers or similiar.

We can look at adding runtime detection of instructions but I plan to add this later behind a feature flag as it's not always what you want to do for maximum performance. For instance, right now if I don't ask for any specific instructions it defaults to SSE. I have AVX2 available on my dev machine but there's no difference as the computation is completely memory bound.

As others will likely be building higher level libraries on Arrow I thought it best give all the options to developers (i.e. static detection + optional runtime detection behind a feature flag).

@paddyhoran
Copy link
Contributor Author

@kszucs @siddharthteotia @andygrove @sunchao could you please give this a review when you have a chance.

@kszucs I re-worked the code to use traits instead while extracting what I could for re-use in other kernel implementations (where I will want to operate on the values buffer and bitmap buffer in the same loop).

I removed the compute sub module (which I will re-introduce as part of ARROW-4490).

I left the benchmark in place but renamed it to illustrate the speed up but this will be replaced by benchmarks on the array kernels in time.

@paddyhoran paddyhoran changed the title ARROW-4468: [Rust] Implement AND/OR kernels for Buffer (with SIMD) ARROW-4468: [Rust] Implement BitAnd/BitOr for &Buffer (with SIMD) Feb 8, 2019
Copy link
Contributor Author

@paddyhoran paddyhoran Feb 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have made this generic over the operators (& and |) to clean this up :(, I'll follow up when I get a chance... Please ignore this duplication of code if reviewing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder where are the similar functions for bitwise OR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually I plan to benchmark the array kernels not the buffer level operations. The current benchmark was just to demonstrate that the PR was in fact speeding things up so I just added the AND version almost as an example. I'll add the OR version also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@xhochy xhochy force-pushed the boolean-buffer-kernels branch from db732c1 to a8534b1 Compare February 8, 2019 21:30
@xhochy
Copy link
Member

xhochy commented Feb 8, 2019

Rebased.

@siddharthteotia
Copy link
Contributor

LGTM. @paddyhoran , are you planning to add more changes or this is good to go?

@paddyhoran
Copy link
Contributor Author

This is good to go IMHO. I'd like to get it merged soon as I'm almost ready to submit another PR that builds on it.

@siddharthteotia siddharthteotia merged commit 3ceef46 into apache:master Feb 12, 2019
@siddharthteotia
Copy link
Contributor

Merged. Thanks @paddyhoran

@paddyhoran paddyhoran deleted the boolean-buffer-kernels branch February 13, 2019 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants