-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10589: [Rust] Implement AVX-512 bit and operation #8665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10589: [Rust] Implement AVX-512 bit and operation #8665
Conversation
cd07e3b to
324bdce
Compare
|
Let's make the nightly update a separate PR, because there's a few other changes that we need to make for CI to pass. I'll work on that today |
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but we should reverse the nightly changes in favour of #8666
|
I will do that in a min. edit: @nevi-me updated. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this being tested on the CI? I did not see any changes to the CI to test with that feature.
324bdce to
06a7d64
Compare
The main thing we have to be wary of, which I'm fine if we test locally and confirm; is whether anything here breaks the arrow crate building on stable. I haven't tested this yet. Do the GHA machines support avx512? We might have to rely on those with AVX512-capable Intels to check when new PRs that use AVX512 are submitted |
I tested this before pushing this pr and that was my concern before implementing this. I build it with the latest stable and tested too before opening this one. p.s. linker failed on windows no disk space left. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this would be a great addition, but IMO we must add it to the CI.
If this is not under the CI, then every PR must be run locally to confirm that the PR is not breaking the code under this feature gate. This implies that developers will be unable to independently develop as they rely on someone being available to run this on their own computer before merging.
Alternatively, we risk having master with un-compilable code (on that feature gate).
IMO both options would be an anti-pattern and would set a bad precedence.
IMO we either support a feature and we have it under CI, or we do not support it and if we can't find a machine to test this under CI, then IMO we should not support it.
I have seem more than once code compiling and running with default features and not in the CI because the feature SIMD broke.
|
feature |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I agree with @jorgecarleitao . I really appreciate all your work @vertexclique and the performance changes / improvements you have been proposing are really cool.
As Jorge mentions, the key concern I have is "how will we ensure someone doesn't break your great work in this PR accidentally in the future"? The only real way I know how to do so is via automated testing (aka CI), which is why adding support for some environment / target is about more than just the initial code to provide that support (which is also required), it is also about avoiding it getting broken in the long term
|
@vertexclique here. The timings are important: e.g. we decided to not place coverage on every PR because it was taking too long (it is under a cron job). |
|
I'm running a CI job at https://github.com/nevi-me/arrow/runs/1402635214, which includes We do test "simd" in one of the CI tasks. Perhaps I didn't word my previous comment correctly. |
|
@vertexclique from my reviewing the code, it looked like @jorgecarleitao @alamb CI fails because the AVX512 intrinsics aren't found (https://github.com/nevi-me/arrow/runs/1402635245#step:8:1537), so my concern seems valid that we might be running CI on CPUs that don't support AVX512, and thus make it a challenge to enable the feature. |
Yes, they shouldn't coexist since they are conflicting implementations I would like to document that when everything settles down. Another option would make all of them mutually exclusive but that is going to be problematic. So the person who enables |
Okay, great. For tracking purposes, once we find a resolution to the CI question; we should open an umbrella JIRA to implement I'm trying CI again at https://github.com/nevi-me/arrow/runs/1402675547 with simd and avx512 running separately: # run unit tests with non-default features on
pushd arrow
cargo test --features "simd"
cargo test --features "avx512"
popd |
Yes, I am going to open PRs one by one on top of each other and by opening an umbrella ticket for each one of them. About CI resolution, AWS offers avx512 machines. That can be a solution to the CI problem. |
|
Reran the tests, failure now confirmed: https://github.com/nevi-me/arrow/runs/1402675504#step:8:2084 The C++ implementation has avx512 support, so maybe @kszucs or someone else deeply familiar with our CI knows what a probable solution is. Otherwise this might be a matter for the mailing list. My position is that I'm happy to still proceed with merging this, because this doesn't break CI for stable and nightly. Else, we could open a separate branch where we can merge this into, to avoid @vertexclique having to pile up PRs on top of each other. @jorgecarleitao what's your opinion on us testing # run unit tests, excluding arrow
cargo test --exclude arrow
# run unit tests on arrow separately
pushd arrow
# run arrow unit tests on stable
cargo +stable test
# run arrow unit tests with features, separate test for mutually exclusive oens
cargo test --features "simd"
cargo test --features "avx512"
popd |
|
I can see both sides of the argument here but I would be supportive of merging this PR as long as we have a JIRA filed to follow up on the CI support (which I agree is really important). My understanding is that these changes will not break anything for users who are using Arrow without this new features enabled, and that is the configuration that we are currently certifying in CI. We have a similar situation already with DataFusion, where we have to run benchmarks locally before merging some PRs because we don't have those set up in CI yet, and we have seen performance regressions as a result, so I don't see this as being particulary different. |
|
Running the CI on separate feature fits exactly what I was thinking about. IMO we could then benefit from some clarity over how we merge PRs from here on: do we wait for someone with a machine with avx512 to run the PR locally before merging? Or do we accept breaking that feature set? Note that any PR can break a feature (e.g. removing a |
|
Nice performance improvement! I'm a bit surprised by that since the packed_simd version also uses 512-bit wide types ( |
|
@andygrove @nevi-me https://issues.apache.org/jira/browse/ARROW-10612 Umbrella issue for AVX-512. Includes CI support follow up subtask. I will create a subtask for every operation that I will implement. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am approving this as per discussion on the PR. 👍
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am convinced too
|
I'm looking at contributing an AVX-512-capable machine to run occasional builds on Buildkite, I'd guess we're looking at 2-3 month time frame for that though. Note that anyone can hook up an AVX-512 machine to Buildkite and arrange for builds to be triggered on it |
|
@vertexclique I was actually wondering whether AVX-512 is really faster than compiling the kernel with "simd" and the right rustflags , i.e. |
|
For some operations, it will use wider registers and their transfers with avx512, but not all algorithms are expressable using other simd sets. Main idea was instead of creating ordinary instructions of both feature sets, creating fast operations on a specific collection of data. packed_simd or stdsimd doesn't generate the optimal ordering as a code—neither some intrinsics in the language's core. Useful intrinsics are bound to llvm procedures. Not compiler optimized ones. AVX512 set is one of those sets. |
Implements bit and on avx512.