ARROW-10589: [Rust] Implement AVX-512 bit and operation #8665

vertexclique · 2020-11-15T05:10:34Z

Implements bit and on avx512.

AVX2
=============

buffer_bit_ops and      time:   [729.17 ns 729.31 ns 729.49 ns]                                
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

AVX512
==============
buffer_bit_ops and      time:   [332.39 ns 332.55 ns 332.71 ns]                               
                        change: [-54.427% -54.390% -54.355%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

github-actions · 2020-11-15T05:18:25Z

https://issues.apache.org/jira/browse/ARROW-10589

nevi-me · 2020-11-15T06:09:45Z

Let's make the nightly update a separate PR, because there's a few other changes that we need to make for CI to pass. I'll work on that today

nevi-me

LGTM, but we should reverse the nightly changes in favour of #8666

vertexclique · 2020-11-15T12:14:26Z

I will do that in a min. edit: @nevi-me updated.

jorgecarleitao

Is this being tested on the CI? I did not see any changes to the CI to test with that feature.

nevi-me · 2020-11-15T12:37:02Z

Is this being tested on the CI? I did not see any changes to the CI to test with that feature.

The main thing we have to be wary of, which I'm fine if we test locally and confirm; is whether anything here breaks the arrow crate building on stable. I haven't tested this yet.

Do the GHA machines support avx512? We might have to rely on those with AVX512-capable Intels to check when new PRs that use AVX512 are submitted

vertexclique · 2020-11-15T13:18:15Z

The main thing we have to be wary of, which I'm fine if we test locally and confirm; is whether anything here breaks the arrow crate building on stable. I haven't tested this yet.

I tested this before pushing this pr and that was my concern before implementing this. I build it with the latest stable and tested too before opening this one.

p.s. linker failed on windows no disk space left.

jorgecarleitao

I think that this would be a great addition, but IMO we must add it to the CI.

If this is not under the CI, then every PR must be run locally to confirm that the PR is not breaking the code under this feature gate. This implies that developers will be unable to independently develop as they rely on someone being available to run this on their own computer before merging.

Alternatively, we risk having master with un-compilable code (on that feature gate).

IMO both options would be an anti-pattern and would set a bad precedence.

IMO we either support a feature and we have it under CI, or we do not support it and if we can't find a machine to test this under CI, then IMO we should not support it.

I have seem more than once code compiling and running with default features and not in the CI because the feature SIMD broke.

vertexclique · 2020-11-15T13:40:00Z

feature simd is not tested on this CI too. Am I missing something?

alamb

In general I agree with @jorgecarleitao . I really appreciate all your work @vertexclique and the performance changes / improvements you have been proposing are really cool.

As Jorge mentions, the key concern I have is "how will we ensure someone doesn't break your great work in this PR accidentally in the future"? The only real way I know how to do so is via automated testing (aka CI), which is why adding support for some environment / target is about more than just the initial code to provide that support (which is also required), it is also about avoiding it getting broken in the long term

jorgecarleitao · 2020-11-15T13:44:17Z

@vertexclique here. The timings are important: e.g. we decided to not place coverage on every PR because it was taking too long (it is under a cron job).

nevi-me · 2020-11-15T13:46:27Z

I'm running a CI job at https://github.com/nevi-me/arrow/runs/1402635214, which includes "simd avx512" features.

We do test "simd" in one of the CI tasks. Perhaps I didn't word my previous comment correctly.
If GH CI does support AVX512 (I might be going on outdated assumptions that not every Intel CPU supports AVX512), then we can add it to the tests in ci/scripts/rust_test.sh.

nevi-me · 2020-11-15T13:54:41Z

@vertexclique from my reviewing the code, it looked like simd and avx512 were able to co-exist, but I'm getting some errors on the macOS CI (https://github.com/nevi-me/arrow/runs/1402635245#step:8:1511).
Should I disable one for the other?

@jorgecarleitao @alamb CI fails because the AVX512 intrinsics aren't found (https://github.com/nevi-me/arrow/runs/1402635245#step:8:1537), so my concern seems valid that we might be running CI on CPUs that don't support AVX512, and thus make it a challenge to enable the feature.

vertexclique · 2020-11-15T14:00:01Z

@vertexclique from my reviewing the code, it looked like simd and avx512 were able to co-exist, but I'm getting some errors on the macOS CI (https://github.com/nevi-me/arrow/runs/1402635245#step:8:1511).
Should I disable one for the other?

Yes, they shouldn't coexist since they are conflicting implementations I would like to document that when everything settles down. Another option would make all of them mutually exclusive but that is going to be problematic. So the person who enables avx512 should use avx512 feature or fallback to autovec. Not need to bring packed_simd or something else to do fallback.

nevi-me · 2020-11-15T14:05:23Z

So the person who enables avx512 should use avx512 feature or fallback to autovec. Not need to bring packed_simd or something else to do fallback.

Okay, great. For tracking purposes, once we find a resolution to the CI question; we should open an umbrella JIRA to implement avx512 for other simd equivalents. I'm assuming that's your plan?

I'm trying CI again at https://github.com/nevi-me/arrow/runs/1402675547 with simd and avx512 running separately:

# run unit tests with non-default features on
pushd arrow
cargo test --features "simd"
cargo test --features "avx512"
popd

vertexclique · 2020-11-15T14:09:00Z

Okay, great. For tracking purposes, once we find a resolution to the CI question; we should open an umbrella JIRA to implement avx512 for other simd equivalents. I'm assuming that's your plan?

Yes, I am going to open PRs one by one on top of each other and by opening an umbrella ticket for each one of them. About CI resolution, AWS offers avx512 machines. That can be a solution to the CI problem.

nevi-me · 2020-11-15T14:29:00Z

Reran the tests, failure now confirmed: https://github.com/nevi-me/arrow/runs/1402675504#step:8:2084

The C++ implementation has avx512 support, so maybe @kszucs or someone else deeply familiar with our CI knows what a probable solution is. Otherwise this might be a matter for the mailing list.

My position is that I'm happy to still proceed with merging this, because this doesn't break CI for stable and nightly. Else, we could open a separate branch where we can merge this into, to avoid @vertexclique having to pile up PRs on top of each other.
It gets very irritating after a while, from my experience with the parquet writer.
This separate branch could also temporarily house ARMv7, as we don't yet have CI for it.

@jorgecarleitao what's your opinion on us testing arrow separately on stable, to ensure that we don't regress? Something like:

# run unit tests, excluding arrow
cargo test --exclude arrow
# run unit tests on arrow separately
pushd arrow
# run arrow unit tests on stable
cargo +stable test
# run arrow unit tests with features, separate test for mutually exclusive oens
cargo test --features "simd"
cargo test --features "avx512"
popd

andygrove · 2020-11-15T15:01:02Z

I can see both sides of the argument here but I would be supportive of merging this PR as long as we have a JIRA filed to follow up on the CI support (which I agree is really important). My understanding is that these changes will not break anything for users who are using Arrow without this new features enabled, and that is the configuration that we are currently certifying in CI.

We have a similar situation already with DataFusion, where we have to run benchmarks locally before merging some PRs because we don't have those set up in CI yet, and we have seen performance regressions as a result, so I don't see this as being particulary different.

jorgecarleitao · 2020-11-15T15:06:45Z

Running the CI on separate feature fits exactly what I was thinking about.

IMO we could then benefit from some clarity over how we merge PRs from here on: do we wait for someone with a machine with avx512 to run the PR locally before merging? Or do we accept breaking that feature set? Note that any PR can break a feature (e.g. removing a use from the top of the module is often sufficient, as it just happened recently to me on #8670).

jhorstmann · 2020-11-16T09:16:18Z

Nice performance improvement! I'm a bit surprised by that since the packed_simd version also uses 512-bit wide types (u8x64) which should then generate avx-512 instructions when targeting a machine that supports them.

vertexclique · 2020-11-16T11:12:29Z

@andygrove @nevi-me https://issues.apache.org/jira/browse/ARROW-10612 Umbrella issue for AVX-512. Includes CI support follow up subtask. I will create a subtask for every operation that I will implement.

jorgecarleitao

I am approving this as per discussion on the PR. 👍

alamb

I am convinced too

wesm · 2020-11-17T00:29:00Z

I'm looking at contributing an AVX-512-capable machine to run occasional builds on Buildkite, I'd guess we're looking at 2-3 month time frame for that though. Note that anyone can hook up an AVX-512 machine to Buildkite and arrange for builds to be triggered on it

Dandandan · 2021-01-31T13:42:34Z

@vertexclique I was actually wondering whether AVX-512 is really faster than compiling the kernel with "simd" and the right rustflags , i.e. RUSTFLAGS='-C target-cpu=native'?
I don't have a AVX-512-enabled cpu to try this out, but I would have the same intuition as @jhorstmann that packed_simdshould generate similar AVX-512 instructions?
Would be nice maybe to revisit this once to see whether we need to maintain 2 different implementations, especially when it's going to be in std / stable.

vertexclique · 2021-01-31T13:53:11Z

For some operations, it will use wider registers and their transfers with avx512, but not all algorithms are expressable using other simd sets. Main idea was instead of creating ordinary instructions of both feature sets, creating fast operations on a specific collection of data. packed_simd or stdsimd doesn't generate the optimal ordering as a code—neither some intrinsics in the language's core. Useful intrinsics are bound to llvm procedures. Not compiler optimized ones. AVX512 set is one of those sets.

github-actions bot added the Component: Rust label Nov 15, 2020

vertexclique force-pushed the ARROW-10589-implement-avx-512-bitand branch from cd07e3b to 324bdce Compare November 15, 2020 05:20

nevi-me approved these changes Nov 15, 2020

View reviewed changes

jorgecarleitao reviewed Nov 15, 2020

View reviewed changes

ARROW-10589 - Implement AVX-512 bit and operation

06a7d64

vertexclique force-pushed the ARROW-10589-implement-avx-512-bitand branch from 324bdce to 06a7d64 Compare November 15, 2020 12:33

jorgecarleitao requested changes Nov 15, 2020

View reviewed changes

alamb reviewed Nov 15, 2020

View reviewed changes

jorgecarleitao approved these changes Nov 16, 2020

View reviewed changes

alamb approved these changes Nov 16, 2020

View reviewed changes

alamb closed this in ca9783b Nov 16, 2020

vertexclique deleted the ARROW-10589-implement-avx-512-bitand branch November 16, 2020 21:13

vertexclique mentioned this pull request Nov 25, 2020

ARROW-10588: [Rust] Safe and parallel bit operations for Arrow #8664

Closed

vertexclique mentioned this pull request Jan 28, 2021

ARROW-11387: [Rust] fix build for conditional compilation of features 'simd + avx512' #9337

Closed

asfimport mentioned this pull request Jan 31, 2021

[Rust]: Implement AVX-512 bit and operation #26552

Closed

ARROW-10589: [Rust] Implement AVX-512 bit and operation #8665

ARROW-10589: [Rust] Implement AVX-512 bit and operation #8665

Uh oh!

Conversation

vertexclique commented Nov 15, 2020

Uh oh!

github-actions bot commented Nov 15, 2020

Uh oh!

nevi-me commented Nov 15, 2020

Uh oh!

nevi-me left a comment

Choose a reason for hiding this comment

Uh oh!

vertexclique commented Nov 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

nevi-me commented Nov 15, 2020

Uh oh!

vertexclique commented Nov 15, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

vertexclique commented Nov 15, 2020

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao commented Nov 15, 2020

Uh oh!

nevi-me commented Nov 15, 2020

Uh oh!

nevi-me commented Nov 15, 2020

Uh oh!

vertexclique commented Nov 15, 2020

Uh oh!

nevi-me commented Nov 15, 2020

Uh oh!

vertexclique commented Nov 15, 2020

Uh oh!

nevi-me commented Nov 15, 2020

Uh oh!

andygrove commented Nov 15, 2020

Uh oh!

jorgecarleitao commented Nov 15, 2020

Uh oh!

jhorstmann commented Nov 16, 2020

Uh oh!

vertexclique commented Nov 16, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

wesm commented Nov 17, 2020

Uh oh!

Dandandan commented Jan 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vertexclique commented Jan 31, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

vertexclique commented Nov 15, 2020 •

edited

Loading

Dandandan commented Jan 31, 2021 •

edited

Loading