Skip to content

Conversation

@vertexclique
Copy link
Contributor

Implements bit and on avx512.

AVX2
=============

buffer_bit_ops and      time:   [729.17 ns 729.31 ns 729.49 ns]                                
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

AVX512
==============
buffer_bit_ops and      time:   [332.39 ns 332.55 ns 332.71 ns]                               
                        change: [-54.427% -54.390% -54.355%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

@github-actions
Copy link

@vertexclique vertexclique force-pushed the ARROW-10589-implement-avx-512-bitand branch from cd07e3b to 324bdce Compare November 15, 2020 05:20
@nevi-me
Copy link
Contributor

nevi-me commented Nov 15, 2020

Let's make the nightly update a separate PR, because there's a few other changes that we need to make for CI to pass. I'll work on that today

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but we should reverse the nightly changes in favour of #8666

@vertexclique
Copy link
Contributor Author

vertexclique commented Nov 15, 2020

I will do that in a min. edit: @nevi-me updated.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being tested on the CI? I did not see any changes to the CI to test with that feature.

@vertexclique vertexclique force-pushed the ARROW-10589-implement-avx-512-bitand branch from 324bdce to 06a7d64 Compare November 15, 2020 12:33
@nevi-me
Copy link
Contributor

nevi-me commented Nov 15, 2020

Is this being tested on the CI? I did not see any changes to the CI to test with that feature.

The main thing we have to be wary of, which I'm fine if we test locally and confirm; is whether anything here breaks the arrow crate building on stable. I haven't tested this yet.

Do the GHA machines support avx512? We might have to rely on those with AVX512-capable Intels to check when new PRs that use AVX512 are submitted

@vertexclique
Copy link
Contributor Author

The main thing we have to be wary of, which I'm fine if we test locally and confirm; is whether anything here breaks the arrow crate building on stable. I haven't tested this yet.

I tested this before pushing this pr and that was my concern before implementing this. I build it with the latest stable and tested too before opening this one.

p.s. linker failed on windows no disk space left.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this would be a great addition, but IMO we must add it to the CI.

If this is not under the CI, then every PR must be run locally to confirm that the PR is not breaking the code under this feature gate. This implies that developers will be unable to independently develop as they rely on someone being available to run this on their own computer before merging.

Alternatively, we risk having master with un-compilable code (on that feature gate).

IMO both options would be an anti-pattern and would set a bad precedence.

IMO we either support a feature and we have it under CI, or we do not support it and if we can't find a machine to test this under CI, then IMO we should not support it.

I have seem more than once code compiling and running with default features and not in the CI because the feature SIMD broke.

@vertexclique
Copy link
Contributor Author

feature simd is not tested on this CI too. Am I missing something?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I agree with @jorgecarleitao . I really appreciate all your work @vertexclique and the performance changes / improvements you have been proposing are really cool.

As Jorge mentions, the key concern I have is "how will we ensure someone doesn't break your great work in this PR accidentally in the future"? The only real way I know how to do so is via automated testing (aka CI), which is why adding support for some environment / target is about more than just the initial code to provide that support (which is also required), it is also about avoiding it getting broken in the long term

@jorgecarleitao
Copy link
Member

@vertexclique here. The timings are important: e.g. we decided to not place coverage on every PR because it was taking too long (it is under a cron job).

@nevi-me
Copy link
Contributor

nevi-me commented Nov 15, 2020

I'm running a CI job at https://github.com/nevi-me/arrow/runs/1402635214, which includes "simd avx512" features.

We do test "simd" in one of the CI tasks. Perhaps I didn't word my previous comment correctly.
If GH CI does support AVX512 (I might be going on outdated assumptions that not every Intel CPU supports AVX512), then we can add it to the tests in ci/scripts/rust_test.sh.

@nevi-me
Copy link
Contributor

nevi-me commented Nov 15, 2020

@vertexclique from my reviewing the code, it looked like simd and avx512 were able to co-exist, but I'm getting some errors on the macOS CI (https://github.com/nevi-me/arrow/runs/1402635245#step:8:1511).
Should I disable one for the other?

@jorgecarleitao @alamb CI fails because the AVX512 intrinsics aren't found (https://github.com/nevi-me/arrow/runs/1402635245#step:8:1537), so my concern seems valid that we might be running CI on CPUs that don't support AVX512, and thus make it a challenge to enable the feature.

@vertexclique
Copy link
Contributor Author

@vertexclique from my reviewing the code, it looked like simd and avx512 were able to co-exist, but I'm getting some errors on the macOS CI (https://github.com/nevi-me/arrow/runs/1402635245#step:8:1511).
Should I disable one for the other?

Yes, they shouldn't coexist since they are conflicting implementations I would like to document that when everything settles down. Another option would make all of them mutually exclusive but that is going to be problematic. So the person who enables avx512 should use avx512 feature or fallback to autovec. Not need to bring packed_simd or something else to do fallback.

@nevi-me
Copy link
Contributor

nevi-me commented Nov 15, 2020

So the person who enables avx512 should use avx512 feature or fallback to autovec. Not need to bring packed_simd or something else to do fallback.

Okay, great. For tracking purposes, once we find a resolution to the CI question; we should open an umbrella JIRA to implement avx512 for other simd equivalents. I'm assuming that's your plan?

I'm trying CI again at https://github.com/nevi-me/arrow/runs/1402675547 with simd and avx512 running separately:

# run unit tests with non-default features on
pushd arrow
cargo test --features "simd"
cargo test --features "avx512"
popd

@vertexclique
Copy link
Contributor Author

Okay, great. For tracking purposes, once we find a resolution to the CI question; we should open an umbrella JIRA to implement avx512 for other simd equivalents. I'm assuming that's your plan?

Yes, I am going to open PRs one by one on top of each other and by opening an umbrella ticket for each one of them. About CI resolution, AWS offers avx512 machines. That can be a solution to the CI problem.

@nevi-me
Copy link
Contributor

nevi-me commented Nov 15, 2020

Reran the tests, failure now confirmed: https://github.com/nevi-me/arrow/runs/1402675504#step:8:2084

The C++ implementation has avx512 support, so maybe @kszucs or someone else deeply familiar with our CI knows what a probable solution is. Otherwise this might be a matter for the mailing list.

My position is that I'm happy to still proceed with merging this, because this doesn't break CI for stable and nightly. Else, we could open a separate branch where we can merge this into, to avoid @vertexclique having to pile up PRs on top of each other.
It gets very irritating after a while, from my experience with the parquet writer.
This separate branch could also temporarily house ARMv7, as we don't yet have CI for it.


@jorgecarleitao what's your opinion on us testing arrow separately on stable, to ensure that we don't regress? Something like:

# run unit tests, excluding arrow
cargo test --exclude arrow
# run unit tests on arrow separately
pushd arrow
# run arrow unit tests on stable
cargo +stable test
# run arrow unit tests with features, separate test for mutually exclusive oens
cargo test --features "simd"
cargo test --features "avx512"
popd

@andygrove
Copy link
Member

I can see both sides of the argument here but I would be supportive of merging this PR as long as we have a JIRA filed to follow up on the CI support (which I agree is really important). My understanding is that these changes will not break anything for users who are using Arrow without this new features enabled, and that is the configuration that we are currently certifying in CI.

We have a similar situation already with DataFusion, where we have to run benchmarks locally before merging some PRs because we don't have those set up in CI yet, and we have seen performance regressions as a result, so I don't see this as being particulary different.

@jorgecarleitao
Copy link
Member

Running the CI on separate feature fits exactly what I was thinking about.

IMO we could then benefit from some clarity over how we merge PRs from here on: do we wait for someone with a machine with avx512 to run the PR locally before merging? Or do we accept breaking that feature set? Note that any PR can break a feature (e.g. removing a use from the top of the module is often sufficient, as it just happened recently to me on #8670).

@jhorstmann
Copy link
Contributor

Nice performance improvement! I'm a bit surprised by that since the packed_simd version also uses 512-bit wide types (u8x64) which should then generate avx-512 instructions when targeting a machine that supports them.

@vertexclique
Copy link
Contributor Author

@andygrove @nevi-me https://issues.apache.org/jira/browse/ARROW-10612 Umbrella issue for AVX-512. Includes CI support follow up subtask. I will create a subtask for every operation that I will implement.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am approving this as per discussion on the PR. 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am convinced too

@alamb alamb closed this in ca9783b Nov 16, 2020
@vertexclique vertexclique deleted the ARROW-10589-implement-avx-512-bitand branch November 16, 2020 21:13
@wesm
Copy link
Member

wesm commented Nov 17, 2020

I'm looking at contributing an AVX-512-capable machine to run occasional builds on Buildkite, I'd guess we're looking at 2-3 month time frame for that though. Note that anyone can hook up an AVX-512 machine to Buildkite and arrange for builds to be triggered on it

@Dandandan
Copy link
Contributor

Dandandan commented Jan 31, 2021

@vertexclique I was actually wondering whether AVX-512 is really faster than compiling the kernel with "simd" and the right rustflags , i.e. RUSTFLAGS='-C target-cpu=native'?
I don't have a AVX-512-enabled cpu to try this out, but I would have the same intuition as @jhorstmann that packed_simdshould generate similar AVX-512 instructions?
Would be nice maybe to revisit this once to see whether we need to maintain 2 different implementations, especially when it's going to be in std / stable.

@vertexclique
Copy link
Contributor Author

For some operations, it will use wider registers and their transfers with avx512, but not all algorithms are expressable using other simd sets. Main idea was instead of creating ordinary instructions of both feature sets, creating fast operations on a specific collection of data. packed_simd or stdsimd doesn't generate the optimal ordering as a code—neither some intrinsics in the language's core. Useful intrinsics are bound to llvm procedures. Not compiler optimized ones. AVX512 set is one of those sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants