Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement avx512 masked load and store intrinsics #1254

Merged

Conversation

jhorstmann
Copy link
Contributor

@jhorstmann jhorstmann commented Nov 15, 2021

Implement avx512 masked load and store intrinsics using inline assembly.

  • aligned/unaligned masked/zero-masked loads
  • aligned/unaligned stores
  • avx512vl (_mm256 and _mm) variants of the above
  • avx512bw (byte and word) variants of the above
  • formatting
  • tests for all of the above
  • updated avx512f.md and avx512bw.md

The same approach also works for masked gather/scatter and compress/expand intrinsics. Probably makes sense to split these into their own PR.

@rust-highfive
Copy link

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @Amanieu (or someone else) soon.

Please see the contribution instructions for more information.

@Amanieu
Copy link
Member

Amanieu commented Nov 16, 2021

LGTM!

Just a small style nit: please indent the contents of the asm! macro, like rustfmt does for function calls.

@Amanieu
Copy link
Member

Amanieu commented Nov 20, 2021

I believe in the past we avoided defining functions inside macros because it interacts poorly with our intrinsic checking tools.

@jhorstmann
Copy link
Contributor Author

I believe in the past we avoided defining functions inside macros because it interacts poorly with our intrinsic checking tools.

Was just about to update the description to mention this. I saw the assert_instr tests are running, but are probably not very useful since the same parameter gets used for the assertion as in the inline asm. Tests for all functions are still on my todo list.

I think the macro approach is worth it since it reduces code for the load intrinsics by about 30x and reduces chances of copy-paste mistakes.

@Amanieu
Copy link
Member

Amanieu commented Nov 20, 2021

Specifically the stdarch_verify crate will parse every .rs file to find the names and signatures of all intrinsic functions. This parsing does not perform macro expansion (that would require running the full rust compiler).

I think it would be better to avoid using macros for now. The ARM code avoids this issue by using a code generator, but it is probably not worth the effort in this case since AVX512 is mostly complete already.

@Amanieu
Copy link
Member

Amanieu commented Nov 21, 2021

Could you also mark the intrinsics as implemented in crates/core_arch/avx512f.md. We should be able to start stabilizing avx512 once it is complete.

@jhorstmann
Copy link
Contributor Author

Should be ready for review now. The github diff view looks confusing, individual commits might be clearer.

I ended up using a "poor man's" code generator by expanding the macros from the earlier commit and postprocessing the output with some small regular expressions. It's a bit manual and probably not worth checking in. More time was spent in writing all the tests.

The avx512vl functions required adding avx to target_feature in order to use ymm registers. It seems the CI run for i586 also required the sse feature to use xmm registers. I could not reproduce that locally. Would be nice if avx512vl already allowed using those registers.

@jhorstmann jhorstmann marked this pull request as ready for review November 28, 2021 23:03
@Amanieu
Copy link
Member

Amanieu commented Dec 1, 2021

LGTM! I'm just waiting on rust-lang/rust#91381 which is causing the Android CI to fail.

@Amanieu Amanieu merged commit 59df818 into rust-lang:master Dec 4, 2021
@luojia65
Copy link
Contributor

luojia65 commented Dec 9, 2021

Did this commit break rollup merge? :)
Ref: rust-lang/rust#91548 (comment)

let mut dst: __m512i = src;
asm!(
"vmovdqu32 {2}{{{1}}}, [{0}]",
in(reg) mem_addr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiling std v0.0.0 (/checkout/library/std)
error: formatting may not be suitable for sub-register argument
     --> library/core/src/../../stdarch/crates/core_arch/src/x86/avx512f.rs:30336:34
      |
30336 |          "vmovdqu32 {2}{{{1}}}, [{0}]",
      |                                  ^^^
30337 |          in(reg) mem_addr,
      |                  -------- for this argument
      |
      = note: `-D asm-sub-register` implied by `-D warnings`
      = help: use the `e` modifier to have the register formatted as `eax`
      = help: or use the `r` modifier to keep the default formatting of `rax`

(from rollup CI Result)

@Amanieu
Copy link
Member

Amanieu commented Dec 9, 2021

Yes. The issue is that x32 (x86_64 with 32-bit pointers) the address operand is inserted into the asm as rax instead of eax. The fix is to use the :e modifier on x86 and x32 (but not x86_64). Have a look at bt.rs for a similar case.

@jhorstmann
Copy link
Contributor Author

Oh, my bad. I'll keep this in mind when I start working on remaining intrinsics.

@luojia65
Copy link
Contributor

luojia65 commented Dec 9, 2021

@jhorstmann I submitted fix at: #1264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants