-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miscompilation under target-cpu >= haswell #63791
Comments
It would help tremendously if you could reduce that example further. |
You can use C-Reduce to achieve that. |
I'll look into using bugpoint and -opt-bisect-limit to narrow this down. Any other suggestions welcome. @mati865 Thanks for the suggestion, I didn't know C-Reduce worked for Rust! My suspicion is in llvm's optimization passes, i.e. LLVM IR -> assembly, so I'll give llvm's bugpoint a go first. |
It works for pretty much any language. It just does changes like removing a line or changing it a bit, so it often generates invalid C as well. Only a few passes assume C, but those will just not reduce the test case for Rust. See https://embed.cs.utah.edu/creduce/pldi12_talk.pdf |
Unforunately, C-Reduce proved to be too clever when I ran it: it replaced: if output != encrypted {
std::process::abort();
} with std::process::abort(); While this is technically a 'minimization', it's not a very useful one. |
Not sure if it helps, but |
THe problem appears to be related to the The earliest point at which this occurs appears to be the call Unfortunately, the problem disappeared when I made a standalone project with |
I decided to give it a try and managed to simplify the code: fn main() {
let x = [0, 0, 0, 0, 0, 0, 0, 1];
let x = unslice(x).0;
assert_eq!(x, 1 << 31);
}
#[inline(never)] // can't reproduce without inline(never)
fn unslice(x: [u16; 8]) -> (u32, u32, u32) {
let a = deconstruct(x, 0);
let b = deconstruct(x, 1); // needs two calls with different parameter
(a, b, 0) // can't reproduce without the third value in the tuple
}
#[inline(never)] // can't reproduce without inline(never)
fn deconstruct(x: [u16; 8], bit: u32) -> u32 {
// this part needs to get vectorized
pb(x[0], bit + 1, 16) | pb(x[1], bit + 1, 17)
| pb(x[2], bit + 1, 18) | pb(x[3], bit + 1, 19)
| pb(x[4], bit + 1, 20) | pb(x[5], bit + 1, 21)
| pb(x[6], bit + 1, 22) | pb(x[7], bit + 1, 23)
| pb(x[0], bit, 24) | pb(x[1], bit, 25)
| pb(x[2], bit, 26) | pb(x[3], bit, 27)
| pb(x[4], bit, 28) | pb(x[5], bit, 29)
| pb(x[6], bit, 30) | pb(x[7], bit, 31)
}
fn pb(x: u16, bit: u32, shift: u32) -> u32 {
u32::from((x >> bit) & 1) << shift // can't reproduce with different order (e.g. mask then shift)
} Assembly output from 1.33.0 and 1.34.0 is identical, except for this block of AVX instructions: 1.33.0
vpunpckhwd ymm4, ymm3, ymm1
vpunpckhwd ymm0, ymm0, ymm3
vpsrlvd ymm0, ymm4, ymm0
vpsrld ymm0, ymm0, 16
vpunpcklwd ymm1, ymm3, ymm1
vpsrld ymm1, ymm1, 1
vpsrld ymm1, ymm1, 16
vpackusdw ymm0, ymm1, ymm0
1.34.0
vpunpckhwd ymm1, ymm3, ymm1
vpunpckhwd ymm0, ymm0, ymm3
vpsrlvd ymm0, ymm1, ymm0 Looks like some of them got incorrectly optimized away. I'm not familiar with how LLVM vectorizes and optimizes things, so sadly this is as far as I can get on my own. Edit: Just to confirm, running with |
I managed to produce a smaller but slightly different reduction. I'm not sure if it also triggers the exact same issue reported here, but a similar asm diff is observed, see the diff in https://gcc.godbolt.org/z/KZVz_O: pub fn deconstruct(x: [u16; 8], bit: u32) -> u32 {
fn pb(x: u16, bit: u32, shift: u32) -> u32 {
// can't reproduce with different order (e.g. mask then shift)
u32::from((x >> bit) & 1) << shift
}
// this part needs to get vectorized
pb(x[0], bit + 1, 16) | pb(x[1], bit + 1, 17)
| pb(x[2], bit + 1, 18) | pb(x[3], bit + 1, 19)
| pb(x[4], bit + 1, 20) | pb(x[5], bit + 1, 21)
| pb(x[6], bit + 1, 22) | pb(x[7], bit + 1, 23)
| pb(x[0], bit, 24) | pb(x[1], bit, 25)
| pb(x[2], bit, 26) | pb(x[3], bit, 27)
| pb(x[4], bit, 28) | pb(x[5], bit, 29)
| pb(x[6], bit, 30) | pb(x[7], bit, 31)
} |
Check-in from the compiler triage meeting: Summarizing, the previous comments, it seems we have minimized the example and we have bisected to a commit that upgrades LLVM. This looks likely to be an LLVM bug. I'm going to cc @nikic and @nagisa in the hopes that one of them can help in pushing this to resolution. I'm marking it as P-high. |
The IR for the functions is the same, so this is an issue in the backend, not the SLP vectorizer. It looks like LLVM correctly determines that one of the vpsrld's is dead based on demanded elements and replaces the packus with a pshufb. A little later we see:
Which doesn't look right... pshuflw -28 is a no-op operation, but the previous pshufb is not. This combine is performed by matchUnaryPermuteShuffle(), but I haven't looked further than that yet. |
Fixed & backported to LLVM 9. |
This example fails when compiled with
target-cpu
"haswell" or more recent:I bumped into this when compiling with
target-cpu=native
and assumed it was related to #54688, but after minimising the testcase I don't think it is. My next guess is an llvm bug but I thought I'd make an issue here in case anyone else bumps into it or wants to help investigate.I've created a self-contained godbolt: https://rust.godbolt.org/z/ChdU6w
The two uses of unsafe I believe are sound. I think I've ruled out target-features being the cause, because this succeeds:
Which should enable identical features to haswell given https://github.com/llvm-mirror/llvm/blob/release_90/lib/Target/X86/X86.td#L542-L568.
This occurs on nightly and stables back to 1.34.0, though 1.33.0 and prior seem to work correctly.
Tested on Broadwell (Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz) and Skylake (Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz).
Same issue filed against
aes_soft
crate: RustCrypto/block-ciphers#51The text was updated successfully, but these errors were encountered: