-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong optimization: instability of x87 floating-point results leads to nonsense #44218
Comments
And instability of integers then easily taints surrounding code: #include <stdio.h>
__attribute__((optnone)) // imagine it in a separate TU
static int opaque(int i) { return i; }
static void f(int a)
{
int z = opaque(0);
printf("z = %d\n", z);
if (z == a && a == 1)
printf("z = %d\n", z);
}
int main()
{
f(opaque(1) + 0x1p-60 == 1);
} $ clang -std=c11 -pedantic -Wall -Wextra -m32 -march=i686 -O3 test.c && ./a.out
z = 0
z = 1
gcc bug -- https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93681 |
|
Yeah, 1 + 0x1p-60 is the simplest expression I could think of where both operands fit into If you force the expression to honest |
This issue in particular should be resolvable by either always storing the result to memory before performing a comparison, or by setting the x87 FPU's precision to the destination's precision. |
I'm trying to fix the contradictory control flow here by promoting f32 and f64 ops to intel_fp80 on the x87. This wouldn't make the numerical results conform to IEEE 754, but it would at least make them consistent. However, I'm running into trouble because it appears that LLVM doesn't actually properly support promotion for floating-point operations. I tried removing the f32/f64 registers from the x87, setting the operations to Promote, and then fiddling with this part of llvm-project/llvm/lib/CodeGen/TargetLoweringBase.cpp Lines 1414 to 1419 in 14bc11a
But now I'm getting errors because it's trying to promote constants to intel_fp80, and I don't know how to tell it that the x87 can load those from memory as a widening conversion. Can anyone point me in the right direction? Pinging @d0k @topperc since they were the last to touch that part of the code. |
It is sufficient to read from a global variable to reproduce this (godbolt): #include <stdio.h>
float c = 1.0f;
static void f(int z) {
printf("z = %d\n", z);
if (z)
printf("z = %d\n", z);
}
int main(void) {
f(c + 0x1p-52f == c);
} This version should also reproduce correctly on Windows as well. |
I'm looking at this and realising that that can avoid the instability, but cannot generate correct results. For f32, it can work but for f64, it cannot. Because |
Yes, I'm only trying to fix the control flow issue for now. Actually getting correct results will involve a performance hit, and my previous inquiries suggest that this would only be merged if it was behind a compiler flag. Also, are you thinking of products? Twice the precision is enough to represent products exactly. However, I'm almost certain you need much more than twice the precision to make double rounding give correct results for addition, if it's possible at all. (I don't have a specific counterexample on hand, but I can probably come up with something if I fiddle with the exponents and rounding breakpoints...) |
I do not see how it's possible to fix even that without a performance hit, in general, but I can imagine that there is a possibility that this performance hit will be lower than that of any alternative.
For addition, first consider round to nearest, because it is the most complicated rounding mode. For round to nearest, in order for double rounding to happen, and to produce wrong results, you need the round to extended precision to be inexact, and produce a value which is exactly at a halfway point for the reduced precision which then gets rounded the wrong way in the second round operation. But producing a value which is exactly at a halfway point requires the most significant extended precision bit to be set after the round to extended precision, which requires either that or the next bit to be set in the infinitely precise original result. Meanwhile, having the round to extended precision be inexact requires a bit after the extended precision bits to be set in the infinitely precise result. These cannot both be true when performing an addition if the extended precision adds sufficient bits. I'm not sure how many extra bits of precision are needed exactly, but it should be little more than twice the original precision. For the other rounding modes, it's much simpler, it's either round to zero or round away from zero (round up and round down are one of those, depending on the sign), round to zero is just discarding the extra precision bits, and round away from zero, if inexact, ensures that at least one bit will be set in the extra precision bits so that the second conversion rounds correctly. I don't believe double rounding should ever be an issue with fp32 additions promoted to fp80, in any rounding mode supported by x87. Happy to be corrected if wrong, of course. That said, this is kind of distracting from the point, which is not how much precision is enough for addition to be safe in extended precision, it's that fp80 is definitely not precise enough for fp64 operations to be safely promoted to fp80. The only way I think we can get correct results is if we let LLVM alter the x87 control word to set the precision to that needed by operations. And I actually suspect that if done right, that will have a lower impact on performance in code which does not mix different FP types, because it means LLVM can just set the precision at the start of a function, perform all the operations it needs, and then restore the precision at the end, whereas leaving x87 in its default precision would require extra instructions to ensure correct rounding after every operation. Code which does mix different FP types would need the control word to be changed throughout the function though. |
I agree that a true fix would be immensely preferable. However, I want to add that it's not sufficient to set the control word to the right precision, because that doesn't modify the exponent range of the x87 registers. You still need to do extra ops to ensure correct handling of underflow, subnormals, and overflow. Also, yes, there is a performance cost to promoting all ops to fp80. Since changing the control word plays havoc with pipelining, I suspect this is cheaper, but only profiling will tell. (Also, indeed, double rounding is never a problem for directed rounding modes.) |
Thanks for that addition. You are correct, we need that as well. |
Yeah a naive translation will give wrong results. But getting the exact results seems to be possible; Java is supposedly precisely emulating fp64 on the x87. See here and here for details.
There's a general theorem that twice the precision is good enough, AFAIK. Here's the paper for that: https://hal.science/hal-01091186/document |
Fantastic, thank you for this. Reading through it, the extended format needs to have 2p + 2 digits of precision to guarantee innocuous double rounding for all of (+, -, *, /, sqrt). I've also now got a little test case to demonstrate that double rounding doesn't work for emulating f64 addition using f80 addition: https://godbolt.org/z/j11xbqzr3 I've also come up with a spill-free testcase which miscompiles (godbolt). The following program should always exit with status code 0, but when compiled with int main(void) {
volatile float c = 1.0f;
float d = c + 0x1p-52f;
c = d; // ensure that, regardless of the FP semantics, we should have c == d after this
return c == d ? 0 : 127;
} Accordingly, the following program should never segfault, but it does when compiled with int* p = 0;
int sane(void) {
volatile float c = 1.0f;
float d = c + 0x1p-52f;
c = d; // ensure that, regardless of the FP semantics, we should have c == d after this
return c == d;
}
int main(void) {
if (!sane()) {
*p = 1;
}
} These aren't quite the same bug as initially reported: there's no contradictory control flow here. In fact, I don't think it's possible for contradictory control flow to happen unless there is a spill: if there are no spills, then all variables are still in registers, so repeating a comparison will give the same result as it did the first time. These testcases also show that promoting all ops to I think, in fact, that might be the crux of the matter: an extended precision value, once created, may either be narrowed, or used otherwise, but not both on the same path. If this is not the case, then there may be conflicts, where one part of the code sees the extended version of that value, and another part sees the narrowed version. |
The emulation works by using the precision control set to 53 bits (and using essentially
In the interests of pedantry, you also have need the exponent range to be slightly larger in the second type (two extra bits is sufficient I believe). It might be worth having a helper method in |
Yeah I was assuming this would have to be done in the backend. But still it's probably not easy; I just wanted to raise the possibility. |
If LLVM were to set the precision control to 53-bits for operations on double values, I think it'll need to restore the mode before every external function call, as well as before executing any operation on an fp80 value. And even that only enables the implementation of the complex code sequences listed in https://open-std.org/JTC1/SC22/JSG/docs/m3/docs/jsgn326.pdf which IMO are themselves well beyond what's worthwhile to implement. Anyone really wanting that might be better off implementing a correctly-rounded soft-float library which uses x87 instructions to accelerate its emulation, instead of putting that into LLVM directly. |
I was going to argue against this, but on further reflection, the main value-add you would get from LLVM would be supporting optimizations on the FP control word in conjunction with an inlineable soft-float library, such that operations can be bucketed into those that require the same FP control word bits, and then ordered so as to minimize flag transitions. Such optimizations I think could be useful in other situations (do GPU libm libraries have flag-twiddling operations that might be omittable with such optimizations?), but there is still a lot of work needed to enable such optimizations, and if x87 is the only thing that benefits from it... it's not the highest priority thing to solve. |
In a certain sense, those code sequences are exactly what an x87-accelerated softfloat library would be. That said, it's not clear to me that this is too complex to put into LLVM. Each sequence is only 6-7 instructions. What's so bad about that? |
We found another possibly-related miscompilation at rust-lang/rust#114479 (comment). This one triggers a segfault from a safe Rust program by exploiting the fact that signaling NaNs get quieted upon being loaded into an x87 register. Furthermore, it does not use bitcasting, and it also does not include any NaNs per se. Instead, LLVM merrily loads an integer with an SNaN bit pattern into a floating-point register. |
This sounds like a different issue altogether |
… to IEEE-754 (#102140) Fixes #60942: IEEE semantics is likely what many frontends want (it definitely is what Rust wants), and it is what LLVM passes already assume when they use APFloat to propagate float operations. This does not reflect what happens on x87, but what happens there is just plain unsound (#89885, #44218); there is no coherent specification that will describe this behavior correctly -- the backend in combination with standard LLVM passes is just fundamentally buggy in a hard-to-fix-way. There's also the questions around flushing subnormals to zero, but [this discussion](https://discourse.llvm.org/t/questions-about-llvm-canonicalize/79378) seems to indicate a general stance of: this is specific non-standard hardware behavior, and generally needs LLVM to be told that basic float ops do not return the standard result. Just naively running LLVM-compiled code on hardware configured to flush subnormals will lead to #89885-like issues. AFAIK this is also what Alive2 implements (@nunoplopes please correct me if I am wrong).
… to IEEE-754 (llvm#102140) Fixes llvm#60942: IEEE semantics is likely what many frontends want (it definitely is what Rust wants), and it is what LLVM passes already assume when they use APFloat to propagate float operations. This does not reflect what happens on x87, but what happens there is just plain unsound (llvm#89885, llvm#44218); there is no coherent specification that will describe this behavior correctly -- the backend in combination with standard LLVM passes is just fundamentally buggy in a hard-to-fix-way. There's also the questions around flushing subnormals to zero, but [this discussion](https://discourse.llvm.org/t/questions-about-llvm-canonicalize/79378) seems to indicate a general stance of: this is specific non-standard hardware behavior, and generally needs LLVM to be told that basic float ops do not return the standard result. Just naively running LLVM-compiled code on hardware configured to flush subnormals will lead to llvm#89885-like issues. AFAIK this is also what Alive2 implements (@nunoplopes please correct me if I am wrong).
… to IEEE-754 (llvm#102140) Fixes llvm#60942: IEEE semantics is likely what many frontends want (it definitely is what Rust wants), and it is what LLVM passes already assume when they use APFloat to propagate float operations. This does not reflect what happens on x87, but what happens there is just plain unsound (llvm#89885, llvm#44218); there is no coherent specification that will describe this behavior correctly -- the backend in combination with standard LLVM passes is just fundamentally buggy in a hard-to-fix-way. There's also the questions around flushing subnormals to zero, but [this discussion](https://discourse.llvm.org/t/questions-about-llvm-canonicalize/79378) seems to indicate a general stance of: this is specific non-standard hardware behavior, and generally needs LLVM to be told that basic float ops do not return the standard result. Just naively running LLVM-compiled code on hardware configured to flush subnormals will lead to llvm#89885-like issues. AFAIK this is also what Alive2 implements (@nunoplopes please correct me if I am wrong).
Upstream Rust always requires SSE2 for x86. But back in 2017[^1][^2] we patched lang/rust to disable SSE2 for i386. At the time, it was reported that some people were still using non-SSE2 capable hardware. More recently, LLVM bugs have been discovered[^3][^4] that can result in rounding bugs and reduced accuracy when using f64 on non-SSE hardware. In weird cases, they can even cause wilder unpredictable behavior, like segfaults. Revert our change for the sake of Pentium 4 (and later) users. But add an SSE2 option. Disabling it will allow the port to be used on Pentium 3 and older CPUs. [^1]: d65b288 [^2]: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223415 [^3]: rust-lang/rust#114479 [^4]: llvm/llvm-project#44218 Reviewed by: emaste Differential Revision: https://reviews.freebsd.org/D47227
Extended Description
x87 floating-point results are effectively unstable due to possible excess precision. Without extra care, this instability can taint everything around and lead to nonsense.
Instability is not limited to FP numbers, it extends to integers too:
The text was updated successfully, but these errors were encountered: