Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong optimization: instability of x87 floating-point results leads to nonsense #44218

Open
llvmbot opened this issue Feb 11, 2020 · 20 comments
Open
Labels
backend:X86 bugzilla Issues migrated from bugzilla confirmed Verified by a second party floating-point Floating-point math

Comments

@llvmbot
Copy link
Member

llvmbot commented Feb 11, 2020

Bugzilla Link 44873
Version trunk
OS Linux
Reporter LLVM Bugzilla Contributor
CC @topperc,@LebedevRI,@RKSimon,@rotateright,@vinc17fr

Extended Description

x87 floating-point results are effectively unstable due to possible excess precision. Without extra care, this instability can taint everything around and lead to nonsense.

Instability is not limited to FP numbers, it extends to integers too:

#include <stdio.h>

__attribute__((noipa,optnone)) // imagine it in a separate TU
static int opaque(int i) { return i; }

int main()
{
    int z = opaque(1) + 0x1p-60 == 1;

    printf("z = %d\n", z);
    if (z) 
        puts("z is one");
}
$ clang -std=c11 -pedantic -Wall -Wextra -Wno-unknown-attributes -m32 -march=i686 -O3 test.c && ./a.out
z = 0
z is one
----------------------------------------------------------------------
clang x86-64 version: clang version 11.0.0 (https://github.com/llvm/llvm-project 14ecbd7b8ded18af6c95f6a9957da541d1ec0e80)
----------------------------------------------------------------------
@llvmbot
Copy link
Member Author

llvmbot commented Feb 11, 2020

And instability of integers then easily taints surrounding code:

#include <stdio.h>

__attribute__((optnone)) // imagine it in a separate TU
static int opaque(int i) { return i; }

static void f(int a)
{
    int z = opaque(0);

    printf("z = %d\n", z);
    if (z == a && a == 1)
        printf("z = %d\n", z);
}

int main()
{
    f(opaque(1) + 0x1p-60 == 1);
}
$ clang -std=c11 -pedantic -Wall -Wextra -m32 -march=i686 -O3 test.c && ./a.out
z = 0
z = 1
----------------------------------------------------------------------
clang x86-64 version: clang version 11.0.0 (https://github.com/llvm/llvm-project 14ecbd7b8ded18af6c95f6a9957da541d1ec0e80)
----------------------------------------------------------------------

gcc bug -- https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93681

@LebedevRI
Copy link
Member

LebedevRI commented Feb 11, 2020

int z = opaque(1) + 0x1p-60 == 1;

0x1p-60 there is just a double
https://godbolt.org/z/mJS3RN
It's as expected if it is long double: https://godbolt.org/z/NVvJht

@llvmbot
Copy link
Member Author

llvmbot commented Feb 11, 2020

int z = opaque(1) + 0x1p-60 == 1;

0x1p-60 there is just a double
https://godbolt.org/z/mJS3RN
It's as expected if it is long double: https://godbolt.org/z/NVvJht

Yeah, 1 + 0x1p-60 is the simplest expression I could think of where both operands fit into double exactly but the result doesn't.
x87 evaluates it as long double and gets the exact result. But if the result is spilled into memory (as double) it's rounded to just 1.

If you force the expression to honest long double (e.g., by casting any of the operands to it) then everything becomes stable -- the result both in a register and in memory has the same type and the same value. The same happens with SSE -- the result is stored as double everywhere and there are no discrepancies.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@Muon
Copy link

Muon commented Feb 16, 2023

This issue in particular should be resolvable by either always storing the result to memory before performing a comparison, or by setting the x87 FPU's precision to the destination's precision.

@Muon
Copy link

Muon commented Oct 21, 2023

I'm trying to fix the contradictory control flow here by promoting f32 and f64 ops to intel_fp80 on the x87. This wouldn't make the numerical results conform to IEEE 754, but it would at least make them consistent. However, I'm running into trouble because it appears that LLVM doesn't actually properly support promotion for floating-point operations. I tried removing the f32/f64 registers from the x87, setting the operations to Promote, and then fiddling with this part of TargetLoweringBase.cpp:

if (!isTypeLegal(MVT::f32)) {
NumRegistersForVT[MVT::f32] = NumRegistersForVT[MVT::i32];
RegisterTypeForVT[MVT::f32] = RegisterTypeForVT[MVT::i32];
TransformToType[MVT::f32] = MVT::i32;
ValueTypeActions.setTypeAction(MVT::f32, TypeSoftenFloat);
}

But now I'm getting errors because it's trying to promote constants to intel_fp80, and I don't know how to tell it that the x87 can load those from memory as a widening conversion.

Can anyone point me in the right direction? Pinging @d0k @topperc since they were the last to touch that part of the code.

@Muon
Copy link

Muon commented Nov 23, 2023

It is sufficient to read from a global variable to reproduce this (godbolt):

#include <stdio.h>

float c = 1.0f;

static void f(int z) {
    printf("z = %d\n", z);
    if (z) 
        printf("z = %d\n", z);
}

int main(void) {
    f(c + 0x1p-52f == c);
}

This version should also reproduce correctly on Windows as well.

@Endilll Endilll added confirmed Verified by a second party floating-point Floating-point math labels Nov 23, 2023
@hvdijk
Copy link
Contributor

hvdijk commented Nov 28, 2023

I'm trying to fix the contradictory control flow here by promoting f32 and f64 ops to intel_fp80 on the x87.

I'm looking at this and realising that that can avoid the instability, but cannot generate correct results. For f32, it can work but for f64, it cannot. Because intel_fp80 has less than double the precision of f64, it is possible to get double rounding errors where adding two f64 values gives a different result than promoting both to intel_fp80, performing the addition there, and then rounding back to f64. It would still be an improvement, but the work needed to get that done might be work that would need to be thrown out again later.

@Muon
Copy link

Muon commented Nov 28, 2023

Yes, I'm only trying to fix the control flow issue for now. Actually getting correct results will involve a performance hit, and my previous inquiries suggest that this would only be merged if it was behind a compiler flag.

Also, are you thinking of products? Twice the precision is enough to represent products exactly. However, I'm almost certain you need much more than twice the precision to make double rounding give correct results for addition, if it's possible at all. (I don't have a specific counterexample on hand, but I can probably come up with something if I fiddle with the exponents and rounding breakpoints...)

@hvdijk
Copy link
Contributor

hvdijk commented Nov 29, 2023

Yes, I'm only trying to fix the control flow issue for now. Actually getting correct results will involve a performance hit, and my previous inquiries suggest that this would only be merged if it was behind a compiler flag.

I do not see how it's possible to fix even that without a performance hit, in general, but I can imagine that there is a possibility that this performance hit will be lower than that of any alternative.

Also, are you thinking of products? Twice the precision is enough to represent products exactly. However, I'm almost certain you need much more than twice the precision to make double rounding give correct results for addition, if it's possible at all. (I don't have a specific counterexample on hand, but I can probably come up with something if I fiddle with the exponents and rounding breakpoints...)

For addition, first consider round to nearest, because it is the most complicated rounding mode. For round to nearest, in order for double rounding to happen, and to produce wrong results, you need the round to extended precision to be inexact, and produce a value which is exactly at a halfway point for the reduced precision which then gets rounded the wrong way in the second round operation. But producing a value which is exactly at a halfway point requires the most significant extended precision bit to be set after the round to extended precision, which requires either that or the next bit to be set in the infinitely precise original result. Meanwhile, having the round to extended precision be inexact requires a bit after the extended precision bits to be set in the infinitely precise result. These cannot both be true when performing an addition if the extended precision adds sufficient bits. I'm not sure how many extra bits of precision are needed exactly, but it should be little more than twice the original precision. For the other rounding modes, it's much simpler, it's either round to zero or round away from zero (round up and round down are one of those, depending on the sign), round to zero is just discarding the extra precision bits, and round away from zero, if inexact, ensures that at least one bit will be set in the extra precision bits so that the second conversion rounds correctly. I don't believe double rounding should ever be an issue with fp32 additions promoted to fp80, in any rounding mode supported by x87. Happy to be corrected if wrong, of course.

That said, this is kind of distracting from the point, which is not how much precision is enough for addition to be safe in extended precision, it's that fp80 is definitely not precise enough for fp64 operations to be safely promoted to fp80.

The only way I think we can get correct results is if we let LLVM alter the x87 control word to set the precision to that needed by operations. And I actually suspect that if done right, that will have a lower impact on performance in code which does not mix different FP types, because it means LLVM can just set the precision at the start of a function, perform all the operations it needs, and then restore the precision at the end, whereas leaving x87 in its default precision would require extra instructions to ensure correct rounding after every operation. Code which does mix different FP types would need the control word to be changed throughout the function though.

@Muon
Copy link

Muon commented Nov 29, 2023

I agree that a true fix would be immensely preferable. However, I want to add that it's not sufficient to set the control word to the right precision, because that doesn't modify the exponent range of the x87 registers. You still need to do extra ops to ensure correct handling of underflow, subnormals, and overflow.

Also, yes, there is a performance cost to promoting all ops to fp80. Since changing the control word plays havoc with pipelining, I suspect this is cheaper, but only profiling will tell.

(Also, indeed, double rounding is never a problem for directed rounding modes.)

@hvdijk
Copy link
Contributor

hvdijk commented Nov 29, 2023

However, I want to add that it's not sufficient to set the control word to the right precision, because that doesn't modify the exponent range of the x87 registers.

Thanks for that addition. You are correct, we need that as well.

@RalfJung
Copy link
Contributor

I'm looking at this and realising that that can avoid the instability, but cannot generate correct results. For f32, it can work but for f64, it cannot. Because intel_fp80 has less than double the precision of f64, it is possible to get double rounding errors where adding two f64 values gives a different result than promoting both to intel_fp80, performing the addition there, and then rounding back to f64. It would still be an improvement, but the work needed to get that done might be work that would need to be thrown out again later.

Yeah a naive translation will give wrong results. But getting the exact results seems to be possible; Java is supposedly precisely emulating fp64 on the x87. See here and here for details.

Also, are you thinking of products? Twice the precision is enough to represent products exactly. However, I'm almost certain you need much more than twice the precision to make double rounding give correct results for addition, if it's possible at all. (I don't have a specific counterexample on hand, but I can probably come up with something if I fiddle with the exponents and rounding breakpoints...)

There's a general theorem that twice the precision is good enough, AFAIK. Here's the paper for that: https://hal.science/hal-01091186/document

@Muon
Copy link

Muon commented Nov 30, 2023

There's a general theorem that twice the precision is good enough, AFAIK. Here's the paper for that: https://hal.science/hal-01091186/document

Fantastic, thank you for this. Reading through it, the extended format needs to have 2p + 2 digits of precision to guarantee innocuous double rounding for all of (+, -, *, /, sqrt). I've also now got a little test case to demonstrate that double rounding doesn't work for emulating f64 addition using f80 addition: https://godbolt.org/z/j11xbqzr3

I've also come up with a spill-free testcase which miscompiles (godbolt). The following program should always exit with status code 0, but when compiled with -m32 -mno-sse and -O1 or higher, it exits with code 127:

int main(void) {
    volatile float c = 1.0f;
    float d = c + 0x1p-52f;
    c = d; // ensure that, regardless of the FP semantics, we should have c == d after this
    return c == d ? 0 : 127;
}

Accordingly, the following program should never segfault, but it does when compiled with -m32 -mno-sse and -O1 or higher (godbolt):

int* p = 0;

int sane(void) {
    volatile float c = 1.0f;
    float d = c + 0x1p-52f;
    c = d; // ensure that, regardless of the FP semantics, we should have c == d after this
    return c == d;
}


int main(void) {
    if (!sane()) {
        *p = 1;
    }
}

These aren't quite the same bug as initially reported: there's no contradictory control flow here. In fact, I don't think it's possible for contradictory control flow to happen unless there is a spill: if there are no spills, then all variables are still in registers, so repeating a comparison will give the same result as it did the first time.

These testcases also show that promoting all ops to intel_fp80 isn't necessarily a fix either, unless extra care is taken to reload values after a narrowing store to memory. If these promotions mean that d retains excess precision until the comparison (after all, it hasn't changed since its initial assignment), then c == d will still be false, because c = d forces a narrowing store to memory since c is volatile.

I think, in fact, that might be the crux of the matter: an extended precision value, once created, may either be narrowed, or used otherwise, but not both on the same path. If this is not the case, then there may be conflicts, where one part of the code sees the extended version of that value, and another part sees the narrowed version.

@jcranmer-intel
Copy link
Contributor

Yeah a naive translation will give wrong results. But getting the exact results seems to be possible; Java is supposedly precisely emulating fp64 on the x87. See here and here for details.

The emulation works by using the precision control set to 53 bits (and using essentially fptrunc to clamp the resulting exponent). Which does make it trickier if you want to also support x86_fp80 at the same time, since now you have to juggle the FCW changes interspersed with FP instructions, and that requires strictfp if you're doing it at the level of LLVM IR.

There's a general theorem that twice the precision is good enough, AFAIK. Here's the paper for that: https://hal.science/hal-01091186/document

In the interests of pedantry, you also have need the exponent range to be slightly larger in the second type (two extra bits is sufficient I believe). It might be worth having a helper method in fltSemantics to return if the type can emulate another type without double rounding.

@RalfJung
Copy link
Contributor

RalfJung commented Dec 5, 2023

The emulation works by using the precision control set to 53 bits (and using essentially fptrunc to clamp the resulting exponent). Which does make it trickier if you want to also support x86_fp80 at the same time, since now you have to juggle the FCW changes interspersed with FP instructions, and that requires strictfp if you're doing it at the level of LLVM IR.

Yeah I was assuming this would have to be done in the backend. But still it's probably not easy; I just wanted to raise the possibility.

@jyknight
Copy link
Member

jyknight commented Dec 8, 2023

If LLVM were to set the precision control to 53-bits for operations on double values, I think it'll need to restore the mode before every external function call, as well as before executing any operation on an fp80 value.

And even that only enables the implementation of the complex code sequences listed in https://open-std.org/JTC1/SC22/JSG/docs/m3/docs/jsgn326.pdf which IMO are themselves well beyond what's worthwhile to implement. Anyone really wanting that might be better off implementing a correctly-rounded soft-float library which uses x87 instructions to accelerate its emulation, instead of putting that into LLVM directly.

@jcranmer-intel
Copy link
Contributor

Anyone really wanting that might be better off implementing a correctly-rounded soft-float library which uses x87 instructions to accelerate its emulation, instead of putting that into LLVM directly.

I was going to argue against this, but on further reflection, the main value-add you would get from LLVM would be supporting optimizations on the FP control word in conjunction with an inlineable soft-float library, such that operations can be bucketed into those that require the same FP control word bits, and then ordered so as to minimize flag transitions. Such optimizations I think could be useful in other situations (do GPU libm libraries have flag-twiddling operations that might be omittable with such optimizations?), but there is still a lot of work needed to enable such optimizations, and if x87 is the only thing that benefits from it... it's not the highest priority thing to solve.

@Muon
Copy link

Muon commented Dec 21, 2023

And even that only enables the implementation of the complex code sequences listed in https://open-std.org/JTC1/SC22/JSG/docs/m3/docs/jsgn326.pdf which IMO are themselves well beyond what's worthwhile to implement. Anyone really wanting that might be better off implementing a correctly-rounded soft-float library which uses x87 instructions to accelerate its emulation, instead of putting that into LLVM directly.

In a certain sense, those code sequences are exactly what an x87-accelerated softfloat library would be. That said, it's not clear to me that this is too complex to put into LLVM. Each sequence is only 6-7 instructions. What's so bad about that?

@Muon
Copy link

Muon commented Apr 23, 2024

We found another possibly-related miscompilation at rust-lang/rust#114479 (comment). This one triggers a segfault from a safe Rust program by exploiting the fact that signaling NaNs get quieted upon being loaded into an x87 register. Furthermore, it does not use bitcasting, and it also does not include any NaNs per se. Instead, LLVM merrily loads an integer with an SNaN bit pattern into a floating-point register.

@arsenm
Copy link
Contributor

arsenm commented Apr 23, 2024

LLVM merrily loads an integer with an SNaN bit pattern into a floating-point register.

This sounds like a different issue altogether

arsenm pushed a commit that referenced this issue Oct 11, 2024
… to IEEE-754 (#102140)

Fixes #60942: IEEE semantics
is likely what many frontends want (it definitely is what Rust wants),
and it is what LLVM passes already assume when they use APFloat to
propagate float operations.

This does not reflect what happens on x87, but what happens there is
just plain unsound (#89885,
#44218); there is no coherent
specification that will describe this behavior correctly -- the backend
in combination with standard LLVM passes is just fundamentally buggy in
a hard-to-fix-way.

There's also the questions around flushing subnormals to zero, but [this
discussion](https://discourse.llvm.org/t/questions-about-llvm-canonicalize/79378)
seems to indicate a general stance of: this is specific non-standard
hardware behavior, and generally needs LLVM to be told that basic float
ops do not return the standard result. Just naively running
LLVM-compiled code on hardware configured to flush subnormals will lead
to #89885-like issues.

AFAIK this is also what Alive2 implements (@nunoplopes please correct me
if I am wrong).
DanielCChen pushed a commit to DanielCChen/llvm-project that referenced this issue Oct 16, 2024
… to IEEE-754 (llvm#102140)

Fixes llvm#60942: IEEE semantics
is likely what many frontends want (it definitely is what Rust wants),
and it is what LLVM passes already assume when they use APFloat to
propagate float operations.

This does not reflect what happens on x87, but what happens there is
just plain unsound (llvm#89885,
llvm#44218); there is no coherent
specification that will describe this behavior correctly -- the backend
in combination with standard LLVM passes is just fundamentally buggy in
a hard-to-fix-way.

There's also the questions around flushing subnormals to zero, but [this
discussion](https://discourse.llvm.org/t/questions-about-llvm-canonicalize/79378)
seems to indicate a general stance of: this is specific non-standard
hardware behavior, and generally needs LLVM to be told that basic float
ops do not return the standard result. Just naively running
LLVM-compiled code on hardware configured to flush subnormals will lead
to llvm#89885-like issues.

AFAIK this is also what Alive2 implements (@nunoplopes please correct me
if I am wrong).
bricknerb pushed a commit to bricknerb/llvm-project that referenced this issue Oct 17, 2024
… to IEEE-754 (llvm#102140)

Fixes llvm#60942: IEEE semantics
is likely what many frontends want (it definitely is what Rust wants),
and it is what LLVM passes already assume when they use APFloat to
propagate float operations.

This does not reflect what happens on x87, but what happens there is
just plain unsound (llvm#89885,
llvm#44218); there is no coherent
specification that will describe this behavior correctly -- the backend
in combination with standard LLVM passes is just fundamentally buggy in
a hard-to-fix-way.

There's also the questions around flushing subnormals to zero, but [this
discussion](https://discourse.llvm.org/t/questions-about-llvm-canonicalize/79378)
seems to indicate a general stance of: this is specific non-standard
hardware behavior, and generally needs LLVM to be told that basic float
ops do not return the standard result. Just naively running
LLVM-compiled code on hardware configured to flush subnormals will lead
to llvm#89885-like issues.

AFAIK this is also what Alive2 implements (@nunoplopes please correct me
if I am wrong).
freebsd-git pushed a commit to freebsd/freebsd-ports that referenced this issue Nov 9, 2024
Upstream Rust always requires SSE2 for x86.  But back in 2017[^1][^2] we
patched lang/rust to disable SSE2 for i386.  At the time, it was
reported that some people were still using non-SSE2 capable hardware.
More recently, LLVM bugs have been discovered[^3][^4] that can result in
rounding bugs and reduced accuracy when using f64 on non-SSE hardware.
In weird cases, they can even cause wilder unpredictable behavior, like
segfaults.

Revert our change for the sake of Pentium 4 (and later) users.  But add
an SSE2 option.  Disabling it will allow the port to be used on Pentium
3 and older CPUs.

[^1]: d65b288
[^2]: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223415
[^3]: rust-lang/rust#114479
[^4]: llvm/llvm-project#44218

Reviewed by: emaste
Differential Revision: https://reviews.freebsd.org/D47227
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla confirmed Verified by a second party floating-point Floating-point math
Projects
None yet
Development

No branches or pull requests

9 participants