Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recalculate lerp if we got infinity. Eliminates some overflows. #1918

Merged
merged 36 commits into from
Jun 20, 2022

Conversation

fsb4000
Copy link
Contributor

@fsb4000 fsb4000 commented May 11, 2021

Fixes #1917

@fsb4000 fsb4000 requested a review from a team as a code owner May 11, 2021 05:14
@fsb4000
Copy link
Contributor Author

fsb4000 commented May 12, 2021

@statementreply

It breaks monotonicity.

Thank you for finding some issues. I fixed some but I have no idea how to restore monotonicity for lerp :(

@statementreply
Copy link
Contributor

statementreply commented May 12, 2021

I fixed some but I have no idea how to restore monotonicity for lerp :(

Never mind. It can be tricky to implement floating point functionalities when there are requirements on precision, handling of extreme values, and/or preservation of math properties.

I'm going to open a PR (maybe on the weekend) with an alternative approach to fix the false overflow. We need to investigate whether there's a simple-ish fix for the original issue.

@abolz
Copy link
Contributor

abolz commented May 12, 2021

FWIW, this could be fixed with a fused multiply-add

--const auto _Candidate = _ArgA + _ArgT * (_ArgB - _ArgA);
++const auto _Candidate = std::fma(_ArgT, _ArgB - _ArgA, _ArgA);

But fma is not constexpr.

@StephanTLavavej StephanTLavavej added the bug Something isn't working label May 12, 2021
@fsb4000
Copy link
Contributor Author

fsb4000 commented May 13, 2021

@statementreply

Well, at least we can reuse new test code :(

What do you think about fma? It seems it doesn't overflow but we can get different results for runtime/compiletime invocation of lerp...

https://gcc.godbolt.org/z/T564z58qo

@MattStephanson
Copy link
Contributor

The condition we have to handle here is limited to |t*(b-a)| ≤ 2*MEOW_MAX, right? Otherwise we'd have a true, not spurious, overflow. If that's the case, maybe we can simply scale the calculation by a fixed factor.

auto _Smaller     = _ArgT;
auto _Larger      = _ArgB - _ArgA;
auto _Abs_smaller = _Float_abs(_Smaller);
auto _Abs_larger  = _Float_abs(_Larger);
if (_Abs_larger < _Abs_smaller) {
    _STD swap(_Smaller, _Larger);
    _STD swap(_Abs_smaller, _Abs_larger);
}

if (_Abs_larger > _STD sqrt(numeric_limits<_Ty>::max()) && _Abs_smaller > 1) {
    return 2 * (_Ty{0.5} * _ArgA + _Smaller * (_Ty{0.5} * _Larger));
} else {
    return _ArgA + _Smaller * _Larger;
}

_Larger is too large to be subnormal, so scaling by 0.5 is exact, and the product _Smaller * _Larger is large enough that if _ArgA is subnormal, it will be too small to contribute anyway. So I think the two expressions should be equivalent, apart from internal overflow, so there won't be a loss of monotonicity when switching from one to the other.

@statementreply
Copy link
Contributor

statementreply commented May 14, 2021

What do you think about fma? It seems it doesn't overflow but we can get different results for runtime/compiletime invocation of lerp...

  • Accuracy

    fma(t, b - a, a) is (on average) more accurate than a + t * (b - a), and doesn't suffer from t * (b - a) overflowing as long as the final result doesn't overflow. (1 - t) * a + t * b in the other branch could also be fma(t, b, fma(-t, a, a)).

  • Performance

    Recent CPUs (AMD since ~2012, Intel since ~2014, and all ARM that Windows arm64 supports) have hardware fma instructions. Older CPUs don't support hardware fma , so std::fma needs to be implemented in software.

    On the current version of MSVC, if not compiled with /arch:AVX2 or alike, std::fma isn't inlined and generates a call into UCRT (https://godbolt.org/z/q4vbe7xqn). It then checks whether the CPU supports fma. If so, it calls the CPU instruction directly (however the function call overhead is non-trivial). If not, UCRT uses another algorithm to compute the result, which is unfortunately neither accurate (DevCom-242309) nor fast (hundreds of nanoseconds for some input values) as of ucrtbase.dll 10.0.19041.789.

  • Different results between compile time and runtime

    There are already situations where results could differ.

    <cmath> functions (exp, sin, etc.) use fma powered implementation on CPUs with hardware fma, and a different implementation without fma on CPUs without hardware fma. Code using these math functions could generate different results on different CPUs.

    GCC performs constant folding (not limited to constexpr) on math functions. The compile time results could be different from the results from the runtime math library. (https://godbolt.org/z/r49WEax6E)

    Also, the compiler is allowed to compile x * y + z into fma(x, y, z) (https://godbolt.org/z/zaY6veW9j), which may also lead to different results between compile time and runtime.

    However, should the results differ between compile time and runtime, it's desirable that the compile time results are correctly rounded (as if computed with infinite precision and then rounded), or at least more accurate than the runtime results.

@statementreply
Copy link
Contributor

statementreply commented May 14, 2021

The condition we have to handle here is limited to |t*(b-a)| ≤ 2*MEOW_MAX, right? Otherwise we'd have a true, not spurious, overflow. If that's the case, maybe we can simply scale the calculation by a fixed factor.

auto _Smaller     = _ArgT;
auto _Larger      = _ArgB - _ArgA;
auto _Abs_smaller = _Float_abs(_Smaller);
auto _Abs_larger  = _Float_abs(_Larger);
if (_Abs_larger < _Abs_smaller) {
    _STD swap(_Smaller, _Larger);
    _STD swap(_Abs_smaller, _Abs_larger);
}

if (_Abs_larger > _STD sqrt(numeric_limits<_Ty>::max()) && _Abs_smaller > 1) {
    return 2 * (_Ty{0.5} * _ArgA + _Smaller * (_Ty{0.5} * _Larger));
} else {
    return _ArgA + _Smaller * _Larger;
}

_Larger is too large to be subnormal, so scaling by 0.5 is exact, and the product _Smaller * _Larger is large enough that if _ArgA is subnormal, it will be too small to contribute anyway. So I think the two expressions should be equivalent, apart from internal overflow, so there won't be a loss of monotonicity when switching from one to the other.

Note that we have |b - a| <= MEOW_MAX in this branch, so it is limited to |t| > 1. I think we could do the following to save a few branches (untested, needs benchmark):

// precondition: T{0.5} * t is exact
T linear_for_lerp(T intercept, T slope, T t) {
    T half_prod = slope * (T{0.5} * t);
    if (abs(half_prod) <= numeric_limits<T>::max() * T{0.5}) {
        return intercept + slope * t;
    } else {
        return T{2} * (T{0.5} * intercept + half_prod);
    }
}

T lerp(T a, T b, T t) {
    // [...]
    {
        if (abs(t) < T{1}) {
            T candidate = a + t * (b - a);
            // [...] fix monotonicity
        }

        if (t >= T{1}) {
            return linear_for_lerp(b, b - a, t - T{1});
        }

        // t <= -1
        return linear_for_lerp(a, b - a, t);
    }
    // [[...]]
}

@StephanTLavavej StephanTLavavej removed their assignment Jun 13, 2022
@StephanTLavavej StephanTLavavej self-assigned this Jun 15, 2022
@StephanTLavavej StephanTLavavej removed their assignment Jun 16, 2022
Co-authored-by: Stephan T. Lavavej <[email protected]>
Co-authored-by: Matt Stephanson <[email protected]>
@StephanTLavavej StephanTLavavej self-assigned this Jun 19, 2022
@StephanTLavavej
Copy link
Member

I'm speculatively mirroring this to the MSVC-internal repo. Further changes can be pushed, but please notify me.


return _STD fma(_ArgT, _ArgB - _ArgA, _ArgA);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find that Twitter thread mocking our huge implementation of lerp to let them know we've added another 25 lines. (No change requested.)

@StephanTLavavej StephanTLavavej merged commit 8e5d0aa into microsoft:main Jun 20, 2022
@StephanTLavavej
Copy link
Member

Thanks for infinitely improving this implementation! ♾️ 🚀 😹

@fsb4000 fsb4000 deleted the fix1917 branch June 20, 2022 04:07
fsb4000 added a commit to fsb4000/STL that referenced this pull request Aug 13, 2022
…osoft#1918)

Co-authored-by: Adam Bucior <[email protected]>
Co-authored-by: Alexander Bolz <[email protected]>
Co-authored-by: Curtis J Bezault <[email protected]>
Co-authored-by: Matt Stephanson <[email protected]>
Co-authored-by: Michael Schellenberger Costa <[email protected]>
Co-authored-by: statementreply <[email protected]>
Co-authored-by: Stephan T. Lavavej <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<cmath>: lerp(1e+308, 5e+307, 4.0) spuriously overflows