Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROTOTYPE] Adding a libm library for use by the JIT and VM #51439

Closed
wants to merge 2 commits into from

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Apr 17, 2021

This adds a new libm library that can be used by the VM (for Math internal calls) and by the JIT (for ValueNum constant folding).

The implementations are copied nearly verbatim from https://github.com/amd/aocl-libm-ose (as per the license headers), with the only fixups being minor changes to get it building in CoreCLR and running on ARM.
This means picking up fixes or perf improvements is effectively as straightforward as copying the files from aocl-libm-ose into the corresponding folder in the runtime.

This change would allow us to provide performant and deterministic implementation of System.Math across all our platforms without taking on the burden of creating or maintaining our own implementations.

CC. @jkotas, @jeffhandley, @danmoseley. I'd appreciate feedback here on if this approach (or a variation of it) is believed to be feasible.

I did some minor initial profiling on x64 Windows (via our Math Benchmarks in dotnet/performance) and for the methods that have switched we are on par or faster than in all cases.
For reference, x64 Windows uses what is effectively https://github.com/amd/win-libm, https://github.com/amd/aocl-libm-ose is a modernized and cross-platform variant of the same and so has some perf improvements (such as to Acosh) that we don't currently have.

The functions that aren't ported to in this PR (such as cos) either only exist as assembly and would need a C reference port (which would be contributed to amd/aocl-libm-ose before being taken on our end) or have multiple implementations available based on what CPU features are available and so aren't yet fully portable (the same issue of ensuring a stable reference implementation exists in amd/aocl-libm-ose would apply before taking it in dotnet/runtime).

@jkotas
Copy link
Member

jkotas commented Apr 17, 2021

performant and deterministic implementation of System.Math across all our platforms

If it is accross all our platforms, it should include Mono. src/native would be a better place for this library in that case.


if (xnan)
{
#ifdef WINDOWS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many places that are ifdefed for Windows like this. Why is that? Is it a problem for consistency accross platforms?

Copy link
Member Author

@tannergooding tannergooding Apr 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This namely represents a difference in what IEEE exception is raised and what errno is set for failure cases (cases that return NaN) as POSIX sometimes puts additional requirements on the methods that Windows doesn't conform to.

For managed code, these don't actually matter because we don't support (IEEE exceptions) or surface (errno) for these methods at all.

If we plan on using these with any native code, then we'd likely want to ensure UNIX platforms stay POSIX compliant in their error reporting; otherwise we can just pick one (I chose Windows for the purpose of the draft, since it does less) and it won't impact what is exposed via System.Math.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do not care about errno or IEEE exceptions, can we just turn this error handling into no-op or almost no-op?

If we plan on using these with any native code,

I do not think we have any native code that would need this. And if we ever had native code like that, we can fix it up as necessary to not expect errno, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do not care about errno or IEEE exceptions, can we just turn this error handling into no-op or almost no-op?

Yes, I have it commented out in the files that are part of this PR. I plan on submitting an issue/PR to amd/aocl-libm-ose to allow the same via an #ifdef, so we don't need to maintain any custom logic on our end.

Copy link
Member Author

@tannergooding tannergooding Apr 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, it's the logic that was in here: https://github.com/dotnet/runtime/pull/51439/files#diff-19a08ddd64deeddea26b16f7e96a1ce59bb6bf1e163ae53067bca6600ae73f04R32-R53

It was commented out because it triggered the exception via inline x86 assembly, which isn't compatible with ARM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@tannergooding
Copy link
Member Author

If it is accross all our platforms, it should include Mono. src/native would be a better place for this library in that case.

That's a good point. I wasn't quite sure on where it should live or the correct mechanism to get it linked in and so this draft PR is more about showing a working minimal prototype on Windows to see if we believe this is feasible and a good direction.

I expect there are a likely a few changes required to the build parameters, how this is linked or referenced by the various components, and to determine how this interacts with the PAL layer.

If we do believe this is worthwhile and that it addresses the maintainability concerns (given changes would be contributed back to and pulled from amd/aocl-libm-ose), then I'd make the appropriate changes here to make it fully functional across all our platforms.

@jkotas
Copy link
Member

jkotas commented Apr 17, 2021

determine how this interacts with the PAL layer.

I expect that we would delete all floating point wrappers from the CoreCLR PAL layer with this change.

@jkotas
Copy link
Member

jkotas commented Apr 17, 2021

The functions that aren't ported to in this PR (such as cos)

Yeah, it would be nice to sort these out. It would look odd to have this fancy logic for advanced functions, but missing it for basic functions such as cos.

It looks reasonable to me otherwise.

@ghost
Copy link

ghost commented Apr 17, 2021

Tagging subscribers to this area: @tannergooding, @pgovind
See info in area-owners.md if you want to be subscribed.

Issue Details

This adds a new libm library that can be used by the VM (for Math internal calls) and by the JIT (for ValueNum constant folding).

The implementations are copied nearly verbatim from https://github.com/amd/aocl-libm-ose (as per the license headers), with the only fixups being minor changes to get it building in CoreCLR and running on ARM.
This means picking up fixes or perf improvements is effectively as straightforward as copying the files from aocl-libm-ose into the corresponding folder in the runtime.

This change would allow us to provide performant and deterministic implementation of System.Math across all our platforms without taking on the burden of creating or maintaining our own implementations.

CC. @jkotas, @jeffhandley, @danmoseley. I'd appreciate feedback here on if this approach (or a variation of it) is believed to be feasible.

I did some minor initial profiling on x64 Windows (via our Math Benchmarks in dotnet/performance) and for the methods that have switched we are on par or faster than in all cases.
For reference, x64 Windows uses what is effectively https://github.com/amd/win-libm, https://github.com/amd/aocl-libm-ose is a modernized and cross-platform variant of the same and so has some perf improvements (such as to Acosh) that we don't currently have.

The functions that aren't ported to in this PR (such as cos) either only exist as assembly and would need a C reference port (which would be contributed to amd/aocl-libm-ose before being taken on our end) or have multiple implementations available based on what CPU features are available and so aren't yet fully portable (the same issue of ensuring a stable reference implementation exists in amd/aocl-libm-ose would apply before taking it in dotnet/runtime).

Author: tannergooding
Assignees: -
Labels:

area-System.Numerics, area-VM-coreclr

Milestone: -

@tannergooding
Copy link
Member Author

tannergooding commented Apr 17, 2021

It looks reasonable to me otherwise.

Based on your suggestion of placing it in src/native, I have a few questions:

  • Do we just want src/native/libm?
  • Is statically linking it into both mscoree and clrjit, like this PR is doing, ok? Or do we want this to be a proper DLL that lives SxS with the JIT?
    • Having it statically linked may improve codegen in COMSingle/COMDouble but increases overall size of the distribution
    • Having it be a DLL means reduced size and easier consumption from other tools (say if libgdiplus wanted to use the same implementation), but power users could replace it with their own "libm" and that might be "problematic"
  • Who are the right mono folks to loop in to determine how they would want this integrated on their end?

Yeah, it would be nice to sort these out

For sin and cos in particular, they have a startup switch to pick the "best" implementation: https://github.com/amd/aocl-libm-ose/blob/master/src/iface/sin.c

In practice, this roughly boils down to picking an "optmized" implementation on most platforms/setups: ALM_PROTO_OPT(sin) (which compiles down to the same as FN_PROTOTYPE_OPT(sin))
and potentially optimized assembly variants on older setups: FN_PROTOTYPE_BAS64(sin). There also exists FN_PROTOTYPE_FMA3(sin), but it doesn't appear to be used anymore.

So, these ones should be fairly straightforward to use since the variant used on most setups is pure C and we don't need the micro-architecture specific optimizations.

  • Noting that there may be some consideration if the C variants here are less efficient on non-AVX machines, since that may impact perf on machines older than 10 years, but given its all C, I wouldn't expect anything here to be major.

There are notably a couple functions, such as log, which don't have an optimized C replacement yet (they do have a variant, but its not been enabled as of today). However, the float version (logf) has been ported to C already: https://github.com/amd/aocl-libm-ose/blob/master/src/iface/log.c#L74


I'm going to submit a PR to amd/aocl-libm-ose to fix the only blocker for consuming the library from ARM (which is how they call sqrt internally right now).
I'll likewise create issues and submit fixes for any other changes that are needed here so we aren't deviating from the official sources 😄

@jkotas
Copy link
Member

jkotas commented Apr 17, 2021

Do we just want src/native/libm?

I think so.

Is statically linking it into both mscoree and clrjit, like this PR is doing, ok? O

Yes, it is small enough. No big deal with statically linking it multiple times.

Who are the right mono folks to loop in to determine how they would want this integrated on their end?

@marek-safar Who should work with Tanner on this on Mono side?

@tannergooding
Copy link
Member Author

I moved the files into src\native\clrmath (not libm to avoid conflicts with the actual libm on Unix). I also brought along a few of the opt methods (cosf, exp, expf, logf, pow, powf, sinf, and tanf) and removed the corresponding PAL entries.

sin and tan (the double variants) do have C implementations, but depend on __amd_remainder_piby2 which is only in assembly today.

That leaves: cbrt, cbrtf, cos, fabs, fabsf, fma, fmaf, fmod, fmodf, log, log2, log2f, log10, log10f, sin, sincos, sincosf, sqrt, sqrtf, and tan that aren't using aocl-libm-ose.
Notably, fabs, fabsf, sqrt, and sqrft don't matter too much since they are always intrinsic and compiled down to a CPU instruction. fma, fmaf are also already IEEE compliant and intrinsic on modern hardware.

@tannergooding
Copy link
Member Author

(I fully expect the build will fail here and there will be iteration here to get it fully working on macOS, Unix, and ARM).


#include "clrmath.h"

float clrmath_cbrtf(float x)
Copy link
Member

@jkotas jkotas Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the C/C++ compiler able to optimize out these forwarders in all situations, and to always pick the local implementation?

It feels fragile. It may be better if the C method names for the local implementations included a prefix.

Copy link
Member Author

@tannergooding tannergooding Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These forwarders only exist for the methods that are still coming from the C runtime (because they are implemented in assembly for aocl-libm-ose).
Since it's all statically linked, COMSingle::Cbrt is compiled down to call cbrtf directly (for release mode), rather than going to clrmath_cbrtf and then jumping to cbrtf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It assumes good link-time code generation that is not given, especially on some platforms targeted by Mono.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update these to just be `#define clrmath_cbrtf cbrtf.

That being said, I would be very surprised if LLVM failed to do the basic inlining optimization here.

@danmoseley
Copy link
Member

Does it matter that https://github.com/amd/aocl-libm-ose does not have its own tests? Would AMD be willing to add them?

@marek-safar
Copy link
Contributor

Who should work with Tanner on this on Mono side?

@vargaz could you please work with Tanner on MonoVM side?

/cc @lambdageek for awareness

@tannergooding
Copy link
Member Author

Does it matter that https://github.com/amd/aocl-libm-ose does not have its own tests? Would AMD be willing to add them?

The repo has a couple branches. master is not currently the latest, aocl-3.0 is and represents v3.7.

This branch does contain many of the tests and more are being added: https://github.com/amd/aocl-libm-ose/tree/aocl-3.0/gtests

@vargaz
Copy link
Contributor

vargaz commented Apr 19, 2021

Aren't these versions going to be slower than the platform default versions, i.e. those are probably optimized with assembly etc ?

@tannergooding
Copy link
Member Author

tannergooding commented Apr 19, 2021

Aren't these versions going to be slower than the platform default versions, i.e. those are probably optimized with assembly etc ?

Not all of the platform versions are implemented in assembly. On x64 Windows, we use what is effectively https://github.com/amd/win-libm in MSVCRT (aocl-libm-ose is basically the newer cross-platform version of the same library).
Likewise, on Linux the glibc implementations are most often used and are almost all in C for portability across all the platforms that they need to target.

Most of the functions aren't overly complex and really just rely on the native compiler efficiently reinterpret casting between float and uint32_t (or double and uint64_t) and so modern C/C++ compilers (particularly MSVC, GCC, and Clang) are more than able to produce quality assembly code for these implementations and running our benchmarks in dotnet/performance shows that. The JIT is likewise able to do a good job in these cases, were they to be ported to C# (even from an assembly implementation): https://gist.github.com/tannergooding/103be726d48dc3d0b09e890bad0b892f#file-perf-md

@lambdageek
Copy link
Member

I'm worried about the size impact on WebAssembly and Android if we use this in Mono. (also... we don't use LTO on most platforms - i'm unclear under what circumstances this library falls back to using the system math functions and whether its going to inhibit optmizations by calling non-intrinsic versions of them).

@lambdageek
Copy link
Member

This change would allow us to provide performant and deterministic implementation of System.Math across all our platforms without taking on the burden of creating or maintaining our own implementations.

@tannergooding I'm not sure I understand the motivation for this PR. The goal is that .NET will have a consistent behavior across platforms, even if that behavior differs between .NET and the platform math library? Shouldn't that be a new API surface, and not System.Math ? System.Math.Portable or something.

@tannergooding
Copy link
Member Author

i'm unclear under what circumstances this library falls back to using the system math functions and whether its going to inhibit optmizations by calling non-intrinsic versions of them

This only falls back to the C Runtime if a corresponding portable implementation doesn't exist in amd/aocl-libm-ose.

I'm not sure I understand the motivation for this PR. The goal is that .NET will have a consistent behavior across platforms, even if that behavior differs between .NET and the platform math library?

Yes, the intent is for .NET to provide performant and deterministic math on all platforms to help avoid issues that frequently arise from them differing between Windows, MacOS, Linux, x64, x86, ARM32, ARM64 and all valid variations there-in. Today, these differences can result in significant issues for projects like ML.NET or other math heavy libraries.

We provide no guarantee that System.Math will forward down to the C runtime and almost no guarantee on the respective accuracy or performance of these methods. The exceptions are trivial functions like FusedMultiplyAdd, Ceiling, Floor, Truncate, and some Round overloads, and Sqrt, all of which have direct hardware support on most supported platforms and which have well-known IEEE compliant implementations otherwise. For other functions, like Cos, we provide minimal guarantees on certain well-defined inputs/outputs that are explicitly required by the IEEE spec.

Shouldn't that be a new API surface, and not System.Math ? System.Math.Portable or something.

I don't believe so. The existing functions already are portable, this is just normalizing the cross-platform behavior to help improve program consistency across platforms.

whether its going to inhibit optmizations by calling non-intrinsic versions of them

This shouldn't inhibit any optimizations. Both Mono and RyuJIT need to explicitly recognize and handle functions like Math.Sqrt or Math.Round for them to be treated as intirnsics today, this PR won't change that. This will only impact functions like Cos, Sin, and Tan which don't have dedicated hardware instructions and which are calls into the C Runtime.

The implementation being normalized on is that from AMD, which is already the basis for the x64 Windows CRT implementation. The implementation is more performant in many cases than the existing x86 Windows implementation.

For Linux and Mac, they each have their own math libraries, with Linux coming from glibc (libm in particular) and being written largely in C. There are some well-known cases where the Linux implementation is slower than the Windows implementation and so this will help improve perf there.

@lambdageek
Copy link
Member

This will only impact functions like Cos, Sin, and Tan which don't have dedicated hardware instructions and which are calls into the C Runtime.

yea that's what I'm talking about.

Let's take a look at three examples: sincos, cosf and sinh.

  • clrmath_sinccosf forwards to the platform math library sincos. There's an extra indirection on every call. the forwarding function is in a separate translation unit and (since we're not using LTCG) won't be inlined.
  • cosf has an optimized implementation. May disagree with platform sincosf, but otherwise good.
  • sinh -> reference implementation. Using some table lookups. Will be a code size increase compared to calling the system libm sinh.

I don't see how this is a benefit to Android or WebAssembly users.

@tannergooding
Copy link
Member Author

clrmath_sinccosf forwards to the platform math library sincos. There's an extra indirection on every call. the forwarding function is in a separate translation unit and (since we're not using LTCG) won't be inlined.

It already may not have agreed with Sin or Cos due to differences in how standalone Sin/Cos are implemented as compared to SinCos. It is generally expected they are similar, but not all platforms work like one might expect.
APIs like sincos are also only going to forward to the CRT for another release or so. amd/aocl-libm-ose has a sincos implementation, but its in assembly. As with most of the other assembly functions, it will eventually be ported to C instead (just as Sin and Cos have been in recent releases).

Additionally, for non-Windows, we already have an indirection due to the PAL layer today. Having consistent implementations across platforms actually allows us to remove that indirection and call the functions more directly.
If not doing LTO is a concern here, then we should likely look at enabling it specifically for comfloat_wks and the corresponding mono equivalent, to ensure that the Math APIs are performant, regardless of this change.

sinh -> reference implementation. Using some table lookups. Will be a code size increase compared to calling the system libm sinh.

Yes, there will be some additional size increase to effectively duplicate lookup tables that the C Runtime may already have declared.
The benefit is that all platforms (Windows, Linux, MacOS, Android, iOS, etc) will now agree on the output of Sin(x) for a given release of .NET
All platforms will likewise have similar performance characteristics and will see the same bug fixes. If there are particular platforms where this is showing to not be optimal, then we should likely log issues on amd/aocl-libm-ose and we could then look at alternatives, as appropriate.

@danmoseley
Copy link
Member

The repo has a couple branches. master is not currently the latest, aocl-3.0 is and represents v3.7.
This branch does contain many of the tests and more are being added: https://github.com/amd/aocl-libm-ose/tree/aocl-3.0/gtests

What are AMD's plans for this code? Are they planning to maintain it, have CI, take contributions, etc? Right now the glibc etc we depend on is maintained, of course. It would be good if there was a public roadmap there.

@tannergooding
Copy link
Member Author

What are AMD's plans for this code? Are they planning to maintain it, have CI, take contributions, etc? Right now the glibc etc we depend on is maintained, of course. It would be good if there was a public roadmap there.

I can provide more details offline.

@ghost
Copy link

ghost commented May 21, 2021

Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it.

@ghost ghost locked as resolved and limited conversation to collaborators Jun 20, 2021
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants