Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Math.Round, Math.ILogB, and do some minor cleanup of Half, Single, and Double #98040

Merged
merged 7 commits into from
Feb 7, 2024

Conversation

tannergooding
Copy link
Member

As per the title this improves the software fallback used for Math.Round and Math.ILogB to be significantly simpler/faster.

It additionally does some cleanup of functions in Half, Single, and Double to ensure they remain performant and consistent with eachother where possible.

@ghost
Copy link

ghost commented Feb 6, 2024

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

As per the title this improves the software fallback used for Math.Round and Math.ILogB to be significantly simpler/faster.

It additionally does some cleanup of functions in Half, Single, and Double to ensure they remain performant and consistent with eachother where possible.

Author: tannergooding
Assignees: tannergooding
Labels:

area-System.Numerics

Milestone: -

@stephentoub
Copy link
Member

improves the software fallback

I still see references to Math.Round being an intrinsic. Is that only in support of cases where the argument is known constant, or are there other cases where this managed implementation isn't used?

@filipnavara
Copy link
Member

Is that only in support of cases where the argument is known constant, or are there other cases where this managed implementation isn't used?

ARM64 implements it using a HW instruction.

@MichalPetryka
Copy link
Contributor

Is that only in support of cases where the argument is known constant, or are there other cases where this managed implementation isn't used?

ARM64 implements it using a HW instruction.

And X86 does too for all but AwayFromZero.

@tannergooding
Copy link
Member Author

Most of our platforms use an intrinsic where possible. The exact mapping of that is here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/importercalls.cpp#L6840

For Arm64 we can always accelerate while for x86/x64, we can only trivially accelerate on SSE4.1 capable hardware (which is pretty much everything) (we could also accelerate on SSE2 hardware via cvtss2si followed by cvtsi2ss, but its not worth the complexity).

The software fallback is still used for indirect invocation, for certain cases involving tail calls that prevent us from optimizing it as an intrinsic, and on other platforms outside what RyuJIT supports.

@ryujit-bot
Copy link

Diff results for #98040

Throughput diffs

Throughput diffs for linux/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.linux.arm64.checked.mch +0.01%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
libraries.pmi.windows.arm64.checked.mch -0.01%

Details here


@ryujit-bot
Copy link

Diff results for #98040

Throughput diffs

Throughput diffs for linux/arm64 ran on linux/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
libraries.pmi.linux.arm64.checked.mch -0.01%

Throughput diffs for windows/arm64 ran on linux/x64

MinOpts (-0.01% to +0.00%)
Collection PDIFF
libraries.pmi.windows.arm64.checked.mch -0.01%

Details here


// This is probably not worth inlining, it has branches and should be rarely called
public static unsafe bool IsSubnormal(double d)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsZero(double d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this helper if all it's doing is == 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows the code to be consistent with Half (where == 0 is less efficient today), with Generic Math (where a user really needs to use the exposed IsZero API), and is something we should probably expose publicly anyways since it's part of the IEEE 754 Required Operations (under 5.7.2 General Operations).

Comment on lines +1493 to +1497
bits += 1;
}
else
{
bits -= 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Curious for the reason for prefering this over ++ and -- ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my preferred style. I personally don't like x++ or x-- as a standalone expression (in loops its fine), as I find it makes the code much harder to read/reason about. The x += 1 and x -= 1 instead makes it very clear what's going on and that an assignment/mutation is happening here.

int bits = BitConverter.SingleToInt32Bits(f);
return (bits & 0x7FFFFFFF) < 0x7F800000;
uint bits = BitConverter.SingleToUInt32Bits(f);
return (~bits & PositiveInfinityBits) != 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this more efficient, or it's just clearer what it's doing?

Copy link
Member Author

@tannergooding tannergooding Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit more efficient (both perf and space wise) for 64-bit, while allowing smaller code in hoistable loops for 32-bit. It also allows branching code to be a bit more efficient (instead of and; cmp; jcc, we just do andn; jcc).

Basically the old pattern was:

; 32-bit
vmovd eax, xmm0                     ;  4-bytes, 3-cycles
and   eax, 0x7FFF_FFFF              ;  5-bytes, 1-cycles
cmp   eax, 0x7F80_0000              ;  5-bytes, 1-cycles
setl  al                            ;  3-bytes, 1-cycles
movzx eax, al                       ;  3-bytes, 1-cycles
                                    ; 20-bytes, 7-cycles

; 64-bit
vmovd rax, xmm0                     ;  5-bytes, 3-cycles
mov   rcx, 0x7FFF_FFFF_FFFF_FFFF    ; 10-bytes, 1-cycles
and   rax, rcx                      ;  3-bytes, 1-cycles
mov   rcx, 0x7FF0_0000_0000_0000    ; 10-bytes, 1-cycles
cmp   rax, rcx                      ;  3-bytes, 1-cycles
setl  al                            ;  3-bytes, 1-cycles
movzx eax, al                       ;  3-bytes, 1-cycles
                                    ; 37-bytes, 9-cycles

While the new pattern can be much smaller:

; 32-bit
vmovd eax, xmm0                     ;  4-bytes, 3-cycles
mov   ecx, 0x7F80_0000              ;  5-bytes, 1-cycles
andn  eax, eax, ecx                 ;  5-bytes, 1-cycles
setne al                            ;  3-bytes, 1-cycles
movzx rax, al                       ;  3-bytes, 1-cycles
                                    ; 20-bytes, 7-cycles

; 64-bit
vmovd rax, xmm0                     ;  5-bytes, 3-cycles
mov   rcx, 0x7FF0_0000_0000_0000    ; 10-bytes, 1-cycles
andn  rax, rax, rcx                 ;  5-bytes, 1-cycles
setne al                            ;  3-bytes, 1-cycles
movzx rax, al                       ;  3-bytes, 1-cycles
                                    ; 26-bytes, 7-cycles

@tannergooding
Copy link
Member Author

@dotnet-policy-service rerun

@tannergooding
Copy link
Member Author

@dotnet-policy-bot rerun

@github-actions github-actions bot locked and limited conversation to collaborators Mar 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants