Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure various scalar cross platform helper APIs are handled directly as intrinsic #80789

Merged
merged 3 commits into from
Jan 27, 2023

Conversation

tannergooding
Copy link
Member

Much like with APIs directly exposed on Vector64/128/256/512, several of the APIs exposed on BitOperations are "cross platform helper APIs" and are used in various perf critical code paths.

However, unlike the vector APIs, the bit operation APIs were not directly handled as intrinsic and only ever executed a software fallback which included manual dispatch to the relevant underlying hardware intrinsics. This works well most of the time, but it does have the side effect of reducing/impacting JIT throughput, being subject to inlining heuristics, and was not able to participate in constant folding.

This PR updates those APIs to be directly imported as the relevant hardware intrinsic, when supported, and to more generally support constant folding.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 18, 2023
@ghost ghost assigned tannergooding Jan 18, 2023
@ghost
Copy link

ghost commented Jan 18, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Much like with APIs directly exposed on Vector64/128/256/512, several of the APIs exposed on BitOperations are "cross platform helper APIs" and are used in various perf critical code paths.

However, unlike the vector APIs, the bit operation APIs were not directly handled as intrinsic and only ever executed a software fallback which included manual dispatch to the relevant underlying hardware intrinsics. This works well most of the time, but it does have the side effect of reducing/impacting JIT throughput, being subject to inlining heuristics, and was not able to participate in constant folding.

This PR updates those APIs to be directly imported as the relevant hardware intrinsic, when supported, and to more generally support constant folding.

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

Comment on lines 4313 to 4314
// Signed needs to throw for negative inputs, so fallback to software impl
return nullptr;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could handle this specially with a check for negative values, but it is a more complex change and so I decided to push it out to a later PR.

Comment on lines 4355 to 4482
if (compOpportunisticallyDependsOn(InstructionSet_POPCNT))
{
// Pop the value from the stack
impPopStack();

hwintrinsic = varTypeIsLong(retType) ? NI_POPCNT_X64_PopCount : NI_POPCNT_PopCount;
return gtNewScalarHWIntrinsicNode(retType, op1, hwintrinsic);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there isn't an "always available" instruction for x86/x64, we should probably import this as GenTreeIntrinsic much as happens for various Math APIs like Sin, Cos, and other APIs.

Doing so would allow us to still perform post import constant folding and then transform this back into a GT_CALL during rationalization on older hardware.

However, given it is a more complex change I opted to push it out to a later PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Arm64, we could do the same or we could add basic SIMD constant folding support for PopCount, AddAcross, and ToScalar. CreateScalarNode will already generate a GT_CNS_VEC where applicable, including post import.

@tannergooding tannergooding marked this pull request as ready for review January 20, 2023 19:11
@tannergooding
Copy link
Member Author

tannergooding commented Jan 20, 2023

Not a perfect diff due to 40 missed contexts, but still good overall and showing some TP improvements + positive diffs.

There is notably a very small +0.01% TP regression for Arm64 minopts

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib, this is ready for review.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants