-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure various scalar cross platform helper APIs are handled directly as intrinsic #80789
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsMuch like with APIs directly exposed on However, unlike the vector APIs, the bit operation APIs were not directly handled as intrinsic and only ever executed a software fallback which included manual dispatch to the relevant underlying hardware intrinsics. This works well most of the time, but it does have the side effect of reducing/impacting JIT throughput, being subject to inlining heuristics, and was not able to participate in constant folding. This PR updates those APIs to be directly imported as the relevant hardware intrinsic, when supported, and to more generally support constant folding.
|
src/coreclr/jit/importercalls.cpp
Outdated
// Signed needs to throw for negative inputs, so fallback to software impl | ||
return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could handle this specially with a check for negative values, but it is a more complex change and so I decided to push it out to a later PR.
src/coreclr/jit/importercalls.cpp
Outdated
if (compOpportunisticallyDependsOn(InstructionSet_POPCNT)) | ||
{ | ||
// Pop the value from the stack | ||
impPopStack(); | ||
|
||
hwintrinsic = varTypeIsLong(retType) ? NI_POPCNT_X64_PopCount : NI_POPCNT_PopCount; | ||
return gtNewScalarHWIntrinsicNode(retType, op1, hwintrinsic); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there isn't an "always available" instruction for x86/x64, we should probably import this as GenTreeIntrinsic
much as happens for various Math APIs like Sin
, Cos
, and other APIs.
Doing so would allow us to still perform post import constant folding and then transform this back into a GT_CALL during rationalization on older hardware.
However, given it is a more complex change I opted to push it out to a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Arm64, we could do the same or we could add basic SIMD constant folding support for PopCount
, AddAcross
, and ToScalar
. CreateScalarNode
will already generate a GT_CNS_VEC
where applicable, including post import.
46a98b0
to
20e1693
Compare
392fc87
to
cd5959a
Compare
37d0475
to
76b84c5
Compare
76b84c5
to
a527d5a
Compare
e3f2d19
to
6483f2e
Compare
Not a perfect diff due to 40 missed contexts, but still good overall and showing some TP improvements + positive diffs. There is notably a very small |
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop |
Azure Pipelines successfully started running 3 pipeline(s). |
CC. @dotnet/jit-contrib, this is ready for review. |
Much like with APIs directly exposed on
Vector64/128/256/512
, several of the APIs exposed onBitOperations
are "cross platform helper APIs" and are used in various perf critical code paths.However, unlike the vector APIs, the bit operation APIs were not directly handled as intrinsic and only ever executed a software fallback which included manual dispatch to the relevant underlying hardware intrinsics. This works well most of the time, but it does have the side effect of reducing/impacting JIT throughput, being subject to inlining heuristics, and was not able to participate in constant folding.
This PR updates those APIs to be directly imported as the relevant hardware intrinsic, when supported, and to more generally support constant folding.