Skip to content

Conversation

saucecontrol
Copy link
Member

@saucecontrol saucecontrol commented Jun 19, 2025

This is the follow-up to #114597. It addresses the invalid removal of intermediate double casts in long->double->float chains and does more cleanup/normalization in fgMorphExpandCast, removing more of the platform-specific logic.

It also adds some more acceleration to x86. With this, all integral<->floating casts are accelerated with AVX-512, and all 32-bit integral<->floating casts are accelerated with the baseline ISAs.

Example AVX-512 implementation of floating->long/ulong casts:

 G_M38981_IG01:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
-       sub      esp, 8
-       vzeroupper 
-						;; size=6 bbWeight=1 PerfScore 1.25
+						;; size=0 bbWeight=1 PerfScore 0.00
 G_M38981_IG02:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
-       vmovsd   xmm0, qword ptr [esp+0x0C]
-       sub      esp, 8
-       ; npt arg push 0
-       ; npt arg push 1
-       vmovsd   qword ptr [esp], xmm0
-       call     CORINFO_HELP_DBL2LNG
-       ; gcr arg pop 2
-						;; size=19 bbWeight=1 PerfScore 6.25
+       vmovsd   xmm0, qword ptr [esp+0x04]
+       vcmpsd   k1, xmm0, xmm0, 7
+       vcvttpd2qq xmm1 {k1}{z}, xmm0
+       vcmpsd   k1, xmm0, qword ptr [@RWD00], 29
+       vpcmpeqd xmm0, xmm0, xmm0
+       vpsrlq   xmm1 {k1}, xmm0, 1
+       vmovd    eax, xmm1
+       vpextrd  edx, xmm1, 1
+						;; size=51 bbWeight=1 PerfScore 18.50
 G_M38981_IG03:        ; bbWeight=1, epilog, nogc, extend
-       add      esp, 8
        ret      8
-						;; size=6 bbWeight=1 PerfScore 2.25
+						;; size=3 bbWeight=1 PerfScore 2.00
+RWD00  	dq	43E0000000000000h

Example SSE2/SSE4.1 implementation of uint->floating casts:

 G_M24329_IG01:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}
 						;; size=3 bbWeight=1 PerfScore 0.25
 G_M24329_IG02:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000002 {ecx}, byref
        ; byrRegs +[ecx]
-       push     0
-       ; npt arg push 0
-       push     dword ptr [ecx]
-       ; npt arg push 1
-       call     [CORINFO_HELP_LNG2DBL]
-       ; byrRegs -[ecx]
-       ; gcr arg pop 2
-       fstp     qword ptr [esp]
-       movsd    xmm0, qword ptr [esp]
+       mov      eax, dword ptr [ecx]
+       xorps    xmm0, xmm0
+       cvtsi2sd xmm0, eax
+       movaps   xmm1, xmm0
+       addsd    xmm1, qword ptr [reloc @RWD00]
+       movaps   xmm2, xmm0
+       blendvpd xmm0, xmm1
+       movsd    qword ptr [esp], xmm0
        fld      qword ptr [esp]
-						;; size=21 bbWeight=1 PerfScore 10.00
+						;; size=36 bbWeight=1 PerfScore 15.83
 G_M24329_IG03:        ; bbWeight=1, epilog, nogc, extend
        add      esp, 8
        ret      
 						;; size=4 bbWeight=1 PerfScore 1.25
+RWD00  	dq	41F0000000000000h

Full diffs

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 19, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 19, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@saucecontrol saucecontrol marked this pull request as ready for review June 19, 2025 23:16
@saucecontrol
Copy link
Member Author

cc @tannergooding

@saucecontrol
Copy link
Member Author

The codegen for AVX-512 floating->long regressed after merging in main (likely something with #116983). I'll give it another look.

@saucecontrol
Copy link
Member Author

#117485 resolved the regression. Diffs look good again.

CorInfoType srcBaseType = CORINFO_TYPE_UNDEF;
CorInfoType dstBaseType = CORINFO_TYPE_UNDEF;

if (varTypeIsFloating(srcType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if the compiler is CSEing this check with the above check as expected?

-- Asking since manually caching might be a way to win some throughput back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for sure, but I generally assume C++ compilers will handle 'obvious' ones like this. It should be noted, the throughput hit to x86 directly correlates with the number of casts that are now inlined.

i.e. the only significant throughput hit is on the coreclr_tests collection

image

which is also the one that had the most casts in it

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍. The biggest concern is the TP hit to minopts. It may be desirable to leave that using the helper there so that floating-point heavy code doesn't start up slower.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #117512 will reduce the hit a bit.

It may be desirable to leave that using the helper there so that floating-point heavy code doesn't start up slower.

This is interesting, because the same argument could apply to the complicated saturating logic that we have for x64 as well. #97529 introduced a similar throughput regression, and although it was done for correctness instead of perf, the throughput hit could have been avoided by using the helper in minopts there too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the throughput hit could have been avoided by using the helper in minopts there too.

AFAIR, the JIT throughput hit there ended up being very minimal (and often an improvement). It was the perf score and code output size that regressed, which was expected.

If there was a significant perf score hit to minopts, then yes the same would apply here and it would likely be beneficial to ensure that is doing the "better" thing as well.

@tannergooding
Copy link
Member

@saucecontrol there's some merge conflicts on this one and it is also dependent on #117571 going in first, correct?

@saucecontrol
Copy link
Member Author

My plan was to trim this down to just the new acceleration for x86, but yeah, that would build on the part split into #117571. I'll set this to draft in the meantime.

@saucecontrol saucecontrol marked this pull request as draft August 8, 2025 16:37
Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants