-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update constant prop to only consider certain hwintrinsics #97616
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis updates constant prop of a
Before; Method Program:Test():System.Runtime.Intrinsics.Vector128`1[ubyte] (FullOpts)
G_M000_IG01: ;; offset=0x0000
vzeroupper
G_M000_IG02: ;; offset=0x0003
vmovups xmm0, xmmword ptr [reloc @RWD00]
vpalignr xmm0, xmm0, xmmword ptr [reloc @RWD00], 12
vpor xmm0, xmm0, xmmword ptr [reloc @RWD00]
vmovups xmm1, xmmword ptr [reloc @RWD00]
vpalignr xmm1, xmm1, xmmword ptr [reloc @RWD00], 8
vpor xmm0, xmm0, xmm1
vmovups xmm1, xmmword ptr [reloc @RWD00]
vpalignr xmm1, xmm1, xmmword ptr [reloc @RWD00], 4
vpor xmm0, xmm0, xmm1
vmovups xmmword ptr [rcx], xmm0
mov rax, rcx
G_M000_IG03: ;; offset=0x0050
ret
RWD00 dq FFFFFFFF0C080400h, FFFFFFFFFFFFFFFFh
; Total bytes of code: 81 After; Method Program:Test():System.Runtime.Intrinsics.Vector128`1[ubyte] (FullOpts)
G_M18664_IG01: ;; offset=0x0000
vzeroupper
;; size=3 bbWeight=1 PerfScore 1.00
G_M18664_IG02: ;; offset=0x0003
vmovups xmm0, xmmword ptr [reloc @RWD00]
vpalignr xmm1, xmm0, xmm0, 12
vpalignr xmm2, xmm0, xmm0, 8
vmovaps xmm3, xmm0
vpternlogd xmm3, xmm1, xmm2, -2
vpalignr xmm0, xmm0, xmm0, 4
vpor xmm0, xmm3, xmm0
vmovups xmmword ptr [rcx], xmm0
mov rax, rcx
;; size=48 bbWeight=1 PerfScore 9.33
G_M18664_IG03: ;; offset=0x0033
ret
;; size=1 bbWeight=1 PerfScore 1.00
RWD00 dq FFFFFFFF0C080400h, FFFFFFFFFFFFFFFFh
; Total bytes of code: 52
|
This also allows a workaround for cases like #76067, as the user can manually hoist var byteVector = Vector256.LoadUnsafe<byte>(ref spanRef);
var zero = Vector256<short>.Zero;
var low = Avx2.UnpackLow(byteVector, zero.AsByte());
var high = Avx2.UnpackHigh(byteVector, zero.AsByte());
var added = Avx2.Add(low.AsInt16(), high.AsInt16());
added = Avx2.HorizontalAdd(added, zero);
added = Avx2.HorizontalAdd(added, zero);
return Avx2.HorizontalAdd(added, zero); which generates: ; Method Program:Test2(byref):System.Runtime.Intrinsics.Vector256`1[short] (FullOpts)
G_M36983_IG01: ;; offset=0x0000
vzeroupper
;; size=3 bbWeight=1 PerfScore 1.00
G_M36983_IG02: ;; offset=0x0003
vxorps ymm0, ymm0, ymm0
vmovups ymm1, ymmword ptr [rdx]
vpunpcklbw ymm2, ymm1, ymm0
vpunpckhbw ymm1, ymm1, ymm0
vpaddw ymm1, ymm2, ymm1
vphaddw ymm1, ymm1, ymm0
vphaddw ymm1, ymm1, ymm0
vphaddw ymm0, ymm1, ymm0
vmovups ymmword ptr [rcx], ymm0
mov rax, rcx
;; size=42 bbWeight=1 PerfScore 15.92
G_M36983_IG03: ;; offset=0x002D
vzeroupper
ret
;; size=4 bbWeight=1 PerfScore 2.00
; Total bytes of code: 49 It may also open the opportunity to decide to CSE |
Diff results for #97616Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,259,628 contexts (1,008,044 MinOpts, 1,251,584 FullOpts). MISSED contexts: 1 (0.00%) Overall (+46,644 bytes)
FullOpts (+46,644 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,249,837 contexts (981,298 MinOpts, 1,268,539 FullOpts). Overall (+12,283 bytes)
FullOpts (+12,283 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,029,495 contexts (927,368 MinOpts, 1,102,127 FullOpts). Overall (+39,960 bytes)
FullOpts (+39,960 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,070,988 contexts (937,853 MinOpts, 1,133,135 FullOpts). MISSED contexts: 1 (0.00%) Overall (+37,060 bytes)
FullOpts (+37,060 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,098,663 contexts (926,221 MinOpts, 1,172,442 FullOpts). MISSED contexts: 1 (0.00%) Overall (+28,074 bytes)
FullOpts (+28,074 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.01% to +0.02%)
MinOpts (-0.01% to +0.00%)
FullOpts (+0.02%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.01% to +0.02%)
FullOpts (+0.02%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.01% to +0.03%)
FullOpts (+0.02% to +0.03%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.01% to +0.02%)
FullOpts (+0.02%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.01% to +0.03%)
FullOpts (+0.02% to +0.04%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.00% to +0.01%)
FullOpts (+0.01%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.00% to +0.01%)
FullOpts (+0.01%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (+0.01%)
FullOpts (+0.01%)
Details here |
Diff results for #97616Assembly diffsAssembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,291,563 contexts (838,165 MinOpts, 1,453,398 FullOpts). Overall (+5,650 bytes)
FullOpts (+5,650 bytes)
Details here |
Diff results for #97616Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,259,628 contexts (1,008,044 MinOpts, 1,251,584 FullOpts). MISSED contexts: 1 (0.00%) Overall (+46,644 bytes)
FullOpts (+46,644 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,249,837 contexts (981,298 MinOpts, 1,268,539 FullOpts). Overall (+12,283 bytes)
FullOpts (+12,283 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,029,495 contexts (927,368 MinOpts, 1,102,127 FullOpts). Overall (+39,960 bytes)
FullOpts (+39,960 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,070,988 contexts (937,853 MinOpts, 1,133,135 FullOpts). MISSED contexts: 1 (0.00%) Overall (+37,060 bytes)
FullOpts (+37,060 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,098,663 contexts (926,221 MinOpts, 1,172,442 FullOpts). MISSED contexts: 1 (0.00%) Overall (+28,074 bytes)
FullOpts (+28,074 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,291,563 contexts (838,165 MinOpts, 1,453,398 FullOpts). Overall (+5,650 bytes)
FullOpts (+5,650 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.03% to +0.06%)
MinOpts (-0.01% to +0.00%)
FullOpts (+0.05% to +0.06%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.03% to +0.06%)
FullOpts (+0.06%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.03% to +0.06%)
FullOpts (+0.05% to +0.07%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.03% to +0.06%)
MinOpts (-0.01% to +0.00%)
FullOpts (+0.05% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.04% to +0.07%)
FullOpts (+0.06% to +0.08%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.04%)
FullOpts (+0.03% to +0.04%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.02% to +0.04%)
FullOpts (+0.03% to +0.04%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (+0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Throughput diffs for windows/x86 ran on windows/x86Overall (+0.03% to +0.04%)
FullOpts (+0.03% to +0.04%)
Details here |
Diff results for #97616Assembly diffsAssembly diffs for osx/arm64 ran on linux/x64Diffs are based on 2,029,495 contexts (927,368 MinOpts, 1,102,127 FullOpts). Overall (+5,328 bytes)
FullOpts (+5,328 bytes)
Assembly diffs for windows/arm64 ran on linux/x64Diffs are based on 2,070,988 contexts (937,853 MinOpts, 1,133,135 FullOpts). MISSED contexts: 1 (0.00%) Overall (+5,088 bytes)
FullOpts (+5,088 bytes)
Assembly diffs for windows/x64 ran on linux/x64Diffs are based on 2,098,663 contexts (926,221 MinOpts, 1,172,442 FullOpts). MISSED contexts: 1 (0.00%) Overall (+1,740 bytes)
FullOpts (+1,740 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.03% to +0.06%)
MinOpts (-0.01% to +0.00%)
FullOpts (+0.05% to +0.06%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.03% to +0.06%)
FullOpts (+0.06%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.03% to +0.06%)
MinOpts (-0.01% to +0.00%)
FullOpts (+0.05% to +0.06%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.03% to +0.06%)
MinOpts (-0.01% to +0.00%)
FullOpts (+0.05% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.03% to +0.06%)
FullOpts (+0.06%)
Details here Throughput diffs for linux/arm ran on linux/x86Overall (+0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Throughput diffs for windows/x86 ran on linux/x86Overall (+0.03% to +0.04%)
FullOpts (+0.03% to +0.04%)
Details here |
CC. @dotnet/jit-contrib Diffs show a good number of wins, especially in the hot parts of code due to the reused constants being hoisted. There are, however, notably some regressions. This appears to be mostly just caused due to different register selection causing small (but execution wise cheaper) regressions. I think overall this is a net win for users in typical SIMD code (excepting when a call exists) and the remaining issues are known general cases that exist for all SIMD code. ImprovementsFor example on Arm64: - sub v20.16b, v16.16b, v20.16b
- ldr q21, [@RWD00]
- sub v21.16b, v17.16b, v21.16b
- ldr q22, [@RWD00]
- sub v22.16b, v18.16b, v22.16b
- ldr q23, [@RWD00]
- sub v23.16b, v19.16b, v23.16b
- ldr q24, [@RWD16]
- cmgt v20.16b, v24.16b, v20.16b
- ldr q24, [@RWD16]
- cmgt v21.16b, v24.16b, v21.16b
- ldr q24, [@RWD16]
- cmgt v22.16b, v24.16b, v22.16b
- ldr q24, [@RWD16]
- cmgt v23.16b, v24.16b, v23.16b
- ldr q24, [@RWD32]
- and v20.16b, v20.16b, v24.16b
- ldr q24, [@RWD32]
- and v21.16b, v21.16b, v24.16b
- ldr q24, [@RWD32]
- and v22.16b, v22.16b, v24.16b
- ldr q24, [@RWD32]
- and v23.16b, v23.16b, v24.16b
- eor v16.16b, v16.16b, v20.16b
- eor v17.16b, v17.16b, v21.16b
- eor v18.16b, v18.16b, v22.16b
- eor v19.16b, v19.16b, v23.16b
+ ldr q21, [@RWD16]
+ ldr q22, [@RWD32]
+ sub v23.16b, v16.16b, v20.16b
+ sub v24.16b, v17.16b, v20.16b
+ sub v25.16b, v18.16b, v20.16b
+ sub v20.16b, v19.16b, v20.16b
+ cmgt v23.16b, v21.16b, v23.16b
+ cmgt v24.16b, v21.16b, v24.16b
+ cmgt v25.16b, v21.16b, v25.16b
+ cmgt v20.16b, v21.16b, v20.16b
+ and v21.16b, v23.16b, v22.16b
+ and v23.16b, v24.16b, v22.16b
+ and v24.16b, v25.16b, v22.16b
+ and v20.16b, v20.16b, v22.16b
+ eor v16.16b, v16.16b, v21.16b
+ eor v17.16b, v17.16b, v23.16b
+ eor v18.16b, v18.16b, v24.16b
+ eor v19.16b, v19.16b, v20.16b or on x64: vpclmulqdq xmm4, xmm0, xmmword ptr [reloc @RWD16], 17
- vpclmulqdq xmm0, xmm0, xmmword ptr [reloc @RWD16], 0
- vpternlogq xmm1, xmm4, xmm0, -106
+ vmovups xmm4, xmmword ptr [reloc @RWD16]
+ vpclmulqdq xmm5, xmm0, xmm4, 17
+ vpclmulqdq xmm0, xmm0, xmm4, 0
+ vpternlogq xmm1, xmm5, xmm0, -106
vmovaps xmm0, xmm1
- vpclmulqdq xmm1, xmm0, xmmword ptr [reloc @RWD16], 17
- vpclmulqdq xmm0, xmm0, xmmword ptr [reloc @RWD16], 0
+ vpclmulqdq xmm1, xmm0, xmm4, 17
+ vpclmulqdq xmm0, xmm0, xmm4, 0
vpternlogq xmm2, xmm1, xmm0, -106
vmovaps xmm0, xmm2
- vpclmulqdq xmm1, xmm0, xmmword ptr [reloc @RWD16], 17
- vpclmulqdq xmm0, xmm0, xmmword ptr [reloc @RWD16], 0
+ vpclmulqdq xmm1, xmm0, xmm4, 17
+ vpclmulqdq xmm0, xmm0, xmm4, 0 RegressionsFor example on Arm64 the register selection causes us to shuffle data around more (note the move sequence at the bottom where its moving - umin v0.8h, v1.8h, v0.8h
- ldr q1, [@RWD00]
+ umin v1.8h, v1.8h, v0.8h
ldr q2, [fp, #0x30] // [V27 tmp25]
- umin v1.8h, v2.8h, v1.8h
- ldr q2, [@RWD00]
+ umin v0.8h, v2.8h, v0.8h
+ mov v2.16b, v0.16b
+ ldr q0, [@RWD00]
ldr q3, [fp, #0x20] // [V28 tmp26]
- umin v2.8h, v3.8h, v2.8h
- ldr q3, [@RWD00]
+ umin v3.8h, v3.8h, v0.8h
ldr q16, [fp, #0x10] // [V29 tmp27]
- umin v3.8h, v16.8h, v3.8h
+ umin v0.8h, v16.8h, v0.8h
+ mov v16.16b, v0.16b
+ mov v0.16b, v1.16b
+ mov v1.16b, v2.16b
+ mov v2.16b, v3.16b
+ mov v3.16b, v16.16b Similarly on x64 we have cases where we'll avoid propagating because we don't take into account + vextractf128 xmm7, ymm6, 1
call [<unknown method>]
; gcr arg pop 0
- vmovups ymm0, ymmword ptr [rsp+0x40]
- vaddpd ymm0, ymm0, qword ptr [reloc @RWD08] {1to4}
- vmovups ymm1, ymmword ptr [reloc @RWD32]
- vdivpd ymm0, ymm1, ymm0
+ vinsertf128 ymm6, ymm6, xmm7, 1
+ vaddpd ymm0, ymm6, ymmword ptr [rsp+0x40]
+ vdivpd ymm0, ymm6, ymm0 |
Diff results for #97616Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,259,628 contexts (1,008,044 MinOpts, 1,251,584 FullOpts). MISSED contexts: 1 (0.00%) Overall (+12,248 bytes)
FullOpts (+12,248 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,249,837 contexts (981,298 MinOpts, 1,268,539 FullOpts). Overall (-6,269 bytes)
FullOpts (-6,269 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,291,563 contexts (838,165 MinOpts, 1,453,398 FullOpts). Overall (-5,513 bytes)
FullOpts (-5,513 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.04%)
FullOpts (+0.03% to +0.04%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.03% to +0.06%)
FullOpts (+0.06%)
Throughput diffs for osx/arm64 ran on linux/x64Overall (+0.03% to +0.06%)
FullOpts (+0.05% to +0.06%)
Throughput diffs for windows/arm64 ran on linux/x64Overall (+0.03% to +0.06%)
FullOpts (+0.05% to +0.06%)
Throughput diffs for windows/x64 ran on linux/x64Overall (+0.03% to +0.06%)
FullOpts (+0.06%)
Details here |
Diff results for #97616Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,259,628 contexts (1,008,044 MinOpts, 1,251,584 FullOpts). MISSED contexts: 1 (0.00%) Overall (+12,248 bytes)
FullOpts (+12,248 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,249,837 contexts (981,298 MinOpts, 1,268,539 FullOpts). Overall (-6,269 bytes)
FullOpts (-6,269 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,029,495 contexts (927,368 MinOpts, 1,102,127 FullOpts). Overall (+5,328 bytes)
FullOpts (+5,328 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,070,988 contexts (937,853 MinOpts, 1,133,135 FullOpts). MISSED contexts: 1 (0.00%) Overall (+5,088 bytes)
FullOpts (+5,088 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,098,663 contexts (926,221 MinOpts, 1,172,442 FullOpts). MISSED contexts: 1 (0.00%) Overall (+1,740 bytes)
FullOpts (+1,740 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,291,563 contexts (838,165 MinOpts, 1,453,398 FullOpts). Overall (-5,513 bytes)
FullOpts (-5,513 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.03% to +0.06%)
FullOpts (+0.05% to +0.06%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (+0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Throughput diffs for windows/x86 ran on windows/x86Overall (+0.03% to +0.04%)
FullOpts (+0.03% to +0.04%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.04%)
FullOpts (+0.03% to +0.04%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.02% to +0.04%)
FullOpts (+0.03% to +0.04%)
Details here |
Diff results for #97616Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,259,470 contexts (1,008,044 MinOpts, 1,251,426 FullOpts). MISSED contexts: 159 (0.01%) Overall (+12,248 bytes)
FullOpts (+12,248 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,249,703 contexts (981,298 MinOpts, 1,268,405 FullOpts). MISSED contexts: 134 (0.01%) Overall (-6,226 bytes)
FullOpts (-6,226 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,029,386 contexts (927,368 MinOpts, 1,102,018 FullOpts). MISSED contexts: 109 (0.01%) Overall (+5,328 bytes)
FullOpts (+5,328 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,070,850 contexts (937,853 MinOpts, 1,132,997 FullOpts). MISSED contexts: 139 (0.01%) Overall (+5,088 bytes)
FullOpts (+5,088 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,098,526 contexts (926,221 MinOpts, 1,172,305 FullOpts). MISSED contexts: 138 (0.01%) Overall (+1,740 bytes)
FullOpts (+1,740 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.03% to +0.06%)
MinOpts (-0.01% to +0.00%)
FullOpts (+0.05% to +0.06%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.03% to +0.06%)
FullOpts (+0.06%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.03% to +0.06%)
FullOpts (+0.05% to +0.06%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.03% to +0.06%)
FullOpts (+0.05% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.03% to +0.06%)
FullOpts (+0.06%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (+0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Throughput diffs for windows/x86 ran on windows/x86Overall (+0.03% to +0.04%)
FullOpts (+0.03% to +0.04%)
Details here |
Diff results for #97616Assembly diffsAssembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,290,755 contexts (838,165 MinOpts, 1,452,590 FullOpts). MISSED contexts: 808 (0.04%) Overall (-5,499 bytes)
FullOpts (-5,499 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.04%)
FullOpts (+0.03% to +0.04%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.02% to +0.04%)
FullOpts (+0.03% to +0.04%)
Details here |
Diff results for #97616Throughput diffsThroughput diffs for linux/x64 ran on linux/x64Overall (+0.02% to +0.04%)
FullOpts (+0.03% to +0.04%)
Details here |
Resolved merge conflict. This should still be ready-for-review. |
Diff results for #97616Throughput diffsThroughput diffs for windows/x86 ran on linux/x86Overall (+0.03% to +0.04%)
FullOpts (+0.03% to +0.04%)
Details here Throughput diffs for osx/arm64 ran on linux/x64Overall (+0.03% to +0.06%)
FullOpts (+0.05% to +0.06%)
Details here |
Diff results for #97616Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,507,317 contexts (1,007,092 MinOpts, 1,500,225 FullOpts). MISSED contexts: 1 (0.00%) Overall (+12,788 bytes)
FullOpts (+12,788 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,517,908 contexts (991,070 MinOpts, 1,526,838 FullOpts). MISSED contexts: 1 (0.00%) Overall (-6,152 bytes)
FullOpts (-6,152 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,270,868 contexts (932,669 MinOpts, 1,338,199 FullOpts). MISSED contexts: 2 (0.00%) Overall (+5,536 bytes)
FullOpts (+5,536 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,341,108 contexts (938,449 MinOpts, 1,402,659 FullOpts). MISSED contexts: 9 (0.00%) Overall (+4,848 bytes)
FullOpts (+4,848 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,512,209 contexts (997,391 MinOpts, 1,514,818 FullOpts). MISSED contexts: 3 (0.00%) Overall (+1,801 bytes)
FullOpts (+1,801 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,293,451 contexts (839,658 MinOpts, 1,453,793 FullOpts). MISSED contexts: 45 (0.00%) Overall (-5,793 bytes)
FullOpts (-5,793 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.03% to +0.06%)
FullOpts (+0.05% to +0.06%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.03% to +0.06%)
FullOpts (+0.06%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.03% to +0.06%)
MinOpts (-0.01% to +0.00%)
FullOpts (+0.05% to +0.06%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.03% to +0.06%)
FullOpts (+0.06%)
Details here Throughput diffs for linux/arm ran on windows/x86FullOpts (+0.00% to +0.01%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.04%)
FullOpts (+0.03% to +0.04%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.02% to +0.04%)
FullOpts (+0.03% to +0.04%)
Details here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
SSA use count can be an over-estimate, but that should only lead to missed opportunities to propagate, which seems fine.
This resolves #97046 by updating constant prop of a
LclVar
into aHWIntrinsic
node to only happen if the constant can actually be consumed directly.HWIntrinsic
nodes are special in many ways and the number of transforms/optimizations we do to them is incredibly limited. Outside of some minimal constant folding done inValueNum
and some minor transforms done inMorph
, these nodes are effectively left alone until lowering. So by only propagating constants into them when we know that another phase might be able to take advantage of it, we can significantly improve the codegen in cases where a user has manually CSE'd such a constant themselves.Before
After