-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SpanHelpers.ClearWithReferences for arm64 #93346
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsFollow up to #93214, now for I noticed that the existing implementation is slightly sub-optimal on arm: it's relying on complex addressing modes + is not aligned to 16 bytes. Turns out, 16-byte alignment makes a lot of sense even if we don't use SIMD - Standalone benchmark (for simpler tests) Apple M2 Max (also tested on Ampere):
Ryzen 7950X:
For X64, it slightly improves small inputs. It seems like if we pin the input and use SIMD we can see >2X improvement for large inputs but I was not sure about doing pinning (from GC's point of view, atomicity is fine) here and left as is, cc @jkotas
|
|
||
Write1: | ||
Debug.Assert(pointerSizeLength >= 1); | ||
// This switch is expected to be lowered to a jump table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works great for microbenchmarks, but it works poorly for real work loads where the jump target is poorly predicated. We used to have switch table like this in managed memcopy originally, but it was later removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works great for microbenchmarks, but it works poorly for real work loads where the jump target is poorly predicated. We used to have switch table like this in managed memcopy originally, but it was later removed.
Should I revert to the previous shape? I was mostly interested in aligning to 16 bytes (it's the reason I am seeing nice results on arm64) and slightly different addressing mode in the main loop, to make it foldable into stp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, presumably, PGO can re-shuffle the switch (or convert to if-else) based on profile. However, in the real world it's likely has a mixed profile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I revert to the previous shape?
Yes, unless you can prove that the switch with jump table is better for real workloads.
For reference, dotnet/coreclr#9786 is the PR that got rid of the switch from memory copy. It went through extensive validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works great for microbenchmarks, but it works poorly for real work loads where the jump target is poorly predicated. We used to have switch table like this in managed memcopy originally, but it was later removed.
@jkotas, I'm not sure what you mean here, could you elaborate?
The hardware optimization manuals and underlying native memcpy algorithms across all 3 major implementations (MSVC, GCC, Clang; for x64 and Arm64) use/recommend a jump table explicitly for this reason. Because an unconditional and unpredictable branch, is better than a tight tree of conditional and still unpredictable branches.
The premise is that a fallthrough jump table allows a tight, unrolled version of handling for the trailing elements to exist. This then requires 1 unpredicted branch to decide which of the immediately following and equidistant switch cases needs to be jumped to. These cases are often already in the instruction/decode cache due to being in the same or following 32-byte code window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dotnet/coreclr#9786 was done by Intel engineers to address bottlenecks observed in real workloads. Yes, it goes contrary to the guidance you have quoted.
Possible explanation: Conditional jumps and indirect jumps use two different predictors. The relative performance of conditional jumps vs. indirect jump depends on relative pressure on each predictor. .NET code tends to have more indirect jumps than an average code. It means that the indirect jump predictor sees more pressure in .NET code and it is more profitable to depend on indirect jump predictor in .NET memcpy implementation.
Codegen on arm64: New codegen:; Method Prog:ClearWithReferences2(byref,ulong) (FullOpts)
G_M64195_IG01: ;; offset=0x0000
stp fp, lr, [sp, #-0x10]!
mov fp, sp
;; size=8 bbWeight=8 PerfScore 12.00
G_M64195_IG02: ;; offset=0x0008
mov x2, x1
cmp x2, #8
bhi G_M64195_IG12
;; size=12 bbWeight=8 PerfScore 16.00
G_M64195_IG03: ;; offset=0x0014
cmp w2, #8
bhi G_M64195_IG12
mov w1, w2
adr x2, [@RWD00]
ldr w2, [x2, x1, LSL #2]
adr x3, [G_M64195_IG02]
add x2, x2, x3
br x2
;; size=32 bbWeight=4 PerfScore 30.00
G_M64195_IG04: ;; offset=0x0034
str xzr, [x0]
b G_M64195_IG15
align [0 bytes for IG13]
align [0 bytes]
align [0 bytes]
align [0 bytes]
;; size=8 bbWeight=0.50 PerfScore 1.00
G_M64195_IG05: ;; offset=0x003C
stp xzr, xzr, [x0]
b G_M64195_IG15
;; size=8 bbWeight=0.50 PerfScore 1.00
G_M64195_IG06: ;; offset=0x0044
stp xzr, xzr, [x0]
str xzr, [x0, #0x10]
b G_M64195_IG15
;; size=12 bbWeight=0.50 PerfScore 1.50
G_M64195_IG07: ;; offset=0x0050
stp xzr, xzr, [x0]
stp xzr, xzr, [x0, #0x10]
b G_M64195_IG15
;; size=12 bbWeight=0.50 PerfScore 1.50
G_M64195_IG08: ;; offset=0x005C
stp xzr, xzr, [x0]
stp xzr, xzr, [x0, #0x10]
str xzr, [x0, #0x20]
b G_M64195_IG15
;; size=16 bbWeight=0.50 PerfScore 2.00
G_M64195_IG09: ;; offset=0x006C
stp xzr, xzr, [x0]
stp xzr, xzr, [x0, #0x10]
stp xzr, xzr, [x0, #0x20]
b G_M64195_IG15
;; size=16 bbWeight=0.50 PerfScore 2.00
G_M64195_IG10: ;; offset=0x007C
stp xzr, xzr, [x0]
stp xzr, xzr, [x0, #0x10]
stp xzr, xzr, [x0, #0x20]
str xzr, [x0, #0x30]
b G_M64195_IG15
;; size=20 bbWeight=0.50 PerfScore 2.50
G_M64195_IG11: ;; offset=0x0090
stp xzr, xzr, [x0]
stp xzr, xzr, [x0, #0x10]
stp xzr, xzr, [x0, #0x20]
stp xzr, xzr, [x0, #0x30]
b G_M64195_IG15
;; size=20 bbWeight=0.50 PerfScore 2.50
G_M64195_IG12: ;; offset=0x00A4
lsr x2, x1, #3
;; size=4 bbWeight=4 PerfScore 4.00
G_M64195_IG13: ;; offset=0x00A8
stp xzr, xzr, [x0]
stp xzr, xzr, [x0, #0x10]
stp xzr, xzr, [x0, #0x20]
stp xzr, xzr, [x0, #0x30]
add x0, x0, #64
sub x2, x2, #1
cbnz x2, G_M64195_IG13
;; size=28 bbWeight=32 PerfScore 192.00
G_M64195_IG14: ;; offset=0x00C4
and x1, x1, #7
b G_M64195_IG02
;; size=8 bbWeight=4 PerfScore 6.00
G_M64195_IG15: ;; offset=0x00CC
ldp fp, lr, [sp], #0x10
ret lr
;; size=8 bbWeight=1 PerfScore 2.00
RWD00 dd 000000C4h ; case G_M64195_IG15
dd 0000002Ch ; case G_M64195_IG04
dd 00000034h ; case G_M64195_IG05
dd 0000003Ch ; case G_M64195_IG06
dd 00000048h ; case G_M64195_IG07
dd 00000054h ; case G_M64195_IG08
dd 00000064h ; case G_M64195_IG09
dd 00000074h ; case G_M64195_IG10
dd 00000088h ; case G_M64195_IG11
; Total bytes of code: 212 Previous codegen:; Method Prog:ClearWithReferences(byref,ulong) (FullOpts)
G_M45361_IG01: ;; offset=0x0000
stp fp, lr, [sp, #-0x10]!
mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M45361_IG02: ;; offset=0x0008
cmp x1, #8
blo G_M45361_IG04
align [0 bytes for IG03]
align [0 bytes]
align [0 bytes]
align [0 bytes]
;; size=8 bbWeight=1 PerfScore 1.50
G_M45361_IG03: ;; offset=0x0010
lsl x2, x1, #3
add x3, x0, x2
str xzr, [x3, #-0x08]
add x3, x0, x2
str xzr, [x3, #-0x10]
add x3, x0, x2
str xzr, [x3, #-0x18]
add x3, x0, x2
str xzr, [x3, #-0x20]
add x3, x0, x2
str xzr, [x3, #-0x28]
add x3, x0, x2
str xzr, [x3, #-0x30]
add x3, x0, x2
str xzr, [x3, #-0x38]
add x2, x0, x2
str xzr, [x2, #-0x40]
sub x1, x1, #8
cmp x1, #8
bhs G_M45361_IG03
;; size=80 bbWeight=4 PerfScore 60.00
G_M45361_IG04: ;; offset=0x0060
cmp x1, #4
bhs G_M45361_IG07
;; size=8 bbWeight=1 PerfScore 1.50
G_M45361_IG05: ;; offset=0x0068
cmp x1, #2
bhs G_M45361_IG08
cbnz x1, G_M45361_IG09
;; size=12 bbWeight=0.50 PerfScore 1.25
G_M45361_IG06: ;; offset=0x0074
ldp fp, lr, [sp], #0x10
ret lr
;; size=8 bbWeight=0.50 PerfScore 1.00
G_M45361_IG07: ;; offset=0x007C
stp xzr, xzr, [x0, #0x10]
lsl x2, x1, #3
add x3, x0, x2
str xzr, [x3, #-0x18]
add x2, x0, x2
str xzr, [x2, #-0x10]
;; size=24 bbWeight=0.50 PerfScore 2.50
G_M45361_IG08: ;; offset=0x0094
str xzr, [x0, #0x08]
lsl x2, x1, #3
add x1, x0, x2
str xzr, [x1, #-0x08]
;; size=16 bbWeight=0.50 PerfScore 1.75
G_M45361_IG09: ;; offset=0x00A4
str xzr, [x0]
;; size=4 bbWeight=0.50 PerfScore 0.50
G_M45361_IG10: ;; offset=0x00A8
ldp fp, lr, [sp], #0x10
ret lr
;; size=8 bbWeight=0.50 PerfScore 1.00
; Total bytes of code: 176 I can tune JIT to fold
to SIMD in the newly added |
0d6d7b8
to
5f4c26e
Compare
I have mixed results with this change, I'll wait for fixes for #93382 (needed for the remainder) and #76067 (comment) (needed for the main loop) and try again after |
Follow up to #93214, now for
Span<GC>.Clear();
I noticed that the existing implementation is slightly sub-optimal on arm: it's relying on complex addressing modes + is not aligned to 16 bytes. Turns out, 16-byte alignment makes a lot of sense even if we don't use SIMD -
stp
(two 8-byte stores) benefits from it as well.Standalone benchmark (for simpler tests)
Apple M2 Max (also tested on Ampere):
Ryzen 7950X:
For X64, it slightly improves small inputs. It seems like if we pin the input and use SIMD we can see >2X improvement for large inputs but I was not sure about doing pinning (from GC's point of view, atomicity is fine) here and left as is, cc @jkotas