-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression of XXHash128 for large buffers with .NET 8 #87107
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsHey, just wanted to double check the performance of .NET 8 with XXHash128 from PR #77944, and running with a large 1MB buffer seems to be significantly slower than the .NET 7 version now:
I haven't dug into why, but considering the performance drop, my first suspect would be less inlining.
|
The method name in the table suggests you're measuring XxHash3 and the title suggests XxHash128. Which is this issue referring to? cc: @EgorBo |
Sorry, udpated, the naming was from an old benchmark, but implementation is using XXHash128 |
@xoofx could you please share your benchmark (which version of public static IEnumerable<byte[]> TestData()
{
yield return new byte[1024*1024];
}
[Benchmark]
[ArgumentsSource(nameof(TestData))]
public byte[] Test(byte[] data) => System.IO.Hashing.XxHash128.Hash(data); and .net 8.0 is notably faster:
|
This is relatively similar but I'm using [Benchmark()]
[ArgumentsSource(nameof(Data))]
public UInt128 XXH3(byte[] data)
{
return XxHash128.HashToUInt128(data);
}
public IEnumerable<byte[]> Data()
{
yield return Enumerable.Range(0, 1 << 20).Select(x => (byte)x).ToArray();
} Looking at the generated assembly - with Disasmo - between .NET 7 and .NET 8, this code XxHashShared.ScrambleAccumulator256 is generating quite something weird for .NET 8. The .NET 7 version is: ; Assembly listing for method System.IO.Hashing.c(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 2 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
4883EC78 sub rsp, 120
C5F877 vzeroupper
G_M000_IG02: ;; offset=0007H
C5FD1002 vmovupd ymm0, ymmword ptr[rdx]
C5F573D02F vpsrlq ymm1, ymm0, 47
C5FDEFC1 vpxor ymm0, ymm0, ymm1
C4C17DEF00 vpxor ymm0, ymm0, ymmword ptr[r8]
C5FD11442420 vmovupd ymmword ptr[rsp+20H], ymm0
C5FD100579000000 vmovupd ymm0, ymmword ptr[reloc @RWD00]
C5FD110424 vmovupd ymmword ptr[rsp], ymm0
488B442420 mov rax, qword ptr [rsp+20H]
480FAF0424 imul rax, qword ptr [rsp]
4889442440 mov qword ptr [rsp+40H], rax
488B442428 mov rax, qword ptr [rsp+28H]
480FAF442408 imul rax, qword ptr [rsp+08H]
4889442448 mov qword ptr [rsp+48H], rax
488B442430 mov rax, qword ptr [rsp+30H]
480FAF442410 imul rax, qword ptr [rsp+10H]
4889442450 mov qword ptr [rsp+50H], rax
488B442438 mov rax, qword ptr [rsp+38H]
480FAF442418 imul rax, qword ptr [rsp+18H]
4889442458 mov qword ptr [rsp+58H], rax
C5FD10442440 vmovupd ymm0, ymmword ptr[rsp+40H]
C5FD1102 vmovupd ymmword ptr[rdx], ymm0
C5FD1002 vmovupd ymm0, ymmword ptr[rdx]
C5FD1101 vmovupd ymmword ptr[rcx], ymm0
488BC1 mov rax, rcx
G_M000_IG03: ;; offset=0080H
C5F877 vzeroupper
4883C478 add rsp, 120
C3 ret
RWD00 dq 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h
; Total bytes of code 136
while the .NET 8 version is generating: ; Assembly listing for method System.IO.Hashing.XxHashShared:ScrambleAccumulator256(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 19 single block inlinees; 13 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
sub rsp, 312
vzeroupper
G_M000_IG02: ;; offset=000AH
vmovups ymm0, ymmword ptr [rdx]
vpsrlq ymm1, ymm0, 47
vpxor ymm0, ymm0, ymm1
vpxor ymm0, ymm0, ymmword ptr [r8]
vmovups ymmword ptr [rsp+100H], ymm0
vmovups ymm0, ymmword ptr [reloc @RWD00]
vmovups ymmword ptr [rsp+E0H], ymm0
vmovups xmm1, xmmword ptr [rsp+100H]
vmovaps xmmword ptr [rsp+B0H], xmm1
vmovups xmm1, xmmword ptr [rsp+E0H]
vmovaps xmmword ptr [rsp+A0H], xmm1
mov rax, qword ptr [rsp+B0H]
mov qword ptr [rsp+90H], rax
mov rax, qword ptr [rsp+A0H]
mov qword ptr [rsp+88H], rax
mov rax, qword ptr [rsp+90H]
imul rax, qword ptr [rsp+88H]
mov qword ptr [rsp+98H], rax
mov rax, qword ptr [rsp+98H]
mov r8, qword ptr [rsp+B8H]
mov qword ptr [rsp+78H], r8
mov r8, qword ptr [rsp+A8H]
mov qword ptr [rsp+70H], r8
mov r8, qword ptr [rsp+78H]
imul r8, qword ptr [rsp+70H]
mov qword ptr [rsp+80H], r8
mov r8, qword ptr [rsp+80H]
mov qword ptr [rsp+60H], rax
mov qword ptr [rsp+68H], r8
vmovups ymmword ptr [rsp+C0H], ymm0
vmovaps xmm0, xmmword ptr [rsp+60H]
vmovups xmm1, xmmword ptr [rsp+110H]
vmovaps xmmword ptr [rsp+50H], xmm1
vmovups xmm1, xmmword ptr [rsp+D0H]
vmovaps xmmword ptr [rsp+40H], xmm1
mov rax, qword ptr [rsp+50H]
mov qword ptr [rsp+30H], rax
mov rax, qword ptr [rsp+40H]
mov qword ptr [rsp+28H], rax
mov rax, qword ptr [rsp+30H]
imul rax, qword ptr [rsp+28H]
mov qword ptr [rsp+38H], rax
mov rax, qword ptr [rsp+38H]
mov r8, qword ptr [rsp+58H]
mov qword ptr [rsp+18H], r8
mov r8, qword ptr [rsp+48H]
mov qword ptr [rsp+10H], r8
mov r8, qword ptr [rsp+18H]
imul r8, qword ptr [rsp+10H]
mov qword ptr [rsp+20H], r8
mov r8, qword ptr [rsp+20H]
mov qword ptr [rsp], rax
mov qword ptr [rsp+08H], r8
vinserti128 ymm0, ymm0, xmmword ptr [rsp], 1
vmovups ymmword ptr [rdx], ymm0
vmovups ymm0, ymmword ptr [rdx]
vmovups ymmword ptr [rcx], ymm0
mov rax, rcx
G_M000_IG03: ;; offset=0178H
vzeroupper
add rsp, 312
ret
RWD00 dq 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h
; Total bytes of code 387 The .NET 8 version is going a bit crazy, loading/storing multiple times the same register to the stack, taking an intermediate route with some xmm registers... quite weird... This code is then called from the loop in |
But even in .NET 7 the |
Sounds great 🙂 Vector256<ulong> Test(Vector256<ulong> a, Vector256<ulong> b) => a * b; emits on my machine: ; Method Prog:Test(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
G_M3664_IG01:
vzeroupper
;; size=3 bbWeight=1 PerfScore 1.00
G_M3664_IG02:
vmovups ymm0, ymmword ptr [rdx]
vpmullq ymm0, ymm0, ymmword ptr [r8]
vmovups ymmword ptr [rcx], ymm0
mov rax, rcx
;; size=17 bbWeight=1 PerfScore 24.25
G_M3664_IG03:
vzeroupper
ret
;; size=4 bbWeight=1 PerfScore 2.00
; Total bytes of code: 24 but it needs AVX-512 to present, if it does not, I get: ; Method Prog:Test(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]:this
G_M62636_IG01:
push rdi
push rsi
push rbp
push rbx
sub rsp, 248
vzeroupper
vmovaps xmmword ptr [rsp+E0H], xmm6
mov rbx, rdx
mov rsi, r8
mov rdi, r9
;; size=32 bbWeight=4 PerfScore 32.00
G_M62636_IG02:
vmovups xmm0, xmmword ptr [rsi]
vmovaps xmmword ptr [rsp+D0H], xmm0
vmovups xmm0, xmmword ptr [rdi]
vmovaps xmmword ptr [rsp+C0H], xmm0
mov rdx, qword ptr [rsp+D0H]
mov qword ptr [rsp+B0H], rdx
mov rdx, qword ptr [rsp+C0H]
mov qword ptr [rsp+A8H], rdx
mov rdx, qword ptr [rsp+B0H]
imul rdx, qword ptr [rsp+A8H]
mov qword ptr [rsp+B8H], rdx
mov rbp, qword ptr [rsp+B8H]
mov rdx, qword ptr [rsp+D8H]
mov qword ptr [rsp+98H], rdx
mov rdx, qword ptr [rsp+C8H]
mov qword ptr [rsp+90H], rdx
mov rcx, qword ptr [rsp+98H]
mov rdx, qword ptr [rsp+90H]
call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
mov qword ptr [rsp+A0H], rax
mov rdx, qword ptr [rsp+A0H]
mov qword ptr [rsp+80H], rbp
mov qword ptr [rsp+88H], rdx
vmovaps xmm6, xmmword ptr [rsp+80H]
vmovups xmm0, xmmword ptr [rsi+10H]
vmovaps xmmword ptr [rsp+70H], xmm0
vmovups xmm0, xmmword ptr [rdi+10H]
vmovaps xmmword ptr [rsp+60H], xmm0
mov rdx, qword ptr [rsp+70H]
mov qword ptr [rsp+50H], rdx
mov rdx, qword ptr [rsp+60H]
mov qword ptr [rsp+48H], rdx
mov rcx, qword ptr [rsp+50H]
mov rdx, qword ptr [rsp+48H]
call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
mov qword ptr [rsp+58H], rax
mov rsi, qword ptr [rsp+58H]
mov rdx, qword ptr [rsp+78H]
mov qword ptr [rsp+38H], rdx
mov rdx, qword ptr [rsp+68H]
mov qword ptr [rsp+30H], rdx
mov rcx, qword ptr [rsp+38H]
mov rdx, qword ptr [rsp+30H]
call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
mov qword ptr [rsp+40H], rax
mov rax, qword ptr [rsp+40H]
mov qword ptr [rsp+20H], rsi
mov qword ptr [rsp+28H], rax
vinserti128 ymm0, ymm6, xmmword ptr [rsp+20H], 1
vmovups ymmword ptr [rbx], ymm0
mov rax, rbx
;; size=325 bbWeight=4 PerfScore 309.00
G_M62636_IG03:
vmovaps xmm6, xmmword ptr [rsp+E0H]
vzeroupper
add rsp, 248
pop rbx
pop rbp
pop rsi
pop rdi
ret
;; size=24 bbWeight=4 PerfScore 33.00
; Total bytes of code: 381 |
Yep, it is. So it is more likely that you should get better performance on your machine |
Btw, if you know how to optimize the multiplication of two |
Yep, exactly, that's what I would have tried, thanks for confirming. Let me check If I can come up with something there. |
Cool, by generating a compatible AVX2 64bit * 64bit vectorized multiplication, I'm now getting 10% faster than the C++ version, that was it! 😎 Will try to prepare a PR later to add AVX2 + SSE2 and double check if can create also an ARM64 version. |
@EgorBo I don't see any code that is using e.g AVX2, SSE2 in this code path and the operator is marked as an intrinsic. I have 2 questions:
Bonus question: How can I open a sln to compile this from VS? (I have only 17.6.0 installed) I'm getting some cryptic errors from ApiCompat MSBuild tasks when just trying to build |
Right, because it's implemented in JIT, namely, here: runtime/src/coreclr/jit/hwintrinsicxarch.cpp Lines 2056 to 2103 in da1da02
So you either implement it in JIT right there or in C#. When JIT doesn't handle the intrinsic it falls back to C# implementation so if you implement a path
it should be taken. Whatever path you want to take is up to you, I, personally, prefer to do it in C# if it's possible - it's simpler and is ILLink friendly (also, might help mono as well).
I personally rarely do so, but to open the sln I do:
|
Btw, #86811 touches the same path in JIT for |
Fix proposal via PR #87113 (Edit: Though, the original issue of the code being worse with .NET 8 compared to .NET 7 with the default operator still stands) |
I couldn't quite tell what issue might remain after your PR -- can you highlight this so we don't forget about it? |
The following code from [MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector256<ulong> ScrambleAccumulator256(Vector256<ulong> accVec, Vector256<ulong> secret)
{
Vector256<ulong> xorShift = accVec ^ Vector256.ShiftRightLogical(accVec, 47);
Vector256<ulong> xorWithKey = xorShift ^ secret;
accVec = xorWithKey * Vector256.Create((ulong)Prime32_1);
return accVec;
} My PR was going to optimize vmovups ymmword ptr [rsp+100H], ymm0
vmovups ymm0, ymmword ptr [reloc @RWD00]
vmovups ymmword ptr [rsp+E0H], ymm0 ; <----- Double store of reloc @RWD00 above
vmovups xmm1, xmmword ptr [rsp+100H] ; <----- reloading from ymm0 above or: mov rax, qword ptr [rsp+B0H]
mov qword ptr [rsp+90H], rax
mov rax, qword ptr [rsp+A0H]
mov qword ptr [rsp+88H], rax
mov rax, qword ptr [rsp+90H] ; <---- Reloading from rax just above
imul rax, qword ptr [rsp+88H] ; <---- Reloading from rax just above |
This won't land in .NET 9 due to timing and other higher priority fixes being needed. The suggested code fix ended up causing problems and we likely need to use a more standard algorithm for emulating long multiplication. The #87142 PR prototypes this, but has some failures that still needs looking at. |
Hey, just wanted to double check the performance of .NET 8 with XXHash128 from PR #77944, and running with a large 1MB buffer seems to be significantly slower than the .NET 7 version now:
I haven't dug into why, but considering the performance drop, my first suspect would be less inlining.
The text was updated successfully, but these errors were encountered: