Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression of XXHash128 for large buffers with .NET 8 #87107

Open
xoofx opened this issue Jun 4, 2023 · 18 comments
Open

Performance regression of XXHash128 for large buffers with .NET 8 #87107

xoofx opened this issue Jun 4, 2023 · 18 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@xoofx
Copy link
Member

xoofx commented Jun 4, 2023

Hey, just wanted to double check the performance of .NET 8 with XXHash128 from PR #77944, and running with a large 1MB buffer seems to be significantly slower than the .NET 7 version now:

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]     : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT AVX2
  Job-YPKMNH : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
  Job-XOTNUZ : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2


|          Method |  Runtime |          data |     Mean |    Error |   StdDev | Ratio |
|---------------- |--------- |-------------- |---------:|---------:|---------:|------:|
| XXHash128Native | .NET 7.0 | Byte[1048576] | 27.33 us | 0.016 us | 0.015 us |  1.00 |
|       XXHash128 | .NET 7.0 | Byte[1048576] | 31.67 us | 0.100 us | 0.093 us |  1.16 |
|       XXHash128 | .NET 8.0 | Byte[1048576] | 45.37 us | 0.165 us | 0.154 us |  1.66 |

I haven't dug into why, but considering the performance drop, my first suspect would be less inlining.

@xoofx xoofx added the tenet-performance Performance related issue label Jun 4, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 4, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 4, 2023
@ghost
Copy link

ghost commented Jun 4, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Hey, just wanted to double check the performance of .NET 8 with XXHash128 from PR #77944, and running with a large 1MB buffer seems to be significantly slower than the .NET 7 version now:

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]     : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT AVX2
  Job-YPKMNH : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
  Job-XOTNUZ : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2


|     Method |  Runtime |          data |     Mean |    Error |   StdDev | Ratio |
|----------- |--------- |-------------- |---------:|---------:|---------:|------:|
| XXH3Native | .NET 7.0 | Byte[1048576] | 27.33 us | 0.016 us | 0.015 us |  1.00 |
|       XXH3 | .NET 7.0 | Byte[1048576] | 31.67 us | 0.100 us | 0.093 us |  1.16 |
|       XXH3 | .NET 8.0 | Byte[1048576] | 45.37 us | 0.165 us | 0.154 us |  1.66 |

I haven't dug into why, but considering the performance drop, my first suspect would be less inlining.

Author: xoofx
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@stephentoub
Copy link
Member

with XXHash128

XXH3

The method name in the table suggests you're measuring XxHash3 and the title suggests XxHash128. Which is this issue referring to?

cc: @EgorBo

@xoofx
Copy link
Member Author

xoofx commented Jun 4, 2023

The method name in the table suggests you're measuring XxHash3 and the title suggests XxHash128. Which is this issue referring to?

Sorry, udpated, the naming was from an old benchmark, but implementation is using XXHash128

@EgorBo
Copy link
Member

EgorBo commented Jun 4, 2023

@xoofx could you please share your benchmark (which version of System.IO.Hashing are you using btw?)? I tried locally with this:

public static IEnumerable<byte[]> TestData()
{
    yield return new byte[1024*1024];
}

[Benchmark]
[ArgumentsSource(nameof(TestData))]
public byte[] Test(byte[] data) => System.IO.Hashing.XxHash128.Hash(data);

and .net 8.0 is notably faster:

AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]   : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
  ShortRun : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1
WarmupCount=3

| Method |          data |     Mean |    Error |   StdDev |
|------- |-------------- |---------:|---------:|---------:|
|   Test | Byte[1048576] | 14.30 us | 0.150 us | 0.008 us |
BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]   : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
  ShortRun : .NET 7.0.3 (7.0.323.6910), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1
WarmupCount=3

| Method |          data |     Mean |    Error |   StdDev |
|------- |-------------- |---------:|---------:|---------:|
|   Test | Byte[1048576] | 21.53 us | 1.083 us | 0.059 us |

@xoofx
Copy link
Member Author

xoofx commented Jun 4, 2023

This is relatively similar but I'm using XxHash128.HashToUInt128:

    [Benchmark()]
    [ArgumentsSource(nameof(Data))]
    public UInt128 XXH3(byte[] data)
    {
        return XxHash128.HashToUInt128(data);
    }

    public IEnumerable<byte[]> Data()
    {
        yield return Enumerable.Range(0, 1 << 20).Select(x => (byte)x).ToArray();
    }

Looking at the generated assembly - with Disasmo - between .NET 7 and .NET 8, this code XxHashShared.ScrambleAccumulator256 is generating quite something weird for .NET 8.

The .NET 7 version is:

; Assembly listing for method System.IO.Hashing.c(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 2 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       4883EC78             sub      rsp, 120
       C5F877               vzeroupper 
 
G_M000_IG02:                ;; offset=0007H
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       C5F573D02F           vpsrlq   ymm1, ymm0, 47
       C5FDEFC1             vpxor    ymm0, ymm0, ymm1
       C4C17DEF00           vpxor    ymm0, ymm0, ymmword ptr[r8]
       C5FD11442420         vmovupd  ymmword ptr[rsp+20H], ymm0
       C5FD100579000000     vmovupd  ymm0, ymmword ptr[reloc @RWD00]
       C5FD110424           vmovupd  ymmword ptr[rsp], ymm0
       488B442420           mov      rax, qword ptr [rsp+20H]
       480FAF0424           imul     rax, qword ptr [rsp]
       4889442440           mov      qword ptr [rsp+40H], rax
       488B442428           mov      rax, qword ptr [rsp+28H]
       480FAF442408         imul     rax, qword ptr [rsp+08H]
       4889442448           mov      qword ptr [rsp+48H], rax
       488B442430           mov      rax, qword ptr [rsp+30H]
       480FAF442410         imul     rax, qword ptr [rsp+10H]
       4889442450           mov      qword ptr [rsp+50H], rax
       488B442438           mov      rax, qword ptr [rsp+38H]
       480FAF442418         imul     rax, qword ptr [rsp+18H]
       4889442458           mov      qword ptr [rsp+58H], rax
       C5FD10442440         vmovupd  ymm0, ymmword ptr[rsp+40H]
       C5FD1102             vmovupd  ymmword ptr[rdx], ymm0
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       C5FD1101             vmovupd  ymmword ptr[rcx], ymm0
       488BC1               mov      rax, rcx
 
G_M000_IG03:                ;; offset=0080H
       C5F877               vzeroupper 
       4883C478             add      rsp, 120
       C3                   ret      
 
RWD00  	dq	000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h

; Total bytes of code 136

while the .NET 8 version is generating:

; Assembly listing for method System.IO.Hashing.XxHashShared:ScrambleAccumulator256(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 19 single block inlinees; 13 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       sub      rsp, 312
       vzeroupper 
 
G_M000_IG02:                ;; offset=000AH
       vmovups  ymm0, ymmword ptr [rdx]
       vpsrlq   ymm1, ymm0, 47
       vpxor    ymm0, ymm0, ymm1
       vpxor    ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rsp+100H], ymm0
       vmovups  ymm0, ymmword ptr [reloc @RWD00]
       vmovups  ymmword ptr [rsp+E0H], ymm0
       vmovups  xmm1, xmmword ptr [rsp+100H]
       vmovaps  xmmword ptr [rsp+B0H], xmm1
       vmovups  xmm1, xmmword ptr [rsp+E0H]
       vmovaps  xmmword ptr [rsp+A0H], xmm1
       mov      rax, qword ptr [rsp+B0H]
       mov      qword ptr [rsp+90H], rax
       mov      rax, qword ptr [rsp+A0H]
       mov      qword ptr [rsp+88H], rax
       mov      rax, qword ptr [rsp+90H]
       imul     rax, qword ptr [rsp+88H]
       mov      qword ptr [rsp+98H], rax
       mov      rax, qword ptr [rsp+98H]
       mov      r8, qword ptr [rsp+B8H]
       mov      qword ptr [rsp+78H], r8
       mov      r8, qword ptr [rsp+A8H]
       mov      qword ptr [rsp+70H], r8
       mov      r8, qword ptr [rsp+78H]
       imul     r8, qword ptr [rsp+70H]
       mov      qword ptr [rsp+80H], r8
       mov      r8, qword ptr [rsp+80H]
       mov      qword ptr [rsp+60H], rax
       mov      qword ptr [rsp+68H], r8
       vmovups  ymmword ptr [rsp+C0H], ymm0
       vmovaps  xmm0, xmmword ptr [rsp+60H]
       vmovups  xmm1, xmmword ptr [rsp+110H]
       vmovaps  xmmword ptr [rsp+50H], xmm1
       vmovups  xmm1, xmmword ptr [rsp+D0H]
       vmovaps  xmmword ptr [rsp+40H], xmm1
       mov      rax, qword ptr [rsp+50H]
       mov      qword ptr [rsp+30H], rax
       mov      rax, qword ptr [rsp+40H]
       mov      qword ptr [rsp+28H], rax
       mov      rax, qword ptr [rsp+30H]
       imul     rax, qword ptr [rsp+28H]
       mov      qword ptr [rsp+38H], rax
       mov      rax, qword ptr [rsp+38H]
       mov      r8, qword ptr [rsp+58H]
       mov      qword ptr [rsp+18H], r8
       mov      r8, qword ptr [rsp+48H]
       mov      qword ptr [rsp+10H], r8
       mov      r8, qword ptr [rsp+18H]
       imul     r8, qword ptr [rsp+10H]
       mov      qword ptr [rsp+20H], r8
       mov      r8, qword ptr [rsp+20H]
       mov      qword ptr [rsp], rax
       mov      qword ptr [rsp+08H], r8
       vinserti128 ymm0, ymm0, xmmword ptr [rsp], 1
       vmovups  ymmword ptr [rdx], ymm0
       vmovups  ymm0, ymmword ptr [rdx]
       vmovups  ymmword ptr [rcx], ymm0
       mov      rax, rcx
 
G_M000_IG03:                ;; offset=0178H
       vzeroupper 
       add      rsp, 312
       ret      
 
RWD00  	dq	000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h

; Total bytes of code 387

The .NET 8 version is going a bit crazy, loading/storing multiple times the same register to the stack, taking an intermediate route with some xmm registers... quite weird...

This code is then called from the loop in XxHashShared.Accumulate and the impact there is not great.

@xoofx
Copy link
Member Author

xoofx commented Jun 4, 2023

But even in .NET 7 the Vector256<ulong> * Vector256<ulong> is going back to scalar which is quite unfortunate. Would require a specialized code path with 32bits mult to keep it vectorized. I can make a separate PR maybe?

@EgorBo
Copy link
Member

EgorBo commented Jun 4, 2023

I can make a separate PR maybe?

Sounds great 🙂

Vector256<ulong> Test(Vector256<ulong> a, Vector256<ulong> b) => a * b;

emits on my machine:

; Method Prog:Test(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
G_M3664_IG01:
       vzeroupper 
						;; size=3 bbWeight=1 PerfScore 1.00

G_M3664_IG02:
       vmovups  ymm0, ymmword ptr [rdx]
       vpmullq  ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rcx], ymm0
       mov      rax, rcx
						;; size=17 bbWeight=1 PerfScore 24.25

G_M3664_IG03:
       vzeroupper 
       ret      
						;; size=4 bbWeight=1 PerfScore 2.00
; Total bytes of code: 24

but it needs AVX-512 to present, if it does not, I get:

; Method Prog:Test(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]:this
G_M62636_IG01:
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 248
       vzeroupper 
       vmovaps  xmmword ptr [rsp+E0H], xmm6
       mov      rbx, rdx
       mov      rsi, r8
       mov      rdi, r9
						;; size=32 bbWeight=4 PerfScore 32.00

G_M62636_IG02:
       vmovups  xmm0, xmmword ptr [rsi]
       vmovaps  xmmword ptr [rsp+D0H], xmm0
       vmovups  xmm0, xmmword ptr [rdi]
       vmovaps  xmmword ptr [rsp+C0H], xmm0
       mov      rdx, qword ptr [rsp+D0H]
       mov      qword ptr [rsp+B0H], rdx
       mov      rdx, qword ptr [rsp+C0H]
       mov      qword ptr [rsp+A8H], rdx
       mov      rdx, qword ptr [rsp+B0H]
       imul     rdx, qword ptr [rsp+A8H]
       mov      qword ptr [rsp+B8H], rdx
       mov      rbp, qword ptr [rsp+B8H]
       mov      rdx, qword ptr [rsp+D8H]
       mov      qword ptr [rsp+98H], rdx
       mov      rdx, qword ptr [rsp+C8H]
       mov      qword ptr [rsp+90H], rdx
       mov      rcx, qword ptr [rsp+98H]
       mov      rdx, qword ptr [rsp+90H]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+A0H], rax
       mov      rdx, qword ptr [rsp+A0H]
       mov      qword ptr [rsp+80H], rbp
       mov      qword ptr [rsp+88H], rdx
       vmovaps  xmm6, xmmword ptr [rsp+80H]
       vmovups  xmm0, xmmword ptr [rsi+10H]
       vmovaps  xmmword ptr [rsp+70H], xmm0
       vmovups  xmm0, xmmword ptr [rdi+10H]
       vmovaps  xmmword ptr [rsp+60H], xmm0
       mov      rdx, qword ptr [rsp+70H]
       mov      qword ptr [rsp+50H], rdx
       mov      rdx, qword ptr [rsp+60H]
       mov      qword ptr [rsp+48H], rdx
       mov      rcx, qword ptr [rsp+50H]
       mov      rdx, qword ptr [rsp+48H]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+58H], rax
       mov      rsi, qword ptr [rsp+58H]
       mov      rdx, qword ptr [rsp+78H]
       mov      qword ptr [rsp+38H], rdx
       mov      rdx, qword ptr [rsp+68H]
       mov      qword ptr [rsp+30H], rdx
       mov      rcx, qword ptr [rsp+38H]
       mov      rdx, qword ptr [rsp+30H]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+40H], rax
       mov      rax, qword ptr [rsp+40H]
       mov      qword ptr [rsp+20H], rsi
       mov      qword ptr [rsp+28H], rax
       vinserti128 ymm0, ymm6, xmmword ptr [rsp+20H], 1
       vmovups  ymmword ptr [rbx], ymm0
       mov      rax, rbx
						;; size=325 bbWeight=4 PerfScore 309.00

G_M62636_IG03:
       vmovaps  xmm6, xmmword ptr [rsp+E0H]
       vzeroupper 
       add      rsp, 248
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret      
						;; size=24 bbWeight=4 PerfScore 33.00
; Total bytes of code: 381

@xoofx
Copy link
Member Author

xoofx commented Jun 4, 2023

but presumably it's AVX-512?

Yep, it is. So it is more likely that you should get better performance on your machine

@EgorBo
Copy link
Member

EgorBo commented Jun 4, 2023

Btw, if you know how to optimize the multiplication of two Vector256<ulong> using pre-avx512 you might want to do it inside the operator * itself? I mean here: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256_1.cs#L268

@xoofx
Copy link
Member Author

xoofx commented Jun 4, 2023

Btw, if you know how to optimize the multiplication of two Vector256 using pre-avx512 you might want to do it inside the operator * itself? I mean here:

Yep, exactly, that's what I would have tried, thanks for confirming. Let me check If I can come up with something there.

@xoofx
Copy link
Member Author

xoofx commented Jun 4, 2023

Cool, by generating a compatible AVX2 64bit * 64bit vectorized multiplication, I'm now getting 10% faster than the C++ version, that was it! 😎
I should have checked more thoroughly the generated ASM in the first place. 😅

Will try to prepare a PR later to add AVX2 + SSE2 and double check if can create also an ARM64 version.

@xoofx
Copy link
Member Author

xoofx commented Jun 4, 2023

@EgorBo I don't see any code that is using e.g AVX2, SSE2 in this code path and the operator is marked as an intrinsic. I have 2 questions:

  • Should I implement a specialized version of operator Vector256<ulong> * or should I if switch on the typeof(ulong) in the generic version?
  • Should I inline AVX2 implementation in this intrinsics operators? (safe guarded by Avx2.IsSupported)

Bonus question: How can I open a sln to compile this from VS? (I have only 17.6.0 installed) I'm getting some cryptic errors from ApiCompat MSBuild tasks when just trying to build System.Private.CorLib.csproj...

@EgorBo
Copy link
Member

EgorBo commented Jun 4, 2023

@EgorBo I don't see any code that is using e.g AVX2, SSE2

Right, because it's implemented in JIT, namely, here:

case NI_Vector256_op_Multiply:
case NI_Vector512_op_Multiply:
{
assert(sig->numArgs == 2);
if ((simdSize == 32) && !varTypeIsFloating(simdBaseType) && !compExactlyDependsOn(InstructionSet_AVX2))
{
// We can't deal with TYP_SIMD32 for integral types if the compiler doesn't support AVX2
break;
}
assert(simdSize != 64 || IsBaselineVector512IsaSupportedDebugOnly());
if ((simdBaseType == TYP_BYTE) || (simdBaseType == TYP_UBYTE))
{
// TODO-XARCH-CQ: We should support byte/sbyte multiplication
break;
}
if (varTypeIsLong(simdBaseType))
{
if (simdSize != 64 && !compOpportunisticallyDependsOn(InstructionSet_AVX512DQ_VL))
{
// TODO-XARCH-CQ: We should support long/ulong multiplication
break;
}
// else if simdSize == 64 then above assert would check if baseline isa supported
#if defined(TARGET_X86)
// TODO-XARCH-CQ: We need to support 64-bit CreateBroadcast
break;
#endif // TARGET_X86
}
CORINFO_ARG_LIST_HANDLE arg1 = sig->args;
CORINFO_ARG_LIST_HANDLE arg2 = info.compCompHnd->getArgNext(arg1);
var_types argType = TYP_UNKNOWN;
CORINFO_CLASS_HANDLE argClass = NO_CLASS_HANDLE;
argType = JITtype2varType(strip(info.compCompHnd->getArgType(sig, arg2, &argClass)));
op2 = getArgForHWIntrinsic(argType, argClass);
argType = JITtype2varType(strip(info.compCompHnd->getArgType(sig, arg1, &argClass)));
op1 = getArgForHWIntrinsic(argType, argClass);
retNode = gtNewSimdBinOpNode(GT_MUL, retType, op1, op2, simdBaseJitType, simdSize);
break;
}
(as you can see it even has a TODO for V256 multiplication when Avx512dq_vl is not presented.

So you either implement it in JIT right there or in C#. When JIT doesn't handle the intrinsic it falls back to C# implementation so if you implement a path

if (typeof(T) == typeof(long) && Avx2.IsSupported)
{
    ...
}

it should be taken. Whatever path you want to take is up to you, I, personally, prefer to do it in C# if it's possible - it's simpler and is ILLink friendly (also, might help mono as well).

Bonus question: How can I open a sln to compile this from VS? (I have only 17.6.0 installed) I'm getting some cryptic errors from ApiCompat MSBuild tasks when just trying to build System.Private.CorLib.csproj...

I personally rarely do so, but to open the sln I do:

.\build.cmd Clr -c Debug -vs .\src\coreclr\System.Private.CoreLib\System.Private.CoreLib.csproj

@EgorBo
Copy link
Member

EgorBo commented Jun 4, 2023

Btw, #86811 touches the same path in JIT for byte as far as I can see (but I still think it'd better to be done on the C# side)

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 4, 2023
@xoofx
Copy link
Member Author

xoofx commented Jun 4, 2023

Fix proposal via PR #87113

(Edit: Though, the original issue of the code being worse with .NET 8 compared to .NET 7 with the default operator still stands)

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 4, 2023
@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jun 5, 2023
@AndyAyersMS
Copy link
Member

(Edit: Though, the original issue of the code being worse with .NET 8 compared to .NET 7 with the default operator still stands)

I couldn't quite tell what issue might remain after your PR -- can you highlight this so we don't forget about it?

@xoofx
Copy link
Member Author

xoofx commented Jun 5, 2023

I couldn't quite tell what issue might remain after your PR -- can you highlight this so we don't forget about it?

The following code from XxHashShared.ScrambleAccumulator256 in .NET 8 is generating a lot more stack spilling (312 bytes - e.g storing and reloading the exact same value from the stack in a row, as captured below) than the .NET 7 equivalent (120 bytes), mainly because of accVec = xorWithKey * Vector256.Create((ulong)Prime32_1);

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector256<ulong> ScrambleAccumulator256(Vector256<ulong> accVec, Vector256<ulong> secret)
{
    Vector256<ulong> xorShift = accVec ^ Vector256.ShiftRightLogical(accVec, 47);
    Vector256<ulong> xorWithKey = xorShift ^ secret;
    accVec = xorWithKey * Vector256.Create((ulong)Prime32_1);
    return accVec;
}

My PR was going to optimize Vector256<ulong> * operator, but that doesn't explain the difference of codegen with .NET 8.
e.g

       vmovups  ymmword ptr [rsp+100H], ymm0
       vmovups  ymm0, ymmword ptr [reloc @RWD00]
       vmovups  ymmword ptr [rsp+E0H], ymm0    ; <----- Double store of reloc @RWD00 above
       vmovups  xmm1, xmmword ptr [rsp+100H]   ; <----- reloading from ymm0 above

or:

       mov      rax, qword ptr [rsp+B0H]
       mov      qword ptr [rsp+90H], rax
       mov      rax, qword ptr [rsp+A0H]
       mov      qword ptr [rsp+88H], rax
       mov      rax, qword ptr [rsp+90H] ; <---- Reloading from rax just above
       imul     rax, qword ptr [rsp+88H] ; <---- Reloading from rax just above

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 5, 2023
@JulieLeeMSFT JulieLeeMSFT added this to the 8.0.0 milestone Jun 19, 2023
@tannergooding tannergooding modified the milestones: 8.0.0, 9.0.0 Aug 11, 2023
@tannergooding
Copy link
Member

This won't land in .NET 9 due to timing and other higher priority fixes being needed.

The suggested code fix ended up causing problems and we likely need to use a more standard algorithm for emulating long multiplication. The #87142 PR prototypes this, but has some failures that still needs looking at.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 6, 2023
@tannergooding tannergooding removed their assignment Apr 18, 2024
@JulieLeeMSFT JulieLeeMSFT modified the milestones: 9.0.0, Future May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
6 participants