Performance regression of XXHash128 for large buffers with .NET 8 #87107

xoofx · 2023-06-04T08:16:47Z

Hey, just wanted to double check the performance of .NET 8 with XXHash128 from PR #77944, and running with a large 1MB buffer seems to be significantly slower than the .NET 7 version now:

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]     : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT AVX2
  Job-YPKMNH : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
  Job-XOTNUZ : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2


|          Method |  Runtime |          data |     Mean |    Error |   StdDev | Ratio |
|---------------- |--------- |-------------- |---------:|---------:|---------:|------:|
| XXHash128Native | .NET 7.0 | Byte[1048576] | 27.33 us | 0.016 us | 0.015 us |  1.00 |
|       XXHash128 | .NET 7.0 | Byte[1048576] | 31.67 us | 0.100 us | 0.093 us |  1.16 |
|       XXHash128 | .NET 8.0 | Byte[1048576] | 45.37 us | 0.165 us | 0.154 us |  1.66 |

I haven't dug into why, but considering the performance drop, my first suspect would be less inlining.

The text was updated successfully, but these errors were encountered:

ghost · 2023-06-04T08:16:56Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Hey, just wanted to double check the performance of .NET 8 with XXHash128 from PR #77944, and running with a large 1MB buffer seems to be significantly slower than the .NET 7 version now:

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]     : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT AVX2
  Job-YPKMNH : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
  Job-XOTNUZ : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2


|     Method |  Runtime |          data |     Mean |    Error |   StdDev | Ratio |
|----------- |--------- |-------------- |---------:|---------:|---------:|------:|
| XXH3Native | .NET 7.0 | Byte[1048576] | 27.33 us | 0.016 us | 0.015 us |  1.00 |
|       XXH3 | .NET 7.0 | Byte[1048576] | 31.67 us | 0.100 us | 0.093 us |  1.16 |
|       XXH3 | .NET 8.0 | Byte[1048576] | 45.37 us | 0.165 us | 0.154 us |  1.66 |

I haven't dug into why, but considering the performance drop, my first suspect would be less inlining.

Author:	xoofx
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

stephentoub · 2023-06-04T08:26:44Z

with XXHash128

XXH3

The method name in the table suggests you're measuring XxHash3 and the title suggests XxHash128. Which is this issue referring to?

cc: @EgorBo

xoofx · 2023-06-04T08:28:23Z

The method name in the table suggests you're measuring XxHash3 and the title suggests XxHash128. Which is this issue referring to?

Sorry, udpated, the naming was from an old benchmark, but implementation is using XXHash128

EgorBo · 2023-06-04T11:10:00Z

@xoofx could you please share your benchmark (which version of System.IO.Hashing are you using btw?)? I tried locally with this:

public static IEnumerable<byte[]> TestData()
{
    yield return new byte[1024*1024];
}

[Benchmark]
[ArgumentsSource(nameof(TestData))]
public byte[] Test(byte[] data) => System.IO.Hashing.XxHash128.Hash(data);

and .net 8.0 is notably faster:

AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]   : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
  ShortRun : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1
WarmupCount=3

| Method |          data |     Mean |    Error |   StdDev |
|------- |-------------- |---------:|---------:|---------:|
|   Test | Byte[1048576] | 14.30 us | 0.150 us | 0.008 us |

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]   : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
  ShortRun : .NET 7.0.3 (7.0.323.6910), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1
WarmupCount=3

| Method |          data |     Mean |    Error |   StdDev |
|------- |-------------- |---------:|---------:|---------:|
|   Test | Byte[1048576] | 21.53 us | 1.083 us | 0.059 us |

xoofx · 2023-06-04T11:22:31Z

This is relatively similar but I'm using XxHash128.HashToUInt128:

    [Benchmark()]
    [ArgumentsSource(nameof(Data))]
    public UInt128 XXH3(byte[] data)
    {
        return XxHash128.HashToUInt128(data);
    }

    public IEnumerable<byte[]> Data()
    {
        yield return Enumerable.Range(0, 1 << 20).Select(x => (byte)x).ToArray();
    }

Looking at the generated assembly - with Disasmo - between .NET 7 and .NET 8, this code XxHashShared.ScrambleAccumulator256 is generating quite something weird for .NET 8.

The .NET 7 version is:

; Assembly listing for method System.IO.Hashing.c(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 2 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       4883EC78             sub      rsp, 120
       C5F877               vzeroupper 
 
G_M000_IG02:                ;; offset=0007H
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       C5F573D02F           vpsrlq   ymm1, ymm0, 47
       C5FDEFC1             vpxor    ymm0, ymm0, ymm1
       C4C17DEF00           vpxor    ymm0, ymm0, ymmword ptr[r8]
       C5FD11442420         vmovupd  ymmword ptr[rsp+20H], ymm0
       C5FD100579000000     vmovupd  ymm0, ymmword ptr[reloc @RWD00]
       C5FD110424           vmovupd  ymmword ptr[rsp], ymm0
       488B442420           mov      rax, qword ptr [rsp+20H]
       480FAF0424           imul     rax, qword ptr [rsp]
       4889442440           mov      qword ptr [rsp+40H], rax
       488B442428           mov      rax, qword ptr [rsp+28H]
       480FAF442408         imul     rax, qword ptr [rsp+08H]
       4889442448           mov      qword ptr [rsp+48H], rax
       488B442430           mov      rax, qword ptr [rsp+30H]
       480FAF442410         imul     rax, qword ptr [rsp+10H]
       4889442450           mov      qword ptr [rsp+50H], rax
       488B442438           mov      rax, qword ptr [rsp+38H]
       480FAF442418         imul     rax, qword ptr [rsp+18H]
       4889442458           mov      qword ptr [rsp+58H], rax
       C5FD10442440         vmovupd  ymm0, ymmword ptr[rsp+40H]
       C5FD1102             vmovupd  ymmword ptr[rdx], ymm0
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       C5FD1101             vmovupd  ymmword ptr[rcx], ymm0
       488BC1               mov      rax, rcx
 
G_M000_IG03:                ;; offset=0080H
       C5F877               vzeroupper 
       4883C478             add      rsp, 120
       C3                   ret      
 
RWD00  	dq	000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h

; Total bytes of code 136

while the .NET 8 version is generating:

; Assembly listing for method System.IO.Hashing.XxHashShared:ScrambleAccumulator256(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 19 single block inlinees; 13 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       sub      rsp, 312
       vzeroupper 
 
G_M000_IG02:                ;; offset=000AH
       vmovups  ymm0, ymmword ptr [rdx]
       vpsrlq   ymm1, ymm0, 47
       vpxor    ymm0, ymm0, ymm1
       vpxor    ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rsp+100H], ymm0
       vmovups  ymm0, ymmword ptr [reloc @RWD00]
       vmovups  ymmword ptr [rsp+E0H], ymm0
       vmovups  xmm1, xmmword ptr [rsp+100H]
       vmovaps  xmmword ptr [rsp+B0H], xmm1
       vmovups  xmm1, xmmword ptr [rsp+E0H]
       vmovaps  xmmword ptr [rsp+A0H], xmm1
       mov      rax, qword ptr [rsp+B0H]
       mov      qword ptr [rsp+90H], rax
       mov      rax, qword ptr [rsp+A0H]
       mov      qword ptr [rsp+88H], rax
       mov      rax, qword ptr [rsp+90H]
       imul     rax, qword ptr [rsp+88H]
       mov      qword ptr [rsp+98H], rax
       mov      rax, qword ptr [rsp+98H]
       mov      r8, qword ptr [rsp+B8H]
       mov      qword ptr [rsp+78H], r8
       mov      r8, qword ptr [rsp+A8H]
       mov      qword ptr [rsp+70H], r8
       mov      r8, qword ptr [rsp+78H]
       imul     r8, qword ptr [rsp+70H]
       mov      qword ptr [rsp+80H], r8
       mov      r8, qword ptr [rsp+80H]
       mov      qword ptr [rsp+60H], rax
       mov      qword ptr [rsp+68H], r8
       vmovups  ymmword ptr [rsp+C0H], ymm0
       vmovaps  xmm0, xmmword ptr [rsp+60H]
       vmovups  xmm1, xmmword ptr [rsp+110H]
       vmovaps  xmmword ptr [rsp+50H], xmm1
       vmovups  xmm1, xmmword ptr [rsp+D0H]
       vmovaps  xmmword ptr [rsp+40H], xmm1
       mov      rax, qword ptr [rsp+50H]
       mov      qword ptr [rsp+30H], rax
       mov      rax, qword ptr [rsp+40H]
       mov      qword ptr [rsp+28H], rax
       mov      rax, qword ptr [rsp+30H]
       imul     rax, qword ptr [rsp+28H]
       mov      qword ptr [rsp+38H], rax
       mov      rax, qword ptr [rsp+38H]
       mov      r8, qword ptr [rsp+58H]
       mov      qword ptr [rsp+18H], r8
       mov      r8, qword ptr [rsp+48H]
       mov      qword ptr [rsp+10H], r8
       mov      r8, qword ptr [rsp+18H]
       imul     r8, qword ptr [rsp+10H]
       mov      qword ptr [rsp+20H], r8
       mov      r8, qword ptr [rsp+20H]
       mov      qword ptr [rsp], rax
       mov      qword ptr [rsp+08H], r8
       vinserti128 ymm0, ymm0, xmmword ptr [rsp], 1
       vmovups  ymmword ptr [rdx], ymm0
       vmovups  ymm0, ymmword ptr [rdx]
       vmovups  ymmword ptr [rcx], ymm0
       mov      rax, rcx
 
G_M000_IG03:                ;; offset=0178H
       vzeroupper 
       add      rsp, 312
       ret      
 
RWD00  	dq	000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h

; Total bytes of code 387

The .NET 8 version is going a bit crazy, loading/storing multiple times the same register to the stack, taking an intermediate route with some xmm registers... quite weird...

This code is then called from the loop in XxHashShared.Accumulate and the impact there is not great.

xoofx · 2023-06-04T11:30:32Z

But even in .NET 7 the Vector256<ulong> * Vector256<ulong> is going back to scalar which is quite unfortunate. Would require a specialized code path with 32bits mult to keep it vectorized. I can make a separate PR maybe?

EgorBo · 2023-06-04T11:35:57Z

I can make a separate PR maybe?

Sounds great 🙂

Vector256<ulong> Test(Vector256<ulong> a, Vector256<ulong> b) => a * b;

emits on my machine:

; Method Prog:Test(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
G_M3664_IG01:
       vzeroupper 
						;; size=3 bbWeight=1 PerfScore 1.00

G_M3664_IG02:
       vmovups  ymm0, ymmword ptr [rdx]
       vpmullq  ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rcx], ymm0
       mov      rax, rcx
						;; size=17 bbWeight=1 PerfScore 24.25

G_M3664_IG03:
       vzeroupper 
       ret      
						;; size=4 bbWeight=1 PerfScore 2.00
; Total bytes of code: 24

but it needs AVX-512 to present, if it does not, I get:

; Method Prog:Test(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]:this
G_M62636_IG01:
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 248
       vzeroupper 
       vmovaps  xmmword ptr [rsp+E0H], xmm6
       mov      rbx, rdx
       mov      rsi, r8
       mov      rdi, r9
						;; size=32 bbWeight=4 PerfScore 32.00

G_M62636_IG02:
       vmovups  xmm0, xmmword ptr [rsi]
       vmovaps  xmmword ptr [rsp+D0H], xmm0
       vmovups  xmm0, xmmword ptr [rdi]
       vmovaps  xmmword ptr [rsp+C0H], xmm0
       mov      rdx, qword ptr [rsp+D0H]
       mov      qword ptr [rsp+B0H], rdx
       mov      rdx, qword ptr [rsp+C0H]
       mov      qword ptr [rsp+A8H], rdx
       mov      rdx, qword ptr [rsp+B0H]
       imul     rdx, qword ptr [rsp+A8H]
       mov      qword ptr [rsp+B8H], rdx
       mov      rbp, qword ptr [rsp+B8H]
       mov      rdx, qword ptr [rsp+D8H]
       mov      qword ptr [rsp+98H], rdx
       mov      rdx, qword ptr [rsp+C8H]
       mov      qword ptr [rsp+90H], rdx
       mov      rcx, qword ptr [rsp+98H]
       mov      rdx, qword ptr [rsp+90H]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+A0H], rax
       mov      rdx, qword ptr [rsp+A0H]
       mov      qword ptr [rsp+80H], rbp
       mov      qword ptr [rsp+88H], rdx
       vmovaps  xmm6, xmmword ptr [rsp+80H]
       vmovups  xmm0, xmmword ptr [rsi+10H]
       vmovaps  xmmword ptr [rsp+70H], xmm0
       vmovups  xmm0, xmmword ptr [rdi+10H]
       vmovaps  xmmword ptr [rsp+60H], xmm0
       mov      rdx, qword ptr [rsp+70H]
       mov      qword ptr [rsp+50H], rdx
       mov      rdx, qword ptr [rsp+60H]
       mov      qword ptr [rsp+48H], rdx
       mov      rcx, qword ptr [rsp+50H]
       mov      rdx, qword ptr [rsp+48H]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+58H], rax
       mov      rsi, qword ptr [rsp+58H]
       mov      rdx, qword ptr [rsp+78H]
       mov      qword ptr [rsp+38H], rdx
       mov      rdx, qword ptr [rsp+68H]
       mov      qword ptr [rsp+30H], rdx
       mov      rcx, qword ptr [rsp+38H]
       mov      rdx, qword ptr [rsp+30H]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+40H], rax
       mov      rax, qword ptr [rsp+40H]
       mov      qword ptr [rsp+20H], rsi
       mov      qword ptr [rsp+28H], rax
       vinserti128 ymm0, ymm6, xmmword ptr [rsp+20H], 1
       vmovups  ymmword ptr [rbx], ymm0
       mov      rax, rbx
						;; size=325 bbWeight=4 PerfScore 309.00

G_M62636_IG03:
       vmovaps  xmm6, xmmword ptr [rsp+E0H]
       vzeroupper 
       add      rsp, 248
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret      
						;; size=24 bbWeight=4 PerfScore 33.00
; Total bytes of code: 381

xoofx · 2023-06-04T11:38:07Z

but presumably it's AVX-512?

Yep, it is. So it is more likely that you should get better performance on your machine

EgorBo · 2023-06-04T11:42:03Z

Btw, if you know how to optimize the multiplication of two Vector256<ulong> using pre-avx512 you might want to do it inside the operator * itself? I mean here: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256_1.cs#L268

xoofx · 2023-06-04T11:43:07Z

Btw, if you know how to optimize the multiplication of two Vector256 using pre-avx512 you might want to do it inside the operator * itself? I mean here:

Yep, exactly, that's what I would have tried, thanks for confirming. Let me check If I can come up with something there.

xoofx · 2023-06-04T12:30:52Z

Cool, by generating a compatible AVX2 64bit * 64bit vectorized multiplication, I'm now getting 10% faster than the C++ version, that was it! 😎
I should have checked more thoroughly the generated ASM in the first place. 😅

Will try to prepare a PR later to add AVX2 + SSE2 and double check if can create also an ARM64 version.

xoofx · 2023-06-04T13:59:44Z

@EgorBo I don't see any code that is using e.g AVX2, SSE2 in this code path and the operator is marked as an intrinsic. I have 2 questions:

Should I implement a specialized version of operator Vector256<ulong> * or should I if switch on the typeof(ulong) in the generic version?
Should I inline AVX2 implementation in this intrinsics operators? (safe guarded by Avx2.IsSupported)

Bonus question: How can I open a sln to compile this from VS? (I have only 17.6.0 installed) I'm getting some cryptic errors from ApiCompat MSBuild tasks when just trying to build System.Private.CorLib.csproj...

EgorBo · 2023-06-04T14:11:17Z

@EgorBo I don't see any code that is using e.g AVX2, SSE2

Right, because it's implemented in JIT, namely, here:

runtime/src/coreclr/jit/hwintrinsicxarch.cpp

Lines 2056 to 2103 in da1da02

    
                   case NI_Vector256_op_Multiply: 
        
                   case NI_Vector512_op_Multiply: 
        
                   { 
        
                       assert(sig->numArgs == 2); 
        
                       if ((simdSize == 32) && !varTypeIsFloating(simdBaseType) && !compExactlyDependsOn(InstructionSet_AVX2)) 
        
                       { 
        
                           // We can't deal with TYP_SIMD32 for integral types if the compiler doesn't support AVX2 
        
                           break; 
        
                       } 
        
                       assert(simdSize != 64 || IsBaselineVector512IsaSupportedDebugOnly()); 
        
                       if ((simdBaseType == TYP_BYTE) || (simdBaseType == TYP_UBYTE)) 
        
                       { 
        
                           // TODO-XARCH-CQ: We should support byte/sbyte multiplication 
        
                           break; 
        
                       } 
        
                       if (varTypeIsLong(simdBaseType)) 
        
                       { 
        
                           if (simdSize != 64 && !compOpportunisticallyDependsOn(InstructionSet_AVX512DQ_VL)) 
        
                           { 
        
                               // TODO-XARCH-CQ: We should support long/ulong multiplication 
        
                               break; 
        
                           } 
        
           // else if simdSize == 64 then above assert would check if baseline isa supported 
        
           #if defined(TARGET_X86) 
        
                           // TODO-XARCH-CQ: We need to support 64-bit CreateBroadcast 
        
                           break; 
        
           #endif // TARGET_X86 
        
                       } 
        
                       CORINFO_ARG_LIST_HANDLE arg1     = sig->args; 
        
                       CORINFO_ARG_LIST_HANDLE arg2     = info.compCompHnd->getArgNext(arg1); 
        
                       var_types               argType  = TYP_UNKNOWN; 
        
                       CORINFO_CLASS_HANDLE    argClass = NO_CLASS_HANDLE; 
        
                       argType = JITtype2varType(strip(info.compCompHnd->getArgType(sig, arg2, &argClass))); 
        
                       op2     = getArgForHWIntrinsic(argType, argClass); 
        
                       argType = JITtype2varType(strip(info.compCompHnd->getArgType(sig, arg1, &argClass))); 
        
                       op1     = getArgForHWIntrinsic(argType, argClass); 
        
                       retNode = gtNewSimdBinOpNode(GT_MUL, retType, op1, op2, simdBaseJitType, simdSize); 
        
                       break; 
        
                   }

(as you can see it even has a TODO for V256 multiplication when Avx512dq_vl is not presented.

So you either implement it in JIT right there or in C#. When JIT doesn't handle the intrinsic it falls back to C# implementation so if you implement a path

if (typeof(T) == typeof(long) && Avx2.IsSupported)
{
    ...
}

it should be taken. Whatever path you want to take is up to you, I, personally, prefer to do it in C# if it's possible - it's simpler and is ILLink friendly (also, might help mono as well).

Bonus question: How can I open a sln to compile this from VS? (I have only 17.6.0 installed) I'm getting some cryptic errors from ApiCompat MSBuild tasks when just trying to build System.Private.CorLib.csproj...

I personally rarely do so, but to open the sln I do:

.\build.cmd Clr -c Debug -vs .\src\coreclr\System.Private.CoreLib\System.Private.CoreLib.csproj

EgorBo · 2023-06-04T14:14:09Z

Btw, #86811 touches the same path in JIT for byte as far as I can see (but I still think it'd better to be done on the C# side)

xoofx · 2023-06-04T15:12:12Z

Fix proposal via PR #87113

(Edit: Though, the original issue of the code being worse with .NET 8 compared to .NET 7 with the default operator still stands)

AndyAyersMS · 2023-06-05T15:37:11Z

(Edit: Though, the original issue of the code being worse with .NET 8 compared to .NET 7 with the default operator still stands)

I couldn't quite tell what issue might remain after your PR -- can you highlight this so we don't forget about it?

xoofx · 2023-06-05T15:56:23Z

I couldn't quite tell what issue might remain after your PR -- can you highlight this so we don't forget about it?

The following code from XxHashShared.ScrambleAccumulator256 in .NET 8 is generating a lot more stack spilling (312 bytes - e.g storing and reloading the exact same value from the stack in a row, as captured below) than the .NET 7 equivalent (120 bytes), mainly because of accVec = xorWithKey * Vector256.Create((ulong)Prime32_1);

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector256<ulong> ScrambleAccumulator256(Vector256<ulong> accVec, Vector256<ulong> secret)
{
    Vector256<ulong> xorShift = accVec ^ Vector256.ShiftRightLogical(accVec, 47);
    Vector256<ulong> xorWithKey = xorShift ^ secret;
    accVec = xorWithKey * Vector256.Create((ulong)Prime32_1);
    return accVec;
}

My PR was going to optimize Vector256<ulong> * operator, but that doesn't explain the difference of codegen with .NET 8.
e.g

       vmovups  ymmword ptr [rsp+100H], ymm0
       vmovups  ymm0, ymmword ptr [reloc @RWD00]
       vmovups  ymmword ptr [rsp+E0H], ymm0    ; <----- Double store of reloc @RWD00 above
       vmovups  xmm1, xmmword ptr [rsp+100H]   ; <----- reloading from ymm0 above

or:

       mov      rax, qword ptr [rsp+B0H]
       mov      qword ptr [rsp+90H], rax
       mov      rax, qword ptr [rsp+A0H]
       mov      qword ptr [rsp+88H], rax
       mov      rax, qword ptr [rsp+90H] ; <---- Reloading from rax just above
       imul     rax, qword ptr [rsp+88H] ; <---- Reloading from rax just above

tannergooding · 2023-08-11T17:17:10Z

This won't land in .NET 9 due to timing and other higher priority fixes being needed.

The suggested code fix ended up causing problems and we likely need to use a more standard algorithm for emulating long multiplication. The #87142 PR prototypes this, but has some failures that still needs looking at.

xoofx added the tenet-performance Performance related issue label Jun 4, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Jun 4, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 4, 2023

xoofx mentioned this issue Jun 4, 2023

Add support for optimized AVX2/SSE2 multiply for VectorXXX<ulong/long> #87113

Closed

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 4, 2023

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 4, 2023

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jun 5, 2023

tannergooding mentioned this issue Jun 5, 2023

Handle Vector multiplication for long/ulong as intrinsic on pre-AVX512 hardware #87142

Closed

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 5, 2023

JulieLeeMSFT assigned tannergooding Jun 19, 2023

JulieLeeMSFT added this to the 8.0.0 milestone Jun 19, 2023

tannergooding modified the milestones: 8.0.0, 9.0.0 Aug 11, 2023

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 6, 2023

tannergooding removed their assignment Apr 18, 2024

JulieLeeMSFT modified the milestones: 9.0.0, Future May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression of XXHash128 for large buffers with .NET 8 #87107

Performance regression of XXHash128 for large buffers with .NET 8 #87107

xoofx commented Jun 4, 2023 •

edited

Loading

ghost commented Jun 4, 2023

stephentoub commented Jun 4, 2023

xoofx commented Jun 4, 2023

EgorBo commented Jun 4, 2023 •

edited

Loading

xoofx commented Jun 4, 2023 •

edited

Loading

xoofx commented Jun 4, 2023

EgorBo commented Jun 4, 2023 •

edited

Loading

xoofx commented Jun 4, 2023

EgorBo commented Jun 4, 2023

xoofx commented Jun 4, 2023

xoofx commented Jun 4, 2023

xoofx commented Jun 4, 2023 •

edited

Loading

EgorBo commented Jun 4, 2023

EgorBo commented Jun 4, 2023 •

edited

Loading

xoofx commented Jun 4, 2023 •

edited

Loading

AndyAyersMS commented Jun 5, 2023

xoofx commented Jun 5, 2023 •

edited

Loading

tannergooding commented Aug 11, 2023

Performance regression of XXHash128 for large buffers with .NET 8 #87107

Performance regression of XXHash128 for large buffers with .NET 8 #87107

Comments

xoofx commented Jun 4, 2023 • edited Loading

ghost commented Jun 4, 2023

stephentoub commented Jun 4, 2023

xoofx commented Jun 4, 2023

EgorBo commented Jun 4, 2023 • edited Loading

xoofx commented Jun 4, 2023 • edited Loading

xoofx commented Jun 4, 2023

EgorBo commented Jun 4, 2023 • edited Loading

xoofx commented Jun 4, 2023

EgorBo commented Jun 4, 2023

xoofx commented Jun 4, 2023

xoofx commented Jun 4, 2023

xoofx commented Jun 4, 2023 • edited Loading

EgorBo commented Jun 4, 2023

EgorBo commented Jun 4, 2023 • edited Loading

xoofx commented Jun 4, 2023 • edited Loading

AndyAyersMS commented Jun 5, 2023

xoofx commented Jun 5, 2023 • edited Loading

tannergooding commented Aug 11, 2023

xoofx commented Jun 4, 2023 •

edited

Loading

EgorBo commented Jun 4, 2023 •

edited

Loading

xoofx commented Jun 4, 2023 •

edited

Loading

EgorBo commented Jun 4, 2023 •

edited

Loading

xoofx commented Jun 4, 2023 •

edited

Loading

EgorBo commented Jun 4, 2023 •

edited

Loading

xoofx commented Jun 4, 2023 •

edited

Loading

xoofx commented Jun 5, 2023 •

edited

Loading