Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate Vector128<long>::op_Multiply on x64 #103555

Merged
merged 21 commits into from
Jun 28, 2024
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jun 17, 2024

This PR optimizes Vector128 and Vector256 multiplication for long/ulong when AVX512 is not presented in the system. It makes XxHash128 faster, see #103555 (comment)

public Vector128<long> Foo(Vector128<long> a, Vector128<long> b) => a * b;

Current codegen on x64 cpu without AVX512:

; Method MyBench:Foo
       push     rsi
       push     rbx
       sub      rsp, 104
       mov      rbx, rdx
       mov      rdx, qword ptr [r8]
       mov      qword ptr [rsp+0x58], rdx
       mov      rdx, qword ptr [r9]
       mov      qword ptr [rsp+0x50], rdx
       mov      rdx, qword ptr [rsp+0x58]
       imul     rdx, qword ptr [rsp+0x50]
       mov      qword ptr [rsp+0x60], rdx
       mov      rsi, qword ptr [rsp+0x60]
       mov      rdx, qword ptr [r8+0x08]
       mov      qword ptr [rsp+0x40], rdx
       mov      rdx, qword ptr [r9+0x08]
       mov      qword ptr [rsp+0x38], rdx
       mov      rcx, qword ptr [rsp+0x40]
       mov      rdx, qword ptr [rsp+0x38]
       call     [System.Runtime.Intrinsics.Scalar`1[long]:Multiply(long,long):long]   ;;; not inlined call!
       mov      qword ptr [rsp+0x48], rax
       mov      rax, qword ptr [rsp+0x48]
       mov      qword ptr [rsp+0x20], rsi
       mov      qword ptr [rsp+0x28], rax
       vmovaps  xmm0, xmmword ptr [rsp+0x20]
       vmovups  xmmword ptr [rbx], xmm0
       mov      rax, rbx
       add      rsp, 104
       pop      rbx
       pop      rsi
       ret      
; Total bytes of code: 120

New codegen:

; Method MyBench:Foo
       vmovups  xmm0, xmmword ptr [r8]
       vmovups  xmm1, xmmword ptr [r9]
       vpmuludq xmm2, xmm1, xmm0
       vpshufd  xmm1, xmm1, -79
       vpmulld  xmm0, xmm1, xmm0
       vxorps   xmm1, xmm1, xmm1
       vphaddd  xmm0, xmm0, xmm1
       vpshufd  xmm0, xmm0, 115
       vpaddq   xmm0, xmm0, xmm2
       vmovups  xmmword ptr [rdx], xmm0
       mov      rax, rdx
       ret      
; Total bytes of code: 50

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 17, 2024

Note: results should be better if we do it in JIT, it will enable loop hoisting, cse, etc for MUL

@neon-sunset
Copy link
Contributor

Note #103539 (comment) (and https://godbolt.org/z/eqsrf341M) from xxHash128 issue.

@dotnet dotnet deleted a comment from EgorBot Jun 20, 2024
@dotnet dotnet deleted a comment from EgorBot Jun 20, 2024
@EgorBo
Copy link
Member Author

EgorBo commented Jun 20, 2024

@EgorBot -amd -intel -arm64 -profiler --envvars DOTNET_PreferredVectorBitWidth:128

using System.IO.Hashing;
using BenchmarkDotNet.Attributes;

public class Bench
{
    static readonly byte[] Data = new byte[1000000];

    [Benchmark]
    public byte[] BenchXxHash128()
    {
        XxHash128 hash = new();
        hash.Append(Data);
        return hash.GetHashAndReset();
    }
}

@EgorBot
Copy link

EgorBot commented Jun 20, 2024

Benchmark results on Intel
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
  Job-ITXSAG : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XSORFZ : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128
Method Toolchain Mean Error Ratio
BenchXxHash128 Main 43.41 μs 0.087 μs 1.00
BenchXxHash128 PR 43.33 μs 0.009 μs 1.00

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@EgorBot
Copy link

EgorBot commented Jun 20, 2024

Benchmark results on Amd
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
AMD EPYC 7763, 1 CPU, 16 logical and 8 physical cores
  Job-SUBLYH : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-OPUYDY : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128
Method Toolchain Mean Error Ratio
BenchXxHash128 Main 71.20 μs 0.022 μs 1.00
BenchXxHash128 PR 43.84 μs 0.013 μs 0.62

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@EgorBot
Copy link

EgorBot commented Jun 20, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-EDPWDU : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-TIALUR : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128
Method Toolchain Mean Error Ratio
BenchXxHash128 Main 116.9 μs 0.11 μs 1.00
BenchXxHash128 PR 116.8 μs 0.07 μs 1.00

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 21, 2024

/azp list

This comment was marked as resolved.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 21, 2024

/azp run runtime-coreclr jitstress-isas-x86

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@EgorBo
Copy link
Member Author

EgorBo commented Jun 21, 2024

@tannergooding PTAL, I'll add arm64 separately, need to test different impls.
I've expanded it in importer similar to existing op_Multiply expansions

Benchmark improvement: #103555 (comment)

@EgorBo EgorBo requested a review from tannergooding June 21, 2024 12:29
@EgorBo EgorBo marked this pull request as ready for review June 24, 2024 14:26
Comment on lines +21627 to +21631
// Vector256<int> tmp3 = Avx2.HorizontalAdd(tmp2.AsInt32(), Vector256<int>.Zero);
GenTreeHWIntrinsic* tmp3 =
gtNewSimdHWIntrinsicNode(type, tmp2, gtNewZeroConNode(type),
is256 ? NI_AVX2_HorizontalAdd : NI_SSSE3_HorizontalAdd,
CORINFO_TYPE_UINT, simdSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know in other places we've started avoiding hadd in favor of shuffle+add, might be worth seeing if that's appropriate here too (low priority, non blocking)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to benchmark different implementations for it and they all were equaly fast e.g. #99871 (comment)

{
// TODO-XARCH-CQ: We should support long/ulong multiplication
// TODO-XARCH-CQ: 32bit support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's blocking 32-bit support? It doesn't look like we're using any _X64 intrinsics in the fallback logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to be honest, that check was pre-existing, I only changed comment

@EgorBo EgorBo merged commit 33ca32d into dotnet:main Jun 28, 2024
167 of 173 checks passed
@EgorBo EgorBo deleted the arm-mul-64bit branch June 28, 2024 21:40
@github-actions github-actions bot locked and limited conversation to collaborators Jul 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants