Skip to content

Conversation

@alexcovington
Copy link
Contributor

The left shift operator for Vector128<byte>/Vector256<byte>/Vector512<byte> does not vectorize. This PR vectorizes the left shift operator for these types on x64.

For example, this C# code will now vectorize:

public Vector128<Byte> LeftShiftByte(Vector128<Byte> v, int shiftAmount)
{
    return v << shiftAmount;
}
Codegen

Before

; Vector128LeftShiftRepro.VectorBenchmarks.LeftShiftByte(System.Runtime.Intrinsics.Vector128`1<Byte>, Int32)
       sub       rsp,38
       mov       rax,[r8]
       mov       [rsp+28],rax
       movzx     eax,byte ptr [rsp+28]
       and       r9d,7
       shlx      eax,eax,r9d
       mov       [rsp+30],al
       movzx     eax,byte ptr [rsp+29]
       shlx      eax,eax,r9d
       mov       [rsp+31],al
       movzx     eax,byte ptr [rsp+2A]
       shlx      eax,eax,r9d
       mov       [rsp+32],al
       movzx     eax,byte ptr [rsp+2B]
       shlx      eax,eax,r9d
       mov       [rsp+33],al
       movzx     eax,byte ptr [rsp+2C]
       shlx      eax,eax,r9d
       mov       [rsp+34],al
       movzx     eax,byte ptr [rsp+2D]
       shlx      eax,eax,r9d
       mov       [rsp+35],al
       movzx     eax,byte ptr [rsp+2E]
       shlx      eax,eax,r9d
       mov       [rsp+36],al
       movzx     eax,byte ptr [rsp+2F]
       shlx      eax,eax,r9d
       mov       [rsp+37],al
       mov       rax,[rsp+30]
       mov       rcx,[r8+8]
       mov       [rsp+18],rcx
       movzx     ecx,byte ptr [rsp+18]
       shlx      ecx,ecx,r9d
       mov       [rsp+20],cl
       movzx     ecx,byte ptr [rsp+19]
       shlx      ecx,ecx,r9d
       mov       [rsp+21],cl
       movzx     ecx,byte ptr [rsp+1A]
       shlx      ecx,ecx,r9d
       mov       [rsp+22],cl
       movzx     ecx,byte ptr [rsp+1B]
       shlx      ecx,ecx,r9d
       mov       [rsp+23],cl
       movzx     ecx,byte ptr [rsp+1C]
       shlx      ecx,ecx,r9d
       mov       [rsp+24],cl
       movzx     ecx,byte ptr [rsp+1D]
       shlx      ecx,ecx,r9d
       mov       [rsp+25],cl
       movzx     ecx,byte ptr [rsp+1E]
       shlx      ecx,ecx,r9d
       mov       [rsp+26],cl
       movzx     ecx,byte ptr [rsp+1F]
       shlx      ecx,ecx,r9d
       mov       [rsp+27],cl
       mov       rcx,[rsp+20]
       mov       [rsp],rax
       mov       [rsp+8],rcx
       vmovaps   xmm0,[rsp]
       vmovups   [rdx],xmm0
       mov       rax,rdx
       add       rsp,38
       ret
; Total bytes of code 285

After

; Vector128LeftShiftRepro.VectorBenchmarks.LeftShiftByte(System.Runtime.Intrinsics.Vector128`1<Byte>, Int32)
       vmovups   xmm0,[r8]
       and       r9d,7
       vmovd     xmm1,r9d
       vpsllw    xmm0,xmm0,xmm1
       mov       eax,0FF
       shlx      eax,eax,r9d
       vpbroadcastb xmm1,eax
       vpand     xmm0,xmm1,xmm0
       vmovups   [rdx],xmm0
       mov       rax,rdx
       ret
; Total bytes of code 46

I see performance improvements in some of the System.Numerics.Tensors microbenchmarks on my x64 machine:

Baseline

| Method    | Job        | Toolchain                                                                             | BufferLength | Mean        | Error     | StdDev    | Median      | Min         | Max         | Ratio | Allocated | Alloc Ratio |
|---------- |----------- |-------------------------------------------------------------------------------------- |------------- |------------:|----------:|----------:|------------:|------------:|------------:|------:|----------:|------------:|
| ShiftLeft | Job-UPQSMH | \runtime-base\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe | 128          |   304.26 ns |  0.937 ns |  0.876 ns |   304.17 ns |   303.00 ns |   305.91 ns |  1.00 |         - |          NA |
| ShiftLeft | Job-UPQSMH | \runtime-base\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe | 3079         | 3,972.14 ns | 27.019 ns | 25.274 ns | 3,966.06 ns | 3,941.09 ns | 4,010.73 ns | 1.000 |         - |          NA |

Diff

| Method    | Job        | Toolchain                                                                             | BufferLength | Mean        | Error     | StdDev    | Median      | Min         | Max         | Ratio | Allocated | Alloc Ratio |
|---------- |----------- |-------------------------------------------------------------------------------------- |------------- |------------:|----------:|----------:|------------:|------------:|------------:|------:|----------:|------------:|
| ShiftLeft | Job-ERWYNR | \runtime-diff\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe | 128          |    31.29 ns |  0.174 ns |  0.163 ns |    31.25 ns |    31.00 ns |    31.57 ns |  0.10 |         - |          NA |
| ShiftLeft | Job-ERWYNR | \runtime-diff\artifacts\tests\coreclr\windows.x64.Release\Tests\Core_Root\corerun.exe | 3079         |    31.36 ns |  0.269 ns |  0.252 ns |    31.34 ns |    30.86 ns |    31.75 ns | 0.008 |         - |          NA |

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 27, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 27, 2024
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@alexcovington alexcovington marked this pull request as ready for review September 30, 2024 15:56
intrinsic =
GenTreeHWIntrinsic::GetHWIntrinsicIdForBinOp(this, op, op1, op2ForLookup, TYP_INT, simdSize, false);
GenTreeHWIntrinsic::GetHWIntrinsicIdForBinOp(this, op, op1, op2ForLookup,
op == GT_RSZ ? TYP_INT : TYP_SHORT, simdSize, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be TYP_SHORT for left shift?

pslld is also available for SSE2 and should achieve the same thing, correct? We typically prefer int where it is possible since instructions operating on 32-bit elements tend to have better overall throughput and optimization opportunities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're also right, it doesn't need to be TYP_SHORT. I hadn't originally considered that it could be TYP_INT as well.

I updated the type here to be TYP_INT.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one question about whether we actually need to use TYP_SHORT.

Thanks for getting this up and completed!

@tannergooding
Copy link
Member

CC. @dotnet/jit-contrib for secondary review.

Simple PR from our friends at AMD

@EgorBo
Copy link
Member

EgorBo commented Oct 14, 2024

/azp run Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tannergooding
Copy link
Member

Fuzzlyn failures are unrelated.

@tannergooding
Copy link
Member

PFX failure is also unrelated, was fixed by f059869

@tannergooding tannergooding merged commit 790a607 into dotnet:main Oct 16, 2024
115 checks passed
@tannergooding
Copy link
Member

Thanks for the contribution @alexcovington

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants