Skip to content

Performance regression in Vector<T>.Vector<T>(T) on x86/x64 #108929

@ap5d

Description

@ap5d

Description

When using NET 9-RC2, Vector<T> constructor that broadcasts a scalar to all elements of a vector is not optimized to a broadcasting instruction on x86/x64. .NET 8 compiler makes this optimization.

Reproduction Steps

The regression can be reproduced by compiling the following function:

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector<int> ScalarToVector(int scalar) => new(scalar);

Expected behavior

I would expect the compiler to use only few instructions for broadcasting the scalar to all elements. This what .NET 8 compiler produces:

       vzeroupper 
       vpbroadcastd  ymm0, esi
       vmovups  ymmword ptr [rdi], ymm0
       mov      rax, rdi
       vzeroupper 
       ret      

So a single vpbroadcastd does the job when AVX2 is enabled.

Actual behavior

Using .NET 9-rc2, the following machine code is generated:

       push     rbp
       sub      rsp, 48
       lea      rbp, [rsp+0x30]
       vxorps   ymm0, ymm0, ymm0
       vmovups  ymmword ptr [rbp-0x30], ymm0
       mov      dword ptr [rbp-0x30], esi
       mov      dword ptr [rbp-0x2C], esi
       mov      dword ptr [rbp-0x28], esi
       mov      dword ptr [rbp-0x24], esi
       mov      dword ptr [rbp-0x20], esi
       mov      dword ptr [rbp-0x1C], esi
       mov      dword ptr [rbp-0x18], esi
       mov      dword ptr [rbp-0x14], esi
       vmovups  ymm0, ymmword ptr [rbp-0x30]
       vmovups  ymmword ptr [rdi], ymm0
       mov      rax, rdi
       vzeroupper 
       add      rsp, 48
       pop      rbp
       ret      

As you can see, the compiler fills elements individually to an array on stack, which is much slower.

Regression?

No response

Known Workarounds

Use .NET 8 or select Vector128/256/512.Create method based on Vector<T> length:

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector<int> ScalarToVector(int scalar)
{
    if (Vector<int>.Count == 16)
    {
        return Vector512.Create(scalar).AsVector();
    }
    else if (Vector<int>.Count == 8)
    {
        return Vector256.Create(scalar).AsVector();
    }
    else if (Vector<int>.Count == 4)
    {
        return Vector128.Create(scalar).AsVector();
    }
    else
    {
        return new(scalar);
    }
}

This workaround results in the following machine code with .NET 9-RC2:

       vpbroadcastd ymm0, esi
       vmovups  ymmword ptr [rdi], ymm0
       mov      rax, rdi
       vzeroupper 
       ret    

Configuration

No response

Other information

No response

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIin-prThere is an active PR which will close this issue when it is merged

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions