Skip to content

Performance regression on .NET 9.0 when creating AVX constants inside loops via System.Numerics.Vector #110125

@Chicken-Bones

Description

@Chicken-Bones

Description

There is a significant performance regression in the .NET 9 JIT with System.Numerics.Vector when Vector constants are created inline. Run the following benchmark to reproduce. The issue occurs at FullOpts, regardless of whether Tiered compilation or PGO is enabled (according to Disasmo)

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using System.Numerics;
using System.Runtime.CompilerServices;

[SimpleJob(RuntimeMoniker.Net481)]
[SimpleJob(RuntimeMoniker.Net80)]
[SimpleJob(RuntimeMoniker.Net90)]
public class VectorStackSpill
{
	public IEnumerable<object> Args() // for single argument it's an IEnumerable of objects (object)
	{
		yield return new int[10000];
	}

	[Benchmark]
	[ArgumentsSource(nameof(Args))]
	public void NumericsVectorDecrement(int[] array)
	{
		if ((array.Length & (Vector<int>.Count - 1)) != 0)
			throw new ArgumentOutOfRangeException("Not a multiple of vector length");

		ref var arrStart = ref array[0];
		for (var i = 0; i <= (array.Length - Vector<int>.Count); i += Vector<int>.Count) {
			ref var p = ref Unsafe.As<int, Vector<int>>(ref Unsafe.Add(ref arrStart, i));
			p -= new Vector<int>(1);
		}
	}

	[Benchmark]
	[ArgumentsSource(nameof(Args))]
	public void NumericsVectorDecrementConstantExtracted(int[] array)
	{
		if ((array.Length & (Vector<int>.Count - 1)) != 0)
			throw new ArgumentOutOfRangeException("Not a multiple of vector length");

		var one = new Vector<int>(1);

		ref var arrStart = ref array[0];
		for (var i = 0; i <= (array.Length - Vector<int>.Count); i += Vector<int>.Count) {
			ref var p = ref Unsafe.As<int, Vector<int>>(ref Unsafe.Add(ref arrStart, i));
			p -= one;
		}
	}

Regression?

This is a regression from .NET 8.0

Data

| Method                                   | Job                  | Runtime              | array        | Mean       | Error    | StdDev   |
|----------------------------------------- |--------------------- |--------------------- |------------- |-----------:|---------:|---------:|
| NumericsVectorDecrement                  | .NET 8.0             | .NET 8.0             | Int32[10000] |   612.6 ns | 12.21 ns | 15.87 ns |
| NumericsVectorDecrementConstantExtracted | .NET 8.0             | .NET 8.0             | Int32[10000] |   560.8 ns | 11.04 ns | 17.51 ns |
| NumericsVectorDecrement                  | .NET 9.0             | .NET 9.0             | Int32[10000] | 7,866.9 ns | 49.79 ns | 46.58 ns |
| NumericsVectorDecrementConstantExtracted | .NET 9.0             | .NET 9.0             | Int32[10000] |   566.6 ns |  6.10 ns |  5.41 ns |
| NumericsVectorDecrement                  | .NET Framework 4.8.1 | .NET Framework 4.8.1 | Int32[10000] |   692.3 ns | 13.85 ns | 20.30 ns |
| NumericsVectorDecrementConstantExtracted | .NET Framework 4.8.1 | .NET Framework 4.8.1 | Int32[10000] |   533.6 ns |  4.27 ns |  3.57 ns |

Analysis

Looking at the x86 in Dasmo reveals the issue.

On .NET 8 the constant is loaded into ymm0 from reloc @RWD00 outside the loop

       vmovups  ymm0, ymmword ptr [reloc @RWD00]
       align    [5 bytes for IG03]
 
G_M000_IG03:                ;; offset=0x0030
       movsxd   r8, eax
       lea      r8, bword ptr [rdx+4*r8]
       vmovups  ymm1, ymmword ptr [r8]
       vpsubd   ymm1, ymm1, ymm0
       vmovups  ymmword ptr [r8], ymm1
       add      eax, 8
       cmp      ecx, eax
       jge      SHORT G_M000_IG03

On .NET 9 the constant is created on the stack inside the loop:

       align    [0 bytes for IG03]
 
G_M000_IG03:                ;; offset=0x0024
       movsxd   r8, eax
       lea      r8, bword ptr [rdx+4*r8]
       vxorps   ymm0, ymm0, ymm0
       vmovups  ymmword ptr [rsp+0x20], ymm0
       vmovups  ymm0, ymmword ptr [r8]
       mov      dword ptr [rsp+0x20], 1
       mov      dword ptr [rsp+0x24], 1
       mov      dword ptr [rsp+0x28], 1
       mov      dword ptr [rsp+0x2C], 1
       mov      dword ptr [rsp+0x30], 1
       mov      dword ptr [rsp+0x34], 1
       mov      dword ptr [rsp+0x38], 1
       mov      dword ptr [rsp+0x3C], 1
       vpsubd   ymm0, ymm0, ymmword ptr [rsp+0x20]
       vmovups  ymmword ptr [r8], ymm0
       add      eax, 8
       cmp      ecx, eax
       jge      SHORT G_M000_IG03

Manually moving the constant outside the loop works around the issue in .NET 9

       mov      dword ptr [rsp+0x20], 1
       mov      dword ptr [rsp+0x24], 1
       mov      dword ptr [rsp+0x28], 1
       mov      dword ptr [rsp+0x2C], 1
       mov      dword ptr [rsp+0x30], 1
       mov      dword ptr [rsp+0x34], 1
       mov      dword ptr [rsp+0x38], 1
       mov      dword ptr [rsp+0x3C], 1
...
       align    [2 bytes for IG03]
 
G_M000_IG03:                ;; offset=0x0070
       movsxd   r8, eax
       lea      r8, bword ptr [rdx+4*r8]
       vmovups  ymm0, ymmword ptr [r8]
       vpsubd   ymm0, ymm0, ymmword ptr [rsp+0x20]
       vmovups  ymmword ptr [r8], ymm0
       add      eax, 8
       cmp      ecx, eax
       jge      SHORT G_M000_IG03

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions