-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Closed
Labels
area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMICLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issuePerformance related issue
Description
Description
There is a significant performance regression in the .NET 9 JIT with System.Numerics.Vector when Vector constants are created inline. Run the following benchmark to reproduce. The issue occurs at FullOpts, regardless of whether Tiered compilation or PGO is enabled (according to Disasmo)
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using System.Numerics;
using System.Runtime.CompilerServices;
[SimpleJob(RuntimeMoniker.Net481)]
[SimpleJob(RuntimeMoniker.Net80)]
[SimpleJob(RuntimeMoniker.Net90)]
public class VectorStackSpill
{
public IEnumerable<object> Args() // for single argument it's an IEnumerable of objects (object)
{
yield return new int[10000];
}
[Benchmark]
[ArgumentsSource(nameof(Args))]
public void NumericsVectorDecrement(int[] array)
{
if ((array.Length & (Vector<int>.Count - 1)) != 0)
throw new ArgumentOutOfRangeException("Not a multiple of vector length");
ref var arrStart = ref array[0];
for (var i = 0; i <= (array.Length - Vector<int>.Count); i += Vector<int>.Count) {
ref var p = ref Unsafe.As<int, Vector<int>>(ref Unsafe.Add(ref arrStart, i));
p -= new Vector<int>(1);
}
}
[Benchmark]
[ArgumentsSource(nameof(Args))]
public void NumericsVectorDecrementConstantExtracted(int[] array)
{
if ((array.Length & (Vector<int>.Count - 1)) != 0)
throw new ArgumentOutOfRangeException("Not a multiple of vector length");
var one = new Vector<int>(1);
ref var arrStart = ref array[0];
for (var i = 0; i <= (array.Length - Vector<int>.Count); i += Vector<int>.Count) {
ref var p = ref Unsafe.As<int, Vector<int>>(ref Unsafe.Add(ref arrStart, i));
p -= one;
}
}Regression?
This is a regression from .NET 8.0
Data
| Method | Job | Runtime | array | Mean | Error | StdDev |
|----------------------------------------- |--------------------- |--------------------- |------------- |-----------:|---------:|---------:|
| NumericsVectorDecrement | .NET 8.0 | .NET 8.0 | Int32[10000] | 612.6 ns | 12.21 ns | 15.87 ns |
| NumericsVectorDecrementConstantExtracted | .NET 8.0 | .NET 8.0 | Int32[10000] | 560.8 ns | 11.04 ns | 17.51 ns |
| NumericsVectorDecrement | .NET 9.0 | .NET 9.0 | Int32[10000] | 7,866.9 ns | 49.79 ns | 46.58 ns |
| NumericsVectorDecrementConstantExtracted | .NET 9.0 | .NET 9.0 | Int32[10000] | 566.6 ns | 6.10 ns | 5.41 ns |
| NumericsVectorDecrement | .NET Framework 4.8.1 | .NET Framework 4.8.1 | Int32[10000] | 692.3 ns | 13.85 ns | 20.30 ns |
| NumericsVectorDecrementConstantExtracted | .NET Framework 4.8.1 | .NET Framework 4.8.1 | Int32[10000] | 533.6 ns | 4.27 ns | 3.57 ns |
Analysis
Looking at the x86 in Dasmo reveals the issue.
On .NET 8 the constant is loaded into ymm0 from reloc @RWD00 outside the loop
vmovups ymm0, ymmword ptr [reloc @RWD00]
align [5 bytes for IG03]
G_M000_IG03: ;; offset=0x0030
movsxd r8, eax
lea r8, bword ptr [rdx+4*r8]
vmovups ymm1, ymmword ptr [r8]
vpsubd ymm1, ymm1, ymm0
vmovups ymmword ptr [r8], ymm1
add eax, 8
cmp ecx, eax
jge SHORT G_M000_IG03
On .NET 9 the constant is created on the stack inside the loop:
align [0 bytes for IG03]
G_M000_IG03: ;; offset=0x0024
movsxd r8, eax
lea r8, bword ptr [rdx+4*r8]
vxorps ymm0, ymm0, ymm0
vmovups ymmword ptr [rsp+0x20], ymm0
vmovups ymm0, ymmword ptr [r8]
mov dword ptr [rsp+0x20], 1
mov dword ptr [rsp+0x24], 1
mov dword ptr [rsp+0x28], 1
mov dword ptr [rsp+0x2C], 1
mov dword ptr [rsp+0x30], 1
mov dword ptr [rsp+0x34], 1
mov dword ptr [rsp+0x38], 1
mov dword ptr [rsp+0x3C], 1
vpsubd ymm0, ymm0, ymmword ptr [rsp+0x20]
vmovups ymmword ptr [r8], ymm0
add eax, 8
cmp ecx, eax
jge SHORT G_M000_IG03
Manually moving the constant outside the loop works around the issue in .NET 9
mov dword ptr [rsp+0x20], 1
mov dword ptr [rsp+0x24], 1
mov dword ptr [rsp+0x28], 1
mov dword ptr [rsp+0x2C], 1
mov dword ptr [rsp+0x30], 1
mov dword ptr [rsp+0x34], 1
mov dword ptr [rsp+0x38], 1
mov dword ptr [rsp+0x3C], 1
...
align [2 bytes for IG03]
G_M000_IG03: ;; offset=0x0070
movsxd r8, eax
lea r8, bword ptr [rdx+4*r8]
vmovups ymm0, ymmword ptr [r8]
vpsubd ymm0, ymm0, ymmword ptr [rsp+0x20]
vmovups ymmword ptr [r8], ymm0
add eax, 8
cmp ecx, eax
jge SHORT G_M000_IG03
Metadata
Metadata
Assignees
Labels
area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMICLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issuePerformance related issue