Instrument stackalloc for PGO #119041

EgorBo · 2025-08-24T10:16:17Z

if non-constant stackalloc (without SkipLocalsInit) is used mostly with the same size, we can convert it from:

var _ = stackalloc T[N]

to

var _ = N == ProfiledConst ? stackalloc T[ProfiledConst] : stackalloc T[N];

so then it can benefit from faster zeroing.

Benchmark

using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);

public class Benchmarks
{
    [Benchmark]
    [Arguments(2)]
    [Arguments(16)]
    [Arguments(64)]
    [Arguments(100)]
    [Arguments(300)]
    [Arguments(512)]
    [Arguments(2048)]
    [Arguments(16*1024)]
    public void NonConstantStackalloc(int n) => Consume(stackalloc byte[n]);

    [MethodImpl(MethodImplOptions.NoInlining)]
    void Consume(Span<byte> span){}
}

Benchmark results on `linux_azure_cascadelake`

Method	Toolchain	n	Mean	Error	Ratio
NonConstantStackalloc	Main	2	1.726 ns	0.0004 ns	1.00
NonConstantStackalloc	PR	2	1.725 ns	0.0005 ns	1.00

NonConstantStackalloc	Main	16	1.726 ns	0.0004 ns	1.00
NonConstantStackalloc	PR	16	1.726 ns	0.0006 ns	1.00

NonConstantStackalloc	Main	64	3.164 ns	0.0007 ns	1.00
NonConstantStackalloc	PR	64	1.839 ns	0.0005 ns	0.58

NonConstantStackalloc	Main	100	5.994 ns	0.0337 ns	1.00
NonConstantStackalloc	PR	100	1.840 ns	0.0002 ns	0.31

NonConstantStackalloc	Main	300	7.746 ns	0.0188 ns	1.00
NonConstantStackalloc	PR	300	1.939 ns	0.0255 ns	0.25

NonConstantStackalloc	Main	512	11.353 ns	0.0384 ns	1.00
NonConstantStackalloc	PR	512	4.177 ns	0.0008 ns	0.37

NonConstantStackalloc	Main	2048	39.302 ns	0.0123 ns	1.00
NonConstantStackalloc	PR	2048	19.214 ns	0.0035 ns	0.49

NonConstantStackalloc	Main	16384	303.593 ns	0.0450 ns	1.00
NonConstantStackalloc	PR	16384	85.821 ns	0.0411 ns	0.28

EgorBo · 2025-08-24T19:46:32Z

@EgorBot -amd -intel -arm

using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);

public class Benchmarks
{
    [Benchmark]
    [Arguments(2)]
    [Arguments(16)]
    [Arguments(64)]
    [Arguments(100)]
    [Arguments(300)]
    [Arguments(512)]
    [Arguments(2048)]
    [Arguments(16*1024)]
    public void NonConstantStackalloc(int n) => Consume(stackalloc byte[n]);

    [MethodImpl(MethodImplOptions.NoInlining)]
    void Consume(Span<byte> span){}
}

ymalich · 2025-08-24T20:19:11Z

thanks!

…ent-localloc

Copilot

Pull Request Overview

This PR adds Profile-Guided Optimization (PGO) instrumentation for stackalloc operations to improve performance by enabling constant-size optimization for frequently used stack allocation sizes.

Key changes:

Introduces profiling and optimization for non-constant stackalloc operations
Refactors profile value picking into a reusable utility function
Creates specialized tree node type for stackalloc to store IL offset information

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/coreclr/jit/importercalls.cpp	Refactors profile value picking logic into reusable `pickProfiledValue` function
src/coreclr/jit/importer.cpp	Adds PGO instrumentation and optimization for `stackalloc` operations
src/coreclr/jit/gtstructs.h	Maps new `GenTreeOpWithILOffset` struct to `GT_LCLHEAP` node type
src/coreclr/jit/gtlist.h	Updates `GT_LCLHEAP` to use new specialized node type with IL offset
src/coreclr/jit/gentree.h	Defines new `GenTreeOpWithILOffset` struct for storing IL offset
src/coreclr/jit/gentree.cpp	Adds support for new node type in comparison, hashing, and cloning operations
src/coreclr/jit/fgprofile.cpp	Extends value profiling infrastructure to handle `stackalloc` operations
src/coreclr/jit/compiler.h	Adds declarations for new utility functions
src/coreclr/jit/block.h	Adds schema index field for value instrumentation

src/coreclr/jit/gentree.cpp

src/coreclr/jit/fgprofile.cpp

EgorBo · 2025-09-03T11:39:28Z

src/coreclr/jit/block.h

        int      bbHistogramSchemaIndex; // schema index for histogram instrumentation
    };

+    int bbValueSchemaIndex; // schema index for value instrumentation


Count and HandleHistogram each have their own index fields, Value probing used to use Handle's one and it could lead to asserts. This field doesn't increase BasicBlock's layout (it had paddings) - still same 272 bytes on Release-64bit.

How does this work if there are multiple value probes in a block?

EgorBo · 2025-09-03T11:40:12Z

src/coreclr/jit/importer.cpp

-                        op1 = gtNewOperNode(GT_LCLHEAP, TYP_I_IMPL, op2);
-                        // May throw a stack overflow exception. Obviously, we don't want locallocs to be CSE'd.
-                        op1->gtFlags |= (GTF_EXCEPT | GTF_DONT_CSE);
+                        op1 = gtNewLclHeapNode(op2, opcodeOffs);


I'll move the entire CEE_LOCALLOC importation to a separate function in a separate PR in order to simplify code-review

EgorBo · 2025-09-03T11:42:00Z

PTAL @AndyAyersMS @dotnet/jit-contrib

This significantly speeds up non-constant stackalloc zeroing with help of Value Profiling. I had to introduce struct GenTreeOpWithILOffset : public GenTreeOp in order to keep IL_OFFSET for GT_LCLHEAP -- we've discussed this option in Discord recently.

EgorBo · 2025-09-03T11:43:58Z

src/coreclr/jit/fgprofile.cpp

+
+        if (lengthNode->TypeGet() != TYP_I_IMPL)
+        {
+            lengthNode = compiler->gtNewCastNode(TYP_I_IMPL, lengthNode, /* isUnsigned */ false, TYP_I_IMPL);


Previously, all memset/memcpy primitives used TYP_I_IMPL length. GT_LCLHEAP uses TYP_INT

The ecma spec is weird here:

III.3.47 ... The localloc instruction allocates size (type native unsigned int or U4) bytes from the local

at any rate, seems like representing it as TYP_I_IMPL would be ok

src/coreclr/jit/fgprofile.cpp

Co-authored-by: Jakob Botsch Nielsen <[email protected]>

AndyAyersMS

Can you try having two variable-sized stackallocks back to back and verify we get the right profile for each?

AndyAyersMS · 2025-09-04T16:53:12Z

src/coreclr/jit/block.h

        int      bbHistogramSchemaIndex; // schema index for histogram instrumentation
    };

+    int bbValueSchemaIndex; // schema index for value instrumentation


How does this work if there are multiple value probes in a block?

AndyAyersMS · 2025-09-04T16:55:58Z

src/coreclr/jit/fgprofile.cpp

+
+        if (lengthNode->TypeGet() != TYP_I_IMPL)
+        {
+            lengthNode = compiler->gtNewCastNode(TYP_I_IMPL, lengthNode, /* isUnsigned */ false, TYP_I_IMPL);


The ecma spec is weird here:

III.3.47 ... The localloc instruction allocates size (type native unsigned int or U4) bytes from the local

at any rate, seems like representing it as TYP_I_IMPL would be ok

AndyAyersMS · 2025-09-04T17:40:44Z

src/coreclr/jit/importer.cpp

+                                    }
+                                    else
+                                    {
+                                        // NOTE: we don't want to convert the fastpath stackalloc to a local like we


If this block is executed frequently enough maybe we should convert to a local? You can compare the block's weight to that of the method entry, if the value is close to 1 then consider the conversion.

Instrument stackalloc for PGO

671dae2

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 24, 2025

dotnet-policy-service bot assigned EgorBo Aug 24, 2025

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

EgorBot mentioned this pull request Aug 24, 2025

Benchmarks for #119041 (EgorBo) EgorBot/runtime-utils#476

Open

fix build

583bf0d

EgorBo closed this Aug 24, 2025

EgorBo reopened this Aug 24, 2025

EgorBo added 2 commits August 24, 2025 21:45

Update importer.cpp

5439d1f

Update importer.cpp

24ed612

EgorBot mentioned this pull request Aug 24, 2025

Benchmarks for #119041 (EgorBo) EgorBot/runtime-utils#477

Open

build-analysis bot mentioned this pull request Aug 24, 2025

System.Net.Http test GetAsync_AllowedSSLVersion_Succeeds failing on android with System.PlatformNotSupportedException : The requested SslProtocols (Tls/Tls11) are not supported on this platform #119016

Closed

EgorBo added 2 commits September 2, 2025 14:49

Merge branch 'main' of https://github.com/dotnet/runtime into instrum…

253bcd7

…ent-localloc

fix issue

d7bd244

EgorBo marked this pull request as ready for review September 3, 2025 11:04

Copilot AI review requested due to automatic review settings September 3, 2025 11:04

Copilot AI reviewed Sep 3, 2025

View reviewed changes

src/coreclr/jit/gentree.cpp Show resolved Hide resolved

src/coreclr/jit/fgprofile.cpp Outdated Show resolved Hide resolved

EgorBo commented Sep 3, 2025

View reviewed changes

EgorBo requested a review from AndyAyersMS September 3, 2025 11:42

EgorBo commented Sep 3, 2025

View reviewed changes

jakobbotsch reviewed Sep 3, 2025

View reviewed changes

src/coreclr/jit/fgprofile.cpp Outdated Show resolved Hide resolved

Update src/coreclr/jit/fgprofile.cpp

e68b573

Co-authored-by: Jakob Botsch Nielsen <[email protected]>

build-analysis bot mentioned this pull request Sep 3, 2025

Test failure: baseservices/exceptions/stackoverflow/stackoverflowtester/stackoverflowtester.cmd #110173

Open

AndyAyersMS reviewed Sep 4, 2025

View reviewed changes

Instrument stackalloc for PGO #119041

Are you sure you want to change the base?

Instrument stackalloc for PGO #119041

Conversation

EgorBo commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Benchmark results on linux_azure_cascadelake

Uh oh!

This comment was marked as outdated.

This comment was marked as resolved.

EgorBo commented Aug 24, 2025

Uh oh!

ymalich commented Aug 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

EgorBo Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

EgorBo Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

EgorBo commented Sep 3, 2025

Uh oh!

EgorBo Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AndyAyersMS left a comment

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

EgorBo commented Aug 24, 2025 •

edited

Loading

Benchmark results on `linux_azure_cascadelake`

EgorBo Sep 3, 2025 •

edited

Loading