-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Instrument stackalloc for PGO #119041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Instrument stackalloc for PGO #119041
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
|
@EgorBot -amd -intel -arm using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);
public class Benchmarks
{
[Benchmark]
[Arguments(2)]
[Arguments(16)]
[Arguments(64)]
[Arguments(100)]
[Arguments(300)]
[Arguments(512)]
[Arguments(2048)]
[Arguments(16*1024)]
public void NonConstantStackalloc(int n) => Consume(stackalloc byte[n]);
[MethodImpl(MethodImplOptions.NoInlining)]
void Consume(Span<byte> span){}
} |
|
thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Profile-Guided Optimization (PGO) instrumentation for stackalloc operations to improve performance by enabling constant-size optimization for frequently used stack allocation sizes.
Key changes:
- Introduces profiling and optimization for non-constant
stackallocoperations - Refactors profile value picking into a reusable utility function
- Creates specialized tree node type for
stackallocto store IL offset information
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/jit/importercalls.cpp | Refactors profile value picking logic into reusable pickProfiledValue function |
| src/coreclr/jit/importer.cpp | Adds PGO instrumentation and optimization for stackalloc operations |
| src/coreclr/jit/gtstructs.h | Maps new GenTreeOpWithILOffset struct to GT_LCLHEAP node type |
| src/coreclr/jit/gtlist.h | Updates GT_LCLHEAP to use new specialized node type with IL offset |
| src/coreclr/jit/gentree.h | Defines new GenTreeOpWithILOffset struct for storing IL offset |
| src/coreclr/jit/gentree.cpp | Adds support for new node type in comparison, hashing, and cloning operations |
| src/coreclr/jit/fgprofile.cpp | Extends value profiling infrastructure to handle stackalloc operations |
| src/coreclr/jit/compiler.h | Adds declarations for new utility functions |
| src/coreclr/jit/block.h | Adds schema index field for value instrumentation |
| int bbHistogramSchemaIndex; // schema index for histogram instrumentation | ||
| }; | ||
|
|
||
| int bbValueSchemaIndex; // schema index for value instrumentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Count and HandleHistogram each have their own index fields, Value probing used to use Handle's one and it could lead to asserts. This field doesn't increase BasicBlock's layout (it had paddings) - still same 272 bytes on Release-64bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work if there are multiple value probes in a block?
| op1 = gtNewOperNode(GT_LCLHEAP, TYP_I_IMPL, op2); | ||
| // May throw a stack overflow exception. Obviously, we don't want locallocs to be CSE'd. | ||
| op1->gtFlags |= (GTF_EXCEPT | GTF_DONT_CSE); | ||
| op1 = gtNewLclHeapNode(op2, opcodeOffs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move the entire CEE_LOCALLOC importation to a separate function in a separate PR in order to simplify code-review
|
PTAL @AndyAyersMS @dotnet/jit-contrib This significantly speeds up non-constant stackalloc zeroing with help of Value Profiling. I had to introduce |
|
|
||
| if (lengthNode->TypeGet() != TYP_I_IMPL) | ||
| { | ||
| lengthNode = compiler->gtNewCastNode(TYP_I_IMPL, lengthNode, /* isUnsigned */ false, TYP_I_IMPL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, all memset/memcpy primitives used TYP_I_IMPL length. GT_LCLHEAP uses TYP_INT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ecma spec is weird here:
III.3.47
...
The localloc instruction allocates size (type native unsigned int or U4) bytes from the local
at any rate, seems like representing it as TYP_I_IMPL would be ok
Co-authored-by: Jakob Botsch Nielsen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try having two variable-sized stackallocks back to back and verify we get the right profile for each?
| int bbHistogramSchemaIndex; // schema index for histogram instrumentation | ||
| }; | ||
|
|
||
| int bbValueSchemaIndex; // schema index for value instrumentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work if there are multiple value probes in a block?
|
|
||
| if (lengthNode->TypeGet() != TYP_I_IMPL) | ||
| { | ||
| lengthNode = compiler->gtNewCastNode(TYP_I_IMPL, lengthNode, /* isUnsigned */ false, TYP_I_IMPL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ecma spec is weird here:
III.3.47
...
The localloc instruction allocates size (type native unsigned int or U4) bytes from the local
at any rate, seems like representing it as TYP_I_IMPL would be ok
| } | ||
| else | ||
| { | ||
| // NOTE: we don't want to convert the fastpath stackalloc to a local like we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this block is executed frequently enough maybe we should convert to a local? You can compare the block's weight to that of the method entry, if the value is close to 1 then consider the conversion.
if non-constant
stackalloc(without SkipLocalsInit) is used mostly with the same size, we can convert it from:to
so then it can benefit from faster zeroing.
Benchmark
Benchmark results on
linux_azure_cascadelake