Skip to content

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Oct 1, 2025

  • Add new JIT-EE API to report back debug information about the generated state machine and continuations
  • Refactor debug info storage on VM side to be more easily extensible. The new format has either a thin or fat header. The fat header is used when we have either uninstrumented bounds, patchpoint info, rich debug info or async debug info, and stores the blob sizes of all of those components in addition to the bounds and vars. It is indicated by the first field (size of bounds) having value 0, which is an uncommon value for this field.
  • Add new async debug information to the storage on the VM side
  • Set target method desc for async resumption stubs, to be used for mapping from continuations back to the async IL function that it will resume.
  • Implement new format in R2R as well, bump R2R major version (might as well do this now as we expect to need to store async debug info in R2R during .NET 11 anyway)

- Add new JIT-EE API to report back debug information about the
  generated state machine and continuations
- Refactor debug info storage on VM side to be more easily extensible.
  The new format has either a thin or fat header. The fat header is used
  when we have either uninstrumented bounds, patchpoint info, rich debug
  info or async debug info, and stores the blob sizes of all of those
  components in addition to the bounds and vars.
- Add new async debug information to the storage on the VM side
- Set get target method desc for async resumption stubs, to be used for
  mapping from continuations back to the async IL function that it will
  resume.
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 1, 2025
@jakobbotsch jakobbotsch changed the title Add debug information for runtime async information Add debug information for runtime async methods Oct 1, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

CONTRACTL
{
NOTHROW;
THROWS;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestorePatchpointInfo is called from JitPatchpointWorker, which has STANDARD_VM_CONTRACT. I think throwing should be ok, since we only throw here on an internal inconsistency in the compressed data when encountered by NibbleReader. That's consistent with the other Restore routines and saves us having to write separate decoding routines for the patchpoint info.

Comment on lines +6548 to +6555
CompressDebugInfo::RestoreRichDebugInfo(
fpNew, pNewData,
pDebugInfo,
ppInlineTree, pNumInlineTree,
ppRichMappings, pNumRichMappings);

return TRUE;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured I might as well hook this one up in ReadyToRunJitManager too since the format now technically allows for R2R images that contain the rich debug info, eeven if crossgen2 doesn't produce it.

uint32_t Offset;
// Index in continuation's object[] data where this variable's GC pointers are stored, or 0xFFFFFFFF
// if the variable does not have any GC pointers
uint32_t GCIndex;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are emitting custom methodtables for the continuations, should we rather make the field layout flat and avoid these extra layers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now Continuation is the same single type for all the uses:

internal sealed unsafe class Continuation
{
public Continuation? Next;
public delegate*<Continuation, Continuation?> Resume;
public uint State;
public CorInfoContinuationFlags Flags;
// Data and GCData contain the state of the continuation.
// Note: The JIT is ultimately responsible for laying out these arrays.
// However, other parts of the system depend on the layout to
// know where to locate or place various pieces of data:
//
// 1. Resumption stubs need to know where to place the return value
// inside the next continuation. If the return value has GC references
// then it is boxed and placed at GCData[0]; otherwise, it is placed
// inside Data at offset 0 if
// CORINFO_CONTINUATION_OSR_IL_OFFSET_IN_DATA is NOT set and otherwise
// at offset 4.
//
// 2. Likewise, Finalize[Value]TaskReturningThunk needs to know from
// where to extract the return value.
//
// 3. The dispatcher needs to know where to place the exception inside
// the next continuation with a handler. Continuations with handlers
// have CORINFO_CONTINUATION_NEEDS_EXCEPTION set. The exception is
// placed at GCData[0] if CORINFO_CONTINUATION_RESULT_IN_GCDATA is NOT
// set, and otherwise at GCData[1].
//
public byte[]? Data;
public object?[]? GCData;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we done any modeling whether it is worth it to have the data in separate blocks from the continuation? The extra object headers and indirections are not free.

Copy link
Member

@VSadov VSadov Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design makes Continuation shape mostly opaque. Most helpers and the infrastructure do not dig into internals of Continuation. As long as JIT is self-consistent in terms of serialization/deserialization of locals, most other things are not concerned with the shape thus JIT can change the continuation shape, modulo R2R and debug interfaces.

There are just a few things that are meaningful to the infrastructure - the link to the Next continuation, the flags, the locations of the return value or an exception when infrastructure, upon completion of the calee, needs to place results into the caller's continuation before resuming it.
How most of the locals are stored is an opaque agreement between the method who serializes locals into a continuation and the method who deserializes - that is the same method.

Currently we use two arrays (array of bytes + array of objects) to serialize/deserialize locals. The biggest advantage is "time-to-market", obviously.
There are other advantages - like there is no type or GC layout generation. The shape of continuation is different for every await, so it is useful to only emit site-specific code, as we must do anyways, but not types/layouts. There could be more than one await per method so it could add up. We optimize for suspensions never happening though, statistically speaking, so slightly less efficient format which has fewer upfront/static requirements is attractive.

There are disadvantages:

  • it is not the most compact format. Think of capturing just one int and one object.
  • return values that happen to be object-containing structs, need to be boxed.
  • Also it could be difficult to external introspection like a debugger.
    For example structs containing a mix of int and object fields get their fields stored in different arrays accordingly. It is not a problem for JIT to reconstitute such struct, but it could be an inconvenience for other observers.

Anyways. The API here tries to support the current format, but leave the door open for future changes.
I think it is good to not have too many parts changing at once, unless it is blocking or costly to change later, so I think it is a good approach even if we think of tweaking the continuation format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are emitting custom methodtables for the continuations, should we rather make the field layout flat and avoid these extra layers?

Yes, once/if #120411 makes it in, we only need Offset here.

Originally I didn't add GCIndex for precisely the reason that I expected we would do that. However @tommcdon was playing around with the debug info earlier and wondered about the data he was seeing, and since it's not a large change I decided to just add it for now, with the catch that it might change in the future.

I agree with all of @VSadov's points, but we probably need to measure it. I also am somewhat worried about creating custom MethodTable for every continuation up front. Although the actual creation in #120411 is lightweight so maybe it will not be a problem.
In the end we could even take a hybrid approach with the byte[]/object[] version used for tier0/debug and the flat versions used for tier1. Of course that makes it more complicated for everyone since now they need to be aware of two possible formats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example structs containing a mix of int and object fields get their fields stored in different arrays accordingly. It is not a problem for JIT to reconstitute such struct, but it could be an inconvenience for other observers.

Is there a design for how the debugger could reconstitute such a struct? It doesn't seem like the current debug info is sufficiently powerful to represent that.

The API here tries to support the current format, but leave the door open for future changes.

If you'd like to leave the door open to store locals inline within the Continuation rather than indirected, perhaps represent the variable location as:

        enum Base
        {
            ContinuationObj = 1, // variable stored at continuation + offset
            DataArray = 2,       // variable stored at continuation->Data + offset
            GCDataArray = 3      // variable stored at continuation->GCData + offset
        }
        Base OffsetBase;
        uint32_t Offset;

Alternately changing the contract sometime in the next 6 months probably isn't too costly. Doing it close to .NET 11 ship or after .NET 11 ship would have additional burdens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a design for how the debugger could reconstitute such a struct? It doesn't seem like the current debug info is sufficiently powerful to represent that.

The replicate the way the JIT currently stores/restores these values, for a type of size S:

  1. If Offset != UINT_MAX, take S bytes from Data starting from Offset
  2. If GCIndex != UINT_MAX, fill in the GC pointers in ascending order of offset in the type, starting with the GC pointer at index GCIndex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If debuggers hard-code that algorithm it becomes a breaking change if the JIT ever wants to lay out the data differently. Is this an algorithm we want to set in stone or just a stop-gap for now? I haven't had a chance to get a good look at #120411 yet but it suggests our plans for field layout are still in flux.

In that algorithm above, do we have to worry about alignment padding or all the fields will be packed?

Copy link
Member Author

@jakobbotsch jakobbotsch Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an algorithm we want to set in stone or just a stop-gap for now?

At this point I am expecting/hopeful that #120411 makes it in. It makes things simpler -- a single offset and the type is just stored in the normal way at that offset in the continuation. But it has other implications for various components as @davidwrighton pointed out in that PR.

The current storage mechanism exists almost unchanged since the original prototype in 2023. It was the simplest thing I could think of that didn't require boxing all structs with object fields in them.

In that algorithm above, do we have to worry about alignment padding or all the fields will be packed?

Do you mean for step 2? The GCIndex encodes an index into the GCData array. If the value has N object refs in it, then GCData will store N object refs starting at that index. However, it will be up to the reconstruction code to figure out where those N object refs are stored in the value and to copy the GC refs from GCData across.

There can be alignment padding in Data to make sure value types are aligned properly but reconstruction doesn't need to worry about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have a little confusion about the current algorithm, but since it doesn't look like we'll be keeping it much longer no need for me to suss out the details :) I'm assuming the new continuation types will trigger a new round of updates here.

@jkotas
Copy link
Member

jkotas commented Oct 7, 2025

it is not the most compact format. Think of capturing just one int and one object.

Right. It is kind of similar to how Async v1 used to work in .NET Framework where the state and Task were separate object. We got nice improvement in .NET Core by getting rid of the indirections and stuffing everything into one object.

costly to change later,

This is starting to establish contracts for diagnostics. Those are always costly to change later.

@VSadov
Copy link
Member

VSadov commented Oct 7, 2025

Prior to Roslyn, if I remember correctly, the capture of some locals was into array of objects - because display types were created too early when not all locals were known. And native CSC was capturing everything visible within containing { } scopes.
Roslyn moved async much further in lowering, so locals are known, and started using liveness analysis to see what possibly can live across an await (the rest can be ordinary locals).
It is more compact design, although some scenarios had concerns with producing numerous display struct shapes.

Runtime async could tailor continuation shape to the corresponding await as well.

I think one possibility (just a rough example) - we could have ContinuationBase that has only what infrastructure needs - Next, Resume, Flags. And the local state would be stored in the derived Continuaton_SomeDisambiguationNumber. The locals could be stored as flat fields and fields could have name-mangling after the locals they represent - kind of like Roslyn does.
For such format storing just Offset for every IL local would be sufficient.

If we have something like this by the end of net11 GCIndex could be dropped.

I do not think we have a lot of choices right now, if we want to make progress. GCIndex is needed to parse the current format.

Other parts are less likely to change.

@jakobbotsch
Copy link
Member Author

I think one possibility (just a rough example) - we could have ContinuationBase that has only what infrastructure needs - Next, Resume, Flags. And the local state would be stored in the derived Continuaton_SomeDisambiguationNumber. The locals could be stored as flat fields and fields could have name-mangling after the locals they represent - kind of like Roslyn does.
For such format storing just Offset for every IL local would be sufficient.

This is essentially what #120411 is doing, although there is no dynamic description of fields or anything like that. The JIT just hands the VM a map of the fields that have object refs in them.

return TRUE;
}

BOOL EEJitManager::GetAsyncDebugInfo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to invoke GetAsyncDebugInfo we'd first need to create a DebugInfoRequest which requires knowing the starting address of the method. Presumably an async stackwalking algorithm is starting from a Continuation object to represent an async frame. Is there a plan yet for how we resolve from Continuation to code start address in order to get at any of this other method data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the plan is that looking up the call address goes something like:

  1. Continuation.Resume contains a pointer to an IL stub. Resolve that function pointer into the MethodDesc*.
  2. The IL stub's resolver contains a pointer to the original async IL method's MethodDesc*
  3. The MethodDesc* has the compressed async debug info, so access that (requires decompression)
  4. Now use Continuation.State to map back to the IL offset of the call that resulted in the suspension point with that State
  5. Finally use IL -> Native mappings to get a native IP

There are concerns being raised about the performance of this if we are making the async stackwalking part of regular stackwalking. I think we can make the improvement you suggested above to store native IP instead. I also think we can store the State -> IP mapping in the compressed debug info in a way that it can be accessed in constant time without fully decompressing it. Then the process looks something like:

  1. Same as above
  2. Same as above
  3. Map Continuation.State -> IP by accessing the debug info directly

That hopefully puts this resolution process in the same realm as normal unwinding data processing of the standard synchronous stackwalking. Or faster than that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to do the first two steps, going from the Continuation.Resume -> MethodDesc * of the IL stub, some map funcPtr -> MethodDesc ? Once having the MethodDesc to the stub do AsDynamicMethodDesc()->GetILStubResolver()->GetStubTargetMethodDesc() to get to the MethodDesc * of the underlying async method? Last step looks straightforward and fast, not sure about the first step or maybe there is a quick path to go from function pointer to MethodDesc*?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NonVirtualEntry2MethodDesc can be used to do the first mapping.

Copy link
Member

@lateralusX lateralusX Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, there was some discussion above to store the native IP's on the continuation for the suspend and potentially the resumption point, I guess it has its pros/cons, like size increase, additional bookkeeping and potential stuff that needs to be update if the method gets changed in anyway while there are continuations still around. Could and alternative be to store the MethodDesc representing the underlying async method directly in the continuation? If so it would be fast to get hold of the MethodDesc * when having a continuation, then use the state to get the native IP of the suspend and potentially resumption points representing that state. Having that said, maybe NonVirutalEntry2MethodDesc will be fast enough to go from a continuation to a native IP during stackwalk, we benchmark against a normal unwinder step getting too next frame, so given that the continuation chain will be quicker to walk (like a shadow stack), it would be nice to not waste that perf doing extensive lookups to get the native IP of the suspension point.

Copy link
Member

@jkotas jkotas Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can construct benchmarks that show benefits of each scheme. It looks like a variant of the optimize for size vs. speed tradeoff.

What would it take to allow both schemes to co-exist? Obviously, the JIT would have to pass down whether to create the continuation optimized for size or speed - that's "just work". It may be interesting to think about the impact that this would have on async stackwalking and diagnostics.

(I am not asking to build this as part of this PR.)

Copy link
Member

@lateralusX lateralusX Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a different comment below related to similar topic around impact on stackwalking and diagnostics., #120303 (comment). At least for EventPipe, we have external API's and tooling that expects native IPs representing each frame in that stackwalk, meaning we would need to resolve the native IP for each async frame we report into a EventPipe callstack. I see a lot of value being able to quickly get to the native IP represented by a continuation frame during normal stackwalk in a way that works with the high amounts of events that could be generated by EventPipe. If we can't do that then we would be stuck using sync frames for all existing EventPipe or do whatever need to resolve async frames into native IPs with impact on stackwalk performance. If we would like to change any of this I imagine we would need a new version of nettrace format that could encode stack traces differently and tools will need to update to support the new nettrace version with updated stack metadata. Totally doable of course, but with a longer adaption tail and lots of more work to do. @noahfalk thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would it take to allow both schemes to co-exist? Obviously, the JIT would have to pass down whether to create the continuation optimized for size or speed - that's "just work". It may be interesting to think about the impact that this would have on async stackwalking and diagnostics.

At a minimum, the places that dynamically need to access these things need to learn about the two ways to do that. That's during the main continuation dispatch loop where we access Continuation.Resume and Continuation.Flags. I think we would introduce

class ContinuationShared : Continuation
{
  public delegate*<Continuation, Continuation?> Resume;
  public uint State;
  public CorInfoContinuationFlags Flags;
}

class ContinuationUnique : Continuation
{
}

and either use virtual methods or (more likely, given our experience with GDV) guard on the base MT being either ContinuationShared or ContinuationUnique. Accessing the IP in the ContinuationUnique case would be simple (some indirections through the method table) while accessing the IP for the shared case needs some lookup scheme like we are discussing here. It will add some cost during dispatch, but hard to say if that would be noticeable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally doable of course, but with a longer adaption tail and lots of more work to do.

Yeah, looking at the complications that this optimization would introduce, it does not seem to be worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally doable of course, but with a longer adaption tail and lots of more work to do. @noahfalk thoughts?

Agreed. EventPipe expects stack traces are represented by an array of IPs + optionally some extra symbolic data in the trace that helps resolve those IPs into method names, IL offsets, source, etc. I am hoping we do a Continuation object -> IP conversion inside the runtime for each frame. Representing the stack trace differently from that would require a breaking change to the trace format and requires users to update their profilers which is not only work for us but also for the entire ecosystem.

It may be interesting to think about the impact that this would have on async stackwalking and diagnostics.

Using multiple discrete encoding schemes in different situations adds some complexity to the runtime code that has to process them + makes the data contracts more complex. It sounds like modest extra work that we could do if the performance gain was substantial enough to justify it.

@rcj1
Copy link
Contributor

rcj1 commented Oct 14, 2025

The native offsets appear fine, however now pMD->GetNativeCode() doesn’t work, giving me an address that is about 0x1000 different from method start. However, in Windbg I am able to get the proper IP by going through the DAC, specifically by going through NativeCodeVersion::GetNativeCode().

Do you know why this is?

@jkotas
Copy link
Member

jkotas commented Oct 14, 2025

The native offsets appear fine, however now pMD->GetNativeCode() doesn’t work anymore

One method can have multiple copies of native code due to code versioning (tiered compilation, etc.). pMD->GetNativeCode() will give you the most recent instance of the native code, but it may not match the native code that you are trying to map the offset for.

The correct way to do this is to go from IP to debug info like what DebugInfoManager::GetBoundariesAndVars does, and never roundtrip through MethodDesc since it may give you mismatched debug info.

(The root cause of the problem you are hitting may be something else, but this would become a problem eventually as well.)

@rcj1
Copy link
Contributor

rcj1 commented Oct 14, 2025

The native offsets appear fine, however now pMD->GetNativeCode() doesn’t work anymore

One method can have multiple copies of native code due to code versioning (tiered compilation, etc.). pMD->GetNativeCode() will give you the most recent instance of the native code, but it may not match the native code that you are trying to map the offset for.

The correct way to do this is to go from IP to debug info like what DebugInfoManager::GetBoundariesAndVars does, and never roundtrip through MethodDesc since it may give you mismatched debug info.

(The root cause of the problem you are hitting may be something else, but this would become a problem eventually as well.)

Ultimately I am trying to find the IP in the first place from the resume, which is a fix up precode stub that jumps to the stub to the actual method. I need it to do the native -> IL mapping.

The way I see to get this information now is through the code versions, as you mentioned. What do you think about the perf implications of this? This is another reason to have the IP directly in the continuation, or at least to have a state -> IP mapping available as you suggest with as little overhead as possible @jakobbotsch

@jakobbotsch
Copy link
Member Author

Ultimately I am trying to find the IP in the first place from the resume, which is a fix up precode stub that jumps to the stub to the actual method. I need it to do the native -> IL mapping.

The way I see to get this information now is through the code versions, as you mentioned. What do you think about the perf implications of this? This is another reason to have the IP directly in the continuation, or at least to have a state -> IP mapping available as you suggest with as little overhead as possible @jakobbotsch

Good point -- to get from the resumption stub back to the original code we need a lookup that gives back the IP of the exact code version we resume in. Today that gets allocated here while we JIT:

{
m_finalCodeAddressSlot = (PCODE*)amTracker.Track(m_pMethodBeingCompiled->GetLoaderAllocator()->GetHighFrequencyHeap()->AllocMem(S_SIZE_T(sizeof(PCODE))));
}

I think we can subclass ILStubResolver and keep it there for the resumption IL stubs, then get rid of that loader heap allocation. Let me look into this.

@jakobbotsch
Copy link
Member Author

@rcj1 I pushed a commit that adds a new AsyncResumeILStubResolver, and the async resumption stubs will have this resolver. There is an AsyncResumeILStubResolver::GetFinalResumeMethodStartAddress() that can be used to retrieve the start address of the method that resumption is going to end up in.

static BOOL GetAsyncDebugInfo(
const DebugInfoRequest & request,
IN FP_IDS_NEW fpNew, IN void * pNewData,
OUT ICorDebugInfo::AsyncInfo* pAsyncInfo,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we keep the number of suspension points in AsyncInfo and number of vars as an out parameter? Since we have the AsyncInfo struct, wouldn't it make sense to put all the out parameters inside that struct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just the fact that the length of the async vars array is not an interesting piece of semantic information about the async method, while the number of suspension points is. So I included the length of that array in the normal "API hygienic" way, while I put the number of suspension points inside ICorDebugInfo::AsyncInfo which contains the semantically interesting method-level information.

I also considered duplicating the length of the suspension points array in the API signature, for API hygiene/consistency, but it feels redundant/confusing the have the same number twice.

OUT ICorDebugInfo::RichOffsetMapping** ppRichMappings,
OUT ULONG32* pNumRichMappings);

static BOOL GetAsyncDebugInfo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to get an optimized version of this function? It will be a common scenario when stackwalking to just request data for a specific continuation resume state index and we are only interested in the suspension point data and not local vars. If we could scope it down to just one item, then I could have a custom fpNew and pNewData using a stack allocated ICorDebugInfo::AsyncSuspensionPoint, meaning there is no need for any dynamic memory allocation, and we could skip to the requested index in async debug info and only extract requested information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into a way to extract the native offset of a particular state number in constant time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stackwalking to just request data for a specific continuation resume state index

It still feels wrong for stackwalking to parse the debug info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should be treating the state index <-> native IP mapping as new unwind data rather than new debug data? Alternately if the Continuations aren't shared across different async methods then putting the info directly in the MethodTable is an option.

Copy link
Member

@jkotas jkotas Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure that we are using the same terminology, there are two steps:

  1. Stack walking: Populates Exception._stackTrace with raw data. For async methods, the raw data is (Resume, State) pair and potential keep alive root. We should not need debug info to find (Resume, State) pair. Is that correct?
  2. Stack trace formatting: Converting Exception._stackTrace to a string, like what Exception.ToString() does. This is several orders more expansive than (1). It can use metadata, debug info, etc.

(There is similar two-step process with other diagnostic scenarios, e.g. CPU profiling.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be clear: this would help find which final method we resume in, but we would still keep the state -> IP mapping that takes us to actual internal place within the function that control resumes at after state has been restored, and stack walking would still do this mapping? I.e. we would not try to duplicate IL mappings at each of the trampolines.

It sounds workable and should unify the first mapping steps for all the runtimes. I assume it won't make things faster, so it is just about moving the work into the JIT and out of the runtimes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would still keep the state -> IP mapping that takes us to actual internal place within the function

Yes, I think it is fine to keep the state and switch-resume model.

stack walking would still do this mapping

Nit: Stack trace formatting would do this mapping (#120303 (comment) )

we would not try to duplicate IL mappings at each of the trampolines

Yes, I think we can start with one shared trampoline that has no IL mapping. We can keep the non-shared trampolines with IL mappings in our back-pocket in case we run into troubles with propagating the state in addition to the IP in all scenarios.

I assume it won't make things faster

Stack trace formatting is slow by design. Stack trace formatting is about gathering and combining data that come from different data sources. I think the main benefit is that the stack trace formatting does not need IL stub -> method translation and the associated data source(s) anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Stack trace formatting would do this mapping (#120303 (comment) )

I see. So we do expect to capture (Resume, State) pair as part of the stack walking. Does it mean e.g. ETW stack traces will need a separate post processing step to turn this into a unique IP? (Maybe it already has something like that.)

My main confusion stems from trying to understand what kind of stack traces diagnostics will see. Do we always expect that the stack traces we make available has been processed in a way that makes the IP more friendly, before it makes it out of the runtime?

Copy link
Member

@lateralusX lateralusX Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For EventPipe stackstraces we will need native IP representing suspension point for each continuation when writing event, see:

https://github.com/microsoft/perfview/blob/main/src/TraceEvent/EventPipe/NetTraceFormat.md#stackblock-object

We will need a quick way to go from continuation -> native IP inside runtime stackwalking. The "formatting" is taken care by the tools, native IP -> method + IL offset -> source line.

ETW/User Events is a different story, callstacks are captured by OS, so currently we have no control of the stackwalk, meaning ETW/User Events callstacks won't be able to capture any async frames as part of the underlying OS API implementation. There are different routes we could take to mitigate this, emit async frame data in the event payload (hidden), emit an extra event handled by tooling to stitch together complete stacks, or using a side channel of events to recreate continuation chains externally. Regardless scenario they would all benefit off having ability to efficiently go from continuation -> native IP inside runtime before emitting the events.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure that we are using the same terminology, there are two steps:

  1. Stack walking: Populates Exception._stackTrace with raw data. For async methods, the raw data is (Resume, State) pair and potential keep alive root.
  1. Stack trace formatting ...

The distinction of two phases is good, though (Resume,State) isn't the data which is exchanged across that boundary today. Converting Exception/StackTrace to operate that way in the future would take some modest work but doing the same for EventPipe would be onerous. I am treating all the code that runs prior to serializing an event into the NetTrace file as the 'stack walking' phase of EventPipe and everything that happens in the profiler parsing the NetTrace file as the 'Stack trace formatting' phase. For EventPipe the exchange format across those two phases is an array of IPs serialized in the file and the formatting phase will do IP -> IL offset -> source line conversion. Changing to put (Resume,State) in the file is a breaking change in the format and requires updates to all profilers. I really hope to avoid that.

Yes, I think we can start with one shared trampoline that has no IL mapping. We can keep the non-shared trampolines with IL mappings in our back-pocket in case we run into troubles with propagating the state in addition to the IP in all scenarios.

For EventPipe the IP that we serialize in the NetTrace file needs to be one with a usable native->IL mapping. We'd wind up converting between these different representations during the 'stackwalking' phase to produce one.

  1. state index + async method trampoline IP
  2. state index + async method start IP
  3. async method IP that has usable IL offset mapping

Getting between (1) and (2) probably has fewer indirections vs. starting with the resume stub IP but moving between step (2) and (3) still pulls in the debug data if that is where we are storing the state index -> native offset mapping.

Hopefully a minor detail, but if we do the funclet approach we also need to ensure that we have a mechanism for the stackwalker to distinguish the funclet call from other recursive calls so that we can filter it out of any synchronous stackwalks, the same as we want to do with the resume stub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI runtime-async

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants