fix: workaround for Task unwinding race condition by KowalskiThomas · Pull Request #211 · P403n1x87/echion

KowalskiThomas · 2025-12-11T09:50:31Z

What does this PR do?

https://datadoghq.atlassian.net/browse/PROF-13137

In short

This depends on

fix: correctly unwind on-CPU tasks #198

Note that this PR (unfortunately) doesn't make all our problems go away (as the red CI testifies...)
There still are issues, just the one that I explained here is gone, as far as I can tell.

In depth

The PR aims at reducing the issues caused by race conditions between sampling of CPU Stacks and asyncio Stacks.
There are three places we can have a race condition. The race condition can happen because the three following events can happen at the same time as the Python thread that runs asyncio Tasks is living its life at the same time we are sampling:

Sample Python Thread stack (a.k.a. “normal synchronous Stack”)
- If it’s currently running a Task, it will include (top sync entry point) (asyncio runtime) (Handle._run) (coroutine(s)) (sync functions called by coroutines)
- If it’s not currently running a Task, it will include (top sync entry point) (asyncio runtime) (Runner._select)
Generate TaskInfo/GenInfo objects
- If it’s currently running a Task/Coroutine, the GenInfo's will have is_running set to true
- If it’s not currently running, then it will be false
Unwind Task Frames
- If the Task is currently being run, the Frame::read we do in TaskInfo::unwind will show the “upper Python Stack” (sync entry point and asyncio runtime) on top of the Task’s Coroutine Frame
- If the Task is not being run, Frame::read will just show the Task’s Coroutine Frame

If any two of those three Samples are out of sync (one returns “we’re running this Task” and any other one returns “we’re not running this Task”), we are at risk of producing incorrect (and potentially inconsistent) Stacks.

The case at hand is “Python Stack captures a running state for Task” and then “all GenInfo objects are marked as off-CPU” (this leads to having one Frame of the previously-running Task in an unrelated Task’s Stack)
We also often witness instances of “GenInfo objects are marked as on-CPU” and then “calling Frame::read doesn’t detect the upper Stack” (this leads to having asynchronous Stacks that do not have their Python entrypoint/asyncio runtime Stacks attached on top)
… I don’t know if we have instances of the other cases – potentially they’re more stealthy (e.g. something is running but we don’t detect it as such; it makes the result technically incorrect but in a way we can’t easily see) and/or extremely rare for whatever reason

We need a way to either exclude those cases (give up on sampling) or recover from them (I think this will come at a performance cost). In this PR, I implemented the former, which is much easier to do and has no performance cost. We may lose a few samples as a result, but in practice this really rarely happens except upon Task completion.

Some implementation details

Detecting whether the Task ought to be running according to the Python Stack is... not that simple. The one way I found to do it is to iterate over Frames and... string match them. As far as I know, there is no easier/more efficient way to do that.

In practice, it means going over all Frames and checking

On 3.11+, whether the function name is Handle._run
Before 3.11, whether function name is run and the filename ends with asyncio/events.py. Checking only the function name (which is unqualified prior to 3.11) would put us at risk of accidentally identifying a user's function named _run as the one that we're interested in.

To make that somewhat more efficient, once we find the relevant Frame, its Cache Key is kept in memory so that subsequent runs only need to check if a certain Frame is present (based on its Cache Key), so that we do not have to do string comparison again.

Testing

I'm happy to say that, following those changes, all the tests (well, except for the ones that are causing errors for unrelated reasons) pass from the first time in the CI! 🎉

KowalskiThomas force-pushed the kowalski/fix-workaround-for-task-unwinding-race-condition branch from fc85934 to df6e3bd Compare December 11, 2025 16:22

fix: workaround for Task unwinding race condition

53649ca

KowalskiThomas force-pushed the kowalski/fix-workaround-for-task-unwinding-race-condition branch from df6e3bd to 53649ca Compare December 12, 2025 12:41

KowalskiThomas closed this Jan 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: workaround for Task unwinding race condition#211

fix: workaround for Task unwinding race condition#211
KowalskiThomas wants to merge 1 commit intoP403n1x87:mainfrom
KowalskiThomas:kowalski/fix-workaround-for-task-unwinding-race-condition

KowalskiThomas commented Dec 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

KowalskiThomas commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

In short

In depth

Some implementation details

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KowalskiThomas commented Dec 11, 2025 •

edited

Loading