Skip to content

fix: workaround for Task unwinding race condition#211

Closed
KowalskiThomas wants to merge 1 commit intoP403n1x87:mainfrom
KowalskiThomas:kowalski/fix-workaround-for-task-unwinding-race-condition
Closed

fix: workaround for Task unwinding race condition#211
KowalskiThomas wants to merge 1 commit intoP403n1x87:mainfrom
KowalskiThomas:kowalski/fix-workaround-for-task-unwinding-race-condition

Conversation

@KowalskiThomas
Copy link
Collaborator

@KowalskiThomas KowalskiThomas commented Dec 11, 2025

What does this PR do?

https://datadoghq.atlassian.net/browse/PROF-13137

In short

This depends on

Note that this PR (unfortunately) doesn't make all our problems go away (as the red CI testifies...)
There still are issues, just the one that I explained here is gone, as far as I can tell.

In depth

The PR aims at reducing the issues caused by race conditions between sampling of CPU Stacks and asyncio Stacks.
There are three places we can have a race condition. The race condition can happen because the three following events can happen at the same time as the Python thread that runs asyncio Tasks is living its life at the same time we are sampling:

  • Sample Python Thread stack (a.k.a. “normal synchronous Stack”)
    • If it’s currently running a Task, it will include (top sync entry point) (asyncio runtime) (Handle._run) (coroutine(s)) (sync functions called by coroutines)
    • If it’s not currently running a Task, it will include (top sync entry point) (asyncio runtime) (Runner._select)
  • Generate TaskInfo/GenInfo objects
    • If it’s currently running a Task/Coroutine, the GenInfo's will have is_running set to true
    • If it’s not currently running, then it will be false
  • Unwind Task Frames
    • If the Task is currently being run, the Frame::read we do in TaskInfo::unwind will show the “upper Python Stack” (sync entry point and asyncio runtime) on top of the Task’s Coroutine Frame
    • If the Task is not being run, Frame::read will just show the Task’s Coroutine Frame

If any two of those three Samples are out of sync (one returns “we’re running this Task” and any other one returns “we’re not running this Task”), we are at risk of producing incorrect (and potentially inconsistent) Stacks.

  • The case at hand is “Python Stack captures a running state for Task” and then “all GenInfo objects are marked as off-CPU” (this leads to having one Frame of the previously-running Task in an unrelated Task’s Stack)
  • We also often witness instances of “GenInfo objects are marked as on-CPU” and then “calling Frame::read doesn’t detect the upper Stack” (this leads to having asynchronous Stacks that do not have their Python entrypoint/asyncio runtime Stacks attached on top)
  • … I don’t know if we have instances of the other cases – potentially they’re more stealthy (e.g. something is running but we don’t detect it as such; it makes the result technically incorrect but in a way we can’t easily see) and/or extremely rare for whatever reason

We need a way to either exclude those cases (give up on sampling) or recover from them (I think this will come at a performance cost). In this PR, I implemented the former, which is much easier to do and has no performance cost. We may lose a few samples as a result, but in practice this really rarely happens except upon Task completion.

Some implementation details

Detecting whether the Task ought to be running according to the Python Stack is... not that simple. The one way I found to do it is to iterate over Frames and... string match them. As far as I know, there is no easier/more efficient way to do that.

In practice, it means going over all Frames and checking

  • On 3.11+, whether the function name is Handle._run
  • Before 3.11, whether function name is run and the filename ends with asyncio/events.py. Checking only the function name (which is unqualified prior to 3.11) would put us at risk of accidentally identifying a user's function named _run as the one that we're interested in.

To make that somewhat more efficient, once we find the relevant Frame, its Cache Key is kept in memory so that subsequent runs only need to check if a certain Frame is present (based on its Cache Key), so that we do not have to do string comparison again.

Testing

I'm happy to say that, following those changes, all the tests (well, except for the ones that are causing errors for unrelated reasons) pass from the first time in the CI! 🎉

@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-workaround-for-task-unwinding-race-condition branch from fc85934 to df6e3bd Compare December 11, 2025 16:22
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-workaround-for-task-unwinding-race-condition branch from df6e3bd to 53649ca Compare December 12, 2025 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant