Skip to content

fix: workaround for Task unwinding race condition#44

Closed
KowalskiThomas wants to merge 4 commits intokowalski/fix-on-cpu-tasksfrom
kowalski/fix-workaround-for-task-unwinding-race-condition
Closed

fix: workaround for Task unwinding race condition#44
KowalskiThomas wants to merge 4 commits intokowalski/fix-on-cpu-tasksfrom
kowalski/fix-workaround-for-task-unwinding-race-condition

Conversation

@KowalskiThomas
Copy link
Owner

@KowalskiThomas KowalskiThomas commented Dec 9, 2025

What does this PR do?

https://datadoghq.atlassian.net/browse/PROF-13137

In short

This depends on

Note that this PR (unfortunately) doesn't make all our problems go away (as the red CI testifies...)
There still are issues, just the one that I explained here is gone, as far as I can tell.

In depth

The PR aims at reducing the issues caused by race conditions between sampling of CPU Stacks and asyncio Stacks.
There are three places we can have a race condition. The race condition can happen because the three following events can happen at the same time as the Python thread that runs asyncio Tasks is living its life at the same time we are sampling:

  • Sample Python Thread stack (a.k.a. “normal synchronous Stack”)
    • If it’s currently running a Task, it will include (top sync entry point) (asyncio runtime) (Handle._run) (coroutine(s)) (sync functions called by coroutines)
    • If it’s not currently running a Task, it will include (top sync entry point) (asyncio runtime) (Runner._select)
  • Generate TaskInfo/GenInfo objects
    • If it’s currently running a Task/Coroutine, the GenInfo's will have is_running set to true
    • If it’s not currently running, then it will be false
  • Unwind Task Frames
    • If the Task is currently being run, the Frame::read we do in TaskInfo::unwind will show the “upper Python Stack” (sync entry point and asyncio runtime) on top of the Task’s Coroutine Frame
    • If the Task is not being run, Frame::read will just show the Task’s Coroutine Frame

If any two of those three Samples are out of sync (one returns “we’re running this Task” and any other one returns “we’re not running this Task”), we are at risk of producing incorrect (and potentially inconsistent) Stacks.

  • The case at hand is “Python Stack captures a running state for Task” and then “all GenInfo objects are marked as off-CPU” (this leads to having one Frame of the previously-running Task in an unrelated Task’s Stack)
  • We also often witness instances of “GenInfo objects are marked as on-CPU” and then “calling Frame::read doesn’t detect the upper Stack” (this leads to having asynchronous Stacks that do not have their Python entrypoint/asyncio runtime Stacks attached on top)
  • … I don’t know if we have instances of the other cases – potentially they’re more stealthy (e.g. something is running but we don’t detect it as such; it makes the result technically incorrect but in a way we can’t easily see) and/or extremely rare for whatever reason

We need a way to either exclude those cases (give up on sampling) or recover from them (I think this will come at a performance cost). In this PR, I implemented the former, which is much easier to do and has no performance cost. We may lose a few samples as a result, but in practice this really rarely happens except upon Task completion.

@KowalskiThomas KowalskiThomas changed the title test: improve summary_to_json fix: workaround for Task unwinding race condition Dec 9, 2025
…created Tasks

fix: do not remove links for just-created Tasks
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-on-cpu-tasks branch from 1d4b884 to 6a3631e Compare December 10, 2025 16:49
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-workaround-for-task-unwinding-race-condition branch from 6e518ea to d9a621d Compare December 10, 2025 16:50
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-workaround-for-task-unwinding-race-condition branch from d9a621d to fc85934 Compare December 11, 2025 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant