fix: work around race conditions for on-CPU Tasks#43
Closed
KowalskiThomas wants to merge 2 commits intokowalski/fix-on-cpu-tasksfrom
Closed
fix: work around race conditions for on-CPU Tasks#43KowalskiThomas wants to merge 2 commits intokowalski/fix-on-cpu-tasksfrom
KowalskiThomas wants to merge 2 commits intokowalski/fix-on-cpu-tasksfrom
Conversation
92ed602 to
aa14006
Compare
81b2bf5 to
fdd153e
Compare
919485b to
6914de2
Compare
fdd153e to
baef4f2
Compare
6914de2 to
7f876a6
Compare
github-merge-queue bot
pushed a commit
to DataDog/dd-trace-py
that referenced
this pull request
Dec 11, 2025
## What does this PR do? More details and investigation results available in https://datadoghq.atlassian.net/browse/PROF-13137 Related PRs * Dependency: #15552 * Dependent: #15579 * Echion PR: P403n1x87/echion#198 This PR fixes the way we sample and capture stacks for on-CPU asyncio Tasks. This is a first step towards fixing _all_ weird (sub-)stacks we are seeing (both in Echion and dd-trace-py), for example this one... ``` background_math_function Task-background_wait background_wait_function sleep ``` ... where in practice, `background_math_function` was executed as part of a Task that was completely unrelated to `Task-background_wait` and should never have appeared in the same stack. This by itself doesn't fix everything (see **The fix(es, actually)**), but, as I said, it's a first step. See [this document](https://docs.google.com/document/d/10HmKhY6tM5j7ojdk7CG8anWsN7qP7IyeZlj3-zAXudM/edit?pli=1&tab=t.9pzrl0uli4g) for more explanations around how this works / is supposed to work, the fix is explained at the bottom here. ## The fix(es, actually) ### Flawed existing logic... The fix logic is the following: * To correctly interleave Task Stacks and Task names, we need to determine the "synchronous Python top-of-stack" that goes on top of the on-CPU Task (if there is an on-CPU Task). * This includes the Python stack that eventually led to doing something like `asyncio.run`, as well as the internal `asyncio` Runner machinery. * We will have to push this on top of every Task Stack we capture (note that the on-CPU Stack will already have it by definition). * To do that, we need to determine which Task is "the on CPU Task" (if any) and process it first, so that we know how "high" the "upper Python stack" is and we can re-use it for subsequent Task unwindings. * To ensure the first Task we unwind is the on-CPU one, we first sort the Leaf Tasks array by on-CPU-ness * Note I'm saying _the_ on CPU Task because, as always in Python, there can only be one thing running at a time. * (Well, that's the case at least in theory – in practice on-CPU-ness may change as we sample and so we could end up with more than one on-CPU Task, but that is fine because it won't change how deep the "upper stack" is) * Once we've done this, we know how high the "top of Stack is" because it is the height of the "pure Python" Stack we get by unwinding the first (top-most) Frame of the on-CPU Task. We need to remember that "height" to push that many items from the Python Thread Stack on top of each Task's Stack. * One final thing to keep in mind is that we "stack" Stacks (hehe) such that if `Task-1` is awaiting `Task-2`, the Stack for `Task-2` goes under that for `Task-1`. We only ever want to add the "pure Python top-of-stack" once per Stack, so we need to do that _once per leaf Task_ after we've recursively unwound awaiting Tasks for the leaf Task. ### ... and a race condition? On top of that – because it's not enough – I think I discovered a rare but very real race condition (which actually was the root cause of that weird, rare but common-in-CI bug with the stack I showed on top). The race condition happens when the pure Python Stack we capture involves some asynchronous Frames, but the associated Task/Coroutine completes or yields before we get to unwinding it. In that case, we assume in Stack interleaving that the Task is running, by doing so we include some Frames from the Python Stack that actually _are not_ in the Stack generated by unwinding the Task, and because we share the same Python Stack for all Tasks (we only use it's "upper part", which is the same for every Task) some Frames from the previously-running-but-now-paused-or-completed Coroutine are leaked to other Tasks. Working around this is not simple; also honestly we should have a discussion around whether we want to properly work around it (which is going to be somewhat difficult, imperfect and probably somewhat costly) or just give up on sampling when we detect something is off. In the interest of keeping things somewhat simple, this will be handled in a separate PR. But I described the bug here anyway, for clarity (temporary PR: KowalskiThomas/echion#43) ### Caveats I realised after the fact that an assumption we implicitly make (that _the on-CPU Task – if it exists – is a Leaf Task_) is actually not generally true. In the case of certain `asyncio` utils, the Parent Task will actually from time to time execute code/be on-CPU to manage the Child Tasks it is awaiting on. This is an edge case, and something that happens rarely, but saying it can never happen is false (and it shows in Profiles). We will also need to figure something out for that – it probably isn't _that_ hard but will require some refactors (probably split the unwinding in two steps: (1) only determine upper Python Stack (2) unwind Tasks strictly speaking, starting from Leaf Tasks). ## Testing The PR adds new tests around on-CPU Tasks and what their Stacks look like The PR also updates a lot of existing tests to output their `DataSummary` as JSON. This is immensely useful to understand what is actually going on when we don't see the expected Stacks (or see the unexpected ones) and I do think it'd be worth keeping. After these changes, the tests that still fail locally and in the CI (flakily) are unrelated to on-CPU Tasks which, I guess, means _mission failed successfully_?
github-merge-queue bot
pushed a commit
to DataDog/dd-trace-py
that referenced
this pull request
Dec 11, 2025
## What does this PR do? More details and investigation results available in https://datadoghq.atlassian.net/browse/PROF-13137 Related PRs * Dependency: #15552 * Dependent: #15579 * Echion PR: P403n1x87/echion#198 This PR fixes the way we sample and capture stacks for on-CPU asyncio Tasks. This is a first step towards fixing _all_ weird (sub-)stacks we are seeing (both in Echion and dd-trace-py), for example this one... ``` background_math_function Task-background_wait background_wait_function sleep ``` ... where in practice, `background_math_function` was executed as part of a Task that was completely unrelated to `Task-background_wait` and should never have appeared in the same stack. This by itself doesn't fix everything (see **The fix(es, actually)**), but, as I said, it's a first step. See [this document](https://docs.google.com/document/d/10HmKhY6tM5j7ojdk7CG8anWsN7qP7IyeZlj3-zAXudM/edit?pli=1&tab=t.9pzrl0uli4g) for more explanations around how this works / is supposed to work, the fix is explained at the bottom here. ## The fix(es, actually) ### Flawed existing logic... The fix logic is the following: * To correctly interleave Task Stacks and Task names, we need to determine the "synchronous Python top-of-stack" that goes on top of the on-CPU Task (if there is an on-CPU Task). * This includes the Python stack that eventually led to doing something like `asyncio.run`, as well as the internal `asyncio` Runner machinery. * We will have to push this on top of every Task Stack we capture (note that the on-CPU Stack will already have it by definition). * To do that, we need to determine which Task is "the on CPU Task" (if any) and process it first, so that we know how "high" the "upper Python stack" is and we can re-use it for subsequent Task unwindings. * To ensure the first Task we unwind is the on-CPU one, we first sort the Leaf Tasks array by on-CPU-ness * Note I'm saying _the_ on CPU Task because, as always in Python, there can only be one thing running at a time. * (Well, that's the case at least in theory – in practice on-CPU-ness may change as we sample and so we could end up with more than one on-CPU Task, but that is fine because it won't change how deep the "upper stack" is) * Once we've done this, we know how high the "top of Stack is" because it is the height of the "pure Python" Stack we get by unwinding the first (top-most) Frame of the on-CPU Task. We need to remember that "height" to push that many items from the Python Thread Stack on top of each Task's Stack. * One final thing to keep in mind is that we "stack" Stacks (hehe) such that if `Task-1` is awaiting `Task-2`, the Stack for `Task-2` goes under that for `Task-1`. We only ever want to add the "pure Python top-of-stack" once per Stack, so we need to do that _once per leaf Task_ after we've recursively unwound awaiting Tasks for the leaf Task. ### ... and a race condition? On top of that – because it's not enough – I think I discovered a rare but very real race condition (which actually was the root cause of that weird, rare but common-in-CI bug with the stack I showed on top). The race condition happens when the pure Python Stack we capture involves some asynchronous Frames, but the associated Task/Coroutine completes or yields before we get to unwinding it. In that case, we assume in Stack interleaving that the Task is running, by doing so we include some Frames from the Python Stack that actually _are not_ in the Stack generated by unwinding the Task, and because we share the same Python Stack for all Tasks (we only use it's "upper part", which is the same for every Task) some Frames from the previously-running-but-now-paused-or-completed Coroutine are leaked to other Tasks. Working around this is not simple; also honestly we should have a discussion around whether we want to properly work around it (which is going to be somewhat difficult, imperfect and probably somewhat costly) or just give up on sampling when we detect something is off. In the interest of keeping things somewhat simple, this will be handled in a separate PR. But I described the bug here anyway, for clarity (temporary PR: KowalskiThomas/echion#43) ### Caveats I realised after the fact that an assumption we implicitly make (that _the on-CPU Task – if it exists – is a Leaf Task_) is actually not generally true. In the case of certain `asyncio` utils, the Parent Task will actually from time to time execute code/be on-CPU to manage the Child Tasks it is awaiting on. This is an edge case, and something that happens rarely, but saying it can never happen is false (and it shows in Profiles). We will also need to figure something out for that – it probably isn't _that_ hard but will require some refactors (probably split the unwinding in two steps: (1) only determine upper Python Stack (2) unwind Tasks strictly speaking, starting from Leaf Tasks). ## Testing The PR adds new tests around on-CPU Tasks and what their Stacks look like The PR also updates a lot of existing tests to output their `DataSummary` as JSON. This is immensely useful to understand what is actually going on when we don't see the expected Stacks (or see the unexpected ones) and I do think it'd be worth keeping. After these changes, the tests that still fail locally and in the CI (flakily) are unrelated to on-CPU Tasks which, I guess, means _mission failed successfully_?
KowalskiThomas
added a commit
to DataDog/dd-trace-py
that referenced
this pull request
Dec 11, 2025
## What does this PR do? More details and investigation results available in https://datadoghq.atlassian.net/browse/PROF-13137 Related PRs * Dependency: #15552 * Dependent: #15579 * Echion PR: P403n1x87/echion#198 This PR fixes the way we sample and capture stacks for on-CPU asyncio Tasks. This is a first step towards fixing _all_ weird (sub-)stacks we are seeing (both in Echion and dd-trace-py), for example this one... ``` background_math_function Task-background_wait background_wait_function sleep ``` ... where in practice, `background_math_function` was executed as part of a Task that was completely unrelated to `Task-background_wait` and should never have appeared in the same stack. This by itself doesn't fix everything (see **The fix(es, actually)**), but, as I said, it's a first step. See [this document](https://docs.google.com/document/d/10HmKhY6tM5j7ojdk7CG8anWsN7qP7IyeZlj3-zAXudM/edit?pli=1&tab=t.9pzrl0uli4g) for more explanations around how this works / is supposed to work, the fix is explained at the bottom here. ## The fix(es, actually) ### Flawed existing logic... The fix logic is the following: * To correctly interleave Task Stacks and Task names, we need to determine the "synchronous Python top-of-stack" that goes on top of the on-CPU Task (if there is an on-CPU Task). * This includes the Python stack that eventually led to doing something like `asyncio.run`, as well as the internal `asyncio` Runner machinery. * We will have to push this on top of every Task Stack we capture (note that the on-CPU Stack will already have it by definition). * To do that, we need to determine which Task is "the on CPU Task" (if any) and process it first, so that we know how "high" the "upper Python stack" is and we can re-use it for subsequent Task unwindings. * To ensure the first Task we unwind is the on-CPU one, we first sort the Leaf Tasks array by on-CPU-ness * Note I'm saying _the_ on CPU Task because, as always in Python, there can only be one thing running at a time. * (Well, that's the case at least in theory – in practice on-CPU-ness may change as we sample and so we could end up with more than one on-CPU Task, but that is fine because it won't change how deep the "upper stack" is) * Once we've done this, we know how high the "top of Stack is" because it is the height of the "pure Python" Stack we get by unwinding the first (top-most) Frame of the on-CPU Task. We need to remember that "height" to push that many items from the Python Thread Stack on top of each Task's Stack. * One final thing to keep in mind is that we "stack" Stacks (hehe) such that if `Task-1` is awaiting `Task-2`, the Stack for `Task-2` goes under that for `Task-1`. We only ever want to add the "pure Python top-of-stack" once per Stack, so we need to do that _once per leaf Task_ after we've recursively unwound awaiting Tasks for the leaf Task. ### ... and a race condition? On top of that – because it's not enough – I think I discovered a rare but very real race condition (which actually was the root cause of that weird, rare but common-in-CI bug with the stack I showed on top). The race condition happens when the pure Python Stack we capture involves some asynchronous Frames, but the associated Task/Coroutine completes or yields before we get to unwinding it. In that case, we assume in Stack interleaving that the Task is running, by doing so we include some Frames from the Python Stack that actually _are not_ in the Stack generated by unwinding the Task, and because we share the same Python Stack for all Tasks (we only use it's "upper part", which is the same for every Task) some Frames from the previously-running-but-now-paused-or-completed Coroutine are leaked to other Tasks. Working around this is not simple; also honestly we should have a discussion around whether we want to properly work around it (which is going to be somewhat difficult, imperfect and probably somewhat costly) or just give up on sampling when we detect something is off. In the interest of keeping things somewhat simple, this will be handled in a separate PR. But I described the bug here anyway, for clarity (temporary PR: KowalskiThomas/echion#43) ### Caveats I realised after the fact that an assumption we implicitly make (that _the on-CPU Task – if it exists – is a Leaf Task_) is actually not generally true. In the case of certain `asyncio` utils, the Parent Task will actually from time to time execute code/be on-CPU to manage the Child Tasks it is awaiting on. This is an edge case, and something that happens rarely, but saying it can never happen is false (and it shows in Profiles). We will also need to figure something out for that – it probably isn't _that_ hard but will require some refactors (probably split the unwinding in two steps: (1) only determine upper Python Stack (2) unwind Tasks strictly speaking, starting from Leaf Tasks). ## Testing The PR adds new tests around on-CPU Tasks and what their Stacks look like The PR also updates a lot of existing tests to output their `DataSummary` as JSON. This is immensely useful to understand what is actually going on when we don't see the expected Stacks (or see the unexpected ones) and I do think it'd be worth keeping. After these changes, the tests that still fail locally and in the CI (flakily) are unrelated to on-CPU Tasks which, I guess, means _mission failed successfully_?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
https://datadoghq.atlassian.net/browse/PROF-13137
In short
This depends on
Note that this PR (unfortunately) doesn't make all our problems go away (as the red CI testifies...)
There still are issues, just the one that I explained here is gone, as far as I can tell.
In depth
The PR aims at reducing the issues caused by race conditions between sampling of CPU Stacks and asyncio Stacks.
There are three places we can have a race condition. The race condition can happen because the three following events can happen at the same time as the Python thread that runs asyncio Tasks is living its life at the same time we are sampling:
If any two of those three Samples are out of sync (one returns “we’re running this Task” and any other one returns “we’re not running this Task”), we are at risk of producing incorrect (and potentially inconsistent) Stacks.
We need a way to either exclude those cases (give up on sampling) or recover from them (I think this will come at a performance cost). In this PR, I implemented the former, which is much easier to do and has no performance cost. We may lose a few samples as a result, but in practice this really rarely happens except upon Task completion.