Skip to content

fix: correctly unwind on-CPU tasks#198

Merged
KowalskiThomas merged 5 commits intoP403n1x87:mainfrom
KowalskiThomas:kowalski/fix-on-cpu-tasks
Dec 11, 2025
Merged

fix: correctly unwind on-CPU tasks#198
KowalskiThomas merged 5 commits intoP403n1x87:mainfrom
KowalskiThomas:kowalski/fix-on-cpu-tasks

Conversation

@KowalskiThomas
Copy link
Collaborator

@KowalskiThomas KowalskiThomas commented Nov 28, 2025

What does this PR do?

dd-trace-py PR:

Note this PR depends on

Note the diff says 22 23 Files changed but it's mostly tests don't worry :)

This PR fixes the way we sample and capture stacks for on-CPU asyncio Tasks. This is a first step towards fixing all weird (sub-)stacks we are seeing (both in Echion and dd-trace-py), for example this one...

background_math_function
Task-background_wait
background_wait_function
sleep

... where in practice, background_math_function was executed as part of a Task that was completely unrelated to Task-background_wait and should never have appeared in the same stack.

This by itself doesn't fix everything (see The fix(es, actually)), but, as I said, it's a first step.

See this document for more explanations around how this works / is supposed to work, the fix is explained at the bottom here.

The fix(es, actually)

Flawed existing logic...

The fix logic is the following:

  • To correctly interleave Task Stacks and Task names, we need to determine the "synchronous Python top-of-stack" that goes on top of the on-CPU Task (if there is an on-CPU Task).
    • This includes the Python stack that eventually led to doing something like asyncio.run, as well as the internal asyncio Runner machinery.
    • We will have to push this on top of every Task Stack we capture (note that the on-CPU Stack will already have it by definition).
  • To do that, we need to determine which Task is "the on CPU Task" (if any) and process it first, so that we know how "high" the "upper Python stack" is and we can re-use it for subsequent Task unwindings.
    • To ensure the first Task we unwind is the on-CPU one, we first sort the Leaf Tasks array by on-CPU-ness
      • Note I'm saying the on CPU Task because, as always in Python, there can only be one thing running at a time.
      • (Well, that's the case at least in theory – in practice on-CPU-ness may change as we sample and so we could end up with more than one on-CPU Task, but that is fine because it won't change how deep the "upper stack" is)
    • Once we've done this, we know how high the "top of Stack is" because it is the height of the "pure Python" Stack we get by unwinding the first (top-most) Frame of the on-CPU Task. We need to remember that "height" to push that many items from the Python Thread Stack on top of each Task's Stack.
  • One final thing to keep in mind is that we "stack" Stacks (hehe) such that if Task-1 is awaiting Task-2, the Stack for Task-2 goes under that for Task-1. We only ever want to add the "pure Python top-of-stack" once per Stack, so we need to do that once per leaf Task after we've recursively unwound awaiting Tasks for the leaf Task.

... and a race condition?

On top of that – because it's not enough – I think I discovered a rare but very real race condition (which actually was the root cause of that weird, rare but common-in-CI bug with the stack I showed on top). The race condition happens when the pure Python Stack we capture involves some asynchronous Frames, but the associated Task/Coroutine completes or yields before we get to unwinding it. In that case, we assume in Stack interleaving that the Task is running, by doing so we include some Frames from the Python Stack that actually are not in the Stack generated by unwinding the Task, and because we share the same Python Stack for all Tasks (we only use it's "upper part", which is the same for every Task) some Frames from the previously-running-but-now-paused-or-completed Coroutine are leaked to other Tasks.

Working around this is not simple; also honestly we should have a discussion around whether we want to properly work around it (which is going to be somewhat difficult, imperfect and probably somewhat costly) or just give up on sampling when we detect something is off.
In the interest of keeping things somewhat simple, this will be handled in a separate PR. But I described the bug here anyway, for clarity (temporary PR: KowalskiThomas#43)

Caveats

I realised after the fact that an assumption we implicitly make (that the on-CPU Task – if it exists – is a Leaf Task) is actually not generally true. In the case of certain asyncio utils, the Parent Task will actually from time to time execute code/be on-CPU to manage the Child Tasks it is awaiting on. This is an edge case, and something that happens rarely, but saying it can never happen is false (and it shows in Profiles).
We will also need to figure something out for that – it probably isn't that hard but will require some refactors (probably split the unwinding in two steps: (1) only determine upper Python Stack (2) unwind Tasks strictly speaking, starting from Leaf Tasks).

Testing

The PR adds new tests around on-CPU Tasks and what their Stacks look like

The PR also updates a lot of existing tests to output their DataSummary as JSON. This is immensely useful to understand what is actually going on when we don't see the expected Stacks (or see the unexpected ones) and I do think it'd be worth keeping.

After these changes, the tests that still fail locally and in the CI (flakily) are unrelated to on-CPU Tasks which, I guess, means mission failed successfully?

@KowalskiThomas KowalskiThomas changed the title fix: use official PyGen_yf for 3.13 fix: on cpu tasks Nov 28, 2025
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-on-cpu-tasks branch 4 times, most recently from bff4477 to 1451a74 Compare November 28, 2025 15:43
@KowalskiThomas KowalskiThomas changed the title fix: on cpu tasks fix: correctly unwind on-CPU tasks Dec 1, 2025
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-on-cpu-tasks branch from 1451a74 to 001ff2c Compare December 3, 2025 12:45
@KowalskiThomas KowalskiThomas marked this pull request as ready for review December 3, 2025 12:48
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-on-cpu-tasks branch from 001ff2c to f464e57 Compare December 3, 2025 12:57
@KowalskiThomas KowalskiThomas marked this pull request as draft December 3, 2025 13:24
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-on-cpu-tasks branch 6 times, most recently from 5a5c7ac to 1041ea8 Compare December 4, 2025 08:57
KowalskiThomas added a commit that referenced this pull request Dec 4, 2025
refactor: change `GenInfo::is_running`

This PR changes the way `GenInfo::is_running` works. Previously, it would indicate whether the _current_ coroutine was on CPU; now, it indicates whether the current coroutine _or the coroutine it (recursively) awaits_ is on CPU.  
Making that change also allows us to do less work when we check whether the current coroutine is on CPU or not. Because a Coroutine / Generator / `GenInfo` can only be running if it is not awaiting another Generator, we do not need to compute `is_running` if we have `await != nullptr` (and we take `await->is_running` for the value in that case). 

Making that change makes it easier to check whether a Task is currently on-CPU; and allows to do less work when we decide how to unwind `asyncio` Tasks (cf the changes in  `TaskInfo` which doesn't need the `is_on_cpu` method that iterates on the `await` chain anymore).

Note that I checked whether  `GenInfo::is_running` was used in any other way than the one I describe and simplify; it is not, so I do think this change is safe to make as-is. 

**Note** This PR makes sense on its own, but it is in the context of making #198 simpler.
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-on-cpu-tasks branch 4 times, most recently from 5b1ad47 to 374cfb7 Compare December 4, 2025 13:41
KowalskiThomas added a commit to DataDog/dd-trace-py that referenced this pull request Dec 5, 2025
## Description

This change replicates P403n1x87/echion#202.

This PR changes the way `GenInfo::is_running` works. Previously, it
would indicate whether the _current_ coroutine was on CPU; now, it
indicates whether the current coroutine _or the coroutine it
(recursively) awaits_ is on CPU.
Making that change also allows us to do less work when we check whether
the current coroutine is on CPU or not. Because a Coroutine / Generator
/ `GenInfo` can only be running if it is not awaiting another Generator,
we do not need to compute `is_running` if we have `await != nullptr`
(and we take `await->is_running` for the value in that case).

Making that change makes it easier to check whether a Task is currently
on-CPU; and allows to do less work when we decide how to unwind
`asyncio` Tasks (cf the changes in `TaskInfo` which doesn't need the
`is_on_cpu` method that iterates on the `await` chain anymore).

Note that I checked whether `GenInfo::is_running` was used in any other
way than the one I describe and simplify; it is not, so I do think this
change is safe to make as-is.

**Note** This PR makes sense on its own, but it is in the context of
making P403n1x87/echion#198 simpler.

**Note** This doesn't need a changelog entry as it makes no difference
to the user, it's purely a refactor.
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-on-cpu-tasks branch 5 times, most recently from b4eb48a to 710a455 Compare December 5, 2025 14:36
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-on-cpu-tasks branch 6 times, most recently from 4c369fe to 51c9072 Compare December 8, 2025 09:36
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-on-cpu-tasks branch 2 times, most recently from 81b2bf5 to fdd153e Compare December 8, 2025 09:44
@KowalskiThomas KowalskiThomas marked this pull request as ready for review December 8, 2025 12:32
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-on-cpu-tasks branch from 1d4b884 to 6a3631e Compare December 10, 2025 16:49
@KowalskiThomas KowalskiThomas merged commit 619ff0e into P403n1x87:main Dec 11, 2025
88 of 96 checks passed
github-merge-queue bot pushed a commit to DataDog/dd-trace-py that referenced this pull request Dec 11, 2025
## What does this PR do?

More details and investigation results available in
https://datadoghq.atlassian.net/browse/PROF-13137

Related PRs
* Dependency: #15552
* Dependent: #15579
* Echion PR: P403n1x87/echion#198

This PR fixes the way we sample and capture stacks for on-CPU asyncio
Tasks. This is a first step towards fixing _all_ weird (sub-)stacks we
are seeing (both in Echion and dd-trace-py), for example this one...

```
background_math_function
Task-background_wait
background_wait_function
sleep
```

... where in practice, `background_math_function` was executed as part
of a Task that was completely unrelated to `Task-background_wait` and
should never have appeared in the same stack.

This by itself doesn't fix everything (see **The fix(es, actually)**),
but, as I said, it's a first step.

See [this
document](https://docs.google.com/document/d/10HmKhY6tM5j7ojdk7CG8anWsN7qP7IyeZlj3-zAXudM/edit?pli=1&tab=t.9pzrl0uli4g)
for more explanations around how this works / is supposed to work, the
fix is explained at the bottom here.

## The fix(es, actually)

### Flawed existing logic...

The fix logic is the following:
* To correctly interleave Task Stacks and Task names, we need to
determine the "synchronous Python top-of-stack" that goes on top of the
on-CPU Task (if there is an on-CPU Task).
* This includes the Python stack that eventually led to doing something
like `asyncio.run`, as well as the internal `asyncio` Runner machinery.
* We will have to push this on top of every Task Stack we capture (note
that the on-CPU Stack will already have it by definition).
* To do that, we need to determine which Task is "the on CPU Task" (if
any) and process it first, so that we know how "high" the "upper Python
stack" is and we can re-use it for subsequent Task unwindings.
* To ensure the first Task we unwind is the on-CPU one, we first sort
the Leaf Tasks array by on-CPU-ness
* Note I'm saying _the_ on CPU Task because, as always in Python, there
can only be one thing running at a time.
* (Well, that's the case at least in theory – in practice on-CPU-ness
may change as we sample and so we could end up with more than one on-CPU
Task, but that is fine because it won't change how deep the "upper
stack" is)
* Once we've done this, we know how high the "top of Stack is" because
it is the height of the "pure Python" Stack we get by unwinding the
first (top-most) Frame of the on-CPU Task. We need to remember that
"height" to push that many items from the Python Thread Stack on top of
each Task's Stack.
* One final thing to keep in mind is that we "stack" Stacks (hehe) such
that if `Task-1` is awaiting `Task-2`, the Stack for `Task-2` goes under
that for `Task-1`. We only ever want to add the "pure Python
top-of-stack" once per Stack, so we need to do that _once per leaf Task_
after we've recursively unwound awaiting Tasks for the leaf Task.

### ... and a race condition?

On top of that – because it's not enough – I think I discovered a rare
but very real race condition (which actually was the root cause of that
weird, rare but common-in-CI bug with the stack I showed on top). The
race condition happens when the pure Python Stack we capture involves
some asynchronous Frames, but the associated Task/Coroutine completes or
yields before we get to unwinding it. In that case, we assume in Stack
interleaving that the Task is running, by doing so we include some
Frames from the Python Stack that actually _are not_ in the Stack
generated by unwinding the Task, and because we share the same Python
Stack for all Tasks (we only use it's "upper part", which is the same
for every Task) some Frames from the
previously-running-but-now-paused-or-completed Coroutine are leaked to
other Tasks.

Working around this is not simple; also honestly we should have a
discussion around whether we want to properly work around it (which is
going to be somewhat difficult, imperfect and probably somewhat costly)
or just give up on sampling when we detect something is off.
In the interest of keeping things somewhat simple, this will be handled
in a separate PR. But I described the bug here anyway, for clarity
(temporary PR: KowalskiThomas/echion#43)

### Caveats

I realised after the fact that an assumption we implicitly make (that
_the on-CPU Task – if it exists – is a Leaf Task_) is actually not
generally true. In the case of certain `asyncio` utils, the Parent Task
will actually from time to time execute code/be on-CPU to manage the
Child Tasks it is awaiting on. This is an edge case, and something that
happens rarely, but saying it can never happen is false (and it shows in
Profiles).
We will also need to figure something out for that – it probably isn't
_that_ hard but will require some refactors (probably split the
unwinding in two steps: (1) only determine upper Python Stack (2) unwind
Tasks strictly speaking, starting from Leaf Tasks).


## Testing

The PR adds new tests around on-CPU Tasks and what their Stacks look
like

The PR also updates a lot of existing tests to output their
`DataSummary` as JSON. This is immensely useful to understand what is
actually going on when we don't see the expected Stacks (or see the
unexpected ones) and I do think it'd be worth keeping.

After these changes, the tests that still fail locally and in the CI
(flakily) are unrelated to on-CPU Tasks which, I guess, means _mission
failed successfully_?
github-merge-queue bot pushed a commit to DataDog/dd-trace-py that referenced this pull request Dec 11, 2025
## What does this PR do?

More details and investigation results available in
https://datadoghq.atlassian.net/browse/PROF-13137

Related PRs
* Dependency: #15552
* Dependent: #15579
* Echion PR: P403n1x87/echion#198

This PR fixes the way we sample and capture stacks for on-CPU asyncio
Tasks. This is a first step towards fixing _all_ weird (sub-)stacks we
are seeing (both in Echion and dd-trace-py), for example this one...

```
background_math_function
Task-background_wait
background_wait_function
sleep
```

... where in practice, `background_math_function` was executed as part
of a Task that was completely unrelated to `Task-background_wait` and
should never have appeared in the same stack.

This by itself doesn't fix everything (see **The fix(es, actually)**),
but, as I said, it's a first step.

See [this
document](https://docs.google.com/document/d/10HmKhY6tM5j7ojdk7CG8anWsN7qP7IyeZlj3-zAXudM/edit?pli=1&tab=t.9pzrl0uli4g)
for more explanations around how this works / is supposed to work, the
fix is explained at the bottom here.

## The fix(es, actually)

### Flawed existing logic...

The fix logic is the following:
* To correctly interleave Task Stacks and Task names, we need to
determine the "synchronous Python top-of-stack" that goes on top of the
on-CPU Task (if there is an on-CPU Task).
* This includes the Python stack that eventually led to doing something
like `asyncio.run`, as well as the internal `asyncio` Runner machinery.
* We will have to push this on top of every Task Stack we capture (note
that the on-CPU Stack will already have it by definition).
* To do that, we need to determine which Task is "the on CPU Task" (if
any) and process it first, so that we know how "high" the "upper Python
stack" is and we can re-use it for subsequent Task unwindings.
* To ensure the first Task we unwind is the on-CPU one, we first sort
the Leaf Tasks array by on-CPU-ness
* Note I'm saying _the_ on CPU Task because, as always in Python, there
can only be one thing running at a time.
* (Well, that's the case at least in theory – in practice on-CPU-ness
may change as we sample and so we could end up with more than one on-CPU
Task, but that is fine because it won't change how deep the "upper
stack" is)
* Once we've done this, we know how high the "top of Stack is" because
it is the height of the "pure Python" Stack we get by unwinding the
first (top-most) Frame of the on-CPU Task. We need to remember that
"height" to push that many items from the Python Thread Stack on top of
each Task's Stack.
* One final thing to keep in mind is that we "stack" Stacks (hehe) such
that if `Task-1` is awaiting `Task-2`, the Stack for `Task-2` goes under
that for `Task-1`. We only ever want to add the "pure Python
top-of-stack" once per Stack, so we need to do that _once per leaf Task_
after we've recursively unwound awaiting Tasks for the leaf Task.

### ... and a race condition?

On top of that – because it's not enough – I think I discovered a rare
but very real race condition (which actually was the root cause of that
weird, rare but common-in-CI bug with the stack I showed on top). The
race condition happens when the pure Python Stack we capture involves
some asynchronous Frames, but the associated Task/Coroutine completes or
yields before we get to unwinding it. In that case, we assume in Stack
interleaving that the Task is running, by doing so we include some
Frames from the Python Stack that actually _are not_ in the Stack
generated by unwinding the Task, and because we share the same Python
Stack for all Tasks (we only use it's "upper part", which is the same
for every Task) some Frames from the
previously-running-but-now-paused-or-completed Coroutine are leaked to
other Tasks.

Working around this is not simple; also honestly we should have a
discussion around whether we want to properly work around it (which is
going to be somewhat difficult, imperfect and probably somewhat costly)
or just give up on sampling when we detect something is off.
In the interest of keeping things somewhat simple, this will be handled
in a separate PR. But I described the bug here anyway, for clarity
(temporary PR: KowalskiThomas/echion#43)

### Caveats

I realised after the fact that an assumption we implicitly make (that
_the on-CPU Task – if it exists – is a Leaf Task_) is actually not
generally true. In the case of certain `asyncio` utils, the Parent Task
will actually from time to time execute code/be on-CPU to manage the
Child Tasks it is awaiting on. This is an edge case, and something that
happens rarely, but saying it can never happen is false (and it shows in
Profiles).
We will also need to figure something out for that – it probably isn't
_that_ hard but will require some refactors (probably split the
unwinding in two steps: (1) only determine upper Python Stack (2) unwind
Tasks strictly speaking, starting from Leaf Tasks).


## Testing

The PR adds new tests around on-CPU Tasks and what their Stacks look
like

The PR also updates a lot of existing tests to output their
`DataSummary` as JSON. This is immensely useful to understand what is
actually going on when we don't see the expected Stacks (or see the
unexpected ones) and I do think it'd be worth keeping.

After these changes, the tests that still fail locally and in the CI
(flakily) are unrelated to on-CPU Tasks which, I guess, means _mission
failed successfully_?
KowalskiThomas added a commit to DataDog/dd-trace-py that referenced this pull request Dec 11, 2025
## What does this PR do?

More details and investigation results available in
https://datadoghq.atlassian.net/browse/PROF-13137

Related PRs
* Dependency: #15552
* Dependent: #15579
* Echion PR: P403n1x87/echion#198

This PR fixes the way we sample and capture stacks for on-CPU asyncio
Tasks. This is a first step towards fixing _all_ weird (sub-)stacks we
are seeing (both in Echion and dd-trace-py), for example this one...

```
background_math_function
Task-background_wait
background_wait_function
sleep
```

... where in practice, `background_math_function` was executed as part
of a Task that was completely unrelated to `Task-background_wait` and
should never have appeared in the same stack.

This by itself doesn't fix everything (see **The fix(es, actually)**),
but, as I said, it's a first step.

See [this
document](https://docs.google.com/document/d/10HmKhY6tM5j7ojdk7CG8anWsN7qP7IyeZlj3-zAXudM/edit?pli=1&tab=t.9pzrl0uli4g)
for more explanations around how this works / is supposed to work, the
fix is explained at the bottom here.

## The fix(es, actually)

### Flawed existing logic...

The fix logic is the following:
* To correctly interleave Task Stacks and Task names, we need to
determine the "synchronous Python top-of-stack" that goes on top of the
on-CPU Task (if there is an on-CPU Task).
* This includes the Python stack that eventually led to doing something
like `asyncio.run`, as well as the internal `asyncio` Runner machinery.
* We will have to push this on top of every Task Stack we capture (note
that the on-CPU Stack will already have it by definition).
* To do that, we need to determine which Task is "the on CPU Task" (if
any) and process it first, so that we know how "high" the "upper Python
stack" is and we can re-use it for subsequent Task unwindings.
* To ensure the first Task we unwind is the on-CPU one, we first sort
the Leaf Tasks array by on-CPU-ness
* Note I'm saying _the_ on CPU Task because, as always in Python, there
can only be one thing running at a time.
* (Well, that's the case at least in theory – in practice on-CPU-ness
may change as we sample and so we could end up with more than one on-CPU
Task, but that is fine because it won't change how deep the "upper
stack" is)
* Once we've done this, we know how high the "top of Stack is" because
it is the height of the "pure Python" Stack we get by unwinding the
first (top-most) Frame of the on-CPU Task. We need to remember that
"height" to push that many items from the Python Thread Stack on top of
each Task's Stack.
* One final thing to keep in mind is that we "stack" Stacks (hehe) such
that if `Task-1` is awaiting `Task-2`, the Stack for `Task-2` goes under
that for `Task-1`. We only ever want to add the "pure Python
top-of-stack" once per Stack, so we need to do that _once per leaf Task_
after we've recursively unwound awaiting Tasks for the leaf Task.

### ... and a race condition?

On top of that – because it's not enough – I think I discovered a rare
but very real race condition (which actually was the root cause of that
weird, rare but common-in-CI bug with the stack I showed on top). The
race condition happens when the pure Python Stack we capture involves
some asynchronous Frames, but the associated Task/Coroutine completes or
yields before we get to unwinding it. In that case, we assume in Stack
interleaving that the Task is running, by doing so we include some
Frames from the Python Stack that actually _are not_ in the Stack
generated by unwinding the Task, and because we share the same Python
Stack for all Tasks (we only use it's "upper part", which is the same
for every Task) some Frames from the
previously-running-but-now-paused-or-completed Coroutine are leaked to
other Tasks.

Working around this is not simple; also honestly we should have a
discussion around whether we want to properly work around it (which is
going to be somewhat difficult, imperfect and probably somewhat costly)
or just give up on sampling when we detect something is off.
In the interest of keeping things somewhat simple, this will be handled
in a separate PR. But I described the bug here anyway, for clarity
(temporary PR: KowalskiThomas/echion#43)

### Caveats

I realised after the fact that an assumption we implicitly make (that
_the on-CPU Task – if it exists – is a Leaf Task_) is actually not
generally true. In the case of certain `asyncio` utils, the Parent Task
will actually from time to time execute code/be on-CPU to manage the
Child Tasks it is awaiting on. This is an edge case, and something that
happens rarely, but saying it can never happen is false (and it shows in
Profiles).
We will also need to figure something out for that – it probably isn't
_that_ hard but will require some refactors (probably split the
unwinding in two steps: (1) only determine upper Python Stack (2) unwind
Tasks strictly speaking, starting from Leaf Tasks).


## Testing

The PR adds new tests around on-CPU Tasks and what their Stacks look
like

The PR also updates a lot of existing tests to output their
`DataSummary` as JSON. This is immensely useful to understand what is
actually going on when we don't see the expected Stacks (or see the
unexpected ones) and I do think it'd be worth keeping.

After these changes, the tests that still fail locally and in the CI
(flakily) are unrelated to on-CPU Tasks which, I guess, means _mission
failed successfully_?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant