Skip to content

fix(profiling): upper bound on iterations for TaskInfo::unwind#16510

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 4 commits intomainfrom
kowalski/test-profiling-upper-bound-on-iterations-for-taskinfo-unwind
Feb 17, 2026
Merged

fix(profiling): upper bound on iterations for TaskInfo::unwind#16510
gh-worker-dd-mergequeue-cf854d[bot] merged 4 commits intomainfrom
kowalski/test-profiling-upper-bound-on-iterations-for-taskinfo-unwind

Conversation

@KowalskiThomas
Copy link
Contributor

@KowalskiThomas KowalskiThomas commented Feb 14, 2026

Description

This PR updates the Task unwinding logic for the Profiler to have an upper bound on the number of (1) Tasks in the Task chain unwound (2) coroutines in the coroutine chain unwound.

This is important because if somehow we have some memory corruption (very possible, as we don't take a snapshot of the interpreter memory but rather copy select "chunks" over time, and the state of Tasks can change as we copy those "chunks"), we could otherwise end up looping infinitely (which is bad for obvious reasons) and as a result try to add an infinite number of items to the Frame Stack (which is arguably significantly worse, as this would mean trying to allocate an infinite amount of memory 💣).

We spotted this issue when we deployed 4.5.0rc2 to internal Rapid Python HTTP services, see IR-49542.

@KowalskiThomas KowalskiThomas added the changelog/no-changelog A changelog entry is not required for this PR. label Feb 14, 2026
@KowalskiThomas KowalskiThomas force-pushed the kowalski/test-profiling-upper-bound-on-iterations-for-taskinfo-unwind branch from 0cedab4 to 47da8ea Compare February 14, 2026 23:05
@cit-pr-commenter-54b7da
Copy link

cit-pr-commenter-54b7da bot commented Feb 14, 2026

Codeowners resolved as

ddtrace/internal/datadog/profiling/stack/src/echion/tasks.cc            @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/src/echion/threads.cc          @DataDog/profiling-python
releasenotes/notes/profiling-fix-max-iterations-unwind-tasks-671d743912c7d600.yaml  @DataDog/apm-python

@datadog-datadog-prod-us1

This comment has been minimized.

@KowalskiThomas KowalskiThomas changed the title test(profiling): upper bound on iterations for taskinfo::unwind test(profiling): upper bound on iterations for TaskInfo::unwind Feb 16, 2026
@KowalskiThomas KowalskiThomas added the Profiling Continous Profling label Feb 16, 2026
@KowalskiThomas KowalskiThomas changed the title test(profiling): upper bound on iterations for TaskInfo::unwind fix(profiling): upper bound on iterations for TaskInfo::unwind Feb 16, 2026
@KowalskiThomas KowalskiThomas marked this pull request as ready for review February 16, 2026 09:30
@KowalskiThomas KowalskiThomas requested review from a team as code owners February 16, 2026 09:30
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb0aecac07

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@KowalskiThomas KowalskiThomas removed the changelog/no-changelog A changelog entry is not required for this PR. label Feb 16, 2026
Copy link
Contributor

@P403n1x87 P403n1x87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try to detect potential cycles as well?

@KowalskiThomas
Copy link
Contributor Author

Should we try to detect potential cycles as well?

Funnily(?) I had a PR that did exactly this but that I never merged (see here: #15712). We had discussions around this recently because detecting cycles means using hash maps and that is more costly than just using a counter.

We probably should decide one way forward -- only counters or only hash sets (or bloom filters, possibly...) but the latest thing we settled on was "let's not introduce more hash sets" so that's what I followed here.
I think given the cost and our current overhead, a bloom filter would probably be the best tradeoff but I only thought of that this weekend and we've never discussed it before.

@KowalskiThomas KowalskiThomas force-pushed the kowalski/test-profiling-upper-bound-on-iterations-for-taskinfo-unwind branch from 5abd871 to 8d12117 Compare February 16, 2026 16:09
@KowalskiThomas
Copy link
Contributor Author

/merge

@gh-worker-devflow-routing-ef8351
Copy link

gh-worker-devflow-routing-ef8351 bot commented Feb 16, 2026

View all feedbacks in Devflow UI.

2026-02-16 16:09:59 UTC ℹ️ Start processing command /merge


2026-02-16 16:10:06 UTC ℹ️ MergeQueue: waiting for PR to be ready

This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
It will be added to the queue as soon as checks pass and/or get approvals. View in MergeQueue UI.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.


2026-02-16 17:16:08 UTC ℹ️ MergeQueue: merge request added to the queue

The expected merge time in main is approximately 5h (p90).


2026-02-16 19:17:23 UTCMergeQueue: The build pipeline has timeout

The merge request has been interrupted because the build 96780885 took longer than expected. The current limit for the base branch 'main' is 120 minutes.

@KowalskiThomas
Copy link
Contributor Author

/merge

@KowalskiThomas KowalskiThomas force-pushed the kowalski/test-profiling-upper-bound-on-iterations-for-taskinfo-unwind branch from 8d12117 to 0bc5769 Compare February 16, 2026 22:44
@gh-worker-devflow-routing-ef8351
Copy link

gh-worker-devflow-routing-ef8351 bot commented Feb 16, 2026

View all feedbacks in Devflow UI.

2026-02-16 22:44:20 UTC ℹ️ Start processing command /merge


2026-02-16 22:44:26 UTC ℹ️ MergeQueue: waiting for PR to be ready

This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
It will be added to the queue as soon as checks pass and/or get approvals. View in MergeQueue UI.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.


2026-02-16 23:36:09 UTC ℹ️ MergeQueue: merge request added to the queue

The expected merge time in main is approximately 5h (p90).


2026-02-17 00:45:01 UTCMergeQueue: The checks failed on this merge request

Tests failed on this commit f6718bc:

What to do next?

  • Investigate the failures and when ready, re-add your pull request to the queue!
  • If your PR checks are green, try to rebase/merge. It might be because the CI run is a bit old.
  • Any question, go check the FAQ.

@KowalskiThomas
Copy link
Contributor Author

/merge

@gh-worker-devflow-routing-ef8351
Copy link

gh-worker-devflow-routing-ef8351 bot commented Feb 17, 2026

View all feedbacks in Devflow UI.

2026-02-17 07:55:01 UTC ℹ️ Start processing command /merge


2026-02-17 07:55:05 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 5h (p90).


2026-02-17 08:41:06 UTCMergeQueue: The checks failed on this merge request

Tests failed on this commit 04dbd01:

What to do next?

  • Investigate the failures and when ready, re-add your pull request to the queue!
  • If your PR checks are green, try to rebase/merge. It might be because the CI run is a bit old.
  • Any question, go check the FAQ.

@KowalskiThomas
Copy link
Contributor Author

/merge

@gh-worker-devflow-routing-ef8351
Copy link

gh-worker-devflow-routing-ef8351 bot commented Feb 17, 2026

View all feedbacks in Devflow UI.

2026-02-17 09:32:58 UTC ℹ️ Start processing command /merge


2026-02-17 09:33:04 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 5h (p90).


2026-02-17 10:21:06 UTCMergeQueue: The checks failed on this merge request

Tests failed on this commit f5a3b91:

What to do next?

  • Investigate the failures and when ready, re-add your pull request to the queue!
  • If your PR checks are green, try to rebase/merge. It might be because the CI run is a bit old.
  • Any question, go check the FAQ.

@KowalskiThomas KowalskiThomas force-pushed the kowalski/test-profiling-upper-bound-on-iterations-for-taskinfo-unwind branch from 0bc5769 to 974ff92 Compare February 17, 2026 13:05
@KowalskiThomas
Copy link
Contributor Author

/merge

@gh-worker-devflow-routing-ef8351
Copy link

gh-worker-devflow-routing-ef8351 bot commented Feb 17, 2026

View all feedbacks in Devflow UI.

2026-02-17 13:23:19 UTC ℹ️ Start processing command /merge


2026-02-17 13:23:28 UTC ℹ️ MergeQueue: waiting for PR to be ready

This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
It will be added to the queue as soon as checks pass and/or get approvals. View in MergeQueue UI.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.


2026-02-17 14:19:05 UTC ℹ️ MergeQueue: merge request added to the queue

The expected merge time in main is approximately 5h (p90).


2026-02-17 14:44:02 UTC ℹ️ MergeQueue: Readding this merge request to the queue because another merge request processed with yours failed. No action is needed from your side.


2026-02-17 14:45:54 UTC ℹ️ MergeQueue: Readding this merge request to the queue because another merge request processed with yours failed. No action is needed from your side.


2026-02-17 16:01:20 UTC ℹ️ MergeQueue: This merge request was merged

@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d bot merged commit 0cfe067 into main Feb 17, 2026
392 checks passed
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d bot deleted the kowalski/test-profiling-upper-bound-on-iterations-for-taskinfo-unwind branch February 17, 2026 16:01
@github-actions
Copy link
Contributor

This change is marked for backport to 4.5 and it does not conflict with that branch.
The command used to test backporting was

git checkout 4.5 && git cherry-pick -x --mainline 1 0cfe067b01a81fd4ea886950eb02a9a05bbfdf17

dd-octo-sts bot pushed a commit that referenced this pull request Feb 17, 2026
)

## Description

This PR updates the Task unwinding logic for the Profiler to have an upper bound on the number of (1) Tasks in the Task chain unwound (2) coroutines in the coroutine chain unwound.

This is important because if somehow we have some memory corruption (very possible, as we don't take a snapshot of the interpreter memory but rather copy select "chunks" over time, and the state of Tasks can change as we copy those "chunks"), we could otherwise end up looping infinitely (which is bad for obvious reasons) and as a result try to add an infinite number of items to the Frame Stack (which is arguably significantly worse, as this would mean trying to allocate an infinite amount of memory 💣).

We spotted this issue when we deployed `4.5.0rc2` to internal Rapid Python HTTP services, see IR-49542.

Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>
(cherry picked from commit 0cfe067)
emmettbutler pushed a commit that referenced this pull request Feb 17, 2026
)

## Description

This PR updates the Task unwinding logic for the Profiler to have an upper bound on the number of (1) Tasks in the Task chain unwound (2) coroutines in the coroutine chain unwound.

This is important because if somehow we have some memory corruption (very possible, as we don't take a snapshot of the interpreter memory but rather copy select "chunks" over time, and the state of Tasks can change as we copy those "chunks"), we could otherwise end up looping infinitely (which is bad for obvious reasons) and as a result try to add an infinite number of items to the Frame Stack (which is arguably significantly worse, as this would mean trying to allocate an infinite amount of memory 💣).

We spotted this issue when we deployed `4.5.0rc2` to internal Rapid Python HTTP services, see IR-49542.

Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>
(cherry picked from commit 0cfe067)
Signed-off-by: Emmett Butler <emmett.butler321@gmail.com>
emmettbutler pushed a commit that referenced this pull request Feb 17, 2026
…kport 4.5] (#16542)

Backport #16510 to 4.5

Signed-off-by: Emmett Butler <emmett.butler321@gmail.com>
Co-authored-by: Thomas Kowalski <thomas.kowalski@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 4.5 Profiling Continous Profling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants