fix: do not remove links for just-created Tasks#210
Merged
KowalskiThomas merged 1 commit intoP403n1x87:mainfrom Dec 9, 2025
Conversation
KowalskiThomas
added a commit
to DataDog/dd-trace-py
that referenced
this pull request
Dec 9, 2025
## Description Echion PR: P403n1x87/echion#210 This PR fixes a race condition happening around linking Parent and Children Tasks when using utilities like `asyncio.gather` or `asyncio.as_completed`. ### The problem The race condition would manifest in the following way 1. User calls e.g. `asyncio.as_completed` from `parent_task` 2. Our patch creates Tasks for each awaitable passed to the function and inserts resulting Task objects into the Task Links Map (one of each is `new_task`) 3. A Sample is taken exactly at that moment, Task Link Map cleanup logic kicks in. a. `new_task` was created only just now, so was not part of the `all_tasks` `MirrorSet` snapshotted by the Sampler b. The cleanup logic sees that there's a link from `new_task` to `parent_task` but `new_task` isn't in `all_tasks` c. The cleanup logic assumes that means `new_task` previously existed but has completed, deduces the link must be removed d. Link is removed 4. `new_task` starts executing, but isn't linked to `parent_task` because we've cleaned it up before it got a chance to start ### Proposed fix The fix I propose is to introduce a new `previous_task_objects` set which contains the addresses of all existing Task objects at the time we previously sampled. If we see a Task object in the Task Links Map that doesn't exist in `all_tasks`, then two things are possible: either the Task used to exist and the link should be removed, or the Task was just created and the link should stay. To determine which case we're in, we look at `previous_task_objects` – if the Task is there then it previously existed and just completed (⇒ need to remove); if not, it was just created (⇒ shouldn't remove). ### Admitted caveat I think this comes with one caveat: if we sample very rarely (which may happen in certain circumstances with adaptive sampling), then we could miss certain Tasks and never get rid of their links: ```py async def super_quick(): return None async def parent(): await asyncio.sleep(1.0) asyncio.gather(asyncio.create_task(super_quick())) ``` Given that super_quick will be… super quick, the following could happen: * We sample; we see parent ; `all_tasks = { parent }; task_links = {} ; previous_all_tasks = { parent }` * Parent continues, calls `asyncio.gather(...)` creates another Task ; `all_tasks = { parent, super_quick }; task_links = { super_quick -> parent }` * (We don’t sample yet because we rarely sample…) * Task execution continues, `super_quick` runs super quickly and finishes ; `all_tasks = { parent }; task_links = { super_quick -> parent }` * We sample; we see only parent (only Task in existence at that point); `current all_tasks = { parent }`; * `previous_all_tasks = { parent }` (still from the first sample we took) and `task_links = { super_quick -> parent }` * Our “debouncing” logic considers that `super_quick` is a new Task [because it’s not in `previous_all_tasks`] and doesn’t remove the link ⇒ The link will never go away because we’ll keep repeating this (potentially adding/removing other Tasks and Links, but this specific one will stay forever) ### Additional change The PR also fixes an issue with our patch for `as_completed` that would crash if `parent` was `None`. I don't think this can ever happen (since `as_completed` is necessarily called from a Task), but better safe than sorry. ### Testing As this change fixes a race condition, it is pretty hard to strictly speaking test it. `dd-trace-py` already has a test that flaked in the past due to it (and that flaked in Echion as well): [`test_asyncio_as_completed`](https://github.com/datadog/dd-trace-py/blob/5b8add2c81604dafaa2b6c772c18efed71a6ae19/tests/profiling/collector/test_asyncio_as_completed.py#L62)
Collaborator
Author
|
This has been reviewed+approved by Taegyun in dd-trace-py (here: DataDog/dd-trace-py#15552) so I'll go ahead and merge. |
KowalskiThomas
added a commit
to DataDog/dd-trace-py
that referenced
this pull request
Jan 7, 2026
Echion PR: P403n1x87/echion#210 This PR fixes a race condition happening around linking Parent and Children Tasks when using utilities like `asyncio.gather` or `asyncio.as_completed`. The race condition would manifest in the following way 1. User calls e.g. `asyncio.as_completed` from `parent_task` 2. Our patch creates Tasks for each awaitable passed to the function and inserts resulting Task objects into the Task Links Map (one of each is `new_task`) 3. A Sample is taken exactly at that moment, Task Link Map cleanup logic kicks in. a. `new_task` was created only just now, so was not part of the `all_tasks` `MirrorSet` snapshotted by the Sampler b. The cleanup logic sees that there's a link from `new_task` to `parent_task` but `new_task` isn't in `all_tasks` c. The cleanup logic assumes that means `new_task` previously existed but has completed, deduces the link must be removed d. Link is removed 4. `new_task` starts executing, but isn't linked to `parent_task` because we've cleaned it up before it got a chance to start The fix I propose is to introduce a new `previous_task_objects` set which contains the addresses of all existing Task objects at the time we previously sampled. If we see a Task object in the Task Links Map that doesn't exist in `all_tasks`, then two things are possible: either the Task used to exist and the link should be removed, or the Task was just created and the link should stay. To determine which case we're in, we look at `previous_task_objects` – if the Task is there then it previously existed and just completed (⇒ need to remove); if not, it was just created (⇒ shouldn't remove). I think this comes with one caveat: if we sample very rarely (which may happen in certain circumstances with adaptive sampling), then we could miss certain Tasks and never get rid of their links: ```py async def super_quick(): return None async def parent(): await asyncio.sleep(1.0) asyncio.gather(asyncio.create_task(super_quick())) ``` Given that super_quick will be… super quick, the following could happen: * We sample; we see parent ; `all_tasks = { parent }; task_links = {} ; previous_all_tasks = { parent }` * Parent continues, calls `asyncio.gather(...)` creates another Task ; `all_tasks = { parent, super_quick }; task_links = { super_quick -> parent }` * (We don’t sample yet because we rarely sample…) * Task execution continues, `super_quick` runs super quickly and finishes ; `all_tasks = { parent }; task_links = { super_quick -> parent }` * We sample; we see only parent (only Task in existence at that point); `current all_tasks = { parent }`; * `previous_all_tasks = { parent }` (still from the first sample we took) and `task_links = { super_quick -> parent }` * Our “debouncing” logic considers that `super_quick` is a new Task [because it’s not in `previous_all_tasks`] and doesn’t remove the link ⇒ The link will never go away because we’ll keep repeating this (potentially adding/removing other Tasks and Links, but this specific one will stay forever) The PR also fixes an issue with our patch for `as_completed` that would crash if `parent` was `None`. I don't think this can ever happen (since `as_completed` is necessarily called from a Task), but better safe than sorry. As this change fixes a race condition, it is pretty hard to strictly speaking test it. `dd-trace-py` already has a test that flaked in the past due to it (and that flaked in Echion as well): [`test_asyncio_as_completed`](https://github.com/datadog/dd-trace-py/blob/5b8add2c81604dafaa2b6c772c18efed71a6ae19/tests/profiling/collector/test_asyncio_as_completed.py#L62)
KowalskiThomas
added a commit
to DataDog/dd-trace-py
that referenced
this pull request
Jan 8, 2026
Echion PR: P403n1x87/echion#210 This PR fixes a race condition happening around linking Parent and Children Tasks when using utilities like `asyncio.gather` or `asyncio.as_completed`. The race condition would manifest in the following way 1. User calls e.g. `asyncio.as_completed` from `parent_task` 2. Our patch creates Tasks for each awaitable passed to the function and inserts resulting Task objects into the Task Links Map (one of each is `new_task`) 3. A Sample is taken exactly at that moment, Task Link Map cleanup logic kicks in. a. `new_task` was created only just now, so was not part of the `all_tasks` `MirrorSet` snapshotted by the Sampler b. The cleanup logic sees that there's a link from `new_task` to `parent_task` but `new_task` isn't in `all_tasks` c. The cleanup logic assumes that means `new_task` previously existed but has completed, deduces the link must be removed d. Link is removed 4. `new_task` starts executing, but isn't linked to `parent_task` because we've cleaned it up before it got a chance to start The fix I propose is to introduce a new `previous_task_objects` set which contains the addresses of all existing Task objects at the time we previously sampled. If we see a Task object in the Task Links Map that doesn't exist in `all_tasks`, then two things are possible: either the Task used to exist and the link should be removed, or the Task was just created and the link should stay. To determine which case we're in, we look at `previous_task_objects` – if the Task is there then it previously existed and just completed (⇒ need to remove); if not, it was just created (⇒ shouldn't remove). I think this comes with one caveat: if we sample very rarely (which may happen in certain circumstances with adaptive sampling), then we could miss certain Tasks and never get rid of their links: ```py async def super_quick(): return None async def parent(): await asyncio.sleep(1.0) asyncio.gather(asyncio.create_task(super_quick())) ``` Given that super_quick will be… super quick, the following could happen: * We sample; we see parent ; `all_tasks = { parent }; task_links = {} ; previous_all_tasks = { parent }` * Parent continues, calls `asyncio.gather(...)` creates another Task ; `all_tasks = { parent, super_quick }; task_links = { super_quick -> parent }` * (We don’t sample yet because we rarely sample…) * Task execution continues, `super_quick` runs super quickly and finishes ; `all_tasks = { parent }; task_links = { super_quick -> parent }` * We sample; we see only parent (only Task in existence at that point); `current all_tasks = { parent }`; * `previous_all_tasks = { parent }` (still from the first sample we took) and `task_links = { super_quick -> parent }` * Our “debouncing” logic considers that `super_quick` is a new Task [because it’s not in `previous_all_tasks`] and doesn’t remove the link ⇒ The link will never go away because we’ll keep repeating this (potentially adding/removing other Tasks and Links, but this specific one will stay forever) The PR also fixes an issue with our patch for `as_completed` that would crash if `parent` was `None`. I don't think this can ever happen (since `as_completed` is necessarily called from a Task), but better safe than sorry. As this change fixes a race condition, it is pretty hard to strictly speaking test it. `dd-trace-py` already has a test that flaked in the past due to it (and that flaked in Echion as well): [`test_asyncio_as_completed`](https://github.com/datadog/dd-trace-py/blob/5b8add2c81604dafaa2b6c772c18efed71a6ae19/tests/profiling/collector/test_asyncio_as_completed.py#L62)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR fixes a race condition happening around linking Parent and Children Tasks when using utilities like
asyncio.gatherorasyncio.as_completed.The problem
The race condition would manifest in the following way
asyncio.as_completedfromparent_tasknew_task)a.
new_taskwas created only just now, so was not part of theall_tasksMirrorSetsnapshotted by the Samplerb. The cleanup logic sees that there's a link from
new_tasktoparent_taskbutnew_taskisn't inall_tasksc. The cleanup logic assumes that means
new_taskpreviously existed but has completed, deduces the link must be removedd. Link is removed
new_taskstarts executing, but isn't linked toparent_taskbecause we've cleaned it up before it got a chance to startProposed fix
The fix I propose is to introduce a new
previous_task_objectsset which contains the addresses of all existing Task objects at the time we previously sampled.If we see a Task object in the Task Links Map that doesn't exist in
all_tasks, then two things are possible: either the Task used to exist and the link should be removed, or the Task was just created and the link should stay. To determine which case we're in, we look atprevious_task_objects– if the Task is there then it previously existed and just completed (⇒ need to remove); if not, it was just created (⇒ shouldn't remove).Admitted caveat
I think this comes with one caveat: if we sample very rarely (which may happen in certain circumstances with adaptive sampling), then we could miss certain Tasks and never get rid of their links:
Given that super_quick will be… super quick, the following could happen:
all_tasks = { parent }; task_links = {} ; previous_all_tasks = { parent }asyncio.gather(...)creates another Task ;all_tasks = { parent, super_quick }; task_links = { super_quick -> parent }super_quickruns super quickly and finishes ;all_tasks = { parent }; task_links = { super_quick -> parent }current all_tasks = { parent };previous_all_tasks = { parent }(still from the first sample we took) andtask_links = { super_quick -> parent }super_quickis a new Task [because it’s not inprevious_all_tasks] and doesn’t remove the link⇒ The link will never go away because we’ll keep repeating this (potentially adding/removing other Tasks and Links, but this specific one will stay forever)
Additional change
The PR also fixes an issue with our patch for
as_completedthat would crash ifparentwasNone. I don't think this can ever happen (sinceas_completedis necessarily called from a Task), but better safe than sorry.