Skip to content

fix: do not remove links for just-created Tasks#210

Merged
KowalskiThomas merged 1 commit intoP403n1x87:mainfrom
KowalskiThomas:kowalski/fix-do-not-remove-links-for-just-created-tasks
Dec 9, 2025
Merged

fix: do not remove links for just-created Tasks#210
KowalskiThomas merged 1 commit intoP403n1x87:mainfrom
KowalskiThomas:kowalski/fix-do-not-remove-links-for-just-created-tasks

Conversation

@KowalskiThomas
Copy link
Collaborator

@KowalskiThomas KowalskiThomas commented Dec 8, 2025

What does this PR do?

This PR fixes a race condition happening around linking Parent and Children Tasks when using utilities like asyncio.gather or asyncio.as_completed.

The problem

The race condition would manifest in the following way

  1. User calls e.g. asyncio.as_completed from parent_task
  2. Our patch creates Tasks for each awaitable passed to the function and inserts resulting Task objects into the Task Links Map (one of each is new_task)
  3. A Sample is taken exactly at that moment, Task Link Map cleanup logic kicks in.
    a. new_task was created only just now, so was not part of the all_tasks MirrorSet snapshotted by the Sampler
    b. The cleanup logic sees that there's a link from new_task to parent_task but new_task isn't in all_tasks
    c. The cleanup logic assumes that means new_task previously existed but has completed, deduces the link must be removed
    d. Link is removed
  4. new_task starts executing, but isn't linked to parent_task because we've cleaned it up before it got a chance to start

Proposed fix

The fix I propose is to introduce a new previous_task_objects set which contains the addresses of all existing Task objects at the time we previously sampled.
If we see a Task object in the Task Links Map that doesn't exist in all_tasks, then two things are possible: either the Task used to exist and the link should be removed, or the Task was just created and the link should stay. To determine which case we're in, we look at previous_task_objects – if the Task is there then it previously existed and just completed (⇒ need to remove); if not, it was just created (⇒ shouldn't remove).

Admitted caveat

I think this comes with one caveat: if we sample very rarely (which may happen in certain circumstances with adaptive sampling), then we could miss certain Tasks and never get rid of their links:

async def super_quick():
  return None

async def parent():
  await asyncio.sleep(1.0)
  asyncio.gather(asyncio.create_task(super_quick()))

Given that super_quick will be… super quick, the following could happen:

  • We sample; we see parent ; all_tasks = { parent }; task_links = {} ; previous_all_tasks = { parent }
  • Parent continues, calls asyncio.gather(...) creates another Task ; all_tasks = { parent, super_quick }; task_links = { super_quick -> parent }
  • (We don’t sample yet because we rarely sample…)
  • Task execution continues, super_quick runs super quickly and finishes ; all_tasks = { parent }; task_links = { super_quick -> parent }
  • We sample; we see only parent (only Task in existence at that point); current all_tasks = { parent };
    • previous_all_tasks = { parent } (still from the first sample we took) and task_links = { super_quick -> parent }
    • Our “debouncing” logic considers that super_quick is a new Task [because it’s not in previous_all_tasks] and doesn’t remove the link

⇒ The link will never go away because we’ll keep repeating this (potentially adding/removing other Tasks and Links, but this specific one will stay forever)

Additional change

The PR also fixes an issue with our patch for as_completed that would crash if parent was None. I don't think this can ever happen (since as_completed is necessarily called from a Task), but better safe than sorry.

@KowalskiThomas KowalskiThomas marked this pull request as ready for review December 8, 2025 14:15
KowalskiThomas added a commit to DataDog/dd-trace-py that referenced this pull request Dec 9, 2025
## Description

Echion PR: P403n1x87/echion#210

This PR fixes a race condition happening around linking Parent and
Children Tasks when using utilities like `asyncio.gather` or
`asyncio.as_completed`.

### The problem

The race condition would manifest in the following way
1.  User calls e.g. `asyncio.as_completed` from `parent_task`
2. Our patch creates Tasks for each awaitable passed to the function and
inserts resulting Task objects into the Task Links Map (one of each is
`new_task`)
3. A Sample is taken exactly at that moment, Task Link Map cleanup logic
kicks in.
a. `new_task` was created only just now, so was not part of the
`all_tasks` `MirrorSet` snapshotted by the Sampler
b. The cleanup logic sees that there's a link from `new_task` to
`parent_task` but `new_task` isn't in `all_tasks`
c. The cleanup logic assumes that means `new_task` previously existed
but has completed, deduces the link must be removed
  d. Link is removed
4. `new_task` starts executing, but isn't linked to `parent_task`
because we've cleaned it up before it got a chance to start

### Proposed fix

The fix I propose is to introduce a new `previous_task_objects` set
which contains the addresses of all existing Task objects at the time we
previously sampled.
If we see a Task object in the Task Links Map that doesn't exist in
`all_tasks`, then two things are possible: either the Task used to exist
and the link should be removed, or the Task was just created and the
link should stay. To determine which case we're in, we look at
`previous_task_objects` – if the Task is there then it previously
existed and just completed (⇒ need to remove); if not, it was just
created (⇒ shouldn't remove).

### Admitted caveat

I think this comes with one caveat: if we sample very rarely (which may
happen in certain circumstances with adaptive sampling), then we could
miss certain Tasks and never get rid of their links:

```py
async def super_quick():
  return None

async def parent():
  await asyncio.sleep(1.0)
  asyncio.gather(asyncio.create_task(super_quick()))
```

Given that super_quick will be… super quick, the following could happen:

* We sample; we see parent ; `all_tasks = { parent }; task_links = {} ;
previous_all_tasks = { parent }`
* Parent continues, calls `asyncio.gather(...)` creates another Task ;
`all_tasks = { parent, super_quick }; task_links = { super_quick ->
parent }`
* (We don’t sample yet because we rarely sample…)
* Task execution continues, `super_quick` runs super quickly and
finishes ; `all_tasks = { parent }; task_links = { super_quick -> parent
}`
* We sample; we see only parent (only Task in existence at that point);
`current all_tasks = { parent }`;
* `previous_all_tasks = { parent }` (still from the first sample we
took) and `task_links = { super_quick -> parent }`
* Our “debouncing” logic considers that `super_quick` is a new Task
[because it’s not in `previous_all_tasks`] and doesn’t remove the link

⇒ The link will never go away because we’ll keep repeating this
(potentially adding/removing other Tasks and Links, but this specific
one will stay forever)

### Additional change

The PR also fixes an issue with our patch for `as_completed` that would
crash if `parent` was `None`. I don't think this can ever happen (since
`as_completed` is necessarily called from a Task), but better safe than
sorry.

### Testing

As this change fixes a race condition, it is pretty hard to strictly
speaking test it. `dd-trace-py` already has a test that flaked in the
past due to it (and that flaked in Echion as well):
[`test_asyncio_as_completed`](https://github.com/datadog/dd-trace-py/blob/5b8add2c81604dafaa2b6c772c18efed71a6ae19/tests/profiling/collector/test_asyncio_as_completed.py#L62)
@KowalskiThomas
Copy link
Collaborator Author

This has been reviewed+approved by Taegyun in dd-trace-py (here: DataDog/dd-trace-py#15552) so I'll go ahead and merge.

@KowalskiThomas KowalskiThomas merged commit a98f7be into P403n1x87:main Dec 9, 2025
91 of 96 checks passed
KowalskiThomas added a commit to DataDog/dd-trace-py that referenced this pull request Jan 7, 2026
Echion PR: P403n1x87/echion#210

This PR fixes a race condition happening around linking Parent and
Children Tasks when using utilities like `asyncio.gather` or
`asyncio.as_completed`.

The race condition would manifest in the following way
1.  User calls e.g. `asyncio.as_completed` from `parent_task`
2. Our patch creates Tasks for each awaitable passed to the function and
inserts resulting Task objects into the Task Links Map (one of each is
`new_task`)
3. A Sample is taken exactly at that moment, Task Link Map cleanup logic
kicks in.
a. `new_task` was created only just now, so was not part of the
`all_tasks` `MirrorSet` snapshotted by the Sampler
b. The cleanup logic sees that there's a link from `new_task` to
`parent_task` but `new_task` isn't in `all_tasks`
c. The cleanup logic assumes that means `new_task` previously existed
but has completed, deduces the link must be removed
  d. Link is removed
4. `new_task` starts executing, but isn't linked to `parent_task`
because we've cleaned it up before it got a chance to start

The fix I propose is to introduce a new `previous_task_objects` set
which contains the addresses of all existing Task objects at the time we
previously sampled.
If we see a Task object in the Task Links Map that doesn't exist in
`all_tasks`, then two things are possible: either the Task used to exist
and the link should be removed, or the Task was just created and the
link should stay. To determine which case we're in, we look at
`previous_task_objects` – if the Task is there then it previously
existed and just completed (⇒ need to remove); if not, it was just
created (⇒ shouldn't remove).

I think this comes with one caveat: if we sample very rarely (which may
happen in certain circumstances with adaptive sampling), then we could
miss certain Tasks and never get rid of their links:

```py
async def super_quick():
  return None

async def parent():
  await asyncio.sleep(1.0)
  asyncio.gather(asyncio.create_task(super_quick()))
```

Given that super_quick will be… super quick, the following could happen:

* We sample; we see parent ; `all_tasks = { parent }; task_links = {} ;
previous_all_tasks = { parent }`
* Parent continues, calls `asyncio.gather(...)` creates another Task ;
`all_tasks = { parent, super_quick }; task_links = { super_quick ->
parent }`
* (We don’t sample yet because we rarely sample…)
* Task execution continues, `super_quick` runs super quickly and
finishes ; `all_tasks = { parent }; task_links = { super_quick -> parent
}`
* We sample; we see only parent (only Task in existence at that point);
`current all_tasks = { parent }`;
* `previous_all_tasks = { parent }` (still from the first sample we
took) and `task_links = { super_quick -> parent }`
* Our “debouncing” logic considers that `super_quick` is a new Task
[because it’s not in `previous_all_tasks`] and doesn’t remove the link

⇒ The link will never go away because we’ll keep repeating this
(potentially adding/removing other Tasks and Links, but this specific
one will stay forever)

The PR also fixes an issue with our patch for `as_completed` that would
crash if `parent` was `None`. I don't think this can ever happen (since
`as_completed` is necessarily called from a Task), but better safe than
sorry.

As this change fixes a race condition, it is pretty hard to strictly
speaking test it. `dd-trace-py` already has a test that flaked in the
past due to it (and that flaked in Echion as well):
[`test_asyncio_as_completed`](https://github.com/datadog/dd-trace-py/blob/5b8add2c81604dafaa2b6c772c18efed71a6ae19/tests/profiling/collector/test_asyncio_as_completed.py#L62)
KowalskiThomas added a commit to DataDog/dd-trace-py that referenced this pull request Jan 8, 2026
Echion PR: P403n1x87/echion#210

This PR fixes a race condition happening around linking Parent and
Children Tasks when using utilities like `asyncio.gather` or
`asyncio.as_completed`.

The race condition would manifest in the following way
1.  User calls e.g. `asyncio.as_completed` from `parent_task`
2. Our patch creates Tasks for each awaitable passed to the function and
inserts resulting Task objects into the Task Links Map (one of each is
`new_task`)
3. A Sample is taken exactly at that moment, Task Link Map cleanup logic
kicks in.
a. `new_task` was created only just now, so was not part of the
`all_tasks` `MirrorSet` snapshotted by the Sampler
b. The cleanup logic sees that there's a link from `new_task` to
`parent_task` but `new_task` isn't in `all_tasks`
c. The cleanup logic assumes that means `new_task` previously existed
but has completed, deduces the link must be removed
  d. Link is removed
4. `new_task` starts executing, but isn't linked to `parent_task`
because we've cleaned it up before it got a chance to start

The fix I propose is to introduce a new `previous_task_objects` set
which contains the addresses of all existing Task objects at the time we
previously sampled.
If we see a Task object in the Task Links Map that doesn't exist in
`all_tasks`, then two things are possible: either the Task used to exist
and the link should be removed, or the Task was just created and the
link should stay. To determine which case we're in, we look at
`previous_task_objects` – if the Task is there then it previously
existed and just completed (⇒ need to remove); if not, it was just
created (⇒ shouldn't remove).

I think this comes with one caveat: if we sample very rarely (which may
happen in certain circumstances with adaptive sampling), then we could
miss certain Tasks and never get rid of their links:

```py
async def super_quick():
  return None

async def parent():
  await asyncio.sleep(1.0)
  asyncio.gather(asyncio.create_task(super_quick()))
```

Given that super_quick will be… super quick, the following could happen:

* We sample; we see parent ; `all_tasks = { parent }; task_links = {} ;
previous_all_tasks = { parent }`
* Parent continues, calls `asyncio.gather(...)` creates another Task ;
`all_tasks = { parent, super_quick }; task_links = { super_quick ->
parent }`
* (We don’t sample yet because we rarely sample…)
* Task execution continues, `super_quick` runs super quickly and
finishes ; `all_tasks = { parent }; task_links = { super_quick -> parent
}`
* We sample; we see only parent (only Task in existence at that point);
`current all_tasks = { parent }`;
* `previous_all_tasks = { parent }` (still from the first sample we
took) and `task_links = { super_quick -> parent }`
* Our “debouncing” logic considers that `super_quick` is a new Task
[because it’s not in `previous_all_tasks`] and doesn’t remove the link

⇒ The link will never go away because we’ll keep repeating this
(potentially adding/removing other Tasks and Links, but this specific
one will stay forever)

The PR also fixes an issue with our patch for `as_completed` that would
crash if `parent` was `None`. I don't think this can ever happen (since
`as_completed` is necessarily called from a Task), but better safe than
sorry.

As this change fixes a race condition, it is pretty hard to strictly
speaking test it. `dd-trace-py` already has a test that flaked in the
past due to it (and that flaked in Echion as well):
[`test_asyncio_as_completed`](https://github.com/datadog/dd-trace-py/blob/5b8add2c81604dafaa2b6c772c18efed71a6ae19/tests/profiling/collector/test_asyncio_as_completed.py#L62)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant