-
Notifications
You must be signed in to change notification settings - Fork 480
fix(profiling): do not remove links for just-created Tasks [backport 4.0] #15892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(profiling): do not remove links for just-created Tasks [backport 4.0] #15892
Conversation
|
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 253 ± 5 ms. The average import time from base is: 257 ± 6 ms. The import time difference between this PR and base is: -4.5 ± 0.2 ms. Import time breakdownThe following import paths have shrunk:
|
Performance SLOsComparing candidate kowalski/fix-profiling-do-not-remove-links-for-just-created-tasks-15552 (3f7ae53) with baseline 4.0 (dc3c5e9) 🟡 Near SLO Breach (2 suites)🟡 packagespackageforrootmodulemapping - 4/4✅ cache_offTime: ✅ 341.160ms (SLO: <354.300ms -3.7%) vs baseline: -0.8% Memory: ✅ 41.094MB (SLO: <41.500MB 🟡 -1.0%) vs baseline: +5.0% ✅ cache_onTime: ✅ 0.385µs (SLO: <10.000µs 📉 -96.1%) vs baseline: +0.7% Memory: ✅ 39.773MB (SLO: <41.000MB -3.0%) vs baseline: +4.8% 🟡 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 2.980µs (SLO: <20.000µs 📉 -85.1%) vs baseline: +1.9% Memory: ✅ 34.544MB (SLO: <35.500MB -2.7%) vs baseline: +4.7% ✅ 1-count-metrics-100-timesTime: ✅ 202.856µs (SLO: <220.000µs -7.8%) vs baseline: +1.8% Memory: ✅ 34.583MB (SLO: <35.500MB -2.6%) vs baseline: +4.9% ✅ 1-distribution-metric-1-timesTime: ✅ 3.343µs (SLO: <20.000µs 📉 -83.3%) vs baseline: +1.2% Memory: ✅ 34.564MB (SLO: <35.500MB -2.6%) vs baseline: +4.6% ✅ 1-distribution-metrics-100-timesTime: ✅ 220.580µs (SLO: <230.000µs -4.1%) vs baseline: +0.8% Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +4.8% ✅ 1-gauge-metric-1-timesTime: ✅ 2.191µs (SLO: <20.000µs 📉 -89.0%) vs baseline: +0.6% Memory: ✅ 34.524MB (SLO: <35.500MB -2.7%) vs baseline: +4.7% ✅ 1-gauge-metrics-100-timesTime: ✅ 137.491µs (SLO: <150.000µs -8.3%) vs baseline: -1.0% Memory: ✅ 34.544MB (SLO: <35.500MB -2.7%) vs baseline: +4.7% ✅ 1-rate-metric-1-timesTime: ✅ 3.291µs (SLO: <20.000µs 📉 -83.5%) vs baseline: +6.0% Memory: ✅ 34.603MB (SLO: <35.500MB -2.5%) vs baseline: +5.0% ✅ 1-rate-metrics-100-timesTime: ✅ 217.054µs (SLO: <250.000µs 📉 -13.2%) vs baseline: +1.9% Memory: ✅ 34.603MB (SLO: <35.500MB -2.5%) vs baseline: +4.9% ✅ 100-count-metrics-100-timesTime: ✅ 20.057ms (SLO: <22.000ms -8.8%) vs baseline: +0.3% Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +4.8% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.273ms (SLO: <2.300ms 🟡 -1.2%) vs baseline: +0.3% Memory: ✅ 34.603MB (SLO: <35.500MB -2.5%) vs baseline: +4.9% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.408ms (SLO: <1.550ms -9.2%) vs baseline: -0.7% Memory: ✅ 34.564MB (SLO: <35.500MB -2.6%) vs baseline: +4.7% ✅ 100-rate-metrics-100-timesTime: ✅ 2.194ms (SLO: <2.550ms 📉 -14.0%) vs baseline: +0.2% Memory: ✅ 34.603MB (SLO: <35.500MB -2.5%) vs baseline: +4.9% ✅ flush-1-metricTime: ✅ 4.758µs (SLO: <20.000µs 📉 -76.2%) vs baseline: +5.2% Memory: ✅ 34.603MB (SLO: <35.500MB -2.5%) vs baseline: +5.0% ✅ flush-100-metricsTime: ✅ 173.930µs (SLO: <250.000µs 📉 -30.4%) vs baseline: ~same Memory: ✅ 34.524MB (SLO: <35.500MB -2.7%) vs baseline: +4.9% ✅ flush-1000-metricsTime: ✅ 2.112ms (SLO: <2.500ms 📉 -15.5%) vs baseline: -0.4% Memory: ✅ 35.350MB (SLO: <36.500MB -3.2%) vs baseline: +4.6%
|
r1viollet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Echion PR: P403n1x87/echion#210 This PR fixes a race condition happening around linking Parent and Children Tasks when using utilities like `asyncio.gather` or `asyncio.as_completed`. The race condition would manifest in the following way 1. User calls e.g. `asyncio.as_completed` from `parent_task` 2. Our patch creates Tasks for each awaitable passed to the function and inserts resulting Task objects into the Task Links Map (one of each is `new_task`) 3. A Sample is taken exactly at that moment, Task Link Map cleanup logic kicks in. a. `new_task` was created only just now, so was not part of the `all_tasks` `MirrorSet` snapshotted by the Sampler b. The cleanup logic sees that there's a link from `new_task` to `parent_task` but `new_task` isn't in `all_tasks` c. The cleanup logic assumes that means `new_task` previously existed but has completed, deduces the link must be removed d. Link is removed 4. `new_task` starts executing, but isn't linked to `parent_task` because we've cleaned it up before it got a chance to start The fix I propose is to introduce a new `previous_task_objects` set which contains the addresses of all existing Task objects at the time we previously sampled. If we see a Task object in the Task Links Map that doesn't exist in `all_tasks`, then two things are possible: either the Task used to exist and the link should be removed, or the Task was just created and the link should stay. To determine which case we're in, we look at `previous_task_objects` – if the Task is there then it previously existed and just completed (⇒ need to remove); if not, it was just created (⇒ shouldn't remove). I think this comes with one caveat: if we sample very rarely (which may happen in certain circumstances with adaptive sampling), then we could miss certain Tasks and never get rid of their links: ```py async def super_quick(): return None async def parent(): await asyncio.sleep(1.0) asyncio.gather(asyncio.create_task(super_quick())) ``` Given that super_quick will be… super quick, the following could happen: * We sample; we see parent ; `all_tasks = { parent }; task_links = {} ; previous_all_tasks = { parent }` * Parent continues, calls `asyncio.gather(...)` creates another Task ; `all_tasks = { parent, super_quick }; task_links = { super_quick -> parent }` * (We don’t sample yet because we rarely sample…) * Task execution continues, `super_quick` runs super quickly and finishes ; `all_tasks = { parent }; task_links = { super_quick -> parent }` * We sample; we see only parent (only Task in existence at that point); `current all_tasks = { parent }`; * `previous_all_tasks = { parent }` (still from the first sample we took) and `task_links = { super_quick -> parent }` * Our “debouncing” logic considers that `super_quick` is a new Task [because it’s not in `previous_all_tasks`] and doesn’t remove the link ⇒ The link will never go away because we’ll keep repeating this (potentially adding/removing other Tasks and Links, but this specific one will stay forever) The PR also fixes an issue with our patch for `as_completed` that would crash if `parent` was `None`. I don't think this can ever happen (since `as_completed` is necessarily called from a Task), but better safe than sorry. As this change fixes a race condition, it is pretty hard to strictly speaking test it. `dd-trace-py` already has a test that flaked in the past due to it (and that flaked in Echion as well): [`test_asyncio_as_completed`](https://github.com/datadog/dd-trace-py/blob/5b8add2c81604dafaa2b6c772c18efed71a6ae19/tests/profiling/collector/test_asyncio_as_completed.py#L62)
5b7aac4 to
3f7ae53
Compare
What is this PR?
This PR is a backport of #15552.