-
Notifications
You must be signed in to change notification settings - Fork 470
fix(profiling): correctly unwind on-CPU Tasks #15562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(profiling): correctly unwind on-CPU Tasks #15562
Conversation
|
|
6a44721 to
166b449
Compare
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 250 ± 3 ms. The average import time from base is: 253 ± 3 ms. The import time difference between this PR and base is: -3.2 ± 0.1 ms. Import time breakdownThe following import paths have shrunk:
|
Performance SLOsComparing candidate kowalski/fix-profiling-correctly-unwind-on-cpu-tasks (5edd9ff) with baseline main (7bff945) 📈 Performance Regressions (2 suites)📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 5.104µs (SLO: <10.000µs 📉 -49.0%) vs baseline: 📈 +21.3% Memory: ✅ 38.574MB (SLO: <41.000MB -5.9%) vs baseline: +5.3% ✅ ospathbasename_noaspectTime: ✅ 1.085µs (SLO: <10.000µs 📉 -89.2%) vs baseline: +0.8% Memory: ✅ 38.417MB (SLO: <41.000MB -6.3%) vs baseline: +4.3% ✅ ospathjoin_aspectTime: ✅ 6.031µs (SLO: <10.000µs 📉 -39.7%) vs baseline: -0.1% Memory: ✅ 38.496MB (SLO: <41.000MB -6.1%) vs baseline: +4.9% ✅ ospathjoin_noaspectTime: ✅ 2.275µs (SLO: <10.000µs 📉 -77.3%) vs baseline: -0.7% Memory: ✅ 38.653MB (SLO: <41.000MB -5.7%) vs baseline: +5.2% ✅ ospathnormcase_aspectTime: ✅ 3.495µs (SLO: <10.000µs 📉 -65.0%) vs baseline: +2.1% Memory: ✅ 38.574MB (SLO: <41.000MB -5.9%) vs baseline: +5.3% ✅ ospathnormcase_noaspectTime: ✅ 0.568µs (SLO: <10.000µs 📉 -94.3%) vs baseline: ~same Memory: ✅ 38.378MB (SLO: <41.000MB -6.4%) vs baseline: +4.2% ✅ ospathsplit_aspectTime: ✅ 4.793µs (SLO: <10.000µs 📉 -52.1%) vs baseline: -1.0% Memory: ✅ 38.516MB (SLO: <41.000MB -6.1%) vs baseline: +4.5% ✅ ospathsplit_noaspectTime: ✅ 1.580µs (SLO: <10.000µs 📉 -84.2%) vs baseline: -0.5% Memory: ✅ 38.653MB (SLO: <41.000MB -5.7%) vs baseline: +5.4% ✅ ospathsplitdrive_aspectTime: ✅ 3.673µs (SLO: <10.000µs 📉 -63.3%) vs baseline: ~same Memory: ✅ 38.594MB (SLO: <41.000MB -5.9%) vs baseline: +5.2% ✅ ospathsplitdrive_noaspectTime: ✅ 0.698µs (SLO: <10.000µs 📉 -93.0%) vs baseline: +1.0% Memory: ✅ 38.555MB (SLO: <41.000MB -6.0%) vs baseline: +4.9% ✅ ospathsplitext_aspectTime: ✅ 4.563µs (SLO: <10.000µs 📉 -54.4%) vs baseline: +1.0% Memory: ✅ 38.673MB (SLO: <41.000MB -5.7%) vs baseline: +5.3% ✅ ospathsplitext_noaspectTime: ✅ 1.367µs (SLO: <10.000µs 📉 -86.3%) vs baseline: -1.4% Memory: ✅ 38.358MB (SLO: <41.000MB -6.4%) vs baseline: +4.3% 📈 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 3.342µs (SLO: <20.000µs 📉 -83.3%) vs baseline: 📈 +16.2% Memory: ✅ 34.800MB (SLO: <35.500MB 🟡 -2.0%) vs baseline: +4.0% ✅ 1-count-metrics-100-timesTime: ✅ 198.993µs (SLO: <220.000µs -9.5%) vs baseline: -0.5% Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.2% ✅ 1-distribution-metric-1-timesTime: ✅ 3.182µs (SLO: <20.000µs 📉 -84.1%) vs baseline: -0.9% Memory: ✅ 34.780MB (SLO: <35.500MB -2.0%) vs baseline: +4.2% ✅ 1-distribution-metrics-100-timesTime: ✅ 214.835µs (SLO: <230.000µs -6.6%) vs baseline: +0.6% Memory: ✅ 35.016MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +4.7% ✅ 1-gauge-metric-1-timesTime: ✅ 2.155µs (SLO: <20.000µs 📉 -89.2%) vs baseline: +0.2% Memory: ✅ 34.800MB (SLO: <35.500MB 🟡 -2.0%) vs baseline: +4.0% ✅ 1-gauge-metrics-100-timesTime: ✅ 136.578µs (SLO: <150.000µs -8.9%) vs baseline: -1.0% Memory: ✅ 34.800MB (SLO: <35.500MB 🟡 -2.0%) vs baseline: +3.9% ✅ 1-rate-metric-1-timesTime: ✅ 2.990µs (SLO: <20.000µs 📉 -85.1%) vs baseline: -0.7% Memory: ✅ 34.780MB (SLO: <35.500MB -2.0%) vs baseline: +4.2% ✅ 1-rate-metrics-100-timesTime: ✅ 211.851µs (SLO: <250.000µs 📉 -15.3%) vs baseline: -1.1% Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +4.5% ✅ 100-count-metrics-100-timesTime: ✅ 20.086ms (SLO: <22.000ms -8.7%) vs baseline: -0.4% Memory: ✅ 34.780MB (SLO: <35.500MB -2.0%) vs baseline: +5.0% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.226ms (SLO: <2.550ms 📉 -12.7%) vs baseline: ~same Memory: ✅ 34.996MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +4.7% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.411ms (SLO: <1.550ms -9.0%) vs baseline: +0.7% Memory: ✅ 34.721MB (SLO: <35.500MB -2.2%) vs baseline: +5.0% ✅ 100-rate-metrics-100-timesTime: ✅ 2.203ms (SLO: <2.550ms 📉 -13.6%) vs baseline: +0.4% Memory: ✅ 34.682MB (SLO: <35.500MB -2.3%) vs baseline: +4.6% ✅ flush-1-metricTime: ✅ 4.528µs (SLO: <20.000µs 📉 -77.4%) vs baseline: -0.4% Memory: ✅ 35.095MB (SLO: <35.500MB 🟡 -1.1%) vs baseline: +4.9% ✅ flush-100-metricsTime: ✅ 174.459µs (SLO: <250.000µs 📉 -30.2%) vs baseline: +0.2% Memory: ✅ 35.173MB (SLO: <35.500MB 🟡 -0.9%) vs baseline: +5.1% ✅ flush-1000-metricsTime: ✅ 2.168ms (SLO: <2.500ms 📉 -13.3%) vs baseline: -0.4% Memory: ✅ 35.920MB (SLO: <36.500MB 🟡 -1.6%) vs baseline: +5.1% 🟡 Near SLO Breach (14 suites)🟡 coreapiscenario - 10/10 (1 unstable)
|
3851514 to
5edd9ff
Compare
taegyunkim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you could attach a very simple program with flamegraphs showing before/after it would be really effective for anyone who reads PR.
Mostly LGTM with some comments.
ddtrace/internal/datadog/profiling/stack_v2/echion/echion/threads.h
Outdated
Show resolved
Hide resolved
ddtrace/internal/datadog/profiling/stack_v2/echion/echion/threads.h
Outdated
Show resolved
Hide resolved
KowalskiThomas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushing the fixes soon ⏰
ddtrace/internal/datadog/profiling/stack_v2/echion/echion/threads.h
Outdated
Show resolved
Hide resolved
ddtrace/internal/datadog/profiling/stack_v2/echion/echion/threads.h
Outdated
Show resolved
Hide resolved
ddtrace/internal/datadog/profiling/stack_v2/echion/echion/threads.h
Outdated
Show resolved
Hide resolved
ddtrace/internal/datadog/profiling/stack_v2/echion/echion/threads.h
Outdated
Show resolved
Hide resolved
c49fe88 to
d299788
Compare
|
Squashed my two last commits and pushed to the merge queue. |
emmettbutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release note looks fine
What does this PR do?
More details and investigation results available in https://datadoghq.atlassian.net/browse/PROF-13137
Note this PR depends on
This PR fixes the way we sample and capture stacks for on-CPU asyncio Tasks. This is a first step towards fixing all weird (sub-)stacks we are seeing (both in Echion and dd-trace-py), for example this one...
... where in practice,
background_math_functionwas executed as part of a Task that was completely unrelated toTask-background_waitand should never have appeared in the same stack.This by itself doesn't fix everything (see The fix(es, actually)), but, as I said, it's a first step.
See this document for more explanations around how this works / is supposed to work, the fix is explained at the bottom here.
The fix(es, actually)
Flawed existing logic...
The fix logic is the following:
asyncio.run, as well as the internalasyncioRunner machinery.Task-1is awaitingTask-2, the Stack forTask-2goes under that forTask-1. We only ever want to add the "pure Python top-of-stack" once per Stack, so we need to do that once per leaf Task after we've recursively unwound awaiting Tasks for the leaf Task.... and a race condition?
On top of that – because it's not enough – I think I discovered a rare but very real race condition (which actually was the root cause of that weird, rare but common-in-CI bug with the stack I showed on top). The race condition happens when the pure Python Stack we capture involves some asynchronous Frames, but the associated Task/Coroutine completes or yields before we get to unwinding it. In that case, we assume in Stack interleaving that the Task is running, by doing so we include some Frames from the Python Stack that actually are not in the Stack generated by unwinding the Task, and because we share the same Python Stack for all Tasks (we only use it's "upper part", which is the same for every Task) some Frames from the previously-running-but-now-paused-or-completed Coroutine are leaked to other Tasks.
Working around this is not simple; also honestly we should have a discussion around whether we want to properly work around it (which is going to be somewhat difficult, imperfect and probably somewhat costly) or just give up on sampling when we detect something is off.
In the interest of keeping things somewhat simple, this will be handled in a separate PR. But I described the bug here anyway, for clarity (temporary PR: KowalskiThomas/echion#43)
Caveats
I realised after the fact that an assumption we implicitly make (that the on-CPU Task – if it exists – is a Leaf Task) is actually not generally true. In the case of certain
asyncioutils, the Parent Task will actually from time to time execute code/be on-CPU to manage the Child Tasks it is awaiting on. This is an edge case, and something that happens rarely, but saying it can never happen is false (and it shows in Profiles).We will also need to figure something out for that – it probably isn't that hard but will require some refactors (probably split the unwinding in two steps: (1) only determine upper Python Stack (2) unwind Tasks strictly speaking, starting from Leaf Tasks).
Testing
The PR adds new tests around on-CPU Tasks and what their Stacks look like
The PR also updates a lot of existing tests to output their
DataSummaryas JSON. This is immensely useful to understand what is actually going on when we don't see the expected Stacks (or see the unexpected ones) and I do think it'd be worth keeping.After these changes, the tests that still fail locally and in the CI (flakily) are unrelated to on-CPU Tasks which, I guess, means mission failed successfully?