-
Notifications
You must be signed in to change notification settings - Fork 470
fix(profiling): do not remove links for just-created Tasks #15552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(profiling): do not remove links for just-created Tasks #15552
Conversation
|
|
3efe284 to
5b8add2
Compare
taegyunkim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to have a test case with what you wrote in the PR description?
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 250 ± 3 ms. The average import time from base is: 252 ± 3 ms. The import time difference between this PR and base is: -2.6 ± 0.1 ms. Import time breakdownThe following import paths have shrunk:
|
Performance SLOsComparing candidate kowalski/fix-profiling-do-not-remove-links-for-just-created-tasks (a7baa36) with baseline main (73210fe) 📈 Performance Regressions (3 suites)📈 iastaspects - 118/118✅ add_aspectTime: ✅ 0.406µs (SLO: <10.000µs 📉 -95.9%) vs baseline: +0.9% Memory: ✅ 40.161MB (SLO: <41.500MB -3.2%) vs baseline: +4.3% ✅ add_inplace_aspectTime: ✅ 0.406µs (SLO: <10.000µs 📉 -95.9%) vs baseline: -1.2% Memory: ✅ 40.187MB (SLO: <41.500MB -3.2%) vs baseline: +4.4% ✅ add_inplace_noaspectTime: ✅ 0.322µs (SLO: <10.000µs 📉 -96.8%) vs baseline: +1.6% Memory: ✅ 40.159MB (SLO: <41.500MB -3.2%) vs baseline: +4.8% ✅ add_noaspectTime: ✅ 0.278µs (SLO: <10.000µs 📉 -97.2%) vs baseline: +0.5% Memory: ✅ 40.202MB (SLO: <41.500MB -3.1%) vs baseline: +4.9% ✅ bytearray_aspectTime: ✅ 1.354µs (SLO: <10.000µs 📉 -86.5%) vs baseline: +1.1% Memory: ✅ 40.239MB (SLO: <41.500MB -3.0%) vs baseline: +4.7% ✅ bytearray_extend_aspectTime: ✅ 1.497µs (SLO: <10.000µs 📉 -85.0%) vs baseline: -1.3% Memory: ✅ 40.223MB (SLO: <41.500MB -3.1%) vs baseline: +4.8% ✅ bytearray_extend_noaspectTime: ✅ 0.614µs (SLO: <10.000µs 📉 -93.9%) vs baseline: ~same Memory: ✅ 40.160MB (SLO: <41.500MB -3.2%) vs baseline: +4.2% ✅ bytearray_noaspectTime: ✅ 0.491µs (SLO: <10.000µs 📉 -95.1%) vs baseline: +2.9% Memory: ✅ 40.178MB (SLO: <41.500MB -3.2%) vs baseline: +4.6% ✅ bytes_aspectTime: ✅ 1.296µs (SLO: <10.000µs 📉 -87.0%) vs baseline: -0.4% Memory: ✅ 40.301MB (SLO: <41.500MB -2.9%) vs baseline: +4.9% ✅ bytes_noaspectTime: ✅ 0.497µs (SLO: <10.000µs 📉 -95.0%) vs baseline: +1.3% Memory: ✅ 40.119MB (SLO: <41.500MB -3.3%) vs baseline: +4.7% ✅ bytesio_aspectTime: ✅ 1.359µs (SLO: <10.000µs 📉 -86.4%) vs baseline: +3.8% Memory: ✅ 40.200MB (SLO: <41.500MB -3.1%) vs baseline: +4.9% ✅ bytesio_noaspectTime: ✅ 0.499µs (SLO: <10.000µs 📉 -95.0%) vs baseline: +0.3% Memory: ✅ 40.118MB (SLO: <41.500MB -3.3%) vs baseline: +4.5% ✅ capitalize_aspectTime: ✅ 0.738µs (SLO: <10.000µs 📉 -92.6%) vs baseline: -0.3% Memory: ✅ 40.238MB (SLO: <41.500MB -3.0%) vs baseline: +4.9% ✅ capitalize_noaspectTime: ✅ 0.435µs (SLO: <10.000µs 📉 -95.6%) vs baseline: ~same Memory: ✅ 40.119MB (SLO: <41.500MB -3.3%) vs baseline: +4.6% ✅ casefold_aspectTime: ✅ 0.737µs (SLO: <10.000µs 📉 -92.6%) vs baseline: -0.3% Memory: ✅ 40.278MB (SLO: <41.500MB -2.9%) vs baseline: +5.1% ✅ casefold_noaspectTime: ✅ 0.367µs (SLO: <10.000µs 📉 -96.3%) vs baseline: -0.8% Memory: ✅ 40.040MB (SLO: <41.500MB -3.5%) vs baseline: +4.7% ✅ decode_aspectTime: ✅ 0.718µs (SLO: <10.000µs 📉 -92.8%) vs baseline: -0.4% Memory: ✅ 40.239MB (SLO: <41.500MB -3.0%) vs baseline: +5.0% ✅ decode_noaspectTime: ✅ 0.417µs (SLO: <10.000µs 📉 -95.8%) vs baseline: -0.5% Memory: ✅ 40.358MB (SLO: <41.500MB -2.8%) vs baseline: +4.6% ✅ encode_aspectTime: ✅ 0.708µs (SLO: <10.000µs 📉 -92.9%) vs baseline: +0.2% Memory: ✅ 40.219MB (SLO: <41.500MB -3.1%) vs baseline: +5.0% ✅ encode_noaspectTime: ✅ 0.405µs (SLO: <10.000µs 📉 -96.0%) vs baseline: +0.6% Memory: ✅ 40.278MB (SLO: <41.500MB -2.9%) vs baseline: +4.8% ✅ format_aspectTime: ✅ 3.450µs (SLO: <10.000µs 📉 -65.5%) vs baseline: +0.6% Memory: ✅ 40.296MB (SLO: <41.500MB -2.9%) vs baseline: +4.7% ✅ format_map_aspectTime: ✅ 3.530µs (SLO: <10.000µs 📉 -64.7%) vs baseline: -1.4% Memory: ✅ 40.199MB (SLO: <41.500MB -3.1%) vs baseline: +4.8% ✅ format_map_noaspectTime: ✅ 0.774µs (SLO: <10.000µs 📉 -92.3%) vs baseline: -0.3% Memory: ✅ 40.218MB (SLO: <41.500MB -3.1%) vs baseline: +4.8% ✅ format_noaspectTime: ✅ 0.595µs (SLO: <10.000µs 📉 -94.0%) vs baseline: +0.4% Memory: ✅ 40.119MB (SLO: <41.500MB -3.3%) vs baseline: +4.2% ✅ index_aspectTime: ✅ 0.358µs (SLO: <10.000µs 📉 -96.4%) vs baseline: +1.1% Memory: ✅ 40.280MB (SLO: <41.500MB -2.9%) vs baseline: +4.8% ✅ index_noaspectTime: ✅ 0.278µs (SLO: <10.000µs 📉 -97.2%) vs baseline: -0.9% Memory: ✅ 40.081MB (SLO: <41.500MB -3.4%) vs baseline: +4.5% ✅ join_aspectTime: ✅ 1.337µs (SLO: <10.000µs 📉 -86.6%) vs baseline: +0.4% Memory: ✅ 40.121MB (SLO: <41.500MB -3.3%) vs baseline: +4.2% ✅ join_noaspectTime: ✅ 0.491µs (SLO: <10.000µs 📉 -95.1%) vs baseline: +0.3% Memory: ✅ 40.221MB (SLO: <41.500MB -3.1%) vs baseline: +4.8% ✅ ljust_aspectTime: ✅ 2.651µs (SLO: <20.000µs 📉 -86.7%) vs baseline: +3.5% Memory: ✅ 40.316MB (SLO: <41.500MB -2.9%) vs baseline: +4.7% ✅ ljust_noaspectTime: ✅ 0.403µs (SLO: <10.000µs 📉 -96.0%) vs baseline: -1.9% Memory: ✅ 40.338MB (SLO: <41.500MB -2.8%) vs baseline: +5.1% ✅ lower_aspectTime: ✅ 2.308µs (SLO: <10.000µs 📉 -76.9%) vs baseline: +3.1% Memory: ✅ 40.116MB (SLO: <41.500MB -3.3%) vs baseline: +4.8% ✅ lower_noaspectTime: ✅ 0.366µs (SLO: <10.000µs 📉 -96.3%) vs baseline: -0.3% Memory: ✅ 40.159MB (SLO: <41.500MB -3.2%) vs baseline: +4.4% ✅ lstrip_aspectTime: ✅ 2.253µs (SLO: <20.000µs 📉 -88.7%) vs baseline: -0.2% Memory: ✅ 40.296MB (SLO: <41.500MB -2.9%) vs baseline: +5.1% ✅ lstrip_noaspectTime: ✅ 0.379µs (SLO: <10.000µs 📉 -96.2%) vs baseline: -1.0% Memory: ✅ 40.299MB (SLO: <41.500MB -2.9%) vs baseline: +4.5% ✅ modulo_aspectTime: ✅ 1.049µs (SLO: <10.000µs 📉 -89.5%) vs baseline: +4.8% Memory: ✅ 40.140MB (SLO: <41.500MB -3.3%) vs baseline: +4.3% ✅ modulo_aspect_for_bytearray_bytearrayTime: ✅ 1.541µs (SLO: <10.000µs 📉 -84.6%) vs baseline: -1.4% Memory: ✅ 40.161MB (SLO: <41.500MB -3.2%) vs baseline: +4.4% ✅ modulo_aspect_for_bytesTime: ✅ 0.981µs (SLO: <10.000µs 📉 -90.2%) vs baseline: -1.0% Memory: ✅ 40.280MB (SLO: <41.500MB -2.9%) vs baseline: +5.1% ✅ modulo_aspect_for_bytes_bytearrayTime: ✅ 1.247µs (SLO: <10.000µs 📉 -87.5%) vs baseline: +2.4% Memory: ✅ 40.143MB (SLO: <41.500MB -3.3%) vs baseline: +4.4% ✅ modulo_noaspectTime: ✅ 0.627µs (SLO: <10.000µs 📉 -93.7%) vs baseline: ~same Memory: ✅ 40.278MB (SLO: <41.500MB -2.9%) vs baseline: +5.3% ✅ replace_aspectTime: ✅ 4.868µs (SLO: <10.000µs 📉 -51.3%) vs baseline: -0.5% Memory: ✅ 40.202MB (SLO: <41.500MB -3.1%) vs baseline: +4.6% ✅ replace_noaspectTime: ✅ 0.464µs (SLO: <10.000µs 📉 -95.4%) vs baseline: +1.7% Memory: ✅ 40.199MB (SLO: <41.500MB -3.1%) vs baseline: +4.7% ✅ repr_aspectTime: ✅ 0.905µs (SLO: <10.000µs 📉 -91.0%) vs baseline: -0.1% Memory: ✅ 40.357MB (SLO: <41.500MB -2.8%) vs baseline: +5.3% ✅ repr_noaspectTime: ✅ 0.421µs (SLO: <10.000µs 📉 -95.8%) vs baseline: +0.9% Memory: ✅ 40.241MB (SLO: <41.500MB -3.0%) vs baseline: +4.7% ✅ rstrip_aspectTime: ✅ 1.968µs (SLO: <20.000µs 📉 -90.2%) vs baseline: +1.9% Memory: ✅ 40.156MB (SLO: <41.500MB -3.2%) vs baseline: +4.5% ✅ rstrip_noaspectTime: ✅ 0.382µs (SLO: <10.000µs 📉 -96.2%) vs baseline: +1.9% Memory: ✅ 40.239MB (SLO: <41.500MB -3.0%) vs baseline: +4.9% ✅ slice_aspectTime: ✅ 0.493µs (SLO: <10.000µs 📉 -95.1%) vs baseline: -0.4% Memory: ✅ 40.301MB (SLO: <41.500MB -2.9%) vs baseline: +4.8% ✅ slice_noaspectTime: ✅ 0.446µs (SLO: <10.000µs 📉 -95.5%) vs baseline: +0.4% Memory: ✅ 40.259MB (SLO: <41.500MB -3.0%) vs baseline: +5.2% ✅ stringio_aspectTime: ✅ 1.764µs (SLO: <10.000µs 📉 -82.4%) vs baseline: 📈 +14.7% Memory: ✅ 40.300MB (SLO: <41.500MB -2.9%) vs baseline: +5.3% ✅ stringio_noaspectTime: ✅ 0.711µs (SLO: <10.000µs 📉 -92.9%) vs baseline: -0.1% Memory: ✅ 40.100MB (SLO: <41.500MB -3.4%) vs baseline: +4.5% ✅ strip_aspectTime: ✅ 2.241µs (SLO: <20.000µs 📉 -88.8%) vs baseline: +0.6% Memory: ✅ 40.396MB (SLO: <41.500MB -2.7%) vs baseline: +5.1% ✅ strip_noaspectTime: ✅ 0.382µs (SLO: <10.000µs 📉 -96.2%) vs baseline: -1.2% Memory: ✅ 40.240MB (SLO: <41.500MB -3.0%) vs baseline: +4.8% ✅ swapcase_aspectTime: ✅ 2.522µs (SLO: <10.000µs 📉 -74.8%) vs baseline: +3.5% Memory: ✅ 40.379MB (SLO: <41.500MB -2.7%) vs baseline: +4.9% ✅ swapcase_noaspectTime: ✅ 0.541µs (SLO: <10.000µs 📉 -94.6%) vs baseline: +1.1% Memory: ✅ 40.200MB (SLO: <41.500MB -3.1%) vs baseline: +5.0% ✅ title_aspectTime: ✅ 2.432µs (SLO: <10.000µs 📉 -75.7%) vs baseline: +2.1% Memory: ✅ 40.196MB (SLO: <41.500MB -3.1%) vs baseline: +4.7% ✅ title_noaspectTime: ✅ 0.507µs (SLO: <10.000µs 📉 -94.9%) vs baseline: +0.3% Memory: ✅ 40.220MB (SLO: <41.500MB -3.1%) vs baseline: +4.5% ✅ translate_aspectTime: ✅ 3.329µs (SLO: <10.000µs 📉 -66.7%) vs baseline: +0.2% Memory: ✅ 40.260MB (SLO: <41.500MB -3.0%) vs baseline: +4.5% ✅ translate_noaspectTime: ✅ 1.043µs (SLO: <10.000µs 📉 -89.6%) vs baseline: +0.2% Memory: ✅ 40.021MB (SLO: <41.500MB -3.6%) vs baseline: +4.5% ✅ upper_aspectTime: ✅ 2.324µs (SLO: <10.000µs 📉 -76.8%) vs baseline: +2.3% Memory: ✅ 40.136MB (SLO: <41.500MB -3.3%) vs baseline: +4.9% ✅ upper_noaspectTime: ✅ 0.370µs (SLO: <10.000µs 📉 -96.3%) vs baseline: +0.1% Memory: ✅ 40.140MB (SLO: <41.500MB -3.3%) vs baseline: +4.5% 📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 5.180µs (SLO: <10.000µs 📉 -48.2%) vs baseline: 📈 +26.5% Memory: ✅ 40.187MB (SLO: <41.000MB 🟡 -2.0%) vs baseline: +4.5% ✅ ospathbasename_noaspectTime: ✅ 1.097µs (SLO: <10.000µs 📉 -89.0%) vs baseline: +0.3% Memory: ✅ 40.246MB (SLO: <41.000MB 🟡 -1.8%) vs baseline: +5.1% ✅ ospathjoin_aspectTime: ✅ 6.217µs (SLO: <10.000µs 📉 -37.8%) vs baseline: +0.6% Memory: ✅ 40.226MB (SLO: <41.000MB 🟡 -1.9%) vs baseline: +4.4% ✅ ospathjoin_noaspectTime: ✅ 2.293µs (SLO: <10.000µs 📉 -77.1%) vs baseline: -0.2% Memory: ✅ 40.206MB (SLO: <41.000MB 🟡 -1.9%) vs baseline: +4.9% ✅ ospathnormcase_aspectTime: ✅ 3.423µs (SLO: <10.000µs 📉 -65.8%) vs baseline: -0.7% Memory: ✅ 40.305MB (SLO: <41.000MB 🟡 -1.7%) vs baseline: +4.9% ✅ ospathnormcase_noaspectTime: ✅ 0.568µs (SLO: <10.000µs 📉 -94.3%) vs baseline: -1.3% Memory: ✅ 40.167MB (SLO: <41.000MB -2.0%) vs baseline: +4.8% ✅ ospathsplit_aspectTime: ✅ 4.741µs (SLO: <10.000µs 📉 -52.6%) vs baseline: -0.6% Memory: ✅ 40.187MB (SLO: <41.000MB 🟡 -2.0%) vs baseline: +4.9% ✅ ospathsplit_noaspectTime: ✅ 1.606µs (SLO: <10.000µs 📉 -83.9%) vs baseline: +1.4% Memory: ✅ 40.305MB (SLO: <41.000MB 🟡 -1.7%) vs baseline: +5.0% ✅ ospathsplitdrive_aspectTime: ✅ 3.638µs (SLO: <10.000µs 📉 -63.6%) vs baseline: -0.7% Memory: ✅ 40.226MB (SLO: <41.000MB 🟡 -1.9%) vs baseline: +4.7% ✅ ospathsplitdrive_noaspectTime: ✅ 0.702µs (SLO: <10.000µs 📉 -93.0%) vs baseline: -2.1% Memory: ✅ 40.167MB (SLO: <41.000MB -2.0%) vs baseline: +4.6% ✅ ospathsplitext_aspectTime: ✅ 4.526µs (SLO: <10.000µs 📉 -54.7%) vs baseline: +1.1% Memory: ✅ 40.285MB (SLO: <41.000MB 🟡 -1.7%) vs baseline: +4.5% ✅ ospathsplitext_noaspectTime: ✅ 1.386µs (SLO: <10.000µs 📉 -86.1%) vs baseline: -0.2% Memory: ✅ 40.265MB (SLO: <41.000MB 🟡 -1.8%) vs baseline: +4.8% 📈 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 3.375µs (SLO: <20.000µs 📉 -83.1%) vs baseline: 📈 +14.5% Memory: ✅ 34.760MB (SLO: <35.500MB -2.1%) vs baseline: +4.7% ✅ 1-count-metrics-100-timesTime: ✅ 199.606µs (SLO: <220.000µs -9.3%) vs baseline: -0.9% Memory: ✅ 34.701MB (SLO: <35.500MB -2.2%) vs baseline: +4.0% ✅ 1-distribution-metric-1-timesTime: ✅ 3.288µs (SLO: <20.000µs 📉 -83.6%) vs baseline: -0.1% Memory: ✅ 35.075MB (SLO: <35.500MB 🟡 -1.2%) vs baseline: +4.9% ✅ 1-distribution-metrics-100-timesTime: ✅ 216.689µs (SLO: <230.000µs -5.8%) vs baseline: ~same Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.1% ✅ 1-gauge-metric-1-timesTime: ✅ 2.156µs (SLO: <20.000µs 📉 -89.2%) vs baseline: -1.7% Memory: ✅ 34.623MB (SLO: <35.500MB -2.5%) vs baseline: +4.0% ✅ 1-gauge-metrics-100-timesTime: ✅ 135.954µs (SLO: <150.000µs -9.4%) vs baseline: +0.5% Memory: ✅ 34.741MB (SLO: <35.500MB -2.1%) vs baseline: +3.9% ✅ 1-rate-metric-1-timesTime: ✅ 3.101µs (SLO: <20.000µs 📉 -84.5%) vs baseline: -0.2% Memory: ✅ 35.075MB (SLO: <35.500MB 🟡 -1.2%) vs baseline: +5.8% ✅ 1-rate-metrics-100-timesTime: ✅ 213.307µs (SLO: <250.000µs 📉 -14.7%) vs baseline: -1.3% Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9% ✅ 100-count-metrics-100-timesTime: ✅ 20.152ms (SLO: <22.000ms -8.4%) vs baseline: -0.7% Memory: ✅ 34.701MB (SLO: <35.500MB -2.2%) vs baseline: +4.9% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.253ms (SLO: <2.550ms 📉 -11.6%) vs baseline: -0.5% Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.7% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.404ms (SLO: <1.550ms -9.4%) vs baseline: -0.5% Memory: ✅ 34.760MB (SLO: <35.500MB -2.1%) vs baseline: +5.1% ✅ 100-rate-metrics-100-timesTime: ✅ 2.198ms (SLO: <2.550ms 📉 -13.8%) vs baseline: -0.5% Memory: ✅ 34.800MB (SLO: <35.500MB 🟡 -2.0%) vs baseline: +5.1% ✅ flush-1-metricTime: ✅ 4.591µs (SLO: <20.000µs 📉 -77.0%) vs baseline: -1.0% Memory: ✅ 35.075MB (SLO: <35.500MB 🟡 -1.2%) vs baseline: +4.9% ✅ flush-100-metricsTime: ✅ 174.530µs (SLO: <250.000µs 📉 -30.2%) vs baseline: ~same Memory: ✅ 35.193MB (SLO: <35.500MB 🟡 -0.9%) vs baseline: +4.7% ✅ flush-1000-metricsTime: ✅ 2.174ms (SLO: <2.500ms 📉 -13.1%) vs baseline: -0.4% Memory: ✅ 35.940MB (SLO: <36.500MB 🟡 -1.5%) vs baseline: +5.1% 🟡 Near SLO Breach (17 suites)🟡 coreapiscenario - 10/10 (1 unstable)
|
|
I'll double Check, but this fixes an edge case that is already covered by a test (that was a bit flaky, at least in echion, due to this race condition). I'll double check and try to add a test that triggers the issue in a sure fire way. |
|
@taegyunkim I double checked and the test that was spuriously failing in Echion due to this race condition seems not to flake (see here) in dd-trace-py, potentially due to our lower sampling rate. If you look at the CI history in the Echion repo, though, you will easily find instances of the test for |
5b8add2 to
eba6d01
Compare
|
@taegyunkim I just had another look to the test and it seems like the one we had in dd-trace-py was much more basic than the one in Echion – probably due to its tendency to flake. I updated it (in the current PR) to be the same as Echion – many Tasks, and we check that we've seen what we expect in all Tasks [except for 1, as we have a known issue in dd-trace-py that makes one of the Tasks appear as part of the Parent Task] I wouldn't have realised this without your comment – thanks for calling it out! 🙇 |
eba6d01 to
a7baa36
Compare
taegyunkim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the test!
## What does this PR do? More details and investigation results available in https://datadoghq.atlassian.net/browse/PROF-13137 Related PRs * Dependency: #15552 * Dependent: #15579 * Echion PR: P403n1x87/echion#198 This PR fixes the way we sample and capture stacks for on-CPU asyncio Tasks. This is a first step towards fixing _all_ weird (sub-)stacks we are seeing (both in Echion and dd-trace-py), for example this one... ``` background_math_function Task-background_wait background_wait_function sleep ``` ... where in practice, `background_math_function` was executed as part of a Task that was completely unrelated to `Task-background_wait` and should never have appeared in the same stack. This by itself doesn't fix everything (see **The fix(es, actually)**), but, as I said, it's a first step. See [this document](https://docs.google.com/document/d/10HmKhY6tM5j7ojdk7CG8anWsN7qP7IyeZlj3-zAXudM/edit?pli=1&tab=t.9pzrl0uli4g) for more explanations around how this works / is supposed to work, the fix is explained at the bottom here. ## The fix(es, actually) ### Flawed existing logic... The fix logic is the following: * To correctly interleave Task Stacks and Task names, we need to determine the "synchronous Python top-of-stack" that goes on top of the on-CPU Task (if there is an on-CPU Task). * This includes the Python stack that eventually led to doing something like `asyncio.run`, as well as the internal `asyncio` Runner machinery. * We will have to push this on top of every Task Stack we capture (note that the on-CPU Stack will already have it by definition). * To do that, we need to determine which Task is "the on CPU Task" (if any) and process it first, so that we know how "high" the "upper Python stack" is and we can re-use it for subsequent Task unwindings. * To ensure the first Task we unwind is the on-CPU one, we first sort the Leaf Tasks array by on-CPU-ness * Note I'm saying _the_ on CPU Task because, as always in Python, there can only be one thing running at a time. * (Well, that's the case at least in theory – in practice on-CPU-ness may change as we sample and so we could end up with more than one on-CPU Task, but that is fine because it won't change how deep the "upper stack" is) * Once we've done this, we know how high the "top of Stack is" because it is the height of the "pure Python" Stack we get by unwinding the first (top-most) Frame of the on-CPU Task. We need to remember that "height" to push that many items from the Python Thread Stack on top of each Task's Stack. * One final thing to keep in mind is that we "stack" Stacks (hehe) such that if `Task-1` is awaiting `Task-2`, the Stack for `Task-2` goes under that for `Task-1`. We only ever want to add the "pure Python top-of-stack" once per Stack, so we need to do that _once per leaf Task_ after we've recursively unwound awaiting Tasks for the leaf Task. ### ... and a race condition? On top of that – because it's not enough – I think I discovered a rare but very real race condition (which actually was the root cause of that weird, rare but common-in-CI bug with the stack I showed on top). The race condition happens when the pure Python Stack we capture involves some asynchronous Frames, but the associated Task/Coroutine completes or yields before we get to unwinding it. In that case, we assume in Stack interleaving that the Task is running, by doing so we include some Frames from the Python Stack that actually _are not_ in the Stack generated by unwinding the Task, and because we share the same Python Stack for all Tasks (we only use it's "upper part", which is the same for every Task) some Frames from the previously-running-but-now-paused-or-completed Coroutine are leaked to other Tasks. Working around this is not simple; also honestly we should have a discussion around whether we want to properly work around it (which is going to be somewhat difficult, imperfect and probably somewhat costly) or just give up on sampling when we detect something is off. In the interest of keeping things somewhat simple, this will be handled in a separate PR. But I described the bug here anyway, for clarity (temporary PR: KowalskiThomas/echion#43) ### Caveats I realised after the fact that an assumption we implicitly make (that _the on-CPU Task – if it exists – is a Leaf Task_) is actually not generally true. In the case of certain `asyncio` utils, the Parent Task will actually from time to time execute code/be on-CPU to manage the Child Tasks it is awaiting on. This is an edge case, and something that happens rarely, but saying it can never happen is false (and it shows in Profiles). We will also need to figure something out for that – it probably isn't _that_ hard but will require some refactors (probably split the unwinding in two steps: (1) only determine upper Python Stack (2) unwind Tasks strictly speaking, starting from Leaf Tasks). ## Testing The PR adds new tests around on-CPU Tasks and what their Stacks look like The PR also updates a lot of existing tests to output their `DataSummary` as JSON. This is immensely useful to understand what is actually going on when we don't see the expected Stacks (or see the unexpected ones) and I do think it'd be worth keeping. After these changes, the tests that still fail locally and in the CI (flakily) are unrelated to on-CPU Tasks which, I guess, means _mission failed successfully_?
Description
Echion PR: P403n1x87/echion#210
This PR fixes a race condition happening around linking Parent and Children Tasks when using utilities like
asyncio.gatherorasyncio.as_completed.The problem
The race condition would manifest in the following way
asyncio.as_completedfromparent_tasknew_task)a.
new_taskwas created only just now, so was not part of theall_tasksMirrorSetsnapshotted by the Samplerb. The cleanup logic sees that there's a link from
new_tasktoparent_taskbutnew_taskisn't inall_tasksc. The cleanup logic assumes that means
new_taskpreviously existed but has completed, deduces the link must be removedd. Link is removed
new_taskstarts executing, but isn't linked toparent_taskbecause we've cleaned it up before it got a chance to startProposed fix
The fix I propose is to introduce a new
previous_task_objectsset which contains the addresses of all existing Task objects at the time we previously sampled.If we see a Task object in the Task Links Map that doesn't exist in
all_tasks, then two things are possible: either the Task used to exist and the link should be removed, or the Task was just created and the link should stay. To determine which case we're in, we look atprevious_task_objects– if the Task is there then it previously existed and just completed (⇒ need to remove); if not, it was just created (⇒ shouldn't remove).Admitted caveat
I think this comes with one caveat: if we sample very rarely (which may happen in certain circumstances with adaptive sampling), then we could miss certain Tasks and never get rid of their links:
Given that super_quick will be… super quick, the following could happen:
all_tasks = { parent }; task_links = {} ; previous_all_tasks = { parent }asyncio.gather(...)creates another Task ;all_tasks = { parent, super_quick }; task_links = { super_quick -> parent }super_quickruns super quickly and finishes ;all_tasks = { parent }; task_links = { super_quick -> parent }current all_tasks = { parent };previous_all_tasks = { parent }(still from the first sample we took) andtask_links = { super_quick -> parent }super_quickis a new Task [because it’s not inprevious_all_tasks] and doesn’t remove the link⇒ The link will never go away because we’ll keep repeating this (potentially adding/removing other Tasks and Links, but this specific one will stay forever)
Additional change
The PR also fixes an issue with our patch for
as_completedthat would crash ifparentwasNone. I don't think this can ever happen (sinceas_completedis necessarily called from a Task), but better safe than sorry.Testing
As this change fixes a race condition, it is pretty hard to strictly speaking test it.
dd-trace-pyalready has a test that flaked in the past due to it (and that flaked in Echion as well):test_asyncio_as_completed