Skip to content

chore(profiling): detect cycles in asyncio#15712

Open
KowalskiThomas wants to merge 2 commits intomainfrom
kowalski/chore-profiling-detect-cycles-in-asyncio
Open

chore(profiling): detect cycles in asyncio#15712
KowalskiThomas wants to merge 2 commits intomainfrom
kowalski/chore-profiling-detect-cycles-in-asyncio

Conversation

@KowalskiThomas
Copy link
Contributor

@KowalskiThomas KowalskiThomas commented Dec 18, 2025

Description

This PR adds detection for cycles in asyncio Tasks and asyncio coroutine chains. Failing to do that could lead to infinite loops / memory starvation in two cases (1) there actually is a loop in a Task or coroutine chain (2) there isn't a cycle, but we copied memory at two different instants and the state changed in between our copies, making it appear (to us) as if there were some, making us loop forever (and insert a LOT of items into collections).

Looking at Profiles of the Python Profiler, it seems like the additional overhead (of building the set, bookkeeping, and whether it contains a certain element) is cheap enough – 14ms per minute over ~290ms per minute (~5% increase). I think it is fine to merge as-is (especially as it does fix a very real problem which can make the whole Python process grind to a halt as the Profiler starts spending all its time trying to satisfy an infinite hunger for more memory).

image

@github-actions
Copy link
Contributor

CODEOWNERS have been resolved as:

ddtrace/internal/datadog/profiling/stack/echion/echion/tasks.h          @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/echion/echion/threads.h        @DataDog/profiling-python

@KowalskiThomas KowalskiThomas added changelog/no-changelog A changelog entry is not required for this PR. Profiling Continous Profling labels Dec 19, 2025
@pr-commenter
Copy link

pr-commenter bot commented Dec 19, 2025

Performance SLOs

Comparing candidate kowalski/chore-profiling-detect-cycles-in-asyncio (e7dbf7f) with baseline main (c5b5e17)

📈 Performance Regressions (3 suites)
📈 iastaspects - 118/118

✅ add_aspect

Time: ✅ 17.847µs (SLO: <20.000µs 📉 -10.8%) vs baseline: 📈 +19.3%

Memory: ✅ 42.979MB (SLO: <43.250MB 🟡 -0.6%) vs baseline: +4.9%


✅ add_inplace_aspect

Time: ✅ 14.852µs (SLO: <20.000µs 📉 -25.7%) vs baseline: ~same

Memory: ✅ 43.057MB (SLO: <43.250MB 🟡 -0.4%) vs baseline: +5.0%


✅ add_inplace_noaspect

Time: ✅ 0.338µs (SLO: <10.000µs 📉 -96.6%) vs baseline: -0.5%

Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +4.8%


✅ add_noaspect

Time: ✅ 0.545µs (SLO: <10.000µs 📉 -94.6%) vs baseline: -0.2%

Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.8%


✅ bytearray_aspect

Time: ✅ 17.991µs (SLO: <30.000µs 📉 -40.0%) vs baseline: +0.1%

Memory: ✅ 42.959MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9%


✅ bytearray_extend_aspect

Time: ✅ 23.796µs (SLO: <30.000µs 📉 -20.7%) vs baseline: -0.4%

Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.7%


✅ bytearray_extend_noaspect

Time: ✅ 2.771µs (SLO: <10.000µs 📉 -72.3%) vs baseline: +1.0%

Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.0%


✅ bytearray_noaspect

Time: ✅ 1.486µs (SLO: <10.000µs 📉 -85.1%) vs baseline: +0.9%

Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9%


✅ bytes_aspect

Time: ✅ 16.614µs (SLO: <20.000µs 📉 -16.9%) vs baseline: -0.3%

Memory: ✅ 42.920MB (SLO: <43.500MB 🟡 -1.3%) vs baseline: +4.8%


✅ bytes_noaspect

Time: ✅ 1.426µs (SLO: <10.000µs 📉 -85.7%) vs baseline: ~same

Memory: ✅ 42.939MB (SLO: <43.500MB 🟡 -1.3%) vs baseline: +4.9%


✅ bytesio_aspect

Time: ✅ 55.283µs (SLO: <70.000µs 📉 -21.0%) vs baseline: -0.5%

Memory: ✅ 42.939MB (SLO: <43.500MB 🟡 -1.3%) vs baseline: +4.6%


✅ bytesio_noaspect

Time: ✅ 3.295µs (SLO: <10.000µs 📉 -67.0%) vs baseline: +0.3%

Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9%


✅ capitalize_aspect

Time: ✅ 14.750µs (SLO: <20.000µs 📉 -26.2%) vs baseline: +0.3%

Memory: ✅ 42.959MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9%


✅ capitalize_noaspect

Time: ✅ 2.619µs (SLO: <10.000µs 📉 -73.8%) vs baseline: +1.0%

Memory: ✅ 42.920MB (SLO: <43.500MB 🟡 -1.3%) vs baseline: +4.5%


✅ casefold_aspect

Time: ✅ 14.641µs (SLO: <20.000µs 📉 -26.8%) vs baseline: -0.2%

Memory: ✅ 42.959MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9%


✅ casefold_noaspect

Time: ✅ 3.148µs (SLO: <10.000µs 📉 -68.5%) vs baseline: -0.9%

Memory: ✅ 43.057MB (SLO: <43.500MB 🟡 -1.0%) vs baseline: +5.1%


✅ decode_aspect

Time: ✅ 15.659µs (SLO: <30.000µs 📉 -47.8%) vs baseline: +0.4%

Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9%


✅ decode_noaspect

Time: ✅ 1.603µs (SLO: <10.000µs 📉 -84.0%) vs baseline: ~same

Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.0%


✅ encode_aspect

Time: ✅ 18.180µs (SLO: <30.000µs 📉 -39.4%) vs baseline: 📈 +22.0%

Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.8%


✅ encode_noaspect

Time: ✅ 1.496µs (SLO: <10.000µs 📉 -85.0%) vs baseline: -1.2%

Memory: ✅ 42.920MB (SLO: <43.500MB 🟡 -1.3%) vs baseline: +4.8%


✅ format_aspect

Time: ✅ 171.260µs (SLO: <200.000µs 📉 -14.4%) vs baseline: +0.1%

Memory: ✅ 43.195MB (SLO: <43.250MB 🟡 -0.1%) vs baseline: +4.9%


✅ format_map_aspect

Time: ✅ 191.277µs (SLO: <200.000µs -4.4%) vs baseline: ~same

Memory: ✅ 43.214MB (SLO: <43.500MB 🟡 -0.7%) vs baseline: +5.1%


✅ format_map_noaspect

Time: ✅ 3.818µs (SLO: <10.000µs 📉 -61.8%) vs baseline: -0.3%

Memory: ✅ 43.018MB (SLO: <43.250MB 🟡 -0.5%) vs baseline: +5.0%


✅ format_noaspect

Time: ✅ 3.150µs (SLO: <10.000µs 📉 -68.5%) vs baseline: -0.2%

Memory: ✅ 42.939MB (SLO: <43.250MB 🟡 -0.7%) vs baseline: +4.8%


✅ index_aspect

Time: ✅ 15.286µs (SLO: <20.000µs 📉 -23.6%) vs baseline: -0.5%

Memory: ✅ 42.998MB (SLO: <43.250MB 🟡 -0.6%) vs baseline: +5.0%


✅ index_noaspect

Time: ✅ 0.465µs (SLO: <10.000µs 📉 -95.4%) vs baseline: +0.5%

Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.7%


✅ join_aspect

Time: ✅ 17.065µs (SLO: <20.000µs 📉 -14.7%) vs baseline: ~same

Memory: ✅ 42.959MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9%


✅ join_noaspect

Time: ✅ 1.548µs (SLO: <10.000µs 📉 -84.5%) vs baseline: -1.2%

Memory: ✅ 42.998MB (SLO: <43.250MB 🟡 -0.6%) vs baseline: +4.8%


✅ ljust_aspect

Time: ✅ 20.759µs (SLO: <30.000µs 📉 -30.8%) vs baseline: +0.4%

Memory: ✅ 43.096MB (SLO: <43.250MB 🟡 -0.4%) vs baseline: +5.0%


✅ ljust_noaspect

Time: ✅ 2.694µs (SLO: <10.000µs 📉 -73.1%) vs baseline: -2.0%

Memory: ✅ 42.959MB (SLO: <43.250MB 🟡 -0.7%) vs baseline: +4.8%


✅ lower_aspect

Time: ✅ 17.967µs (SLO: <30.000µs 📉 -40.1%) vs baseline: +0.1%

Memory: ✅ 42.959MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +5.0%


✅ lower_noaspect

Time: ✅ 2.415µs (SLO: <10.000µs 📉 -75.9%) vs baseline: -0.9%

Memory: ✅ 42.959MB (SLO: <43.250MB 🟡 -0.7%) vs baseline: +4.6%


✅ lstrip_aspect

Time: ✅ 17.681µs (SLO: <30.000µs 📉 -41.1%) vs baseline: +0.9%

Memory: ✅ 42.959MB (SLO: <43.250MB 🟡 -0.7%) vs baseline: +5.0%


✅ lstrip_noaspect

Time: ✅ 1.874µs (SLO: <10.000µs 📉 -81.3%) vs baseline: +1.1%

Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.6%


✅ modulo_aspect

Time: ✅ 166.700µs (SLO: <200.000µs 📉 -16.7%) vs baseline: +0.3%

Memory: ✅ 43.175MB (SLO: <43.500MB 🟡 -0.7%) vs baseline: +5.0%


✅ modulo_aspect_for_bytearray_bytearray

Time: ✅ 174.691µs (SLO: <200.000µs 📉 -12.7%) vs baseline: ~same

Memory: ✅ 43.175MB (SLO: <43.500MB 🟡 -0.7%) vs baseline: +5.1%


✅ modulo_aspect_for_bytes

Time: ✅ 168.588µs (SLO: <200.000µs 📉 -15.7%) vs baseline: ~same

Memory: ✅ 43.057MB (SLO: <43.500MB 🟡 -1.0%) vs baseline: +4.6%


✅ modulo_aspect_for_bytes_bytearray

Time: ✅ 172.353µs (SLO: <200.000µs 📉 -13.8%) vs baseline: +0.1%

Memory: ✅ 43.096MB (SLO: <43.500MB 🟡 -0.9%) vs baseline: +4.8%


✅ modulo_noaspect

Time: ✅ 3.726µs (SLO: <10.000µs 📉 -62.7%) vs baseline: +1.0%

Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +5.0%


✅ replace_aspect

Time: ✅ 211.709µs (SLO: <300.000µs 📉 -29.4%) vs baseline: ~same

Memory: ✅ 43.116MB (SLO: <44.000MB -2.0%) vs baseline: +4.7%


✅ replace_noaspect

Time: ✅ 2.903µs (SLO: <10.000µs 📉 -71.0%) vs baseline: -0.5%

Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.1%


✅ repr_aspect

Time: ✅ 1.422µs (SLO: <10.000µs 📉 -85.8%) vs baseline: ~same

Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.7%


✅ repr_noaspect

Time: ✅ 0.524µs (SLO: <10.000µs 📉 -94.8%) vs baseline: ~same

Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +5.0%


✅ rstrip_aspect

Time: ✅ 19.081µs (SLO: <30.000µs 📉 -36.4%) vs baseline: -0.3%

Memory: ✅ 43.037MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.0%


✅ rstrip_noaspect

Time: ✅ 2.062µs (SLO: <10.000µs 📉 -79.4%) vs baseline: +6.3%

Memory: ✅ 42.920MB (SLO: <43.500MB 🟡 -1.3%) vs baseline: +4.8%


✅ slice_aspect

Time: ✅ 15.894µs (SLO: <20.000µs 📉 -20.5%) vs baseline: ~same

Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +5.0%


✅ slice_noaspect

Time: ✅ 0.603µs (SLO: <10.000µs 📉 -94.0%) vs baseline: +0.5%

Memory: ✅ 43.037MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.0%


✅ stringio_aspect

Time: ✅ 54.141µs (SLO: <80.000µs 📉 -32.3%) vs baseline: +0.2%

Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9%


✅ stringio_noaspect

Time: ✅ 3.648µs (SLO: <10.000µs 📉 -63.5%) vs baseline: -0.2%

Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.1%


✅ strip_aspect

Time: ✅ 17.552µs (SLO: <20.000µs 📉 -12.2%) vs baseline: -0.7%

Memory: ✅ 42.920MB (SLO: <43.500MB 🟡 -1.3%) vs baseline: +4.8%


✅ strip_noaspect

Time: ✅ 1.866µs (SLO: <10.000µs 📉 -81.3%) vs baseline: -0.3%

Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9%


✅ swapcase_aspect

Time: ✅ 18.525µs (SLO: <30.000µs 📉 -38.3%) vs baseline: ~same

Memory: ✅ 42.998MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.9%


✅ swapcase_noaspect

Time: ✅ 2.801µs (SLO: <10.000µs 📉 -72.0%) vs baseline: -0.1%

Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +5.0%


✅ title_aspect

Time: ✅ 22.182µs (SLO: <30.000µs 📉 -26.1%) vs baseline: 📈 +21.1%

Memory: ✅ 42.979MB (SLO: <43.000MB 🟡 ~same) vs baseline: +4.9%


✅ title_noaspect

Time: ✅ 2.682µs (SLO: <10.000µs 📉 -73.2%) vs baseline: -0.2%

Memory: ✅ 42.880MB (SLO: <43.500MB 🟡 -1.4%) vs baseline: +4.7%


✅ translate_aspect

Time: ✅ 20.498µs (SLO: <30.000µs 📉 -31.7%) vs baseline: +0.3%

Memory: ✅ 42.979MB (SLO: <43.500MB 🟡 -1.2%) vs baseline: +4.7%


✅ translate_noaspect

Time: ✅ 4.327µs (SLO: <10.000µs 📉 -56.7%) vs baseline: -0.5%

Memory: ✅ 42.939MB (SLO: <43.500MB 🟡 -1.3%) vs baseline: +5.0%


✅ upper_aspect

Time: ✅ 17.949µs (SLO: <30.000µs 📉 -40.2%) vs baseline: -0.3%

Memory: ✅ 43.018MB (SLO: <43.500MB 🟡 -1.1%) vs baseline: +4.8%


✅ upper_noaspect

Time: ✅ 2.448µs (SLO: <10.000µs 📉 -75.5%) vs baseline: +0.8%

Memory: ✅ 42.920MB (SLO: <43.500MB 🟡 -1.3%) vs baseline: +4.7%


📈 iastaspectsospath - 24/24

✅ ospathbasename_aspect

Time: ✅ 5.169µs (SLO: <10.000µs 📉 -48.3%) vs baseline: 📈 +22.2%

Memory: ✅ 41.386MB (SLO: <43.500MB -4.9%) vs baseline: +4.8%


✅ ospathbasename_noaspect

Time: ✅ 4.271µs (SLO: <10.000µs 📉 -57.3%) vs baseline: -0.7%

Memory: ✅ 41.425MB (SLO: <43.500MB -4.8%) vs baseline: +5.1%


✅ ospathjoin_aspect

Time: ✅ 6.267µs (SLO: <10.000µs 📉 -37.3%) vs baseline: -0.5%

Memory: ✅ 41.406MB (SLO: <43.500MB -4.8%) vs baseline: +4.7%


✅ ospathjoin_noaspect

Time: ✅ 6.265µs (SLO: <10.000µs 📉 -37.3%) vs baseline: -0.8%

Memory: ✅ 41.465MB (SLO: <43.500MB -4.7%) vs baseline: +5.1%


✅ ospathnormcase_aspect

Time: ✅ 3.551µs (SLO: <10.000µs 📉 -64.5%) vs baseline: +0.2%

Memory: ✅ 41.366MB (SLO: <43.500MB -4.9%) vs baseline: +4.6%


✅ ospathnormcase_noaspect

Time: ✅ 3.614µs (SLO: <10.000µs 📉 -63.9%) vs baseline: +0.1%

Memory: ✅ 41.406MB (SLO: <43.500MB -4.8%) vs baseline: +4.9%


✅ ospathsplit_aspect

Time: ✅ 4.902µs (SLO: <10.000µs 📉 -51.0%) vs baseline: +0.2%

Memory: ✅ 41.465MB (SLO: <43.500MB -4.7%) vs baseline: +5.0%


✅ ospathsplit_noaspect

Time: ✅ 4.976µs (SLO: <10.000µs 📉 -50.2%) vs baseline: +0.3%

Memory: ✅ 41.425MB (SLO: <43.500MB -4.8%) vs baseline: +4.9%


✅ ospathsplitdrive_aspect

Time: ✅ 3.731µs (SLO: <10.000µs 📉 -62.7%) vs baseline: -0.5%

Memory: ✅ 41.386MB (SLO: <43.500MB -4.9%) vs baseline: +4.7%


✅ ospathsplitdrive_noaspect

Time: ✅ 0.749µs (SLO: <10.000µs 📉 -92.5%) vs baseline: -0.4%

Memory: ✅ 41.504MB (SLO: <43.500MB -4.6%) vs baseline: +5.0%


✅ ospathsplitext_aspect

Time: ✅ 4.642µs (SLO: <10.000µs 📉 -53.6%) vs baseline: +1.3%

Memory: ✅ 41.465MB (SLO: <43.500MB -4.7%) vs baseline: +4.9%


✅ ospathsplitext_noaspect

Time: ✅ 4.623µs (SLO: <10.000µs 📉 -53.8%) vs baseline: -0.1%

Memory: ✅ 41.386MB (SLO: <43.500MB -4.9%) vs baseline: +4.7%


📈 telemetryaddmetric - 30/30

✅ 1-count-metric-1-times

Time: ✅ 3.389µs (SLO: <20.000µs 📉 -83.1%) vs baseline: 📈 +14.0%

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +5.0%


✅ 1-count-metrics-100-times

Time: ✅ 203.198µs (SLO: <220.000µs -7.6%) vs baseline: +2.4%

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +4.8%


✅ 1-distribution-metric-1-times

Time: ✅ 3.336µs (SLO: <20.000µs 📉 -83.3%) vs baseline: +0.7%

Memory: ✅ 34.721MB (SLO: <35.500MB -2.2%) vs baseline: +4.6%


✅ 1-distribution-metrics-100-times

Time: ✅ 215.777µs (SLO: <230.000µs -6.2%) vs baseline: +2.0%

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9%


✅ 1-gauge-metric-1-times

Time: ✅ 2.180µs (SLO: <20.000µs 📉 -89.1%) vs baseline: -0.3%

Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9%


✅ 1-gauge-metrics-100-times

Time: ✅ 135.787µs (SLO: <150.000µs -9.5%) vs baseline: -0.7%

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +5.0%


✅ 1-rate-metric-1-times

Time: ✅ 3.160µs (SLO: <20.000µs 📉 -84.2%) vs baseline: +0.7%

Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9%


✅ 1-rate-metrics-100-times

Time: ✅ 214.379µs (SLO: <250.000µs 📉 -14.2%) vs baseline: +1.0%

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +5.1%


✅ 100-count-metrics-100-times

Time: ✅ 20.154ms (SLO: <22.000ms -8.4%) vs baseline: +1.1%

Memory: ✅ 34.819MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.9%


✅ 100-distribution-metrics-100-times

Time: ✅ 2.215ms (SLO: <2.550ms 📉 -13.1%) vs baseline: -2.0%

Memory: ✅ 34.760MB (SLO: <35.500MB -2.1%) vs baseline: +4.3%


✅ 100-gauge-metrics-100-times

Time: ✅ 1.405ms (SLO: <1.550ms -9.4%) vs baseline: +0.6%

Memory: ✅ 34.780MB (SLO: <35.500MB -2.0%) vs baseline: +4.7%


✅ 100-rate-metrics-100-times

Time: ✅ 2.182ms (SLO: <2.550ms 📉 -14.4%) vs baseline: -0.3%

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +5.0%


✅ flush-1-metric

Time: ✅ 4.610µs (SLO: <20.000µs 📉 -76.9%) vs baseline: ~same

Memory: ✅ 35.212MB (SLO: <35.500MB 🟡 -0.8%) vs baseline: +5.2%


✅ flush-100-metrics

Time: ✅ 174.269µs (SLO: <250.000µs 📉 -30.3%) vs baseline: +0.3%

Memory: ✅ 35.232MB (SLO: <35.500MB 🟡 -0.8%) vs baseline: +5.0%


✅ flush-1000-metrics

Time: ✅ 2.193ms (SLO: <2.500ms 📉 -12.3%) vs baseline: +0.6%

Memory: ✅ 35.920MB (SLO: <36.500MB 🟡 -1.6%) vs baseline: +4.6%

🟡 Near SLO Breach (14 suites)
🟡 coreapiscenario - 10/10 (1 unstable)

⚠️ context_with_data_listeners

Time: ⚠️ 13.290µs (SLO: <20.000µs 📉 -33.5%) vs baseline: +0.4%

Memory: ✅ 34.996MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +5.0%


✅ context_with_data_no_listeners

Time: ✅ 3.269µs (SLO: <10.000µs 📉 -67.3%) vs baseline: ~same

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +5.0%


✅ get_item_exists

Time: ✅ 0.581µs (SLO: <10.000µs 📉 -94.2%) vs baseline: +0.6%

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +5.0%


✅ get_item_missing

Time: ✅ 0.639µs (SLO: <10.000µs 📉 -93.6%) vs baseline: +0.8%

Memory: ✅ 34.819MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.5%


✅ set_item

Time: ✅ 24.619µs (SLO: <30.000µs 📉 -17.9%) vs baseline: +1.1%

Memory: ✅ 34.977MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.9%


🟡 djangosimple - 30/30

✅ appsec

Time: ✅ 19.525ms (SLO: <22.300ms 📉 -12.4%) vs baseline: -0.2%

Memory: ✅ 68.312MB (SLO: <70.500MB -3.1%) vs baseline: +4.9%


✅ exception-replay-enabled

Time: ✅ 1.358ms (SLO: <1.450ms -6.4%) vs baseline: ~same

Memory: ✅ 66.175MB (SLO: <67.500MB 🟡 -2.0%) vs baseline: +4.7%


✅ iast

Time: ✅ 19.549ms (SLO: <22.250ms 📉 -12.1%) vs baseline: -0.6%

Memory: ✅ 68.370MB (SLO: <70.000MB -2.3%) vs baseline: +4.9%


✅ profiler

Time: ✅ 14.757ms (SLO: <16.550ms 📉 -10.8%) vs baseline: +0.4%

Memory: ✅ 56.058MB (SLO: <57.500MB -2.5%) vs baseline: +4.8%


✅ resource-renaming

Time: ✅ 19.488ms (SLO: <21.750ms 📉 -10.4%) vs baseline: -0.2%

Memory: ✅ 68.341MB (SLO: <70.500MB -3.1%) vs baseline: +5.1%


✅ span-code-origin

Time: ✅ 19.875ms (SLO: <28.200ms 📉 -29.5%) vs baseline: +0.9%

Memory: ✅ 68.311MB (SLO: <71.000MB -3.8%) vs baseline: +4.9%


✅ tracer

Time: ✅ 19.640ms (SLO: <21.750ms -9.7%) vs baseline: +0.2%

Memory: ✅ 68.279MB (SLO: <70.000MB -2.5%) vs baseline: +4.9%


✅ tracer-and-profiler

Time: ✅ 20.908ms (SLO: <23.500ms 📉 -11.0%) vs baseline: +0.2%

Memory: ✅ 69.461MB (SLO: <71.000MB -2.2%) vs baseline: +4.9%


✅ tracer-dont-create-db-spans

Time: ✅ 19.618ms (SLO: <21.500ms -8.8%) vs baseline: -0.5%

Memory: ✅ 68.299MB (SLO: <70.000MB -2.4%) vs baseline: +5.0%


✅ tracer-minimal

Time: ✅ 16.804ms (SLO: <17.500ms -4.0%) vs baseline: +0.1%

Memory: ✅ 68.111MB (SLO: <70.000MB -2.7%) vs baseline: +4.8%


✅ tracer-native

Time: ✅ 19.468ms (SLO: <21.750ms 📉 -10.5%) vs baseline: -0.2%

Memory: ✅ 68.340MB (SLO: <72.500MB -5.7%) vs baseline: +5.0%


✅ tracer-no-caches

Time: ✅ 17.616ms (SLO: <19.650ms 📉 -10.3%) vs baseline: -0.2%

Memory: ✅ 68.283MB (SLO: <70.000MB -2.5%) vs baseline: +5.0%


✅ tracer-no-databases

Time: ✅ 19.103ms (SLO: <20.100ms -5.0%) vs baseline: -0.1%

Memory: ✅ 68.075MB (SLO: <70.000MB -2.8%) vs baseline: +4.7%


✅ tracer-no-middleware

Time: ✅ 19.291ms (SLO: <21.500ms 📉 -10.3%) vs baseline: -0.2%

Memory: ✅ 68.253MB (SLO: <70.000MB -2.5%) vs baseline: +5.0%


✅ tracer-no-templates

Time: ✅ 19.608ms (SLO: <22.000ms 📉 -10.9%) vs baseline: +1.3%

Memory: ✅ 68.263MB (SLO: <70.500MB -3.2%) vs baseline: +4.9%


🟡 errortrackingdjangosimple - 6/6

✅ errortracking-enabled-all

Time: ✅ 16.258ms (SLO: <19.850ms 📉 -18.1%) vs baseline: -0.1%

Memory: ✅ 69.933MB (SLO: <70.000MB 🟡 ~same) vs baseline: +5.0%


✅ errortracking-enabled-user

Time: ✅ 16.318ms (SLO: <19.400ms 📉 -15.9%) vs baseline: +0.2%

Memory: ✅ 69.913MB (SLO: <70.000MB 🟡 -0.1%) vs baseline: +4.7%


✅ tracer-enabled

Time: ✅ 16.308ms (SLO: <19.450ms 📉 -16.2%) vs baseline: +0.1%

Memory: ✅ 69.851MB (SLO: <70.000MB 🟡 -0.2%) vs baseline: +4.7%


🟡 errortrackingflasksqli - 6/6

✅ errortracking-enabled-all

Time: ✅ 2.066ms (SLO: <2.300ms 📉 -10.2%) vs baseline: ~same

Memory: ✅ 55.935MB (SLO: <56.500MB 🟡 -1.0%) vs baseline: +5.0%


✅ errortracking-enabled-user

Time: ✅ 2.079ms (SLO: <2.250ms -7.6%) vs baseline: +0.4%

Memory: ✅ 55.876MB (SLO: <56.500MB 🟡 -1.1%) vs baseline: +4.8%


✅ tracer-enabled

Time: ✅ 2.072ms (SLO: <2.300ms -9.9%) vs baseline: +0.1%

Memory: ✅ 55.915MB (SLO: <56.500MB 🟡 -1.0%) vs baseline: +4.8%


🟡 flasksimple - 18/18

✅ appsec-get

Time: ✅ 3.379ms (SLO: <4.750ms 📉 -28.9%) vs baseline: ~same

Memory: ✅ 55.915MB (SLO: <66.500MB 📉 -15.9%) vs baseline: +5.0%


✅ appsec-post

Time: ✅ 2.859ms (SLO: <6.750ms 📉 -57.6%) vs baseline: +0.2%

Memory: ✅ 56.030MB (SLO: <66.500MB 📉 -15.7%) vs baseline: +4.9%


✅ appsec-telemetry

Time: ✅ 3.400ms (SLO: <4.750ms 📉 -28.4%) vs baseline: +1.0%

Memory: ✅ 55.894MB (SLO: <66.500MB 📉 -15.9%) vs baseline: +4.9%


✅ debugger

Time: ✅ 1.871ms (SLO: <2.000ms -6.5%) vs baseline: +0.1%

Memory: ✅ 47.819MB (SLO: <49.500MB -3.4%) vs baseline: +4.9%


✅ iast-get

Time: ✅ 1.858ms (SLO: <2.000ms -7.1%) vs baseline: ~same

Memory: ✅ 44.851MB (SLO: <49.000MB -8.5%) vs baseline: +4.9%


✅ profiler

Time: ✅ 1.856ms (SLO: <2.100ms 📉 -11.6%) vs baseline: ~same

Memory: ✅ 48.744MB (SLO: <50.000MB -2.5%) vs baseline: +4.9%


✅ resource-renaming

Time: ✅ 3.356ms (SLO: <3.650ms -8.1%) vs baseline: -0.4%

Memory: ✅ 55.931MB (SLO: <56.000MB 🟡 -0.1%) vs baseline: +4.9%


✅ tracer

Time: ✅ 3.370ms (SLO: <3.650ms -7.7%) vs baseline: -0.1%

Memory: ✅ 55.955MB (SLO: <56.500MB 🟡 -1.0%) vs baseline: +4.8%


✅ tracer-native

Time: ✅ 3.373ms (SLO: <3.650ms -7.6%) vs baseline: +0.4%

Memory: ✅ 55.910MB (SLO: <60.000MB -6.8%) vs baseline: +4.7%


🟡 flasksqli - 6/6

✅ appsec-enabled

Time: ✅ 2.062ms (SLO: <4.200ms 📉 -50.9%) vs baseline: +0.2%

Memory: ✅ 55.915MB (SLO: <66.000MB 📉 -15.3%) vs baseline: +4.9%


✅ iast-enabled

Time: ✅ 2.068ms (SLO: <2.800ms 📉 -26.2%) vs baseline: ~same

Memory: ✅ 55.896MB (SLO: <62.500MB 📉 -10.6%) vs baseline: +4.8%


✅ tracer-enabled

Time: ✅ 2.061ms (SLO: <2.250ms -8.4%) vs baseline: +0.2%

Memory: ✅ 55.856MB (SLO: <56.500MB 🟡 -1.1%) vs baseline: +4.8%


🟡 httppropagationextract - 60/60

✅ all_styles_all_headers

Time: ✅ 82.099µs (SLO: <100.000µs 📉 -17.9%) vs baseline: -0.1%

Memory: ✅ 35.036MB (SLO: <35.500MB 🟡 -1.3%) vs baseline: +4.2%


✅ b3_headers

Time: ✅ 14.304µs (SLO: <20.000µs 📉 -28.5%) vs baseline: -0.4%

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.8%


✅ b3_single_headers

Time: ✅ 13.397µs (SLO: <20.000µs 📉 -33.0%) vs baseline: -0.5%

Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.6%


✅ datadog_tracecontext_tracestate_not_propagated_on_trace_id_no_match

Time: ✅ 63.940µs (SLO: <80.000µs 📉 -20.1%) vs baseline: ~same

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +4.2%


✅ datadog_tracecontext_tracestate_propagated_on_trace_id_match

Time: ✅ 69.671µs (SLO: <80.000µs 📉 -12.9%) vs baseline: +5.1%

Memory: ✅ 35.016MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +4.1%


✅ empty_headers

Time: ✅ 1.620µs (SLO: <10.000µs 📉 -83.8%) vs baseline: +0.8%

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.7%


✅ full_t_id_datadog_headers

Time: ✅ 22.711µs (SLO: <30.000µs 📉 -24.3%) vs baseline: ~same

Memory: ✅ 34.977MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.7%


✅ invalid_priority_header

Time: ✅ 6.520µs (SLO: <10.000µs 📉 -34.8%) vs baseline: -0.2%

Memory: ✅ 34.957MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.8%


✅ invalid_span_id_header

Time: ✅ 6.494µs (SLO: <10.000µs 📉 -35.1%) vs baseline: -0.1%

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.7%


✅ invalid_tags_header

Time: ✅ 6.497µs (SLO: <10.000µs 📉 -35.0%) vs baseline: +0.3%

Memory: ✅ 34.957MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.9%


✅ invalid_trace_id_header

Time: ✅ 6.512µs (SLO: <10.000µs 📉 -34.9%) vs baseline: +0.3%

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.6%


✅ large_header_no_matches

Time: ✅ 27.553µs (SLO: <30.000µs -8.2%) vs baseline: ~same

Memory: ✅ 34.996MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +4.7%


✅ large_valid_headers_all

Time: ✅ 28.735µs (SLO: <40.000µs 📉 -28.2%) vs baseline: +0.4%

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.9%


✅ medium_header_no_matches

Time: ✅ 9.808µs (SLO: <20.000µs 📉 -51.0%) vs baseline: ~same

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +4.6%


✅ medium_valid_headers_all

Time: ✅ 11.180µs (SLO: <20.000µs 📉 -44.1%) vs baseline: -0.6%

Memory: ✅ 34.977MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.9%


✅ none_propagation_style

Time: ✅ 1.721µs (SLO: <10.000µs 📉 -82.8%) vs baseline: +0.5%

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +5.0%


✅ tracecontext_headers

Time: ✅ 34.679µs (SLO: <40.000µs 📉 -13.3%) vs baseline: -0.3%

Memory: ✅ 34.957MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.9%


✅ valid_headers_all

Time: ✅ 6.504µs (SLO: <10.000µs 📉 -35.0%) vs baseline: +0.1%

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.8%


✅ valid_headers_basic

Time: ✅ 6.073µs (SLO: <10.000µs 📉 -39.3%) vs baseline: +0.4%

Memory: ✅ 35.036MB (SLO: <35.500MB 🟡 -1.3%) vs baseline: +5.2%


✅ wsgi_empty_headers

Time: ✅ 1.601µs (SLO: <10.000µs 📉 -84.0%) vs baseline: ~same

Memory: ✅ 35.036MB (SLO: <35.500MB 🟡 -1.3%) vs baseline: +5.2%


✅ wsgi_invalid_priority_header

Time: ✅ 6.601µs (SLO: <10.000µs 📉 -34.0%) vs baseline: +1.0%

Memory: ✅ 34.957MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +5.0%


✅ wsgi_invalid_span_id_header

Time: ✅ 1.597µs (SLO: <10.000µs 📉 -84.0%) vs baseline: -0.2%

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +4.9%


✅ wsgi_invalid_tags_header

Time: ✅ 6.525µs (SLO: <10.000µs 📉 -34.7%) vs baseline: -0.4%

Memory: ✅ 34.957MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +5.0%


✅ wsgi_invalid_trace_id_header

Time: ✅ 6.686µs (SLO: <10.000µs 📉 -33.1%) vs baseline: +1.9%

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.8%


✅ wsgi_large_header_no_matches

Time: ✅ 28.621µs (SLO: <40.000µs 📉 -28.4%) vs baseline: -0.6%

Memory: ✅ 35.016MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +4.6%


✅ wsgi_large_valid_headers_all

Time: ✅ 29.785µs (SLO: <40.000µs 📉 -25.5%) vs baseline: ~same

Memory: ✅ 35.016MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +4.7%


✅ wsgi_medium_header_no_matches

Time: ✅ 10.053µs (SLO: <20.000µs 📉 -49.7%) vs baseline: -0.8%

Memory: ✅ 34.957MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.8%


✅ wsgi_medium_valid_headers_all

Time: ✅ 11.566µs (SLO: <20.000µs 📉 -42.2%) vs baseline: -0.5%

Memory: ✅ 34.996MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +4.9%


✅ wsgi_valid_headers_all

Time: ✅ 6.556µs (SLO: <10.000µs 📉 -34.4%) vs baseline: -0.2%

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.6%


✅ wsgi_valid_headers_basic

Time: ✅ 6.113µs (SLO: <10.000µs 📉 -38.9%) vs baseline: +0.2%

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.6%


🟡 httppropagationinject - 16/16

✅ ids_only

Time: ✅ 22.106µs (SLO: <30.000µs 📉 -26.3%) vs baseline: +5.8%

Memory: ✅ 34.977MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +5.1%


✅ with_all

Time: ✅ 27.756µs (SLO: <40.000µs 📉 -30.6%) vs baseline: ~same

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.9%


✅ with_dd_origin

Time: ✅ 24.495µs (SLO: <30.000µs 📉 -18.4%) vs baseline: ~same

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +5.1%


✅ with_priority_and_origin

Time: ✅ 23.949µs (SLO: <40.000µs 📉 -40.1%) vs baseline: -0.2%

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9%


✅ with_sampling_priority

Time: ✅ 20.909µs (SLO: <30.000µs 📉 -30.3%) vs baseline: -0.3%

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +4.9%


✅ with_tags

Time: ✅ 25.902µs (SLO: <40.000µs 📉 -35.2%) vs baseline: +0.3%

Memory: ✅ 35.055MB (SLO: <35.500MB 🟡 -1.3%) vs baseline: +5.4%


✅ with_tags_invalid

Time: ✅ 27.471µs (SLO: <40.000µs 📉 -31.3%) vs baseline: ~same

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.7%


✅ with_tags_max_size

Time: ✅ 26.546µs (SLO: <40.000µs 📉 -33.6%) vs baseline: +0.4%

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +5.3%


🟡 otelspan - 22/22

✅ add-event

Time: ✅ 40.327ms (SLO: <47.150ms 📉 -14.5%) vs baseline: ~same

Memory: ✅ 39.647MB (SLO: <47.000MB 📉 -15.6%) vs baseline: +5.5%


✅ add-metrics

Time: ✅ 262.104ms (SLO: <344.800ms 📉 -24.0%) vs baseline: +1.1%

Memory: ✅ 43.818MB (SLO: <47.500MB -7.8%) vs baseline: +5.0%


✅ add-tags

Time: ✅ 316.114ms (SLO: <321.000ms 🟡 -1.5%) vs baseline: -0.8%

Memory: ✅ 43.829MB (SLO: <47.500MB -7.7%) vs baseline: +5.1%


✅ get-context

Time: ✅ 80.451ms (SLO: <92.350ms 📉 -12.9%) vs baseline: ~same

Memory: ✅ 39.933MB (SLO: <46.500MB 📉 -14.1%) vs baseline: +5.1%


✅ is-recording

Time: ✅ 37.901ms (SLO: <44.500ms 📉 -14.8%) vs baseline: ~same

Memory: ✅ 39.501MB (SLO: <47.500MB 📉 -16.8%) vs baseline: +4.9%


✅ record-exception

Time: ✅ 58.866ms (SLO: <67.650ms 📉 -13.0%) vs baseline: -0.4%

Memory: ✅ 40.029MB (SLO: <47.000MB 📉 -14.8%) vs baseline: +4.9%


✅ set-status

Time: ✅ 44.299ms (SLO: <50.400ms 📉 -12.1%) vs baseline: -0.1%

Memory: ✅ 39.425MB (SLO: <47.000MB 📉 -16.1%) vs baseline: +4.7%


✅ start

Time: ✅ 37.820ms (SLO: <43.450ms 📉 -13.0%) vs baseline: +2.4%

Memory: ✅ 39.399MB (SLO: <47.000MB 📉 -16.2%) vs baseline: +4.5%


✅ start-finish

Time: ✅ 82.739ms (SLO: <88.000ms -6.0%) vs baseline: ~same

Memory: ✅ 37.434MB (SLO: <46.500MB 📉 -19.5%) vs baseline: +5.0%


✅ start-finish-telemetry

Time: ✅ 84.690ms (SLO: <89.000ms -4.8%) vs baseline: +0.4%

Memory: ✅ 37.356MB (SLO: <46.500MB 📉 -19.7%) vs baseline: +4.7%


✅ update-name

Time: ✅ 38.808ms (SLO: <45.150ms 📉 -14.0%) vs baseline: +0.4%

Memory: ✅ 39.569MB (SLO: <47.000MB 📉 -15.8%) vs baseline: +4.5%


🟡 ratelimiter - 12/12

✅ defaults

Time: ✅ 2.377µs (SLO: <10.000µs 📉 -76.2%) vs baseline: +1.7%

Memory: ✅ 35.173MB (SLO: <35.500MB 🟡 -0.9%) vs baseline: +5.1%


✅ high_rate_limit

Time: ✅ 2.414µs (SLO: <10.000µs 📉 -75.9%) vs baseline: -0.1%

Memory: ✅ 35.095MB (SLO: <35.500MB 🟡 -1.1%) vs baseline: +4.7%


✅ long_window

Time: ✅ 2.364µs (SLO: <10.000µs 📉 -76.4%) vs baseline: +0.2%

Memory: ✅ 35.055MB (SLO: <35.500MB 🟡 -1.3%) vs baseline: +4.5%


✅ low_rate_limit

Time: ✅ 2.371µs (SLO: <10.000µs 📉 -76.3%) vs baseline: +0.8%

Memory: ✅ 35.134MB (SLO: <35.500MB 🟡 -1.0%) vs baseline: +5.0%


✅ no_rate_limit

Time: ✅ 0.830µs (SLO: <10.000µs 📉 -91.7%) vs baseline: +0.9%

Memory: ✅ 35.114MB (SLO: <35.500MB 🟡 -1.1%) vs baseline: +4.9%


✅ short_window

Time: ✅ 2.494µs (SLO: <10.000µs 📉 -75.1%) vs baseline: +0.8%

Memory: ✅ 35.212MB (SLO: <35.500MB 🟡 -0.8%) vs baseline: +4.9%


🟡 recursivecomputation - 8/8

✅ deep

Time: ✅ 308.795ms (SLO: <320.950ms -3.8%) vs baseline: -0.1%

Memory: ✅ 36.019MB (SLO: <36.500MB 🟡 -1.3%) vs baseline: +4.7%


✅ deep-profiled

Time: ✅ 315.174ms (SLO: <359.150ms 📉 -12.2%) vs baseline: ~same

Memory: ✅ 39.872MB (SLO: <40.500MB 🟡 -1.6%) vs baseline: +4.7%


✅ medium

Time: ✅ 6.974ms (SLO: <7.400ms -5.8%) vs baseline: -0.5%

Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +5.0%


✅ shallow

Time: ✅ 0.945ms (SLO: <1.050ms -10.0%) vs baseline: +0.9%

Memory: ✅ 34.760MB (SLO: <35.500MB -2.1%) vs baseline: +4.6%


🟡 sethttpmeta - 32/32

✅ all-disabled

Time: ✅ 10.544µs (SLO: <20.000µs 📉 -47.3%) vs baseline: -0.3%

Memory: ✅ 35.291MB (SLO: <36.000MB 🟡 -2.0%) vs baseline: +4.5%


✅ all-enabled

Time: ✅ 40.958µs (SLO: <50.000µs 📉 -18.1%) vs baseline: +1.8%

Memory: ✅ 35.448MB (SLO: <36.000MB 🟡 -1.5%) vs baseline: +5.0%


✅ collectipvariant_exists

Time: ✅ 41.135µs (SLO: <50.000µs 📉 -17.7%) vs baseline: +0.2%

Memory: ✅ 35.389MB (SLO: <36.000MB 🟡 -1.7%) vs baseline: +5.3%


✅ no-collectipvariant

Time: ✅ 40.134µs (SLO: <50.000µs 📉 -19.7%) vs baseline: -0.4%

Memory: ✅ 35.448MB (SLO: <36.000MB 🟡 -1.5%) vs baseline: +4.7%


✅ no-useragentvariant

Time: ✅ 38.883µs (SLO: <50.000µs 📉 -22.2%) vs baseline: -0.7%

Memory: ✅ 35.330MB (SLO: <36.000MB 🟡 -1.9%) vs baseline: +4.8%


✅ obfuscation-no-query

Time: ✅ 40.683µs (SLO: <50.000µs 📉 -18.6%) vs baseline: -0.1%

Memory: ✅ 35.330MB (SLO: <36.000MB 🟡 -1.9%) vs baseline: +4.7%


✅ obfuscation-regular-case-explicit-query

Time: ✅ 75.919µs (SLO: <90.000µs 📉 -15.6%) vs baseline: -0.3%

Memory: ✅ 35.743MB (SLO: <36.500MB -2.1%) vs baseline: +4.9%


✅ obfuscation-regular-case-implicit-query

Time: ✅ 76.595µs (SLO: <90.000µs 📉 -14.9%) vs baseline: ~same

Memory: ✅ 35.724MB (SLO: <36.500MB -2.1%) vs baseline: +5.2%


✅ obfuscation-send-querystring-disabled

Time: ✅ 154.408µs (SLO: <170.000µs -9.2%) vs baseline: -0.1%

Memory: ✅ 35.645MB (SLO: <36.500MB -2.3%) vs baseline: +4.7%


✅ obfuscation-worst-case-explicit-query

Time: ✅ 148.931µs (SLO: <160.000µs -6.9%) vs baseline: -0.2%

Memory: ✅ 35.665MB (SLO: <36.500MB -2.3%) vs baseline: +4.6%


✅ obfuscation-worst-case-implicit-query

Time: ✅ 155.416µs (SLO: <170.000µs -8.6%) vs baseline: +0.4%

Memory: ✅ 35.665MB (SLO: <36.500MB -2.3%) vs baseline: +4.6%


✅ useragentvariant_exists_1

Time: ✅ 39.847µs (SLO: <50.000µs 📉 -20.3%) vs baseline: +0.6%

Memory: ✅ 35.311MB (SLO: <36.000MB 🟡 -1.9%) vs baseline: +4.2%


✅ useragentvariant_exists_2

Time: ✅ 41.098µs (SLO: <50.000µs 📉 -17.8%) vs baseline: +0.4%

Memory: ✅ 35.311MB (SLO: <36.000MB 🟡 -1.9%) vs baseline: +4.6%


✅ useragentvariant_exists_3

Time: ✅ 40.273µs (SLO: <50.000µs 📉 -19.5%) vs baseline: -0.4%

Memory: ✅ 35.370MB (SLO: <36.000MB 🟡 -1.8%) vs baseline: +5.0%


✅ useragentvariant_not_exists_1

Time: ✅ 39.815µs (SLO: <50.000µs 📉 -20.4%) vs baseline: ~same

Memory: ✅ 35.311MB (SLO: <36.000MB 🟡 -1.9%) vs baseline: +4.4%


✅ useragentvariant_not_exists_2

Time: ✅ 39.789µs (SLO: <50.000µs 📉 -20.4%) vs baseline: +0.4%

Memory: ✅ 35.429MB (SLO: <36.000MB 🟡 -1.6%) vs baseline: +5.2%


🟡 span - 26/26

✅ add-event

Time: ✅ 18.206ms (SLO: <22.500ms 📉 -19.1%) vs baseline: ~same

Memory: ✅ 36.904MB (SLO: <53.000MB 📉 -30.4%) vs baseline: +4.5%


✅ add-metrics

Time: ✅ 88.640ms (SLO: <93.500ms -5.2%) vs baseline: +0.2%

Memory: ✅ 41.168MB (SLO: <53.000MB 📉 -22.3%) vs baseline: +4.9%


✅ add-tags

Time: ✅ 141.239ms (SLO: <155.000ms -8.9%) vs baseline: -0.7%

Memory: ✅ 41.183MB (SLO: <53.000MB 📉 -22.3%) vs baseline: +4.9%


✅ get-context

Time: ✅ 16.945ms (SLO: <20.500ms 📉 -17.3%) vs baseline: +0.1%

Memory: ✅ 36.754MB (SLO: <53.000MB 📉 -30.7%) vs baseline: +4.8%


✅ is-recording

Time: ✅ 17.319ms (SLO: <20.500ms 📉 -15.5%) vs baseline: +0.8%

Memory: ✅ 36.696MB (SLO: <53.000MB 📉 -30.8%) vs baseline: +4.5%


✅ record-exception

Time: ✅ 36.745ms (SLO: <40.000ms -8.1%) vs baseline: ~same

Memory: ✅ 37.374MB (SLO: <53.000MB 📉 -29.5%) vs baseline: +4.8%


✅ set-status

Time: ✅ 18.787ms (SLO: <22.000ms 📉 -14.6%) vs baseline: +1.1%

Memory: ✅ 36.794MB (SLO: <53.000MB 📉 -30.6%) vs baseline: +4.8%


✅ start

Time: ✅ 17.390ms (SLO: <20.500ms 📉 -15.2%) vs baseline: +4.0%

Memory: ✅ 36.758MB (SLO: <53.000MB 📉 -30.6%) vs baseline: +4.7%


✅ start-finish

Time: ✅ 51.004ms (SLO: <52.500ms -2.9%) vs baseline: -0.3%

Memory: ✅ 34.800MB (SLO: <35.500MB 🟡 -2.0%) vs baseline: +4.6%


✅ start-finish-telemetry

Time: ✅ 52.311ms (SLO: <54.500ms -4.0%) vs baseline: +0.2%

Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +5.0%


✅ start-finish-traceid128

Time: ✅ 54.397ms (SLO: <57.000ms -4.6%) vs baseline: +0.5%

Memory: ✅ 34.800MB (SLO: <35.500MB 🟡 -2.0%) vs baseline: +4.7%


✅ start-traceid128

Time: ✅ 17.426ms (SLO: <22.500ms 📉 -22.6%) vs baseline: +1.2%

Memory: ✅ 36.783MB (SLO: <53.000MB 📉 -30.6%) vs baseline: +4.6%


✅ update-name

Time: ✅ 17.376ms (SLO: <22.000ms 📉 -21.0%) vs baseline: +0.7%

Memory: ✅ 36.821MB (SLO: <53.000MB 📉 -30.5%) vs baseline: +4.8%


🟡 tracer - 6/6

✅ large

Time: ✅ 29.034ms (SLO: <32.950ms 📉 -11.9%) vs baseline: -0.3%

Memory: ✅ 35.920MB (SLO: <36.500MB 🟡 -1.6%) vs baseline: +4.6%


✅ medium

Time: ✅ 2.875ms (SLO: <3.200ms 📉 -10.2%) vs baseline: ~same

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +5.0%


✅ small

Time: ✅ 330.329µs (SLO: <370.000µs 📉 -10.7%) vs baseline: +1.7%

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.9%

⚠️ Unstable Tests (1 suite)
⚠️ packagesupdateimporteddependencies - 24/24 (1 unstable)

✅ import_many

Time: ✅ 154.913µs (SLO: <170.000µs -8.9%) vs baseline: +0.2%

Memory: ✅ 39.673MB (SLO: <43.000MB -7.7%) vs baseline: +4.9%


✅ import_many_cached

Time: ✅ 120.859µs (SLO: <130.000µs -7.0%) vs baseline: -0.3%

Memory: ✅ 40.313MB (SLO: <43.000MB -6.2%) vs baseline: +6.6%


✅ import_many_stdlib

Time: ✅ 0.758ms (SLO: <1.750ms 📉 -56.7%) vs baseline: +0.2%

Memory: ✅ 39.679MB (SLO: <43.000MB -7.7%) vs baseline: +4.7%


⚠️ import_many_stdlib_cached

Time: ⚠️ 0.173ms (SLO: <1.100ms 📉 -84.3%) vs baseline: +0.9%

Memory: ✅ 39.611MB (SLO: <43.000MB -7.9%) vs baseline: +5.4%


✅ import_many_unknown

Time: ✅ 837.810µs (SLO: <890.000µs -5.9%) vs baseline: +0.5%

Memory: ✅ 39.817MB (SLO: <43.000MB -7.4%) vs baseline: +4.7%


✅ import_many_unknown_cached

Time: ✅ 794.565µs (SLO: <870.000µs -8.7%) vs baseline: ~same

Memory: ✅ 40.052MB (SLO: <43.000MB -6.9%) vs baseline: +5.4%


✅ import_one

Time: ✅ 19.757µs (SLO: <30.000µs 📉 -34.1%) vs baseline: -0.1%

Memory: ✅ 39.748MB (SLO: <43.000MB -7.6%) vs baseline: +5.6%


✅ import_one_cache

Time: ✅ 6.294µs (SLO: <10.000µs 📉 -37.1%) vs baseline: -0.4%

Memory: ✅ 39.781MB (SLO: <43.000MB -7.5%) vs baseline: +5.0%


✅ import_one_stdlib

Time: ✅ 18.704µs (SLO: <20.000µs -6.5%) vs baseline: ~same

Memory: ✅ 39.570MB (SLO: <43.000MB -8.0%) vs baseline: +4.4%


✅ import_one_stdlib_cache

Time: ✅ 6.316µs (SLO: <10.000µs 📉 -36.8%) vs baseline: +0.3%

Memory: ✅ 39.651MB (SLO: <43.000MB -7.8%) vs baseline: +4.3%


✅ import_one_unknown

Time: ✅ 45.247µs (SLO: <50.000µs -9.5%) vs baseline: -0.3%

Memory: ✅ 39.823MB (SLO: <43.000MB -7.4%) vs baseline: +5.7%


✅ import_one_unknown_cache

Time: ✅ 6.304µs (SLO: <10.000µs 📉 -37.0%) vs baseline: +0.5%

Memory: ✅ 39.636MB (SLO: <43.000MB -7.8%) vs baseline: +4.0%

✅ All Tests Passing (6 suites)
iast_aspects - 40/40

✅ re_expand_aspect

Time: ✅ 37.285µs (SLO: <40.000µs -6.8%) vs baseline: +6.4%

Memory: ✅ 41.347MB (SLO: <43.500MB -5.0%) vs baseline: +4.7%


✅ re_expand_noaspect

Time: ✅ 35.256µs (SLO: <40.000µs 📉 -11.9%) vs baseline: +0.3%

Memory: ✅ 41.366MB (SLO: <43.500MB -4.9%) vs baseline: +4.8%


✅ re_findall_aspect

Time: ✅ 3.422µs (SLO: <10.000µs 📉 -65.8%) vs baseline: +0.1%

Memory: ✅ 41.366MB (SLO: <43.500MB -4.9%) vs baseline: +4.8%


✅ re_findall_noaspect

Time: ✅ 3.238µs (SLO: <10.000µs 📉 -67.6%) vs baseline: -0.4%

Memory: ✅ 41.386MB (SLO: <43.500MB -4.9%) vs baseline: +4.9%


✅ re_finditer_aspect

Time: ✅ 4.524µs (SLO: <10.000µs 📉 -54.8%) vs baseline: -0.6%

Memory: ✅ 41.425MB (SLO: <43.500MB -4.8%) vs baseline: +5.0%


✅ re_finditer_noaspect

Time: ✅ 3.287µs (SLO: <10.000µs 📉 -67.1%) vs baseline: -1.2%

Memory: ✅ 41.386MB (SLO: <43.500MB -4.9%) vs baseline: +4.9%


✅ re_fullmatch_aspect

Time: ✅ 2.752µs (SLO: <10.000µs 📉 -72.5%) vs baseline: -1.3%

Memory: ✅ 41.386MB (SLO: <43.500MB -4.9%) vs baseline: +4.8%


✅ re_fullmatch_noaspect

Time: ✅ 3.090µs (SLO: <10.000µs 📉 -69.1%) vs baseline: +0.5%

Memory: ✅ 41.386MB (SLO: <43.500MB -4.9%) vs baseline: +4.9%


✅ re_group_aspect

Time: ✅ 4.826µs (SLO: <10.000µs 📉 -51.7%) vs baseline: +0.2%

Memory: ✅ 41.425MB (SLO: <43.500MB -4.8%) vs baseline: +4.8%


✅ re_group_noaspect

Time: ✅ 4.884µs (SLO: <10.000µs 📉 -51.2%) vs baseline: -0.1%

Memory: ✅ 41.406MB (SLO: <43.500MB -4.8%) vs baseline: +4.7%


✅ re_groups_aspect

Time: ✅ 4.970µs (SLO: <10.000µs 📉 -50.3%) vs baseline: -0.1%

Memory: ✅ 41.425MB (SLO: <43.500MB -4.8%) vs baseline: +4.9%


✅ re_groups_noaspect

Time: ✅ 4.990µs (SLO: <10.000µs 📉 -50.1%) vs baseline: ~same

Memory: ✅ 41.484MB (SLO: <43.500MB -4.6%) vs baseline: +5.0%


✅ re_match_aspect

Time: ✅ 2.831µs (SLO: <10.000µs 📉 -71.7%) vs baseline: -0.5%

Memory: ✅ 41.484MB (SLO: <43.500MB -4.6%) vs baseline: +5.3%


✅ re_match_noaspect

Time: ✅ 3.107µs (SLO: <10.000µs 📉 -68.9%) vs baseline: +0.6%

Memory: ✅ 41.543MB (SLO: <43.500MB -4.5%) vs baseline: +5.3%


✅ re_search_aspect

Time: ✅ 2.638µs (SLO: <10.000µs 📉 -73.6%) vs baseline: -0.7%

Memory: ✅ 41.406MB (SLO: <43.500MB -4.8%) vs baseline: +4.8%


✅ re_search_noaspect

Time: ✅ 2.896µs (SLO: <10.000µs 📉 -71.0%) vs baseline: -0.2%

Memory: ✅ 41.406MB (SLO: <43.500MB -4.8%) vs baseline: +4.8%


✅ re_sub_aspect

Time: ✅ 3.553µs (SLO: <10.000µs 📉 -64.5%) vs baseline: +0.6%

Memory: ✅ 41.445MB (SLO: <43.500MB -4.7%) vs baseline: +4.9%


✅ re_sub_noaspect

Time: ✅ 4.042µs (SLO: <10.000µs 📉 -59.6%) vs baseline: +2.0%

Memory: ✅ 41.366MB (SLO: <43.500MB -4.9%) vs baseline: +4.5%


✅ re_subn_aspect

Time: ✅ 3.791µs (SLO: <10.000µs 📉 -62.1%) vs baseline: +0.1%

Memory: ✅ 41.347MB (SLO: <43.500MB -5.0%) vs baseline: +4.7%


✅ re_subn_noaspect

Time: ✅ 4.066µs (SLO: <10.000µs 📉 -59.3%) vs baseline: -0.6%

Memory: ✅ 41.406MB (SLO: <43.500MB -4.8%) vs baseline: +5.0%


iastaspectssplit - 12/12

✅ rsplit_aspect

Time: ✅ 1.593µs (SLO: <10.000µs 📉 -84.1%) vs baseline: +3.5%

Memory: ✅ 41.386MB (SLO: <43.500MB -4.9%) vs baseline: +4.9%


✅ rsplit_noaspect

Time: ✅ 1.612µs (SLO: <10.000µs 📉 -83.9%) vs baseline: -0.4%

Memory: ✅ 41.425MB (SLO: <43.500MB -4.8%) vs baseline: +5.0%


✅ split_aspect

Time: ✅ 1.546µs (SLO: <10.000µs 📉 -84.5%) vs baseline: -0.2%

Memory: ✅ 41.366MB (SLO: <43.500MB -4.9%) vs baseline: +4.6%


✅ split_noaspect

Time: ✅ 1.617µs (SLO: <10.000µs 📉 -83.8%) vs baseline: +0.2%

Memory: ✅ 41.406MB (SLO: <43.500MB -4.8%) vs baseline: +4.8%


✅ splitlines_aspect

Time: ✅ 1.495µs (SLO: <10.000µs 📉 -85.1%) vs baseline: -1.0%

Memory: ✅ 41.327MB (SLO: <43.500MB -5.0%) vs baseline: +4.6%


✅ splitlines_noaspect

Time: ✅ 1.556µs (SLO: <10.000µs 📉 -84.4%) vs baseline: -0.2%

Memory: ✅ 41.386MB (SLO: <43.500MB -4.9%) vs baseline: +4.7%


iastpropagation - 7/7

✅ no-propagation

Time: ✅ 48.673µs (SLO: <60.000µs 📉 -18.9%) vs baseline: ~same

Memory: ✅ 38.358MB (SLO: <42.000MB -8.7%) vs baseline: +4.6%


✅ propagation_enabled

Time: ✅ 135.900µs (SLO: <190.000µs 📉 -28.5%) vs baseline: -0.6%

Memory: ✅ 38.339MB (SLO: <42.000MB -8.7%) vs baseline: +5.1%


✅ propagation_enabled_100

Time: ✅ 1.583ms (SLO: <2.300ms 📉 -31.2%) vs baseline: +0.1%


✅ propagation_enabled_1000

Time: ✅ 29.359ms (SLO: <34.550ms 📉 -15.0%) vs baseline: -0.2%

Memory: ✅ 38.319MB (SLO: <42.000MB -8.8%) vs baseline: +4.3%


otelsdkspan - 24/24

✅ add-event

Time: ✅ 40.363ms (SLO: <42.000ms -3.9%) vs baseline: +0.1%

Memory: ✅ 37.749MB (SLO: <39.000MB -3.2%) vs baseline: +4.9%


✅ add-link

Time: ✅ 36.213ms (SLO: <38.550ms -6.1%) vs baseline: -0.2%

Memory: ✅ 37.709MB (SLO: <39.000MB -3.3%) vs baseline: +4.5%


✅ add-metrics

Time: ✅ 220.973ms (SLO: <232.000ms -4.8%) vs baseline: +0.3%

Memory: ✅ 37.768MB (SLO: <39.000MB -3.2%) vs baseline: +5.0%


✅ add-tags

Time: ✅ 214.648ms (SLO: <221.600ms -3.1%) vs baseline: +2.1%

Memory: ✅ 37.690MB (SLO: <39.000MB -3.4%) vs baseline: +5.0%


✅ get-context

Time: ✅ 29.055ms (SLO: <31.300ms -7.2%) vs baseline: -0.1%

Memory: ✅ 37.847MB (SLO: <39.000MB -3.0%) vs baseline: +5.4%


✅ is-recording

Time: ✅ 29.055ms (SLO: <31.000ms -6.3%) vs baseline: -0.4%

Memory: ✅ 37.788MB (SLO: <39.000MB -3.1%) vs baseline: +5.0%


✅ record-exception

Time: ✅ 63.016ms (SLO: <65.850ms -4.3%) vs baseline: ~same

Memory: ✅ 37.650MB (SLO: <39.000MB -3.5%) vs baseline: +4.6%


✅ set-status

Time: ✅ 31.628ms (SLO: <34.150ms -7.4%) vs baseline: ~same

Memory: ✅ 37.690MB (SLO: <39.000MB -3.4%) vs baseline: +5.1%


✅ start

Time: ✅ 29.294ms (SLO: <30.150ms -2.8%) vs baseline: +2.1%

Memory: ✅ 37.670MB (SLO: <39.000MB -3.4%) vs baseline: +4.5%


✅ start-finish

Time: ✅ 34.059ms (SLO: <35.350ms -3.7%) vs baseline: +1.0%

Memory: ✅ 37.729MB (SLO: <39.000MB -3.3%) vs baseline: +4.7%


✅ start-finish-telemetry

Time: ✅ 33.747ms (SLO: <35.450ms -4.8%) vs baseline: -0.2%

Memory: ✅ 37.690MB (SLO: <39.000MB -3.4%) vs baseline: +4.8%


✅ update-name

Time: ✅ 30.889ms (SLO: <33.400ms -7.5%) vs baseline: -0.7%

Memory: ✅ 37.808MB (SLO: <39.000MB -3.1%) vs baseline: +5.1%


packagespackageforrootmodulemapping - 4/4

✅ cache_off

Time: ✅ 343.226ms (SLO: <354.300ms -3.1%) vs baseline: -0.8%

Memory: ✅ 41.244MB (SLO: <43.500MB -5.2%) vs baseline: +5.5%


✅ cache_on

Time: ✅ 0.384µs (SLO: <10.000µs 📉 -96.2%) vs baseline: +0.3%

Memory: ✅ 40.071MB (SLO: <43.000MB -6.8%) vs baseline: +4.2%


samplingrules - 8/8

✅ average_match

Time: ✅ 137.600µs (SLO: <290.000µs 📉 -52.6%) vs baseline: +0.3%

Memory: ✅ 34.741MB (SLO: <35.500MB -2.1%) vs baseline: +4.4%


✅ high_match

Time: ✅ 173.781µs (SLO: <480.000µs 📉 -63.8%) vs baseline: -0.4%

Memory: ✅ 34.741MB (SLO: <35.500MB -2.1%) vs baseline: +4.9%


✅ low_match

Time: ✅ 98.750µs (SLO: <120.000µs 📉 -17.7%) vs baseline: ~same

Memory: ✅ 603.704MB (SLO: <700.000MB 📉 -13.8%) vs baseline: +4.8%


✅ very_low_match

Time: ✅ 2.669ms (SLO: <8.500ms 📉 -68.6%) vs baseline: +0.2%

Memory: ✅ 71.153MB (SLO: <75.000MB -5.1%) vs baseline: +4.8%

ℹ️ Scenarios Missing SLO Configuration (26 scenarios)

The following scenarios exist in candidate data but have no SLO thresholds configured:

  • coreapiscenario-core_dispatch_listeners
  • coreapiscenario-core_dispatch_no_listeners
  • coreapiscenario-core_dispatch_with_results_listeners
  • coreapiscenario-core_dispatch_with_results_no_listeners
  • djangosimple-baseline
  • errortrackingdjangosimple-baseline
  • errortrackingflasksqli-baseline
  • flasksimple-baseline
  • flasksqli-baseline
  • sethttpmeta-obfuscation-disabled
  • startup-baseline
  • startup-baseline_django
  • startup-baseline_flask
  • startup-ddtrace_run
  • startup-ddtrace_run_appsec
  • startup-ddtrace_run_profiling
  • startup-ddtrace_run_runtime_metrics
  • startup-ddtrace_run_send_span
  • startup-ddtrace_run_telemetry_disabled
  • startup-ddtrace_run_telemetry_enabled
  • startup-import_ddtrace
  • startup-import_ddtrace_auto
  • startup-import_ddtrace_auto_django
  • startup-import_ddtrace_auto_flask
  • startup-import_ddtrace_django
  • startup-import_ddtrace_flask

@KowalskiThomas KowalskiThomas force-pushed the kowalski/chore-profiling-detect-cycles-in-asyncio branch from b848bad to c0e6f61 Compare December 22, 2025 15:27
@KowalskiThomas KowalskiThomas marked this pull request as ready for review December 22, 2025 15:52
@KowalskiThomas KowalskiThomas requested a review from a team as a code owner December 22, 2025 15:52
@KowalskiThomas KowalskiThomas force-pushed the kowalski/chore-profiling-detect-cycles-in-asyncio branch from fe7b099 to fc65590 Compare December 23, 2025 08:28
@KowalskiThomas KowalskiThomas force-pushed the kowalski/chore-profiling-detect-cycles-in-asyncio branch from fc65590 to e7dbf7f Compare December 23, 2025 12:18
Copy link
Contributor

@vlad-scherbich vlad-scherbich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

KowalskiThomas added a commit that referenced this pull request Jan 1, 2026
## Description

Related PRs
- Related: #15712
- Dependent: #15789
- Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits
(if needed for code archeology...)

### What is this about?

This PR updates the Task unwinding logic in the Profiler to (more)
properly handle race conditions around running/"on CPU" Tasks. A Task
can be either in a _running_ state (i.e. actively _computing_ something
itself, like executing a regular Python function) or in a _sleeping_
state (i.e. waiting for something else to happen to wake up).

<img width="1076" height="434" alt="image"
src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c"
/>

After those changes, this problem does not appear anymore: only Frames
that are actually from the same Stack appear within a given Stack.

<img width="1387" height="445" alt="image"
src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f"
/>



### Why do we need it?

Because we don't take a "snapshot of the whole Python process at once",
there is a race condition in our Sampler.
We first capture the Thread Stack (i.e. for the current Thread, if it is
running, what Python code the interpreter is running), then for each
Task in the Thread's Event Loop [if it exists] we look at the Task's own
Stack. (Since Task/Coroutines are pausable, they have their own Stack
that is kept in memory when they're paused, then re-loaded into context
when they're resumed. Walking each Task's Stack allows us to e.g. know
what code they're "running", even when they aren't actually currently
running code...)
Going back to the race condition question, we may have a discrepancy
between what the Python Thread Stack tells us (what the interpreter is
running) and what Task objects themselves tell us (because a tiny amount
of time actually elapses between the moment we capture the Thread Stack
and the moment we inspect the Task objects, so _what is happening_ may
have changed in the meantime).

I've already in the past gone into more detail regarding what
buggy/unexpected behaviour may result from that race condition; this PR
improves this.

Note that there is a pretty obvious tradeoff here. When we detect a
discrepancy, we can:
- Ignore the fact we know something bad is going to happen – I'd rather
not do that because it can look terrible for customers (and we don't
want to look _obviously wrong_ to them). That would mean _quantity over
quality_.
- Try to recover by doing clever tricks (this can be somewhat costly
because we have to interleave the various Stacks we have... I think we
may want to do that at some point but not without putting more thought
into it; plus those clever tricks can also sometimes be brittle tricks).
That would mean _quality and quantity over cost_ (which in practice
probably also means _quality over quantity_ because increasing costs
will most probably lead to more adaptive sampling).
- Give up and just pretend this never happened – skip that Sample (for
the current Task, and in certain cases for the current Thread
altogether). That would mean _quality over quantity_.

For the time being, things can only get better because we're in a state
where we don't deal with the problem at all. The current PR biases
towards a mix: we detect more reliably the depth of the pure Python
Stack (which allows us not to rely on un-unwinding Task Stacks), and
then we skip Samples that we know will be bogus. If the latter happens
sufficiently rarely [a claim I still need numbers to back] then this is
OK.

### How does it work?

The main problem we are trying to avoid here is having some of Task A's
Frames appearing as part of Task B's Stack. Working around this requires
properly splitting the Python Stack when it says it is running a Task,
such that we only push the `asyncio` runtime Frames on top of each
non-Task A Task. Walking the Python Stack allows us to do that properly.

We thus walk the Python Stack (once per Thread) to detect whether we see
`Handle.run` Frames – those indicate that the Event Loop is currently
_stepping_ the Coroutine – in other words executing code. (When that
happens, we expect at least one Task to be marked as _running_ (there
could be more – that's also a race condition, but it's OK, as far as CPU
Time is not concerned...))
As soon as we see a `run` Frame, we know the depth of the "pure Python
Stack" and we can push it on top of every Task's Stack!

### What does this cost us?

This is not completely free – we're doing more work (namely, walking the
stack at each Sample). Looking at Full Host Profiles on a high-CPU
`asyncio`-based Python script, I'm getting the following difference.
Note that the total Profiler overhead is about 360ms/minute, meaning the
additional ~20ms we're using here represent an extra 5% overhead. Given
the importance of getting Stacks right (or at least not completely
wrong), I'd say it's worth it, but it's still noticeable.

I tried to do otherwise – but as far as I can tell, as long as the race
condition between unwinding the Python Stack and unwinding Task Stacks
exists (which it can't not), we will not be able to tell for sure how
many pure Python Stack Frames we need to push. There are heuristics that
can get us there in theoretically better time (e.g. only walk the Python
Stack if all Tasks reported as non-running), but those come at a
correctness and code readability cost, and it's not even that certain
their overhead would be lesser.

I also have another PR that should reduce the cost of unwinding Tasks
(that uses the fact we now walk the Python Stack only once):
#15789 so hopefully it evens
things out.

<img width="1926" height="921" alt="image"
src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332"
/>
brettlangdon pushed a commit that referenced this pull request Jan 6, 2026
## Description

Related PRs
- Related: #15712
- Dependent: #15789
- Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits
(if needed for code archeology...)

### What is this about?

This PR updates the Task unwinding logic in the Profiler to (more)
properly handle race conditions around running/"on CPU" Tasks. A Task
can be either in a _running_ state (i.e. actively _computing_ something
itself, like executing a regular Python function) or in a _sleeping_
state (i.e. waiting for something else to happen to wake up).

<img width="1076" height="434" alt="image"
src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c"
/>

After those changes, this problem does not appear anymore: only Frames
that are actually from the same Stack appear within a given Stack.

<img width="1387" height="445" alt="image"
src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f"
/>



### Why do we need it?

Because we don't take a "snapshot of the whole Python process at once",
there is a race condition in our Sampler.
We first capture the Thread Stack (i.e. for the current Thread, if it is
running, what Python code the interpreter is running), then for each
Task in the Thread's Event Loop [if it exists] we look at the Task's own
Stack. (Since Task/Coroutines are pausable, they have their own Stack
that is kept in memory when they're paused, then re-loaded into context
when they're resumed. Walking each Task's Stack allows us to e.g. know
what code they're "running", even when they aren't actually currently
running code...)
Going back to the race condition question, we may have a discrepancy
between what the Python Thread Stack tells us (what the interpreter is
running) and what Task objects themselves tell us (because a tiny amount
of time actually elapses between the moment we capture the Thread Stack
and the moment we inspect the Task objects, so _what is happening_ may
have changed in the meantime).

I've already in the past gone into more detail regarding what
buggy/unexpected behaviour may result from that race condition; this PR
improves this.

Note that there is a pretty obvious tradeoff here. When we detect a
discrepancy, we can:
- Ignore the fact we know something bad is going to happen – I'd rather
not do that because it can look terrible for customers (and we don't
want to look _obviously wrong_ to them). That would mean _quantity over
quality_.
- Try to recover by doing clever tricks (this can be somewhat costly
because we have to interleave the various Stacks we have... I think we
may want to do that at some point but not without putting more thought
into it; plus those clever tricks can also sometimes be brittle tricks).
That would mean _quality and quantity over cost_ (which in practice
probably also means _quality over quantity_ because increasing costs
will most probably lead to more adaptive sampling).
- Give up and just pretend this never happened – skip that Sample (for
the current Task, and in certain cases for the current Thread
altogether). That would mean _quality over quantity_.

For the time being, things can only get better because we're in a state
where we don't deal with the problem at all. The current PR biases
towards a mix: we detect more reliably the depth of the pure Python
Stack (which allows us not to rely on un-unwinding Task Stacks), and
then we skip Samples that we know will be bogus. If the latter happens
sufficiently rarely [a claim I still need numbers to back] then this is
OK.

### How does it work?

The main problem we are trying to avoid here is having some of Task A's
Frames appearing as part of Task B's Stack. Working around this requires
properly splitting the Python Stack when it says it is running a Task,
such that we only push the `asyncio` runtime Frames on top of each
non-Task A Task. Walking the Python Stack allows us to do that properly.

We thus walk the Python Stack (once per Thread) to detect whether we see
`Handle.run` Frames – those indicate that the Event Loop is currently
_stepping_ the Coroutine – in other words executing code. (When that
happens, we expect at least one Task to be marked as _running_ (there
could be more – that's also a race condition, but it's OK, as far as CPU
Time is not concerned...))
As soon as we see a `run` Frame, we know the depth of the "pure Python
Stack" and we can push it on top of every Task's Stack!

### What does this cost us?

This is not completely free – we're doing more work (namely, walking the
stack at each Sample). Looking at Full Host Profiles on a high-CPU
`asyncio`-based Python script, I'm getting the following difference.
Note that the total Profiler overhead is about 360ms/minute, meaning the
additional ~20ms we're using here represent an extra 5% overhead. Given
the importance of getting Stacks right (or at least not completely
wrong), I'd say it's worth it, but it's still noticeable.

I tried to do otherwise – but as far as I can tell, as long as the race
condition between unwinding the Python Stack and unwinding Task Stacks
exists (which it can't not), we will not be able to tell for sure how
many pure Python Stack Frames we need to push. There are heuristics that
can get us there in theoretically better time (e.g. only walk the Python
Stack if all Tasks reported as non-running), but those come at a
correctness and code readability cost, and it's not even that certain
their overhead would be lesser.

I also have another PR that should reduce the cost of unwinding Tasks
(that uses the fact we now walk the Python Stack only once):
#15789 so hopefully it evens
things out.

<img width="1926" height="921" alt="image"
src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332"
/>
dd-octo-sts bot pushed a commit that referenced this pull request Jan 6, 2026
## Description

Related PRs
- Related: #15712
- Dependent: #15789
- Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits
(if needed for code archeology...)

### What is this about?

This PR updates the Task unwinding logic in the Profiler to (more)
properly handle race conditions around running/"on CPU" Tasks. A Task
can be either in a _running_ state (i.e. actively _computing_ something
itself, like executing a regular Python function) or in a _sleeping_
state (i.e. waiting for something else to happen to wake up).

<img width="1076" height="434" alt="image"
src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c"
/>

After those changes, this problem does not appear anymore: only Frames
that are actually from the same Stack appear within a given Stack.

<img width="1387" height="445" alt="image"
src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f"
/>

### Why do we need it?

Because we don't take a "snapshot of the whole Python process at once",
there is a race condition in our Sampler.
We first capture the Thread Stack (i.e. for the current Thread, if it is
running, what Python code the interpreter is running), then for each
Task in the Thread's Event Loop [if it exists] we look at the Task's own
Stack. (Since Task/Coroutines are pausable, they have their own Stack
that is kept in memory when they're paused, then re-loaded into context
when they're resumed. Walking each Task's Stack allows us to e.g. know
what code they're "running", even when they aren't actually currently
running code...)
Going back to the race condition question, we may have a discrepancy
between what the Python Thread Stack tells us (what the interpreter is
running) and what Task objects themselves tell us (because a tiny amount
of time actually elapses between the moment we capture the Thread Stack
and the moment we inspect the Task objects, so _what is happening_ may
have changed in the meantime).

I've already in the past gone into more detail regarding what
buggy/unexpected behaviour may result from that race condition; this PR
improves this.

Note that there is a pretty obvious tradeoff here. When we detect a
discrepancy, we can:
- Ignore the fact we know something bad is going to happen – I'd rather
not do that because it can look terrible for customers (and we don't
want to look _obviously wrong_ to them). That would mean _quantity over
quality_.
- Try to recover by doing clever tricks (this can be somewhat costly
because we have to interleave the various Stacks we have... I think we
may want to do that at some point but not without putting more thought
into it; plus those clever tricks can also sometimes be brittle tricks).
That would mean _quality and quantity over cost_ (which in practice
probably also means _quality over quantity_ because increasing costs
will most probably lead to more adaptive sampling).
- Give up and just pretend this never happened – skip that Sample (for
the current Task, and in certain cases for the current Thread
altogether). That would mean _quality over quantity_.

For the time being, things can only get better because we're in a state
where we don't deal with the problem at all. The current PR biases
towards a mix: we detect more reliably the depth of the pure Python
Stack (which allows us not to rely on un-unwinding Task Stacks), and
then we skip Samples that we know will be bogus. If the latter happens
sufficiently rarely [a claim I still need numbers to back] then this is
OK.

### How does it work?

The main problem we are trying to avoid here is having some of Task A's
Frames appearing as part of Task B's Stack. Working around this requires
properly splitting the Python Stack when it says it is running a Task,
such that we only push the `asyncio` runtime Frames on top of each
non-Task A Task. Walking the Python Stack allows us to do that properly.

We thus walk the Python Stack (once per Thread) to detect whether we see
`Handle.run` Frames – those indicate that the Event Loop is currently
_stepping_ the Coroutine – in other words executing code. (When that
happens, we expect at least one Task to be marked as _running_ (there
could be more – that's also a race condition, but it's OK, as far as CPU
Time is not concerned...))
As soon as we see a `run` Frame, we know the depth of the "pure Python
Stack" and we can push it on top of every Task's Stack!

### What does this cost us?

This is not completely free – we're doing more work (namely, walking the
stack at each Sample). Looking at Full Host Profiles on a high-CPU
`asyncio`-based Python script, I'm getting the following difference.
Note that the total Profiler overhead is about 360ms/minute, meaning the
additional ~20ms we're using here represent an extra 5% overhead. Given
the importance of getting Stacks right (or at least not completely
wrong), I'd say it's worth it, but it's still noticeable.

I tried to do otherwise – but as far as I can tell, as long as the race
condition between unwinding the Python Stack and unwinding Task Stacks
exists (which it can't not), we will not be able to tell for sure how
many pure Python Stack Frames we need to push. There are heuristics that
can get us there in theoretically better time (e.g. only walk the Python
Stack if all Tasks reported as non-running), but those come at a
correctness and code readability cost, and it's not even that certain
their overhead would be lesser.

I also have another PR that should reduce the cost of unwinding Tasks
(that uses the fact we now walk the Python Stack only once):
#15789 so hopefully it evens
things out.

<img width="1926" height="921" alt="image"
src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332"
/>

(cherry picked from commit 61b1799)
KowalskiThomas added a commit that referenced this pull request Jan 6, 2026
## Description

Related PRs
- Related: #15712
- Dependent: #15789
- Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits
(if needed for code archeology...)

### What is this about?

This PR updates the Task unwinding logic in the Profiler to (more)
properly handle race conditions around running/"on CPU" Tasks. A Task
can be either in a _running_ state (i.e. actively _computing_ something
itself, like executing a regular Python function) or in a _sleeping_
state (i.e. waiting for something else to happen to wake up).

<img width="1076" height="434" alt="image"
src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c"
/>

After those changes, this problem does not appear anymore: only Frames
that are actually from the same Stack appear within a given Stack.

<img width="1387" height="445" alt="image"
src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f"
/>

### Why do we need it?

Because we don't take a "snapshot of the whole Python process at once",
there is a race condition in our Sampler.
We first capture the Thread Stack (i.e. for the current Thread, if it is
running, what Python code the interpreter is running), then for each
Task in the Thread's Event Loop [if it exists] we look at the Task's own
Stack. (Since Task/Coroutines are pausable, they have their own Stack
that is kept in memory when they're paused, then re-loaded into context
when they're resumed. Walking each Task's Stack allows us to e.g. know
what code they're "running", even when they aren't actually currently
running code...)
Going back to the race condition question, we may have a discrepancy
between what the Python Thread Stack tells us (what the interpreter is
running) and what Task objects themselves tell us (because a tiny amount
of time actually elapses between the moment we capture the Thread Stack
and the moment we inspect the Task objects, so _what is happening_ may
have changed in the meantime).

I've already in the past gone into more detail regarding what
buggy/unexpected behaviour may result from that race condition; this PR
improves this.

Note that there is a pretty obvious tradeoff here. When we detect a
discrepancy, we can:
- Ignore the fact we know something bad is going to happen – I'd rather
not do that because it can look terrible for customers (and we don't
want to look _obviously wrong_ to them). That would mean _quantity over
quality_.
- Try to recover by doing clever tricks (this can be somewhat costly
because we have to interleave the various Stacks we have... I think we
may want to do that at some point but not without putting more thought
into it; plus those clever tricks can also sometimes be brittle tricks).
That would mean _quality and quantity over cost_ (which in practice
probably also means _quality over quantity_ because increasing costs
will most probably lead to more adaptive sampling).
- Give up and just pretend this never happened – skip that Sample (for
the current Task, and in certain cases for the current Thread
altogether). That would mean _quality over quantity_.

For the time being, things can only get better because we're in a state
where we don't deal with the problem at all. The current PR biases
towards a mix: we detect more reliably the depth of the pure Python
Stack (which allows us not to rely on un-unwinding Task Stacks), and
then we skip Samples that we know will be bogus. If the latter happens
sufficiently rarely [a claim I still need numbers to back] then this is
OK.

### How does it work?

The main problem we are trying to avoid here is having some of Task A's
Frames appearing as part of Task B's Stack. Working around this requires
properly splitting the Python Stack when it says it is running a Task,
such that we only push the `asyncio` runtime Frames on top of each
non-Task A Task. Walking the Python Stack allows us to do that properly.

We thus walk the Python Stack (once per Thread) to detect whether we see
`Handle.run` Frames – those indicate that the Event Loop is currently
_stepping_ the Coroutine – in other words executing code. (When that
happens, we expect at least one Task to be marked as _running_ (there
could be more – that's also a race condition, but it's OK, as far as CPU
Time is not concerned...))
As soon as we see a `run` Frame, we know the depth of the "pure Python
Stack" and we can push it on top of every Task's Stack!

### What does this cost us?

This is not completely free – we're doing more work (namely, walking the
stack at each Sample). Looking at Full Host Profiles on a high-CPU
`asyncio`-based Python script, I'm getting the following difference.
Note that the total Profiler overhead is about 360ms/minute, meaning the
additional ~20ms we're using here represent an extra 5% overhead. Given
the importance of getting Stacks right (or at least not completely
wrong), I'd say it's worth it, but it's still noticeable.

I tried to do otherwise – but as far as I can tell, as long as the race
condition between unwinding the Python Stack and unwinding Task Stacks
exists (which it can't not), we will not be able to tell for sure how
many pure Python Stack Frames we need to push. There are heuristics that
can get us there in theoretically better time (e.g. only walk the Python
Stack if all Tasks reported as non-running), but those come at a
correctness and code readability cost, and it's not even that certain
their overhead would be lesser.

I also have another PR that should reduce the cost of unwinding Tasks
(that uses the fact we now walk the Python Stack only once):
#15789 so hopefully it evens
things out.

<img width="1926" height="921" alt="image"
src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332"
/>

(cherry picked from commit 61b1799)
KowalskiThomas added a commit that referenced this pull request Jan 6, 2026
…15854)

Backport 61b1799 from #15780 to 4.1.

## Description

Related PRs
- Related: #15712
- Dependent: #15789
- Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits
(if needed for code archeology...)

### What is this about?

This PR updates the Task unwinding logic in the Profiler to (more)
properly handle race conditions around running/"on CPU" Tasks. A Task
can be either in a _running_ state (i.e. actively _computing_ something
itself, like executing a regular Python function) or in a _sleeping_
state (i.e. waiting for something else to happen to wake up).

<img width="1076" height="434" alt="image"
src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c"
/>

After those changes, this problem does not appear anymore: only Frames
that are actually from the same Stack appear within a given Stack.

<img width="1387" height="445" alt="image"
src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f"
/>



### Why do we need it?

Because we don't take a "snapshot of the whole Python process at once",
there is a race condition in our Sampler.
We first capture the Thread Stack (i.e. for the current Thread, if it is
running, what Python code the interpreter is running), then for each
Task in the Thread's Event Loop [if it exists] we look at the Task's own
Stack. (Since Task/Coroutines are pausable, they have their own Stack
that is kept in memory when they're paused, then re-loaded into context
when they're resumed. Walking each Task's Stack allows us to e.g. know
what code they're "running", even when they aren't actually currently
running code...)
Going back to the race condition question, we may have a discrepancy
between what the Python Thread Stack tells us (what the interpreter is
running) and what Task objects themselves tell us (because a tiny amount
of time actually elapses between the moment we capture the Thread Stack
and the moment we inspect the Task objects, so _what is happening_ may
have changed in the meantime).

I've already in the past gone into more detail regarding what
buggy/unexpected behaviour may result from that race condition; this PR
improves this.

Note that there is a pretty obvious tradeoff here. When we detect a
discrepancy, we can:
- Ignore the fact we know something bad is going to happen – I'd rather
not do that because it can look terrible for customers (and we don't
want to look _obviously wrong_ to them). That would mean _quantity over
quality_.
- Try to recover by doing clever tricks (this can be somewhat costly
because we have to interleave the various Stacks we have... I think we
may want to do that at some point but not without putting more thought
into it; plus those clever tricks can also sometimes be brittle tricks).
That would mean _quality and quantity over cost_ (which in practice
probably also means _quality over quantity_ because increasing costs
will most probably lead to more adaptive sampling).
- Give up and just pretend this never happened – skip that Sample (for
the current Task, and in certain cases for the current Thread
altogether). That would mean _quality over quantity_.

For the time being, things can only get better because we're in a state
where we don't deal with the problem at all. The current PR biases
towards a mix: we detect more reliably the depth of the pure Python
Stack (which allows us not to rely on un-unwinding Task Stacks), and
then we skip Samples that we know will be bogus. If the latter happens
sufficiently rarely [a claim I still need numbers to back] then this is
OK.

### How does it work?

The main problem we are trying to avoid here is having some of Task A's
Frames appearing as part of Task B's Stack. Working around this requires
properly splitting the Python Stack when it says it is running a Task,
such that we only push the `asyncio` runtime Frames on top of each
non-Task A Task. Walking the Python Stack allows us to do that properly.

We thus walk the Python Stack (once per Thread) to detect whether we see
`Handle.run` Frames – those indicate that the Event Loop is currently
_stepping_ the Coroutine – in other words executing code. (When that
happens, we expect at least one Task to be marked as _running_ (there
could be more – that's also a race condition, but it's OK, as far as CPU
Time is not concerned...))
As soon as we see a `run` Frame, we know the depth of the "pure Python
Stack" and we can push it on top of every Task's Stack!

### What does this cost us?

This is not completely free – we're doing more work (namely, walking the
stack at each Sample). Looking at Full Host Profiles on a high-CPU
`asyncio`-based Python script, I'm getting the following difference.
Note that the total Profiler overhead is about 360ms/minute, meaning the
additional ~20ms we're using here represent an extra 5% overhead. Given
the importance of getting Stacks right (or at least not completely
wrong), I'd say it's worth it, but it's still noticeable.

I tried to do otherwise – but as far as I can tell, as long as the race
condition between unwinding the Python Stack and unwinding Task Stacks
exists (which it can't not), we will not be able to tell for sure how
many pure Python Stack Frames we need to push. There are heuristics that
can get us there in theoretically better time (e.g. only walk the Python
Stack if all Tasks reported as non-running), but those come at a
correctness and code readability cost, and it's not even that certain
their overhead would be lesser.

I also have another PR that should reduce the cost of unwinding Tasks
(that uses the fact we now walk the Python Stack only once):
#15789 so hopefully it evens
things out.

<img width="1926" height="921" alt="image"
src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332"
/>

Co-authored-by: Thomas Kowalski <thomas.kowalski@datadoghq.com>
kianjones9 pushed a commit to kianjones9/dd-trace-py that referenced this pull request Jan 9, 2026
)

## Description

Related PRs
- Related: DataDog#15712
- Dependent: DataDog#15789
- Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits
(if needed for code archeology...)

### What is this about?

This PR updates the Task unwinding logic in the Profiler to (more)
properly handle race conditions around running/"on CPU" Tasks. A Task
can be either in a _running_ state (i.e. actively _computing_ something
itself, like executing a regular Python function) or in a _sleeping_
state (i.e. waiting for something else to happen to wake up).

<img width="1076" height="434" alt="image"
src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c"
/>

After those changes, this problem does not appear anymore: only Frames
that are actually from the same Stack appear within a given Stack.

<img width="1387" height="445" alt="image"
src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f"
/>



### Why do we need it?

Because we don't take a "snapshot of the whole Python process at once",
there is a race condition in our Sampler.
We first capture the Thread Stack (i.e. for the current Thread, if it is
running, what Python code the interpreter is running), then for each
Task in the Thread's Event Loop [if it exists] we look at the Task's own
Stack. (Since Task/Coroutines are pausable, they have their own Stack
that is kept in memory when they're paused, then re-loaded into context
when they're resumed. Walking each Task's Stack allows us to e.g. know
what code they're "running", even when they aren't actually currently
running code...)
Going back to the race condition question, we may have a discrepancy
between what the Python Thread Stack tells us (what the interpreter is
running) and what Task objects themselves tell us (because a tiny amount
of time actually elapses between the moment we capture the Thread Stack
and the moment we inspect the Task objects, so _what is happening_ may
have changed in the meantime).

I've already in the past gone into more detail regarding what
buggy/unexpected behaviour may result from that race condition; this PR
improves this.

Note that there is a pretty obvious tradeoff here. When we detect a
discrepancy, we can:
- Ignore the fact we know something bad is going to happen – I'd rather
not do that because it can look terrible for customers (and we don't
want to look _obviously wrong_ to them). That would mean _quantity over
quality_.
- Try to recover by doing clever tricks (this can be somewhat costly
because we have to interleave the various Stacks we have... I think we
may want to do that at some point but not without putting more thought
into it; plus those clever tricks can also sometimes be brittle tricks).
That would mean _quality and quantity over cost_ (which in practice
probably also means _quality over quantity_ because increasing costs
will most probably lead to more adaptive sampling).
- Give up and just pretend this never happened – skip that Sample (for
the current Task, and in certain cases for the current Thread
altogether). That would mean _quality over quantity_.

For the time being, things can only get better because we're in a state
where we don't deal with the problem at all. The current PR biases
towards a mix: we detect more reliably the depth of the pure Python
Stack (which allows us not to rely on un-unwinding Task Stacks), and
then we skip Samples that we know will be bogus. If the latter happens
sufficiently rarely [a claim I still need numbers to back] then this is
OK.

### How does it work?

The main problem we are trying to avoid here is having some of Task A's
Frames appearing as part of Task B's Stack. Working around this requires
properly splitting the Python Stack when it says it is running a Task,
such that we only push the `asyncio` runtime Frames on top of each
non-Task A Task. Walking the Python Stack allows us to do that properly.

We thus walk the Python Stack (once per Thread) to detect whether we see
`Handle.run` Frames – those indicate that the Event Loop is currently
_stepping_ the Coroutine – in other words executing code. (When that
happens, we expect at least one Task to be marked as _running_ (there
could be more – that's also a race condition, but it's OK, as far as CPU
Time is not concerned...))
As soon as we see a `run` Frame, we know the depth of the "pure Python
Stack" and we can push it on top of every Task's Stack!

### What does this cost us?

This is not completely free – we're doing more work (namely, walking the
stack at each Sample). Looking at Full Host Profiles on a high-CPU
`asyncio`-based Python script, I'm getting the following difference.
Note that the total Profiler overhead is about 360ms/minute, meaning the
additional ~20ms we're using here represent an extra 5% overhead. Given
the importance of getting Stacks right (or at least not completely
wrong), I'd say it's worth it, but it's still noticeable.

I tried to do otherwise – but as far as I can tell, as long as the race
condition between unwinding the Python Stack and unwinding Task Stacks
exists (which it can't not), we will not be able to tell for sure how
many pure Python Stack Frames we need to push. There are heuristics that
can get us there in theoretically better time (e.g. only walk the Python
Stack if all Tasks reported as non-running), but those come at a
correctness and code readability cost, and it's not even that certain
their overhead would be lesser.

I also have another PR that should reduce the cost of unwinding Tasks
(that uses the fact we now walk the Python Stack only once):
DataDog#15789 so hopefully it evens
things out.

<img width="1926" height="921" alt="image"
src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332"
/>
Copy link
Contributor

@taegyunkim taegyunkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to merge this?

One thing I keep ask myself when adding these kind of checks is that we tend to ignore the fact that std::unordered_set.find() can show up in the profiles, meaning that it could be costly. Given that we have adaptive sampling in works, it wouldn't be too problematic, but imho it's worth thinking.

The upstream CPython profiler tends to use a cap on the number of items to iterate on, and check whether any pointer is not progressing (an additional variable to compare against). And they do not use 'seen' container like us.

@KowalskiThomas
Copy link
Contributor Author

Do we still want to merge this?

One thing I keep ask myself when adding these kind of checks is that we tend to ignore the fact that std::unordered_set.find() can show up in the profiles, meaning that it could be costly. Given that we have adaptive sampling in works, it wouldn't be too problematic, but imho it's worth thinking.

The upstream CPython profiler tends to use a cap on the number of items to iterate on, and check whether any pointer is not progressing (an additional variable to compare against). And they do not use 'seen' container like us.

The very reason why I had second thoughts around merging this PR was exactly that, and that's why I haven't merged it since I got the approval! 😅

Indeed I've seen some unordered_set and unordered_map frames show up in profiles and I don't think such a change is going to be free performance-wise, so I wanted to make sure it was worth it feature-wise (and... I wasn't able to make sure of that/nothing has proven it to me since).

@github-actions
Copy link
Contributor

This pull request has been automatically closed after a period of inactivity.
After this much time, it will likely be easier to open a new pull request with the
same changes than to update this one from the base branch. Please comment or reopen
if you think this pull request was closed in error.

@github-actions github-actions bot closed this Feb 17, 2026
@github-actions github-actions bot removed the stale label Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/no-changelog A changelog entry is not required for this PR. Profiling Continous Profling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants