perf(profiling): unwind only one Frame per Task by KowalskiThomas · Pull Request #15789 · DataDog/dd-trace-py

KowalskiThomas · 2025-12-25T09:00:47Z

Description

This is a small (but still worth it) performance improvement in the context of asyncio Task unwinding. Previously, we would use the unwind_frame to get the current Frame for an asyncio Task. In the case of a running Task, this would also yield all the Python asyncio runtime Frames that were "on top" of that Task Frame (and that we would later have to remove from the Stack).

Building that Python Stack that we don't care about takes some time (because we need to walk the Frame chain) – we should only do it if we need to. Here, we clearly don't need to, as we only care about the Task Frame, so I added a new argument to unwind_frame that allows to early exit after a certain Stack depth has been reached (and I set it to 1 when unwinding a Task Frame).

cit-pr-commenter · 2025-12-25T09:01:44Z

Codeowners resolved as

ddtrace/internal/datadog/profiling/stack/echion/echion/stacks.h         @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/echion/echion/tasks.h          @DataDog/profiling-python

pr-commenter · 2025-12-25T09:50:31Z

Performance SLOs

Comparing candidate kowalski/perf-profiling-unwind-only-one-frame-per-task (04aa20b) with baseline main (21d5a9b)

📈 Performance Regressions (3 suites)

📈 iastaspects - 118/118

✅ add_aspect

Time: ✅ 17.903µs (SLO: <20.000µs 📉 -10.5%) vs baseline: 📈 +20.7%

Memory: ✅ 42.605MB (SLO: <43.250MB 🟡 -1.5%) vs baseline: +4.9%

✅ add_inplace_aspect

Time: ✅ 14.869µs (SLO: <20.000µs 📉 -25.7%) vs baseline: -0.3%

Memory: ✅ 42.546MB (SLO: <43.250MB 🟡 -1.6%) vs baseline: +4.8%

✅ add_inplace_noaspect

Time: ✅ 0.340µs (SLO: <10.000µs 📉 -96.6%) vs baseline: -0.4%

Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.9%

✅ add_noaspect

Time: ✅ 0.546µs (SLO: <10.000µs 📉 -94.5%) vs baseline: -0.7%

Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +5.0%

✅ bytearray_aspect

Time: ✅ 17.971µs (SLO: <30.000µs 📉 -40.1%) vs baseline: -0.1%

Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +5.0%

✅ bytearray_extend_aspect

Time: ✅ 23.778µs (SLO: <30.000µs 📉 -20.7%) vs baseline: -0.3%

Memory: ✅ 42.526MB (SLO: <43.500MB -2.2%) vs baseline: +4.9%

✅ bytearray_extend_noaspect

Time: ✅ 2.764µs (SLO: <10.000µs 📉 -72.4%) vs baseline: +0.1%

Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +4.9%

✅ bytearray_noaspect

Time: ✅ 1.483µs (SLO: <10.000µs 📉 -85.2%) vs baseline: ~same

Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +4.9%

✅ bytes_aspect

Time: ✅ 16.551µs (SLO: <20.000µs 📉 -17.2%) vs baseline: -1.4%

Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +5.1%

✅ bytes_noaspect

Time: ✅ 1.419µs (SLO: <10.000µs 📉 -85.8%) vs baseline: -0.1%

Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +4.9%

✅ bytesio_aspect

Time: ✅ 55.506µs (SLO: <70.000µs 📉 -20.7%) vs baseline: -0.5%

Memory: ✅ 42.585MB (SLO: <43.500MB -2.1%) vs baseline: +4.8%

✅ bytesio_noaspect

Time: ✅ 3.276µs (SLO: <10.000µs 📉 -67.2%) vs baseline: +0.5%

Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.5%

✅ capitalize_aspect

Time: ✅ 14.562µs (SLO: <20.000µs 📉 -27.2%) vs baseline: -0.5%

Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.8%

✅ capitalize_noaspect

Time: ✅ 2.582µs (SLO: <10.000µs 📉 -74.2%) vs baseline: -0.2%

Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +5.0%

✅ casefold_aspect

Time: ✅ 14.690µs (SLO: <20.000µs 📉 -26.5%) vs baseline: +0.9%

Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +4.9%

✅ casefold_noaspect

Time: ✅ 3.158µs (SLO: <10.000µs 📉 -68.4%) vs baseline: +0.2%

Memory: ✅ 42.487MB (SLO: <43.500MB -2.3%) vs baseline: +4.5%

✅ decode_aspect

Time: ✅ 15.647µs (SLO: <30.000µs 📉 -47.8%) vs baseline: +0.5%

Memory: ✅ 42.526MB (SLO: <43.500MB -2.2%) vs baseline: +4.6%

✅ decode_noaspect

Time: ✅ 1.616µs (SLO: <10.000µs 📉 -83.8%) vs baseline: +0.2%

Memory: ✅ 42.585MB (SLO: <43.500MB -2.1%) vs baseline: +4.9%

✅ encode_aspect

Time: ✅ 18.143µs (SLO: <30.000µs 📉 -39.5%) vs baseline: 📈 +23.2%

Memory: ✅ 42.684MB (SLO: <43.500MB 🟡 -1.9%) vs baseline: +5.1%

✅ encode_noaspect

Time: ✅ 1.489µs (SLO: <10.000µs 📉 -85.1%) vs baseline: -1.7%

Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +5.0%

✅ format_aspect

Time: ✅ 171.177µs (SLO: <200.000µs 📉 -14.4%) vs baseline: +0.1%

Memory: ✅ 42.625MB (SLO: <43.250MB 🟡 -1.4%) vs baseline: +4.9%

✅ format_map_aspect

Time: ✅ 190.989µs (SLO: <200.000µs -4.5%) vs baseline: +0.2%

Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +5.0%

✅ format_map_noaspect

Time: ✅ 3.830µs (SLO: <10.000µs 📉 -61.7%) vs baseline: +0.2%

Memory: ✅ 42.546MB (SLO: <43.250MB 🟡 -1.6%) vs baseline: +4.5%

✅ format_noaspect

Time: ✅ 3.139µs (SLO: <10.000µs 📉 -68.6%) vs baseline: -1.0%

Memory: ✅ 42.526MB (SLO: <43.250MB 🟡 -1.7%) vs baseline: +4.7%

✅ index_aspect

Time: ✅ 15.266µs (SLO: <20.000µs 📉 -23.7%) vs baseline: -0.4%

Memory: ✅ 42.566MB (SLO: <43.250MB 🟡 -1.6%) vs baseline: +4.8%

✅ index_noaspect

Time: ✅ 0.470µs (SLO: <10.000µs 📉 -95.3%) vs baseline: +0.9%

Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +4.8%

✅ join_aspect

Time: ✅ 16.993µs (SLO: <20.000µs 📉 -15.0%) vs baseline: -0.2%

Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.8%

✅ join_noaspect

Time: ✅ 1.554µs (SLO: <10.000µs 📉 -84.5%) vs baseline: +0.2%

Memory: ✅ 42.566MB (SLO: <43.250MB 🟡 -1.6%) vs baseline: +4.8%

✅ ljust_aspect

Time: ✅ 20.861µs (SLO: <30.000µs 📉 -30.5%) vs baseline: ~same

Memory: ✅ 42.585MB (SLO: <43.250MB 🟡 -1.5%) vs baseline: +4.8%

✅ ljust_noaspect

Time: ✅ 2.748µs (SLO: <10.000µs 📉 -72.5%) vs baseline: +1.8%

Memory: ✅ 42.566MB (SLO: <43.250MB 🟡 -1.6%) vs baseline: +4.9%

✅ lower_aspect

Time: ✅ 17.971µs (SLO: <30.000µs 📉 -40.1%) vs baseline: +0.5%

Memory: ✅ 42.644MB (SLO: <43.500MB 🟡 -2.0%) vs baseline: +4.9%

✅ lower_noaspect

Time: ✅ 2.441µs (SLO: <10.000µs 📉 -75.6%) vs baseline: +0.8%

Memory: ✅ 42.546MB (SLO: <43.250MB 🟡 -1.6%) vs baseline: +4.8%

✅ lstrip_aspect

Time: ✅ 17.716µs (SLO: <30.000µs 📉 -40.9%) vs baseline: +0.5%

Memory: ✅ 42.605MB (SLO: <43.250MB 🟡 -1.5%) vs baseline: +4.8%

✅ lstrip_noaspect

Time: ✅ 1.843µs (SLO: <10.000µs 📉 -81.6%) vs baseline: -1.9%

Memory: ✅ 42.644MB (SLO: <43.500MB 🟡 -2.0%) vs baseline: +5.1%

✅ modulo_aspect

Time: ✅ 166.169µs (SLO: <200.000µs 📉 -16.9%) vs baseline: ~same

Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.9%

✅ modulo_aspect_for_bytearray_bytearray

Time: ✅ 174.337µs (SLO: <200.000µs 📉 -12.8%) vs baseline: -0.2%

Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +4.7%

✅ modulo_aspect_for_bytes

Time: ✅ 168.753µs (SLO: <200.000µs 📉 -15.6%) vs baseline: +0.2%

Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +4.8%

✅ modulo_aspect_for_bytes_bytearray

Time: ✅ 172.180µs (SLO: <200.000µs 📉 -13.9%) vs baseline: ~same

Memory: ✅ 42.664MB (SLO: <43.500MB 🟡 -1.9%) vs baseline: +5.0%

✅ modulo_noaspect

Time: ✅ 3.756µs (SLO: <10.000µs 📉 -62.4%) vs baseline: +2.0%

Memory: ✅ 42.526MB (SLO: <43.500MB -2.2%) vs baseline: +4.8%

✅ replace_aspect

Time: ✅ 212.207µs (SLO: <300.000µs 📉 -29.3%) vs baseline: ~same

Memory: ✅ 42.546MB (SLO: <44.000MB -3.3%) vs baseline: +4.5%

✅ replace_noaspect

Time: ✅ 2.927µs (SLO: <10.000µs 📉 -70.7%) vs baseline: +0.9%

Memory: ✅ 42.526MB (SLO: <43.500MB -2.2%) vs baseline: +4.5%

✅ repr_aspect

Time: ✅ 1.416µs (SLO: <10.000µs 📉 -85.8%) vs baseline: -0.4%

Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +4.9%

✅ repr_noaspect

Time: ✅ 0.525µs (SLO: <10.000µs 📉 -94.7%) vs baseline: +0.2%

Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.8%

✅ rstrip_aspect

Time: ✅ 19.038µs (SLO: <30.000µs 📉 -36.5%) vs baseline: ~same

Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +4.8%

✅ rstrip_noaspect

Time: ✅ 2.059µs (SLO: <10.000µs 📉 -79.4%) vs baseline: +6.1%

Memory: ✅ 42.644MB (SLO: <43.500MB 🟡 -2.0%) vs baseline: +5.0%

✅ slice_aspect

Time: ✅ 15.978µs (SLO: <20.000µs 📉 -20.1%) vs baseline: +0.6%

Memory: ✅ 42.684MB (SLO: <43.500MB 🟡 -1.9%) vs baseline: +5.2%

✅ slice_noaspect

Time: ✅ 0.597µs (SLO: <10.000µs 📉 -94.0%) vs baseline: +0.1%

Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +4.9%

✅ stringio_aspect

Time: ✅ 54.132µs (SLO: <80.000µs 📉 -32.3%) vs baseline: ~same

Memory: ✅ 42.507MB (SLO: <43.500MB -2.3%) vs baseline: +4.5%

✅ stringio_noaspect

Time: ✅ 3.638µs (SLO: <10.000µs 📉 -63.6%) vs baseline: ~same

Memory: ✅ 42.585MB (SLO: <43.500MB -2.1%) vs baseline: +5.0%

✅ strip_aspect

Time: ✅ 17.569µs (SLO: <20.000µs 📉 -12.2%) vs baseline: -0.2%

Memory: ✅ 42.684MB (SLO: <43.500MB 🟡 -1.9%) vs baseline: +5.2%

✅ strip_noaspect

Time: ✅ 1.860µs (SLO: <10.000µs 📉 -81.4%) vs baseline: -0.3%

Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +5.0%

✅ swapcase_aspect

Time: ✅ 18.613µs (SLO: <30.000µs 📉 -38.0%) vs baseline: +0.6%

Memory: ✅ 42.546MB (SLO: <43.500MB -2.2%) vs baseline: +4.8%

✅ swapcase_noaspect

Time: ✅ 2.793µs (SLO: <10.000µs 📉 -72.1%) vs baseline: +0.3%

Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +4.8%

✅ title_aspect

Time: ✅ 21.873µs (SLO: <30.000µs 📉 -27.1%) vs baseline: 📈 +19.6%

Memory: ✅ 42.605MB (SLO: <43.000MB 🟡 -0.9%) vs baseline: +4.9%

✅ title_noaspect

Time: ✅ 2.675µs (SLO: <10.000µs 📉 -73.3%) vs baseline: -0.1%

Memory: ✅ 42.526MB (SLO: <43.500MB -2.2%) vs baseline: +4.6%

✅ translate_aspect

Time: ✅ 20.427µs (SLO: <30.000µs 📉 -31.9%) vs baseline: ~same

Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +5.0%

✅ translate_noaspect

Time: ✅ 4.340µs (SLO: <10.000µs 📉 -56.6%) vs baseline: +0.3%

Memory: ✅ 42.664MB (SLO: <43.500MB 🟡 -1.9%) vs baseline: +5.1%

✅ upper_aspect

Time: ✅ 18.029µs (SLO: <30.000µs 📉 -39.9%) vs baseline: ~same

Memory: ✅ 42.644MB (SLO: <43.500MB 🟡 -2.0%) vs baseline: +5.0%

✅ upper_noaspect

Time: ✅ 2.445µs (SLO: <10.000µs 📉 -75.5%) vs baseline: +0.5%

Memory: ✅ 42.507MB (SLO: <43.500MB -2.3%) vs baseline: +4.5%

📈 iastaspectsospath - 24/24

✅ ospathbasename_aspect

Time: ✅ 5.156µs (SLO: <10.000µs 📉 -48.4%) vs baseline: 📈 +20.9%

Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +4.9%

✅ ospathbasename_noaspect

Time: ✅ 4.289µs (SLO: <10.000µs 📉 -57.1%) vs baseline: -0.7%

Memory: ✅ 42.467MB (SLO: <43.500MB -2.4%) vs baseline: +4.8%

✅ ospathjoin_aspect

Time: ✅ 6.273µs (SLO: <10.000µs 📉 -37.3%) vs baseline: +0.4%

Memory: ✅ 42.467MB (SLO: <43.500MB -2.4%) vs baseline: +4.7%

✅ ospathjoin_noaspect

Time: ✅ 6.317µs (SLO: <10.000µs 📉 -36.8%) vs baseline: ~same

Memory: ✅ 42.526MB (SLO: <43.500MB -2.2%) vs baseline: +4.9%

✅ ospathnormcase_aspect

Time: ✅ 3.555µs (SLO: <10.000µs 📉 -64.5%) vs baseline: -0.2%

Memory: ✅ 42.507MB (SLO: <43.500MB -2.3%) vs baseline: +4.9%

✅ ospathnormcase_noaspect

Time: ✅ 3.642µs (SLO: <10.000µs 📉 -63.6%) vs baseline: -0.6%

Memory: ✅ 42.467MB (SLO: <43.500MB -2.4%) vs baseline: +4.7%

✅ ospathsplit_aspect

Time: ✅ 4.938µs (SLO: <10.000µs 📉 -50.6%) vs baseline: +0.4%

Memory: ✅ 42.585MB (SLO: <43.500MB -2.1%) vs baseline: +4.8%

✅ ospathsplit_noaspect

Time: ✅ 5.030µs (SLO: <10.000µs 📉 -49.7%) vs baseline: +0.1%

Memory: ✅ 42.487MB (SLO: <43.500MB -2.3%) vs baseline: +4.8%

✅ ospathsplitdrive_aspect

Time: ✅ 3.793µs (SLO: <10.000µs 📉 -62.1%) vs baseline: +1.5%

Memory: ✅ 42.408MB (SLO: <43.500MB -2.5%) vs baseline: +4.7%

✅ ospathsplitdrive_noaspect

Time: ✅ 0.753µs (SLO: <10.000µs 📉 -92.5%) vs baseline: +0.4%

Memory: ✅ 42.546MB (SLO: <43.500MB -2.2%) vs baseline: +4.6%

✅ ospathsplitext_aspect

Time: ✅ 4.679µs (SLO: <10.000µs 📉 -53.2%) vs baseline: +0.8%

Memory: ✅ 42.566MB (SLO: <43.500MB -2.1%) vs baseline: +5.1%

✅ ospathsplitext_noaspect

Time: ✅ 4.671µs (SLO: <10.000µs 📉 -53.3%) vs baseline: +0.3%

Memory: ✅ 42.448MB (SLO: <43.500MB -2.4%) vs baseline: +4.7%

📈 telemetryaddmetric - 30/30

✅ 1-count-metric-1-times

Time: ✅ 3.411µs (SLO: <20.000µs 📉 -82.9%) vs baseline: 📈 +15.2%

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +5.2%

✅ 1-count-metrics-100-times

Time: ✅ 201.064µs (SLO: <220.000µs -8.6%) vs baseline: +1.3%

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9%

✅ 1-distribution-metric-1-times

Time: ✅ 3.352µs (SLO: <20.000µs 📉 -83.2%) vs baseline: +0.6%

Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.9%

✅ 1-distribution-metrics-100-times

Time: ✅ 214.361µs (SLO: <230.000µs -6.8%) vs baseline: +0.8%

Memory: ✅ 34.977MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +5.4%

✅ 1-gauge-metric-1-times

Time: ✅ 2.196µs (SLO: <20.000µs 📉 -89.0%) vs baseline: ~same

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9%

✅ 1-gauge-metrics-100-times

Time: ✅ 136.624µs (SLO: <150.000µs -8.9%) vs baseline: +0.7%

Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.7%

✅ 1-rate-metric-1-times

Time: ✅ 3.147µs (SLO: <20.000µs 📉 -84.3%) vs baseline: +0.9%

Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.8%

✅ 1-rate-metrics-100-times

Time: ✅ 213.085µs (SLO: <250.000µs 📉 -14.8%) vs baseline: +1.1%

Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +5.0%

✅ 100-count-metrics-100-times

Time: ✅ 19.805ms (SLO: <22.000ms -10.0%) vs baseline: -0.8%

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +5.2%

✅ 100-distribution-metrics-100-times

Time: ✅ 2.263ms (SLO: <2.550ms 📉 -11.3%) vs baseline: -0.1%

Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.6%

✅ 100-gauge-metrics-100-times

Time: ✅ 1.407ms (SLO: <1.550ms -9.2%) vs baseline: +0.1%

Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.7%

✅ 100-rate-metrics-100-times

Time: ✅ 2.168ms (SLO: <2.550ms 📉 -15.0%) vs baseline: -0.6%

Memory: ✅ 34.977MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +5.0%

✅ flush-1-metric

Time: ✅ 4.552µs (SLO: <20.000µs 📉 -77.2%) vs baseline: ~same

Memory: ✅ 34.780MB (SLO: <35.500MB -2.0%) vs baseline: +4.7%

✅ flush-100-metrics

Time: ✅ 174.183µs (SLO: <250.000µs 📉 -30.3%) vs baseline: -0.1%

Memory: ✅ 35.212MB (SLO: <35.500MB 🟡 -0.8%) vs baseline: +4.9%

✅ flush-1000-metrics

Time: ✅ 2.190ms (SLO: <2.500ms 📉 -12.4%) vs baseline: +0.5%

Memory: ✅ 36.097MB (SLO: <36.500MB 🟡 -1.1%) vs baseline: +4.9%

🟡 Near SLO Breach (16 suites)

🟡 coreapiscenario - 10/10 (1 unstable)

⚠️ context_with_data_listeners

Time: ⚠️ 13.231µs (SLO: <20.000µs 📉 -33.8%) vs baseline: -0.4%

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9%

✅ context_with_data_no_listeners

Time: ✅ 3.269µs (SLO: <10.000µs 📉 -67.3%) vs baseline: ~same

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +5.0%

✅ get_item_exists

Time: ✅ 0.579µs (SLO: <10.000µs 📉 -94.2%) vs baseline: +0.5%

Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9%

✅ get_item_missing

Time: ✅ 0.633µs (SLO: <10.000µs 📉 -93.7%) vs baseline: +0.6%

Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.7%

✅ set_item

Time: ✅ 24.025µs (SLO: <30.000µs 📉 -19.9%) vs baseline: +0.6%

Memory: ✅ 34.741MB (SLO: <35.500MB -2.1%) vs baseline: +4.6%

🟡 djangosimple - 30/30

✅ appsec

Time: ✅ 19.579ms (SLO: <22.300ms 📉 -12.2%) vs baseline: +0.3%

Memory: ✅ 68.341MB (SLO: <70.500MB -3.1%) vs baseline: +4.8%

✅ exception-replay-enabled

Time: ✅ 1.361ms (SLO: <1.450ms -6.1%) vs baseline: -0.2%

Memory: ✅ 66.348MB (SLO: <67.500MB 🟡 -1.7%) vs baseline: +4.8%

✅ iast

Time: ✅ 19.608ms (SLO: <22.250ms 📉 -11.9%) vs baseline: -0.1%

Memory: ✅ 68.380MB (SLO: <70.000MB -2.3%) vs baseline: +4.9%

✅ profiler

Time: ✅ 15.342ms (SLO: <16.550ms -7.3%) vs baseline: ~same

Memory: ✅ 56.546MB (SLO: <57.500MB 🟡 -1.7%) vs baseline: +4.8%

✅ resource-renaming

Time: ✅ 19.396ms (SLO: <21.750ms 📉 -10.8%) vs baseline: -0.2%

Memory: ✅ 68.302MB (SLO: <70.500MB -3.1%) vs baseline: +4.7%

✅ span-code-origin

Time: ✅ 19.850ms (SLO: <28.200ms 📉 -29.6%) vs baseline: +0.9%

Memory: ✅ 68.307MB (SLO: <71.000MB -3.8%) vs baseline: +5.0%

✅ tracer

Time: ✅ 19.593ms (SLO: <21.750ms -9.9%) vs baseline: +0.2%

Memory: ✅ 68.412MB (SLO: <70.000MB -2.3%) vs baseline: +4.8%

✅ tracer-and-profiler

Time: ✅ 21.595ms (SLO: <23.500ms -8.1%) vs baseline: ~same

Memory: ✅ 69.737MB (SLO: <71.000MB 🟡 -1.8%) vs baseline: +4.8%

✅ tracer-dont-create-db-spans

Time: ✅ 19.758ms (SLO: <21.500ms -8.1%) vs baseline: +1.2%

Memory: ✅ 68.321MB (SLO: <70.000MB -2.4%) vs baseline: +4.8%

✅ tracer-minimal

Time: ✅ 16.775ms (SLO: <17.500ms -4.1%) vs baseline: -0.1%

Memory: ✅ 68.357MB (SLO: <70.000MB -2.3%) vs baseline: +4.9%

✅ tracer-native

Time: ✅ 19.493ms (SLO: <21.750ms 📉 -10.4%) vs baseline: -0.3%

Memory: ✅ 68.341MB (SLO: <72.500MB -5.7%) vs baseline: +4.9%

✅ tracer-no-caches

Time: ✅ 17.972ms (SLO: <19.650ms -8.5%) vs baseline: +0.3%

Memory: ✅ 68.262MB (SLO: <70.000MB -2.5%) vs baseline: +4.7%

✅ tracer-no-databases

Time: ✅ 19.472ms (SLO: <20.100ms -3.1%) vs baseline: +0.4%

Memory: ✅ 68.085MB (SLO: <70.000MB -2.7%) vs baseline: +4.9%

✅ tracer-no-middleware

Time: ✅ 19.276ms (SLO: <21.500ms 📉 -10.3%) vs baseline: ~same

Memory: ✅ 68.400MB (SLO: <70.000MB -2.3%) vs baseline: +4.9%

✅ tracer-no-templates

Time: ✅ 19.510ms (SLO: <22.000ms 📉 -11.3%) vs baseline: +0.8%

Memory: ✅ 68.351MB (SLO: <70.500MB -3.0%) vs baseline: +4.9%

🟡 errortrackingdjangosimple - 6/6

✅ errortracking-enabled-all

Time: ✅ 16.328ms (SLO: <19.850ms 📉 -17.7%) vs baseline: -0.1%

Memory: ✅ 69.936MB (SLO: <70.000MB 🟡 ~same) vs baseline: +5.0%

✅ errortracking-enabled-user

Time: ✅ 16.392ms (SLO: <19.400ms 📉 -15.5%) vs baseline: +0.4%

Memory: ✅ 69.900MB (SLO: <70.000MB 🟡 -0.1%) vs baseline: +4.9%

✅ tracer-enabled

Time: ✅ 16.286ms (SLO: <19.450ms 📉 -16.3%) vs baseline: ~same

Memory: ✅ 69.821MB (SLO: <70.000MB 🟡 -0.3%) vs baseline: +4.8%

🟡 errortrackingflasksqli - 6/6

✅ errortracking-enabled-all

Time: ✅ 2.069ms (SLO: <2.300ms 📉 -10.0%) vs baseline: ~same

Memory: ✅ 55.856MB (SLO: <56.500MB 🟡 -1.1%) vs baseline: +4.9%

✅ errortracking-enabled-user

Time: ✅ 2.077ms (SLO: <2.250ms -7.7%) vs baseline: +0.1%

Memory: ✅ 55.896MB (SLO: <56.500MB 🟡 -1.1%) vs baseline: +4.8%

✅ tracer-enabled

Time: ✅ 2.068ms (SLO: <2.300ms 📉 -10.1%) vs baseline: +0.3%

Memory: ✅ 55.910MB (SLO: <56.500MB 🟡 -1.0%) vs baseline: +5.0%

🟡 flasksimple - 18/18

✅ appsec-get

Time: ✅ 3.371ms (SLO: <4.750ms 📉 -29.0%) vs baseline: -0.6%

Memory: ✅ 55.836MB (SLO: <66.500MB 📉 -16.0%) vs baseline: +4.7%

✅ appsec-post

Time: ✅ 2.854ms (SLO: <6.750ms 📉 -57.7%) vs baseline: -0.1%

Memory: ✅ 55.824MB (SLO: <66.500MB 📉 -16.1%) vs baseline: +4.8%

✅ appsec-telemetry

Time: ✅ 3.395ms (SLO: <4.750ms 📉 -28.5%) vs baseline: +0.6%

Memory: ✅ 55.933MB (SLO: <66.500MB 📉 -15.9%) vs baseline: +5.0%

✅ debugger

Time: ✅ 1.871ms (SLO: <2.000ms -6.5%) vs baseline: +0.2%

Memory: ✅ 47.836MB (SLO: <49.500MB -3.4%) vs baseline: +5.0%

✅ iast-get

Time: ✅ 1.865ms (SLO: <2.000ms -6.7%) vs baseline: +0.2%

Memory: ✅ 44.844MB (SLO: <49.000MB -8.5%) vs baseline: +4.9%

✅ profiler

Time: ✅ 1.903ms (SLO: <2.100ms -9.4%) vs baseline: ~same

Memory: ✅ 48.742MB (SLO: <50.000MB -2.5%) vs baseline: +4.9%

✅ resource-renaming

Time: ✅ 3.355ms (SLO: <3.650ms -8.1%) vs baseline: ~same

Memory: ✅ 55.876MB (SLO: <56.000MB 🟡 -0.2%) vs baseline: +4.7%

✅ tracer

Time: ✅ 3.366ms (SLO: <3.650ms -7.8%) vs baseline: -0.4%

Memory: ✅ 55.892MB (SLO: <56.500MB 🟡 -1.1%) vs baseline: +5.0%

✅ tracer-native

Time: ✅ 3.374ms (SLO: <3.650ms -7.6%) vs baseline: +0.1%

Memory: ✅ 55.817MB (SLO: <60.000MB -7.0%) vs baseline: +4.6%

🟡 flasksqli - 6/6

✅ appsec-enabled

Time: ✅ 2.060ms (SLO: <4.200ms 📉 -51.0%) vs baseline: ~same

Memory: ✅ 55.876MB (SLO: <66.000MB 📉 -15.3%) vs baseline: +4.9%

✅ iast-enabled

Time: ✅ 2.068ms (SLO: <2.800ms 📉 -26.1%) vs baseline: ~same

Memory: ✅ 55.856MB (SLO: <62.500MB 📉 -10.6%) vs baseline: +4.9%

✅ tracer-enabled

Time: ✅ 2.060ms (SLO: <2.250ms -8.4%) vs baseline: -0.2%

Memory: ✅ 55.876MB (SLO: <56.500MB 🟡 -1.1%) vs baseline: +4.9%

🟡 httppropagationextract - 60/60

✅ all_styles_all_headers

Time: ✅ 85.691µs (SLO: <100.000µs 📉 -14.3%) vs baseline: +5.4%

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +4.8%

✅ b3_headers

Time: ✅ 14.523µs (SLO: <20.000µs 📉 -27.4%) vs baseline: +0.6%

Memory: ✅ 34.996MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +4.6%

✅ b3_single_headers

Time: ✅ 13.485µs (SLO: <20.000µs 📉 -32.6%) vs baseline: -0.3%

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.5%

✅ datadog_tracecontext_tracestate_not_propagated_on_trace_id_no_match

Time: ✅ 64.028µs (SLO: <80.000µs 📉 -20.0%) vs baseline: -0.2%

Memory: ✅ 34.957MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.9%

✅ datadog_tracecontext_tracestate_propagated_on_trace_id_match

Time: ✅ 66.021µs (SLO: <80.000µs 📉 -17.5%) vs baseline: -0.4%

Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.7%

✅ empty_headers

Time: ✅ 1.600µs (SLO: <10.000µs 📉 -84.0%) vs baseline: ~same

Memory: ✅ 34.996MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +4.9%

✅ full_t_id_datadog_headers

Time: ✅ 22.750µs (SLO: <30.000µs 📉 -24.2%) vs baseline: -0.1%

Memory: ✅ 34.957MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.7%

✅ invalid_priority_header

Time: ✅ 6.518µs (SLO: <10.000µs 📉 -34.8%) vs baseline: -0.2%

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +4.3%

✅ invalid_span_id_header

Time: ✅ 6.485µs (SLO: <10.000µs 📉 -35.2%) vs baseline: -0.1%

Memory: ✅ 34.977MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.9%

✅ invalid_tags_header

Time: ✅ 6.525µs (SLO: <10.000µs 📉 -34.7%) vs baseline: ~same

Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.6%

✅ invalid_trace_id_header

Time: ✅ 6.536µs (SLO: <10.000µs 📉 -34.6%) vs baseline: +0.4%

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.9%

✅ large_header_no_matches

Time: ✅ 27.619µs (SLO: <30.000µs -7.9%) vs baseline: ~same

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.8%

✅ large_valid_headers_all

Time: ✅ 28.624µs (SLO: <40.000µs 📉 -28.4%) vs baseline: -0.1%

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +4.6%

✅ medium_header_no_matches

Time: ✅ 9.806µs (SLO: <20.000µs 📉 -51.0%) vs baseline: -0.6%

Memory: ✅ 35.055MB (SLO: <35.500MB 🟡 -1.3%) vs baseline: +5.3%

✅ medium_valid_headers_all

Time: ✅ 11.331µs (SLO: <20.000µs 📉 -43.3%) vs baseline: +0.7%

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.7%

✅ none_propagation_style

Time: ✅ 1.701µs (SLO: <10.000µs 📉 -83.0%) vs baseline: ~same

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.7%

✅ tracecontext_headers

Time: ✅ 34.693µs (SLO: <40.000µs 📉 -13.3%) vs baseline: -0.2%

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.4%

✅ valid_headers_all

Time: ✅ 6.504µs (SLO: <10.000µs 📉 -35.0%) vs baseline: +0.2%

Memory: ✅ 34.957MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.6%

✅ valid_headers_basic

Time: ✅ 6.097µs (SLO: <10.000µs 📉 -39.0%) vs baseline: +0.5%

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +4.6%

✅ wsgi_empty_headers

Time: ✅ 1.588µs (SLO: <10.000µs 📉 -84.1%) vs baseline: -0.4%

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.4%

✅ wsgi_invalid_priority_header

Time: ✅ 6.578µs (SLO: <10.000µs 📉 -34.2%) vs baseline: +0.6%

Memory: ✅ 34.957MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.7%

✅ wsgi_invalid_span_id_header

Time: ✅ 1.597µs (SLO: <10.000µs 📉 -84.0%) vs baseline: +0.5%

Memory: ✅ 35.016MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +5.0%

✅ wsgi_invalid_tags_header

Time: ✅ 6.548µs (SLO: <10.000µs 📉 -34.5%) vs baseline: +0.4%

Memory: ✅ 34.977MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.7%

✅ wsgi_invalid_trace_id_header

Time: ✅ 6.559µs (SLO: <10.000µs 📉 -34.4%) vs baseline: -0.3%

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.9%

✅ wsgi_large_header_no_matches

Time: ✅ 28.618µs (SLO: <40.000µs 📉 -28.5%) vs baseline: -0.4%

Memory: ✅ 35.075MB (SLO: <35.500MB 🟡 -1.2%) vs baseline: +5.3%

✅ wsgi_large_valid_headers_all

Time: ✅ 29.924µs (SLO: <40.000µs 📉 -25.2%) vs baseline: +0.9%

Memory: ✅ 34.977MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.9%

✅ wsgi_medium_header_no_matches

Time: ✅ 10.131µs (SLO: <20.000µs 📉 -49.3%) vs baseline: ~same

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +4.4%

✅ wsgi_medium_valid_headers_all

Time: ✅ 11.604µs (SLO: <20.000µs 📉 -42.0%) vs baseline: ~same

Memory: ✅ 34.957MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.7%

✅ wsgi_valid_headers_all

Time: ✅ 6.558µs (SLO: <10.000µs 📉 -34.4%) vs baseline: +0.7%

Memory: ✅ 34.996MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +5.0%

✅ wsgi_valid_headers_basic

Time: ✅ 6.083µs (SLO: <10.000µs 📉 -39.2%) vs baseline: ~same

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.6%

🟡 httppropagationinject - 16/16

✅ ids_only

Time: ✅ 21.998µs (SLO: <30.000µs 📉 -26.7%) vs baseline: +4.7%

Memory: ✅ 34.819MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.5%

✅ with_all

Time: ✅ 27.714µs (SLO: <40.000µs 📉 -30.7%) vs baseline: -0.8%

Memory: ✅ 34.977MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +5.0%

✅ with_dd_origin

Time: ✅ 24.528µs (SLO: <30.000µs 📉 -18.2%) vs baseline: -0.3%

Memory: ✅ 34.819MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.7%

✅ with_priority_and_origin

Time: ✅ 23.987µs (SLO: <40.000µs 📉 -40.0%) vs baseline: ~same

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.9%

✅ with_sampling_priority

Time: ✅ 20.860µs (SLO: <30.000µs 📉 -30.5%) vs baseline: -0.2%

Memory: ✅ 34.996MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +5.2%

✅ with_tags

Time: ✅ 25.871µs (SLO: <40.000µs 📉 -35.3%) vs baseline: -0.5%

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +4.7%

✅ with_tags_invalid

Time: ✅ 27.490µs (SLO: <40.000µs 📉 -31.3%) vs baseline: ~same

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +4.7%

✅ with_tags_max_size

Time: ✅ 26.371µs (SLO: <40.000µs 📉 -34.1%) vs baseline: ~same

Memory: ✅ 34.918MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +5.0%

🟡 iastaspectssplit - 12/12

✅ rsplit_aspect

Time: ✅ 1.612µs (SLO: <10.000µs 📉 -83.9%) vs baseline: +5.2%

Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +5.0%

✅ rsplit_noaspect

Time: ✅ 1.612µs (SLO: <10.000µs 📉 -83.9%) vs baseline: -1.4%

Memory: ✅ 42.625MB (SLO: <43.500MB -2.0%) vs baseline: +5.0%

✅ split_aspect

Time: ✅ 1.553µs (SLO: <10.000µs 📉 -84.5%) vs baseline: +0.2%

Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +5.0%

✅ split_noaspect

Time: ✅ 1.620µs (SLO: <10.000µs 📉 -83.8%) vs baseline: -0.3%

Memory: ✅ 42.644MB (SLO: <43.500MB 🟡 -2.0%) vs baseline: +5.0%

✅ splitlines_aspect

Time: ✅ 1.505µs (SLO: <10.000µs 📉 -85.0%) vs baseline: +0.8%

Memory: ✅ 42.605MB (SLO: <43.500MB -2.1%) vs baseline: +4.9%

✅ splitlines_noaspect

Time: ✅ 1.547µs (SLO: <10.000µs 📉 -84.5%) vs baseline: ~same

Memory: ✅ 42.585MB (SLO: <43.500MB -2.1%) vs baseline: +4.8%

🟡 otelspan - 22/22

✅ add-event

Time: ✅ 40.114ms (SLO: <47.150ms 📉 -14.9%) vs baseline: +0.5%

Memory: ✅ 39.540MB (SLO: <47.000MB 📉 -15.9%) vs baseline: +4.7%

✅ add-metrics

Time: ✅ 259.250ms (SLO: <344.800ms 📉 -24.8%) vs baseline: -0.7%

Memory: ✅ 43.693MB (SLO: <47.500MB -8.0%) vs baseline: +4.8%

✅ add-tags

Time: ✅ 315.119ms (SLO: <321.000ms 🟡 -1.8%) vs baseline: -0.2%

Memory: ✅ 43.831MB (SLO: <47.500MB -7.7%) vs baseline: +5.2%

✅ get-context

Time: ✅ 80.498ms (SLO: <92.350ms 📉 -12.8%) vs baseline: ~same

Memory: ✅ 39.922MB (SLO: <46.500MB 📉 -14.1%) vs baseline: +4.7%

✅ is-recording

Time: ✅ 37.918ms (SLO: <44.500ms 📉 -14.8%) vs baseline: +0.4%

Memory: ✅ 39.601MB (SLO: <47.500MB 📉 -16.6%) vs baseline: +5.1%

✅ record-exception

Time: ✅ 58.858ms (SLO: <67.650ms 📉 -13.0%) vs baseline: -0.1%

Memory: ✅ 39.872MB (SLO: <47.000MB 📉 -15.2%) vs baseline: +4.5%

✅ set-status

Time: ✅ 44.156ms (SLO: <50.400ms 📉 -12.4%) vs baseline: -0.2%

Memory: ✅ 39.492MB (SLO: <47.000MB 📉 -16.0%) vs baseline: +5.2%

✅ start

Time: ✅ 37.898ms (SLO: <43.450ms 📉 -12.8%) vs baseline: +2.4%

Memory: ✅ 39.412MB (SLO: <47.000MB 📉 -16.1%) vs baseline: +4.8%

✅ start-finish

Time: ✅ 82.908ms (SLO: <88.000ms -5.8%) vs baseline: -0.1%

Memory: ✅ 37.356MB (SLO: <46.500MB 📉 -19.7%) vs baseline: +5.0%

✅ start-finish-telemetry

Time: ✅ 84.561ms (SLO: <89.000ms -5.0%) vs baseline: +0.3%

Memory: ✅ 37.395MB (SLO: <46.500MB 📉 -19.6%) vs baseline: +4.9%

✅ update-name

Time: ✅ 38.563ms (SLO: <45.150ms 📉 -14.6%) vs baseline: ~same

Memory: ✅ 39.468MB (SLO: <47.000MB 📉 -16.0%) vs baseline: +4.4%

🟡 ratelimiter - 12/12

✅ defaults

Time: ✅ 2.350µs (SLO: <10.000µs 📉 -76.5%) vs baseline: +0.6%

Memory: ✅ 34.957MB (SLO: <35.500MB 🟡 -1.5%) vs baseline: +4.9%

✅ high_rate_limit

Time: ✅ 2.398µs (SLO: <10.000µs 📉 -76.0%) vs baseline: ~same

Memory: ✅ 35.173MB (SLO: <35.500MB 🟡 -0.9%) vs baseline: +5.0%

✅ long_window

Time: ✅ 2.341µs (SLO: <10.000µs 📉 -76.6%) vs baseline: ~same

Memory: ✅ 34.996MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +5.5%

✅ low_rate_limit

Time: ✅ 2.353µs (SLO: <10.000µs 📉 -76.5%) vs baseline: +0.1%

Memory: ✅ 35.095MB (SLO: <35.500MB 🟡 -1.1%) vs baseline: +4.7%

✅ no_rate_limit

Time: ✅ 0.819µs (SLO: <10.000µs 📉 -91.8%) vs baseline: -0.7%

Memory: ✅ 35.016MB (SLO: <35.500MB 🟡 -1.4%) vs baseline: +5.4%

✅ short_window

Time: ✅ 2.471µs (SLO: <10.000µs 📉 -75.3%) vs baseline: -0.1%

Memory: ✅ 35.114MB (SLO: <35.500MB 🟡 -1.1%) vs baseline: +5.5%

🟡 recursivecomputation - 8/8

✅ deep

Time: ✅ 309.093ms (SLO: <320.950ms -3.7%) vs baseline: +0.1%

Memory: ✅ 35.960MB (SLO: <36.500MB 🟡 -1.5%) vs baseline: +4.9%

✅ deep-profiled

Time: ✅ 325.395ms (SLO: <359.150ms -9.4%) vs baseline: -0.1%

Memory: ✅ 39.872MB (SLO: <40.500MB 🟡 -1.6%) vs baseline: +4.5%

✅ medium

Time: ✅ 6.998ms (SLO: <7.400ms -5.4%) vs baseline: -0.3%

Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.9%

✅ shallow

Time: ✅ 0.943ms (SLO: <1.050ms 📉 -10.2%) vs baseline: +0.7%

Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.9%

🟡 samplingrules - 8/8

✅ average_match

Time: ✅ 137.204µs (SLO: <290.000µs 📉 -52.7%) vs baseline: +0.4%

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +5.0%

✅ high_match

Time: ✅ 173.084µs (SLO: <480.000µs 📉 -63.9%) vs baseline: -1.0%

Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +5.0%

✅ low_match

Time: ✅ 98.982µs (SLO: <120.000µs 📉 -17.5%) vs baseline: ~same

Memory: ✅ 603.634MB (SLO: <700.000MB 📉 -13.8%) vs baseline: +4.8%

✅ very_low_match

Time: ✅ 2.661ms (SLO: <8.500ms 📉 -68.7%) vs baseline: ~same

Memory: ✅ 70.938MB (SLO: <75.000MB -5.4%) vs baseline: +4.8%

🟡 sethttpmeta - 32/32

✅ all-disabled

Time: ✅ 10.484µs (SLO: <20.000µs 📉 -47.6%) vs baseline: ~same

Memory: ✅ 35.783MB (SLO: <36.000MB 🟡 -0.6%) vs baseline: +5.3%

✅ all-enabled

Time: ✅ 41.030µs (SLO: <50.000µs 📉 -17.9%) vs baseline: +2.4%

Memory: ✅ 35.586MB (SLO: <36.000MB 🟡 -1.1%) vs baseline: +4.4%

✅ collectipvariant_exists

Time: ✅ 41.084µs (SLO: <50.000µs 📉 -17.8%) vs baseline: +0.6%

Memory: ✅ 35.763MB (SLO: <36.000MB 🟡 -0.7%) vs baseline: +5.0%

✅ no-collectipvariant

Time: ✅ 40.040µs (SLO: <50.000µs 📉 -19.9%) vs baseline: -0.2%

Memory: ✅ 35.606MB (SLO: <36.000MB 🟡 -1.1%) vs baseline: +4.7%

✅ no-useragentvariant

Time: ✅ 39.003µs (SLO: <50.000µs 📉 -22.0%) vs baseline: +0.4%

Memory: ✅ 35.763MB (SLO: <36.000MB 🟡 -0.7%) vs baseline: +5.1%

✅ obfuscation-no-query

Time: ✅ 40.720µs (SLO: <50.000µs 📉 -18.6%) vs baseline: +0.5%

Memory: ✅ 35.704MB (SLO: <36.000MB 🟡 -0.8%) vs baseline: +4.9%

✅ obfuscation-regular-case-explicit-query

Time: ✅ 76.042µs (SLO: <90.000µs 📉 -15.5%) vs baseline: ~same

Memory: ✅ 35.625MB (SLO: <36.500MB -2.4%) vs baseline: +4.6%

✅ obfuscation-regular-case-implicit-query

Time: ✅ 76.606µs (SLO: <90.000µs 📉 -14.9%) vs baseline: ~same

Memory: ✅ 35.763MB (SLO: <36.500MB -2.0%) vs baseline: +5.1%

✅ obfuscation-send-querystring-disabled

Time: ✅ 154.212µs (SLO: <170.000µs -9.3%) vs baseline: -0.1%

Memory: ✅ 35.645MB (SLO: <36.500MB -2.3%) vs baseline: +5.0%

✅ obfuscation-worst-case-explicit-query

Time: ✅ 148.981µs (SLO: <160.000µs -6.9%) vs baseline: ~same

Memory: ✅ 35.704MB (SLO: <36.500MB -2.2%) vs baseline: +4.6%

✅ obfuscation-worst-case-implicit-query

Time: ✅ 154.770µs (SLO: <170.000µs -9.0%) vs baseline: -0.3%

Memory: ✅ 35.645MB (SLO: <36.500MB -2.3%) vs baseline: +4.7%

✅ useragentvariant_exists_1

Time: ✅ 39.717µs (SLO: <50.000µs 📉 -20.6%) vs baseline: +0.8%

Memory: ✅ 35.665MB (SLO: <36.000MB 🟡 -0.9%) vs baseline: +4.9%

✅ useragentvariant_exists_2

Time: ✅ 40.771µs (SLO: <50.000µs 📉 -18.5%) vs baseline: +0.6%

Memory: ✅ 35.684MB (SLO: <36.000MB 🟡 -0.9%) vs baseline: +4.7%

✅ useragentvariant_exists_3

Time: ✅ 40.045µs (SLO: <50.000µs 📉 -19.9%) vs baseline: ~same

Memory: ✅ 35.625MB (SLO: <36.000MB 🟡 -1.0%) vs baseline: +4.5%

✅ useragentvariant_not_exists_1

Time: ✅ 39.360µs (SLO: <50.000µs 📉 -21.3%) vs baseline: -0.7%

Memory: ✅ 35.645MB (SLO: <36.000MB 🟡 -1.0%) vs baseline: +4.6%

✅ useragentvariant_not_exists_2

Time: ✅ 39.581µs (SLO: <50.000µs 📉 -20.8%) vs baseline: ~same

Memory: ✅ 35.645MB (SLO: <36.000MB 🟡 -1.0%) vs baseline: +4.8%

🟡 span - 26/26

✅ add-event

Time: ✅ 18.218ms (SLO: <22.500ms 📉 -19.0%) vs baseline: +0.1%

Memory: ✅ 36.897MB (SLO: <53.000MB 📉 -30.4%) vs baseline: +4.8%

✅ add-metrics

Time: ✅ 88.988ms (SLO: <93.500ms -4.8%) vs baseline: +1.1%

Memory: ✅ 41.257MB (SLO: <53.000MB 📉 -22.2%) vs baseline: +4.9%

✅ add-tags

Time: ✅ 142.517ms (SLO: <155.000ms -8.1%) vs baseline: +0.5%

Memory: ✅ 41.160MB (SLO: <53.000MB 📉 -22.3%) vs baseline: +4.9%

✅ get-context

Time: ✅ 16.975ms (SLO: <20.500ms 📉 -17.2%) vs baseline: +0.9%

Memory: ✅ 36.829MB (SLO: <53.000MB 📉 -30.5%) vs baseline: +5.2%

✅ is-recording

Time: ✅ 17.289ms (SLO: <20.500ms 📉 -15.7%) vs baseline: +0.5%

Memory: ✅ 36.848MB (SLO: <53.000MB 📉 -30.5%) vs baseline: +5.4%

✅ record-exception

Time: ✅ 36.369ms (SLO: <40.000ms -9.1%) vs baseline: -0.5%

Memory: ✅ 37.371MB (SLO: <53.000MB 📉 -29.5%) vs baseline: +5.2%

✅ set-status

Time: ✅ 18.692ms (SLO: <22.000ms 📉 -15.0%) vs baseline: -0.2%

Memory: ✅ 36.732MB (SLO: <53.000MB 📉 -30.7%) vs baseline: +5.2%

✅ start

Time: ✅ 17.431ms (SLO: <20.500ms 📉 -15.0%) vs baseline: +3.2%

Memory: ✅ 36.728MB (SLO: <53.000MB 📉 -30.7%) vs baseline: +5.0%

✅ start-finish

Time: ✅ 51.852ms (SLO: <52.500ms 🟡 -1.2%) vs baseline: +1.6%

Memory: ✅ 34.937MB (SLO: <35.500MB 🟡 -1.6%) vs baseline: +5.3%

✅ start-finish-telemetry

Time: ✅ 52.143ms (SLO: <54.500ms -4.3%) vs baseline: +0.2%

Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +5.3%

✅ start-finish-traceid128

Time: ✅ 53.850ms (SLO: <57.000ms -5.5%) vs baseline: -0.6%

Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +4.7%

✅ start-traceid128

Time: ✅ 17.295ms (SLO: <22.500ms 📉 -23.1%) vs baseline: +0.5%

Memory: ✅ 36.915MB (SLO: <53.000MB 📉 -30.3%) vs baseline: +5.1%

✅ update-name

Time: ✅ 17.262ms (SLO: <22.000ms 📉 -21.5%) vs baseline: ~same

Memory: ✅ 36.806MB (SLO: <53.000MB 📉 -30.6%) vs baseline: +4.8%

🟡 tracer - 6/6

✅ large

Time: ✅ 29.171ms (SLO: <32.950ms 📉 -11.5%) vs baseline: -0.6%

Memory: ✅ 36.078MB (SLO: <36.500MB 🟡 -1.2%) vs baseline: +5.1%

✅ medium

Time: ✅ 2.871ms (SLO: <3.200ms 📉 -10.3%) vs baseline: -1.1%

Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +5.1%

✅ small

Time: ✅ 330.285µs (SLO: <370.000µs 📉 -10.7%) vs baseline: +1.4%

Memory: ✅ 34.859MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +5.0%

⚠️ Unstable Tests (1 suite)

⚠️

packagesupdateimporteddependencies - 24/24 (1 unstable)

✅ import_many

Time: ✅ 155.426µs (SLO: <170.000µs -8.6%) vs baseline: -0.2%

Memory: ✅ 39.647MB (SLO: <43.000MB -7.8%) vs baseline: +4.9%

✅ import_many_cached

Time: ✅ 121.499µs (SLO: <130.000µs -6.5%) vs baseline: +0.5%

Memory: ✅ 39.601MB (SLO: <43.000MB -7.9%) vs baseline: +4.7%

✅ import_many_stdlib

Time: ✅ 0.759ms (SLO: <1.750ms 📉 -56.7%) vs baseline: +0.1%

Memory: ✅ 39.515MB (SLO: <43.000MB -8.1%) vs baseline: +4.8%

⚠️ import_many_stdlib_cached

Time: ⚠️ 0.172ms (SLO: <1.100ms 📉 -84.4%) vs baseline: -0.2%

Memory: ✅ 39.615MB (SLO: <43.000MB -7.9%) vs baseline: +4.9%

✅ import_many_unknown

Time: ✅ 834.430µs (SLO: <890.000µs -6.2%) vs baseline: -0.5%

Memory: ✅ 39.922MB (SLO: <43.000MB -7.2%) vs baseline: +5.3%

✅ import_many_unknown_cached

Time: ✅ 792.421µs (SLO: <870.000µs -8.9%) vs baseline: ~same

Memory: ✅ 39.738MB (SLO: <43.000MB -7.6%) vs baseline: +4.8%

✅ import_one

Time: ✅ 19.830µs (SLO: <30.000µs 📉 -33.9%) vs baseline: +0.2%

Memory: ✅ 39.600MB (SLO: <43.000MB -7.9%) vs baseline: +4.9%

✅ import_one_cache

Time: ✅ 6.304µs (SLO: <10.000µs 📉 -37.0%) vs baseline: +0.4%

Memory: ✅ 39.627MB (SLO: <43.000MB -7.8%) vs baseline: +4.9%

✅ import_one_stdlib

Time: ✅ 18.713µs (SLO: <20.000µs -6.4%) vs baseline: ~same

Memory: ✅ 39.549MB (SLO: <43.000MB -8.0%) vs baseline: +4.3%

✅ import_one_stdlib_cache

Time: ✅ 6.267µs (SLO: <10.000µs 📉 -37.3%) vs baseline: -0.6%

Memory: ✅ 39.461MB (SLO: <43.000MB -8.2%) vs baseline: +4.4%

✅ import_one_unknown

Time: ✅ 45.701µs (SLO: <50.000µs -8.6%) vs baseline: +0.6%

Memory: ✅ 39.668MB (SLO: <43.000MB -7.7%) vs baseline: +5.5%

✅ import_one_unknown_cache

Time: ✅ 6.281µs (SLO: <10.000µs 📉 -37.2%) vs baseline: +0.2%

Memory: ✅ 39.759MB (SLO: <43.000MB -7.5%) vs baseline: +5.2%

✅ All Tests Passing (4 suites)

✅ iast_aspects - 40/40

✅ re_expand_aspect

Time: ✅ 37.443µs (SLO: <40.000µs -6.4%) vs baseline: +6.8%

Memory: ✅ 42.153MB (SLO: <43.500MB -3.1%) vs baseline: +4.6%

✅ re_expand_noaspect

Time: ✅ 35.004µs (SLO: <40.000µs 📉 -12.5%) vs baseline: -0.4%

Memory: ✅ 42.113MB (SLO: <43.500MB -3.2%) vs baseline: +4.4%

✅ re_findall_aspect

Time: ✅ 3.422µs (SLO: <10.000µs 📉 -65.8%) vs baseline: +0.4%

Memory: ✅ 42.192MB (SLO: <43.500MB -3.0%) vs baseline: +4.4%

✅ re_findall_noaspect

Time: ✅ 3.251µs (SLO: <10.000µs 📉 -67.5%) vs baseline: -0.4%

Memory: ✅ 42.349MB (SLO: <43.500MB -2.6%) vs baseline: +5.2%

✅ re_finditer_aspect

Time: ✅ 4.539µs (SLO: <10.000µs 📉 -54.6%) vs baseline: -0.7%

Memory: ✅ 42.349MB (SLO: <43.500MB -2.6%) vs baseline: +5.1%

✅ re_finditer_noaspect

Time: ✅ 3.296µs (SLO: <10.000µs 📉 -67.0%) vs baseline: ~same

Memory: ✅ 42.231MB (SLO: <43.500MB -2.9%) vs baseline: +4.8%

✅ re_fullmatch_aspect

Time: ✅ 2.789µs (SLO: <10.000µs 📉 -72.1%) vs baseline: -0.5%

Memory: ✅ 42.349MB (SLO: <43.500MB -2.6%) vs baseline: +5.0%

✅ re_fullmatch_noaspect

Time: ✅ 3.096µs (SLO: <10.000µs 📉 -69.0%) vs baseline: ~same

Memory: ✅ 42.212MB (SLO: <43.500MB -3.0%) vs baseline: +4.7%

✅ re_group_aspect

Time: ✅ 4.821µs (SLO: <10.000µs 📉 -51.8%) vs baseline: -0.2%

Memory: ✅ 42.192MB (SLO: <43.500MB -3.0%) vs baseline: +4.7%

✅ re_group_noaspect

Time: ✅ 4.926µs (SLO: <10.000µs 📉 -50.7%) vs baseline: +0.4%

Memory: ✅ 42.271MB (SLO: <43.500MB -2.8%) vs baseline: +4.6%

✅ re_groups_aspect

Time: ✅ 4.930µs (SLO: <10.000µs 📉 -50.7%) vs baseline: -0.9%

Memory: ✅ 42.290MB (SLO: <43.500MB -2.8%) vs baseline: +4.9%

✅ re_groups_noaspect

Time: ✅ 5.005µs (SLO: <10.000µs 📉 -50.0%) vs baseline: +0.3%

Memory: ✅ 42.231MB (SLO: <43.500MB -2.9%) vs baseline: +4.8%

✅ re_match_aspect

Time: ✅ 2.843µs (SLO: <10.000µs 📉 -71.6%) vs baseline: +0.8%

Memory: ✅ 42.310MB (SLO: <43.500MB -2.7%) vs baseline: +5.2%

✅ re_match_noaspect

Time: ✅ 3.103µs (SLO: <10.000µs 📉 -69.0%) vs baseline: +0.6%

Memory: ✅ 42.310MB (SLO: <43.500MB -2.7%) vs baseline: +5.1%

✅ re_search_aspect

Time: ✅ 2.655µs (SLO: <10.000µs 📉 -73.4%) vs baseline: -0.3%

Memory: ✅ 42.113MB (SLO: <43.500MB -3.2%) vs baseline: +4.5%

✅ re_search_noaspect

Time: ✅ 2.917µs (SLO: <10.000µs 📉 -70.8%) vs baseline: +0.9%

Memory: ✅ 42.467MB (SLO: <43.500MB -2.4%) vs baseline: +5.3%

✅ re_sub_aspect

Time: ✅ 3.572µs (SLO: <10.000µs 📉 -64.3%) vs baseline: +0.4%

Memory: ✅ 42.251MB (SLO: <43.500MB -2.9%) vs baseline: +4.7%

✅ re_sub_noaspect

Time: ✅ 3.989µs (SLO: <10.000µs 📉 -60.1%) vs baseline: +0.8%

Memory: ✅ 42.212MB (SLO: <43.500MB -3.0%) vs baseline: +4.7%

✅ re_subn_aspect

Time: ✅ 3.968µs (SLO: <10.000µs 📉 -60.3%) vs baseline: +4.6%

Memory: ✅ 42.212MB (SLO: <43.500MB -3.0%) vs baseline: +4.8%

✅ re_subn_noaspect

Time: ✅ 4.075µs (SLO: <10.000µs 📉 -59.2%) vs baseline: +0.1%

Memory: ✅ 42.231MB (SLO: <43.500MB -2.9%) vs baseline: +4.7%

✅ iastpropagation - 8/8

✅ no-propagation

Time: ✅ 48.958µs (SLO: <60.000µs 📉 -18.4%) vs baseline: ~same

Memory: ✅ 38.240MB (SLO: <42.000MB -9.0%) vs baseline: +4.5%

✅ propagation_enabled

Time: ✅ 137.146µs (SLO: <190.000µs 📉 -27.8%) vs baseline: +0.7%

Memory: ✅ 38.358MB (SLO: <42.000MB -8.7%) vs baseline: +5.0%

✅ propagation_enabled_100

Time: ✅ 1.581ms (SLO: <2.300ms 📉 -31.3%) vs baseline: +0.2%

Memory: ✅ 38.319MB (SLO: <42.000MB -8.8%) vs baseline: +5.1%

✅ propagation_enabled_1000

Time: ✅ 29.376ms (SLO: <34.550ms 📉 -15.0%) vs baseline: -0.3%

Memory: ✅ 38.378MB (SLO: <42.000MB -8.6%) vs baseline: +4.2%

✅ otelsdkspan - 24/24

✅ add-event

Time: ✅ 40.556ms (SLO: <42.000ms -3.4%) vs baseline: +0.4%

Memory: ✅ 37.473MB (SLO: <39.000MB -3.9%) vs baseline: +4.5%

✅ add-link

Time: ✅ 36.398ms (SLO: <38.550ms -5.6%) vs baseline: ~same

Memory: ✅ 37.591MB (SLO: <39.000MB -3.6%) vs baseline: +5.4%

✅ add-metrics

Time: ✅ 220.625ms (SLO: <232.000ms -4.9%) vs baseline: -0.4%

Memory: ✅ 37.336MB (SLO: <39.000MB -4.3%) vs baseline: +3.7%

✅ add-tags

Time: ✅ 210.614ms (SLO: <221.600ms -5.0%) vs baseline: -0.7%

Memory: ✅ 37.552MB (SLO: <39.000MB -3.7%) vs baseline: +4.3%

✅ get-context

Time: ✅ 29.250ms (SLO: <31.300ms -6.5%) vs baseline: -0.4%

Memory: ✅ 37.493MB (SLO: <39.000MB -3.9%) vs baseline: +4.3%

✅ is-recording

Time: ✅ 29.428ms (SLO: <31.000ms -5.1%) vs baseline: +0.2%

Memory: ✅ 37.375MB (SLO: <39.000MB -4.2%) vs baseline: +3.8%

✅ record-exception

Time: ✅ 63.352ms (SLO: <65.850ms -3.8%) vs baseline: +0.5%

Memory: ✅ 37.670MB (SLO: <39.000MB -3.4%) vs baseline: +4.4%

✅ set-status

Time: ✅ 32.311ms (SLO: <34.150ms -5.4%) vs baseline: +1.1%

Memory: ✅ 37.395MB (SLO: <39.000MB -4.1%) vs baseline: +3.9%

✅ start

Time: ✅ 29.448ms (SLO: <30.150ms -2.3%) vs baseline: +2.3%

Memory: ✅ 37.375MB (SLO: <39.000MB -4.2%) vs baseline: +4.8%

✅ start-finish

Time: ✅ 33.862ms (SLO: <35.350ms -4.2%) vs baseline: -0.2%

Memory: ✅ 37.454MB (SLO: <39.000MB -4.0%) vs baseline: +4.8%

✅ start-finish-telemetry

Time: ✅ 34.287ms (SLO: <35.450ms -3.3%) vs baseline: +0.5%

Memory: ✅ 37.375MB (SLO: <39.000MB -4.2%) vs baseline: +4.8%

✅ update-name

Time: ✅ 31.196ms (SLO: <33.400ms -6.6%) vs baseline: +0.6%

Memory: ✅ 37.297MB (SLO: <39.000MB -4.4%) vs baseline: +3.6%

✅ packagespackageforrootmodulemapping - 4/4

✅ cache_off

Time: ✅ 342.835ms (SLO: <354.300ms -3.2%) vs baseline: -1.0%

Memory: ✅ 40.702MB (SLO: <43.500MB -6.4%) vs baseline: +4.8%

✅ cache_on

Time: ✅ 0.381µs (SLO: <10.000µs 📉 -96.2%) vs baseline: +0.3%

Memory: ✅ 38.519MB (SLO: <43.000MB 📉 -10.4%) vs baseline: +4.0%

ℹ️ Scenarios Missing SLO Configuration (26 scenarios)

The following scenarios exist in candidate data but have no SLO thresholds configured:

coreapiscenario-core_dispatch_listeners
coreapiscenario-core_dispatch_no_listeners
coreapiscenario-core_dispatch_with_results_listeners
coreapiscenario-core_dispatch_with_results_no_listeners
djangosimple-baseline
errortrackingdjangosimple-baseline
errortrackingflasksqli-baseline
flasksimple-baseline
flasksqli-baseline
sethttpmeta-obfuscation-disabled
startup-baseline
startup-baseline_django
startup-baseline_flask
startup-ddtrace_run
startup-ddtrace_run_appsec
startup-ddtrace_run_profiling
startup-ddtrace_run_runtime_metrics
startup-ddtrace_run_send_span
startup-ddtrace_run_telemetry_disabled
startup-ddtrace_run_telemetry_enabled
startup-import_ddtrace
startup-import_ddtrace_auto
startup-import_ddtrace_auto_django
startup-import_ddtrace_auto_flask
startup-import_ddtrace_django
startup-import_ddtrace_flask

## Description Related PRs - Related: #15712 - Dependent: #15789 - Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits (if needed for code archeology...) ### What is this about? This PR updates the Task unwinding logic in the Profiler to (more) properly handle race conditions around running/"on CPU" Tasks. A Task can be either in a _running_ state (i.e. actively _computing_ something itself, like executing a regular Python function) or in a _sleeping_ state (i.e. waiting for something else to happen to wake up). <img width="1076" height="434" alt="image" src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c" /> After those changes, this problem does not appear anymore: only Frames that are actually from the same Stack appear within a given Stack. <img width="1387" height="445" alt="image" src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f" /> ### Why do we need it? Because we don't take a "snapshot of the whole Python process at once", there is a race condition in our Sampler. We first capture the Thread Stack (i.e. for the current Thread, if it is running, what Python code the interpreter is running), then for each Task in the Thread's Event Loop [if it exists] we look at the Task's own Stack. (Since Task/Coroutines are pausable, they have their own Stack that is kept in memory when they're paused, then re-loaded into context when they're resumed. Walking each Task's Stack allows us to e.g. know what code they're "running", even when they aren't actually currently running code...) Going back to the race condition question, we may have a discrepancy between what the Python Thread Stack tells us (what the interpreter is running) and what Task objects themselves tell us (because a tiny amount of time actually elapses between the moment we capture the Thread Stack and the moment we inspect the Task objects, so _what is happening_ may have changed in the meantime). I've already in the past gone into more detail regarding what buggy/unexpected behaviour may result from that race condition; this PR improves this. Note that there is a pretty obvious tradeoff here. When we detect a discrepancy, we can: - Ignore the fact we know something bad is going to happen – I'd rather not do that because it can look terrible for customers (and we don't want to look _obviously wrong_ to them). That would mean _quantity over quality_. - Try to recover by doing clever tricks (this can be somewhat costly because we have to interleave the various Stacks we have... I think we may want to do that at some point but not without putting more thought into it; plus those clever tricks can also sometimes be brittle tricks). That would mean _quality and quantity over cost_ (which in practice probably also means _quality over quantity_ because increasing costs will most probably lead to more adaptive sampling). - Give up and just pretend this never happened – skip that Sample (for the current Task, and in certain cases for the current Thread altogether). That would mean _quality over quantity_. For the time being, things can only get better because we're in a state where we don't deal with the problem at all. The current PR biases towards a mix: we detect more reliably the depth of the pure Python Stack (which allows us not to rely on un-unwinding Task Stacks), and then we skip Samples that we know will be bogus. If the latter happens sufficiently rarely [a claim I still need numbers to back] then this is OK. ### How does it work? The main problem we are trying to avoid here is having some of Task A's Frames appearing as part of Task B's Stack. Working around this requires properly splitting the Python Stack when it says it is running a Task, such that we only push the `asyncio` runtime Frames on top of each non-Task A Task. Walking the Python Stack allows us to do that properly. We thus walk the Python Stack (once per Thread) to detect whether we see `Handle.run` Frames – those indicate that the Event Loop is currently _stepping_ the Coroutine – in other words executing code. (When that happens, we expect at least one Task to be marked as _running_ (there could be more – that's also a race condition, but it's OK, as far as CPU Time is not concerned...)) As soon as we see a `run` Frame, we know the depth of the "pure Python Stack" and we can push it on top of every Task's Stack! ### What does this cost us? This is not completely free – we're doing more work (namely, walking the stack at each Sample). Looking at Full Host Profiles on a high-CPU `asyncio`-based Python script, I'm getting the following difference. Note that the total Profiler overhead is about 360ms/minute, meaning the additional ~20ms we're using here represent an extra 5% overhead. Given the importance of getting Stacks right (or at least not completely wrong), I'd say it's worth it, but it's still noticeable. I tried to do otherwise – but as far as I can tell, as long as the race condition between unwinding the Python Stack and unwinding Task Stacks exists (which it can't not), we will not be able to tell for sure how many pure Python Stack Frames we need to push. There are heuristics that can get us there in theoretically better time (e.g. only walk the Python Stack if all Tasks reported as non-running), but those come at a correctness and code readability cost, and it's not even that certain their overhead would be lesser. I also have another PR that should reduce the cost of unwinding Tasks (that uses the fact we now walk the Python Stack only once): #15789 so hopefully it evens things out. <img width="1926" height="921" alt="image" src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332" />

ddtrace/internal/datadog/profiling/stack/echion/echion/tasks.h

ddtrace/internal/datadog/profiling/stack/echion/echion/stacks.h

## Description Related PRs - Related: #15712 - Dependent: #15789 - Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits (if needed for code archeology...) ### What is this about? This PR updates the Task unwinding logic in the Profiler to (more) properly handle race conditions around running/"on CPU" Tasks. A Task can be either in a _running_ state (i.e. actively _computing_ something itself, like executing a regular Python function) or in a _sleeping_ state (i.e. waiting for something else to happen to wake up). <img width="1076" height="434" alt="image" src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c" /> After those changes, this problem does not appear anymore: only Frames that are actually from the same Stack appear within a given Stack. <img width="1387" height="445" alt="image" src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f" /> ### Why do we need it? Because we don't take a "snapshot of the whole Python process at once", there is a race condition in our Sampler. We first capture the Thread Stack (i.e. for the current Thread, if it is running, what Python code the interpreter is running), then for each Task in the Thread's Event Loop [if it exists] we look at the Task's own Stack. (Since Task/Coroutines are pausable, they have their own Stack that is kept in memory when they're paused, then re-loaded into context when they're resumed. Walking each Task's Stack allows us to e.g. know what code they're "running", even when they aren't actually currently running code...) Going back to the race condition question, we may have a discrepancy between what the Python Thread Stack tells us (what the interpreter is running) and what Task objects themselves tell us (because a tiny amount of time actually elapses between the moment we capture the Thread Stack and the moment we inspect the Task objects, so _what is happening_ may have changed in the meantime). I've already in the past gone into more detail regarding what buggy/unexpected behaviour may result from that race condition; this PR improves this. Note that there is a pretty obvious tradeoff here. When we detect a discrepancy, we can: - Ignore the fact we know something bad is going to happen – I'd rather not do that because it can look terrible for customers (and we don't want to look _obviously wrong_ to them). That would mean _quantity over quality_. - Try to recover by doing clever tricks (this can be somewhat costly because we have to interleave the various Stacks we have... I think we may want to do that at some point but not without putting more thought into it; plus those clever tricks can also sometimes be brittle tricks). That would mean _quality and quantity over cost_ (which in practice probably also means _quality over quantity_ because increasing costs will most probably lead to more adaptive sampling). - Give up and just pretend this never happened – skip that Sample (for the current Task, and in certain cases for the current Thread altogether). That would mean _quality over quantity_. For the time being, things can only get better because we're in a state where we don't deal with the problem at all. The current PR biases towards a mix: we detect more reliably the depth of the pure Python Stack (which allows us not to rely on un-unwinding Task Stacks), and then we skip Samples that we know will be bogus. If the latter happens sufficiently rarely [a claim I still need numbers to back] then this is OK. ### How does it work? The main problem we are trying to avoid here is having some of Task A's Frames appearing as part of Task B's Stack. Working around this requires properly splitting the Python Stack when it says it is running a Task, such that we only push the `asyncio` runtime Frames on top of each non-Task A Task. Walking the Python Stack allows us to do that properly. We thus walk the Python Stack (once per Thread) to detect whether we see `Handle.run` Frames – those indicate that the Event Loop is currently _stepping_ the Coroutine – in other words executing code. (When that happens, we expect at least one Task to be marked as _running_ (there could be more – that's also a race condition, but it's OK, as far as CPU Time is not concerned...)) As soon as we see a `run` Frame, we know the depth of the "pure Python Stack" and we can push it on top of every Task's Stack! ### What does this cost us? This is not completely free – we're doing more work (namely, walking the stack at each Sample). Looking at Full Host Profiles on a high-CPU `asyncio`-based Python script, I'm getting the following difference. Note that the total Profiler overhead is about 360ms/minute, meaning the additional ~20ms we're using here represent an extra 5% overhead. Given the importance of getting Stacks right (or at least not completely wrong), I'd say it's worth it, but it's still noticeable. I tried to do otherwise – but as far as I can tell, as long as the race condition between unwinding the Python Stack and unwinding Task Stacks exists (which it can't not), we will not be able to tell for sure how many pure Python Stack Frames we need to push. There are heuristics that can get us there in theoretically better time (e.g. only walk the Python Stack if all Tasks reported as non-running), but those come at a correctness and code readability cost, and it's not even that certain their overhead would be lesser. I also have another PR that should reduce the cost of unwinding Tasks (that uses the fact we now walk the Python Stack only once): #15789 so hopefully it evens things out. <img width="1926" height="921" alt="image" src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332" />

## Description Related PRs - Related: #15712 - Dependent: #15789 - Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits (if needed for code archeology...) ### What is this about? This PR updates the Task unwinding logic in the Profiler to (more) properly handle race conditions around running/"on CPU" Tasks. A Task can be either in a _running_ state (i.e. actively _computing_ something itself, like executing a regular Python function) or in a _sleeping_ state (i.e. waiting for something else to happen to wake up). <img width="1076" height="434" alt="image" src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c" /> After those changes, this problem does not appear anymore: only Frames that are actually from the same Stack appear within a given Stack. <img width="1387" height="445" alt="image" src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f" /> ### Why do we need it? Because we don't take a "snapshot of the whole Python process at once", there is a race condition in our Sampler. We first capture the Thread Stack (i.e. for the current Thread, if it is running, what Python code the interpreter is running), then for each Task in the Thread's Event Loop [if it exists] we look at the Task's own Stack. (Since Task/Coroutines are pausable, they have their own Stack that is kept in memory when they're paused, then re-loaded into context when they're resumed. Walking each Task's Stack allows us to e.g. know what code they're "running", even when they aren't actually currently running code...) Going back to the race condition question, we may have a discrepancy between what the Python Thread Stack tells us (what the interpreter is running) and what Task objects themselves tell us (because a tiny amount of time actually elapses between the moment we capture the Thread Stack and the moment we inspect the Task objects, so _what is happening_ may have changed in the meantime). I've already in the past gone into more detail regarding what buggy/unexpected behaviour may result from that race condition; this PR improves this. Note that there is a pretty obvious tradeoff here. When we detect a discrepancy, we can: - Ignore the fact we know something bad is going to happen – I'd rather not do that because it can look terrible for customers (and we don't want to look _obviously wrong_ to them). That would mean _quantity over quality_. - Try to recover by doing clever tricks (this can be somewhat costly because we have to interleave the various Stacks we have... I think we may want to do that at some point but not without putting more thought into it; plus those clever tricks can also sometimes be brittle tricks). That would mean _quality and quantity over cost_ (which in practice probably also means _quality over quantity_ because increasing costs will most probably lead to more adaptive sampling). - Give up and just pretend this never happened – skip that Sample (for the current Task, and in certain cases for the current Thread altogether). That would mean _quality over quantity_. For the time being, things can only get better because we're in a state where we don't deal with the problem at all. The current PR biases towards a mix: we detect more reliably the depth of the pure Python Stack (which allows us not to rely on un-unwinding Task Stacks), and then we skip Samples that we know will be bogus. If the latter happens sufficiently rarely [a claim I still need numbers to back] then this is OK. ### How does it work? The main problem we are trying to avoid here is having some of Task A's Frames appearing as part of Task B's Stack. Working around this requires properly splitting the Python Stack when it says it is running a Task, such that we only push the `asyncio` runtime Frames on top of each non-Task A Task. Walking the Python Stack allows us to do that properly. We thus walk the Python Stack (once per Thread) to detect whether we see `Handle.run` Frames – those indicate that the Event Loop is currently _stepping_ the Coroutine – in other words executing code. (When that happens, we expect at least one Task to be marked as _running_ (there could be more – that's also a race condition, but it's OK, as far as CPU Time is not concerned...)) As soon as we see a `run` Frame, we know the depth of the "pure Python Stack" and we can push it on top of every Task's Stack! ### What does this cost us? This is not completely free – we're doing more work (namely, walking the stack at each Sample). Looking at Full Host Profiles on a high-CPU `asyncio`-based Python script, I'm getting the following difference. Note that the total Profiler overhead is about 360ms/minute, meaning the additional ~20ms we're using here represent an extra 5% overhead. Given the importance of getting Stacks right (or at least not completely wrong), I'd say it's worth it, but it's still noticeable. I tried to do otherwise – but as far as I can tell, as long as the race condition between unwinding the Python Stack and unwinding Task Stacks exists (which it can't not), we will not be able to tell for sure how many pure Python Stack Frames we need to push. There are heuristics that can get us there in theoretically better time (e.g. only walk the Python Stack if all Tasks reported as non-running), but those come at a correctness and code readability cost, and it's not even that certain their overhead would be lesser. I also have another PR that should reduce the cost of unwinding Tasks (that uses the fact we now walk the Python Stack only once): #15789 so hopefully it evens things out. <img width="1926" height="921" alt="image" src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332" /> (cherry picked from commit 61b1799)

…15854) Backport 61b1799 from #15780 to 4.1. ## Description Related PRs - Related: #15712 - Dependent: #15789 - Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits (if needed for code archeology...) ### What is this about? This PR updates the Task unwinding logic in the Profiler to (more) properly handle race conditions around running/"on CPU" Tasks. A Task can be either in a _running_ state (i.e. actively _computing_ something itself, like executing a regular Python function) or in a _sleeping_ state (i.e. waiting for something else to happen to wake up). <img width="1076" height="434" alt="image" src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c" /> After those changes, this problem does not appear anymore: only Frames that are actually from the same Stack appear within a given Stack. <img width="1387" height="445" alt="image" src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f" /> ### Why do we need it? Because we don't take a "snapshot of the whole Python process at once", there is a race condition in our Sampler. We first capture the Thread Stack (i.e. for the current Thread, if it is running, what Python code the interpreter is running), then for each Task in the Thread's Event Loop [if it exists] we look at the Task's own Stack. (Since Task/Coroutines are pausable, they have their own Stack that is kept in memory when they're paused, then re-loaded into context when they're resumed. Walking each Task's Stack allows us to e.g. know what code they're "running", even when they aren't actually currently running code...) Going back to the race condition question, we may have a discrepancy between what the Python Thread Stack tells us (what the interpreter is running) and what Task objects themselves tell us (because a tiny amount of time actually elapses between the moment we capture the Thread Stack and the moment we inspect the Task objects, so _what is happening_ may have changed in the meantime). I've already in the past gone into more detail regarding what buggy/unexpected behaviour may result from that race condition; this PR improves this. Note that there is a pretty obvious tradeoff here. When we detect a discrepancy, we can: - Ignore the fact we know something bad is going to happen – I'd rather not do that because it can look terrible for customers (and we don't want to look _obviously wrong_ to them). That would mean _quantity over quality_. - Try to recover by doing clever tricks (this can be somewhat costly because we have to interleave the various Stacks we have... I think we may want to do that at some point but not without putting more thought into it; plus those clever tricks can also sometimes be brittle tricks). That would mean _quality and quantity over cost_ (which in practice probably also means _quality over quantity_ because increasing costs will most probably lead to more adaptive sampling). - Give up and just pretend this never happened – skip that Sample (for the current Task, and in certain cases for the current Thread altogether). That would mean _quality over quantity_. For the time being, things can only get better because we're in a state where we don't deal with the problem at all. The current PR biases towards a mix: we detect more reliably the depth of the pure Python Stack (which allows us not to rely on un-unwinding Task Stacks), and then we skip Samples that we know will be bogus. If the latter happens sufficiently rarely [a claim I still need numbers to back] then this is OK. ### How does it work? The main problem we are trying to avoid here is having some of Task A's Frames appearing as part of Task B's Stack. Working around this requires properly splitting the Python Stack when it says it is running a Task, such that we only push the `asyncio` runtime Frames on top of each non-Task A Task. Walking the Python Stack allows us to do that properly. We thus walk the Python Stack (once per Thread) to detect whether we see `Handle.run` Frames – those indicate that the Event Loop is currently _stepping_ the Coroutine – in other words executing code. (When that happens, we expect at least one Task to be marked as _running_ (there could be more – that's also a race condition, but it's OK, as far as CPU Time is not concerned...)) As soon as we see a `run` Frame, we know the depth of the "pure Python Stack" and we can push it on top of every Task's Stack! ### What does this cost us? This is not completely free – we're doing more work (namely, walking the stack at each Sample). Looking at Full Host Profiles on a high-CPU `asyncio`-based Python script, I'm getting the following difference. Note that the total Profiler overhead is about 360ms/minute, meaning the additional ~20ms we're using here represent an extra 5% overhead. Given the importance of getting Stacks right (or at least not completely wrong), I'd say it's worth it, but it's still noticeable. I tried to do otherwise – but as far as I can tell, as long as the race condition between unwinding the Python Stack and unwinding Task Stacks exists (which it can't not), we will not be able to tell for sure how many pure Python Stack Frames we need to push. There are heuristics that can get us there in theoretically better time (e.g. only walk the Python Stack if all Tasks reported as non-running), but those come at a correctness and code readability cost, and it's not even that certain their overhead would be lesser. I also have another PR that should reduce the cost of unwinding Tasks (that uses the fact we now walk the Python Stack only once): #15789 so hopefully it evens things out. <img width="1926" height="921" alt="image" src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332" /> Co-authored-by: Thomas Kowalski <thomas.kowalski@datadoghq.com>

r1viollet

LGTM

) ## Description Related PRs - Related: DataDog#15712 - Dependent: DataDog#15789 - Research PR: https://github.com/DataDog/dd-trace-py/pull/15675/commits (if needed for code archeology...) ### What is this about? This PR updates the Task unwinding logic in the Profiler to (more) properly handle race conditions around running/"on CPU" Tasks. A Task can be either in a _running_ state (i.e. actively _computing_ something itself, like executing a regular Python function) or in a _sleeping_ state (i.e. waiting for something else to happen to wake up). <img width="1076" height="434" alt="image" src="https://github.com/user-attachments/assets/be6759eb-0255-43ef-b3ce-d47486bb653c" /> After those changes, this problem does not appear anymore: only Frames that are actually from the same Stack appear within a given Stack. <img width="1387" height="445" alt="image" src="https://github.com/user-attachments/assets/31287863-f918-47a8-a39b-b3a0d27dce8f" /> ### Why do we need it? Because we don't take a "snapshot of the whole Python process at once", there is a race condition in our Sampler. We first capture the Thread Stack (i.e. for the current Thread, if it is running, what Python code the interpreter is running), then for each Task in the Thread's Event Loop [if it exists] we look at the Task's own Stack. (Since Task/Coroutines are pausable, they have their own Stack that is kept in memory when they're paused, then re-loaded into context when they're resumed. Walking each Task's Stack allows us to e.g. know what code they're "running", even when they aren't actually currently running code...) Going back to the race condition question, we may have a discrepancy between what the Python Thread Stack tells us (what the interpreter is running) and what Task objects themselves tell us (because a tiny amount of time actually elapses between the moment we capture the Thread Stack and the moment we inspect the Task objects, so _what is happening_ may have changed in the meantime). I've already in the past gone into more detail regarding what buggy/unexpected behaviour may result from that race condition; this PR improves this. Note that there is a pretty obvious tradeoff here. When we detect a discrepancy, we can: - Ignore the fact we know something bad is going to happen – I'd rather not do that because it can look terrible for customers (and we don't want to look _obviously wrong_ to them). That would mean _quantity over quality_. - Try to recover by doing clever tricks (this can be somewhat costly because we have to interleave the various Stacks we have... I think we may want to do that at some point but not without putting more thought into it; plus those clever tricks can also sometimes be brittle tricks). That would mean _quality and quantity over cost_ (which in practice probably also means _quality over quantity_ because increasing costs will most probably lead to more adaptive sampling). - Give up and just pretend this never happened – skip that Sample (for the current Task, and in certain cases for the current Thread altogether). That would mean _quality over quantity_. For the time being, things can only get better because we're in a state where we don't deal with the problem at all. The current PR biases towards a mix: we detect more reliably the depth of the pure Python Stack (which allows us not to rely on un-unwinding Task Stacks), and then we skip Samples that we know will be bogus. If the latter happens sufficiently rarely [a claim I still need numbers to back] then this is OK. ### How does it work? The main problem we are trying to avoid here is having some of Task A's Frames appearing as part of Task B's Stack. Working around this requires properly splitting the Python Stack when it says it is running a Task, such that we only push the `asyncio` runtime Frames on top of each non-Task A Task. Walking the Python Stack allows us to do that properly. We thus walk the Python Stack (once per Thread) to detect whether we see `Handle.run` Frames – those indicate that the Event Loop is currently _stepping_ the Coroutine – in other words executing code. (When that happens, we expect at least one Task to be marked as _running_ (there could be more – that's also a race condition, but it's OK, as far as CPU Time is not concerned...)) As soon as we see a `run` Frame, we know the depth of the "pure Python Stack" and we can push it on top of every Task's Stack! ### What does this cost us? This is not completely free – we're doing more work (namely, walking the stack at each Sample). Looking at Full Host Profiles on a high-CPU `asyncio`-based Python script, I'm getting the following difference. Note that the total Profiler overhead is about 360ms/minute, meaning the additional ~20ms we're using here represent an extra 5% overhead. Given the importance of getting Stacks right (or at least not completely wrong), I'd say it's worth it, but it's still noticeable. I tried to do otherwise – but as far as I can tell, as long as the race condition between unwinding the Python Stack and unwinding Task Stacks exists (which it can't not), we will not be able to tell for sure how many pure Python Stack Frames we need to push. There are heuristics that can get us there in theoretically better time (e.g. only walk the Python Stack if all Tasks reported as non-running), but those come at a correctness and code readability cost, and it's not even that certain their overhead would be lesser. I also have another PR that should reduce the cost of unwinding Tasks (that uses the fact we now walk the Python Stack only once): DataDog#15789 so hopefully it evens things out. <img width="1926" height="921" alt="image" src="https://github.com/user-attachments/assets/e21a6698-2b18-43bb-aff9-4e8d59354332" />

## Description This is a small (but still worth it) performance improvement in the context of `asyncio` Task unwinding. Previously, we would use the `unwind_frame` to get the current Frame for an `asyncio` Task. In the case of a running Task, this would also yield all the Python `asyncio` runtime Frames that were "on top" of that Task Frame (and that we would later have to remove from the Stack). Building that Python Stack that we don't care about takes some time (because we need to walk the Frame chain) – we should only do it if we need to. Here, we clearly don't need to, as we only care about the Task Frame, so I added a new argument to `unwind_frame` that allows to early exit after a certain Stack depth has been reached (and I set it to `1` when unwinding a Task Frame).

## Description Related PRs: - Echion PR: P403n1x87/echion#199 - Depends on: #15789 - https://datadoghq.atlassian.net/browse/PROF-13106 This PR adds support for _weak links_ between `asyncio` Tasks in the Python Profiler. Weak Links (as opposed to _Strong Links_) are links between the Task that creates another Task and the created Task itself. We need Weak Links because without them, creating a Task without awaiting it – or creating a Task without awaiting it _immediately_ – will result in the created Task appearing as "independent" of anything else (because nothing is awaiting it), which will make us show a separate Stack (or really, whole separate Flame Graph) for it. That isn't great in terms of user experience, as we usually make Task relationships appear in the Flame Graph (Stack for Task A awaiting Task B is appended on top of the Stack for Task B). Note that Weak Links are named _Weak Links_ (as opposed to _Strong Links_) because they're only used as a fallback. If a certain Task is awaited by another Task than the one that created it, the Weak Link will not be used (in favour of the "real `await` link). --- Here are screenshots of two Flame Graphs – one before and one after – for the following script ```py import asyncio async def func_not_awaited() -> None: await asyncio.sleep(0.5) async def func_awaited() -> None: await asyncio.sleep(1) async def parent() -> asyncio.Task: t_not_awaited = asyncio.create_task(func_not_awaited(), name="Task-not_awaited") t_awaited = asyncio.create_task(func_awaited(), name="Task-awaited") await t_awaited # At this point, we have not awaited t_not_awaited but it should have finished # before t_awaited as the delay is much shorter. # Returning it to avoid the warning on unused variable. return t_not_awaited def main(): while True: asyncio.run(parent()) if __name__ == "__main__": main() ``` Before the change: `func_not_awaited` gets its own Flame Graph, outside `parent`. <img width="1391" height="127" alt="image" src="https://github.com/user-attachments/assets/db9c804d-eb78-43ad-81f3-650f1b11ed72" /> After the change: even though `func_not_awaited` is run in Task that isn't being awaited by the Task running `parent`, it appears under it because it was created by that coroutine. <img width="1393" height="128" alt="image" src="https://github.com/user-attachments/assets/580ef206-662a-4d00-8ac4-034b6ca8affb" /> ## Testing I added a unit test and tested in staging. ## Performance This change should come at very little (or zero) performance cost. We now do more work than we used to in the Python patches (every `create_task` call is instrumented) but that isn't in the _real_ hot path. On the C++ side of things, the processing is slightly more complex (because we need to keep track of Weak Links on top of the ones we already kept track of before) but the complexity is unchanged and those parts of the code aren't what we spend the better part of our time in today.

## Description Related PRs: - Echion PR: P403n1x87/echion#199 - Depends on: DataDog#15789 - https://datadoghq.atlassian.net/browse/PROF-13106 This PR adds support for _weak links_ between `asyncio` Tasks in the Python Profiler. Weak Links (as opposed to _Strong Links_) are links between the Task that creates another Task and the created Task itself. We need Weak Links because without them, creating a Task without awaiting it – or creating a Task without awaiting it _immediately_ – will result in the created Task appearing as "independent" of anything else (because nothing is awaiting it), which will make us show a separate Stack (or really, whole separate Flame Graph) for it. That isn't great in terms of user experience, as we usually make Task relationships appear in the Flame Graph (Stack for Task A awaiting Task B is appended on top of the Stack for Task B). Note that Weak Links are named _Weak Links_ (as opposed to _Strong Links_) because they're only used as a fallback. If a certain Task is awaited by another Task than the one that created it, the Weak Link will not be used (in favour of the "real `await` link). --- Here are screenshots of two Flame Graphs – one before and one after – for the following script ```py import asyncio async def func_not_awaited() -> None: await asyncio.sleep(0.5) async def func_awaited() -> None: await asyncio.sleep(1) async def parent() -> asyncio.Task: t_not_awaited = asyncio.create_task(func_not_awaited(), name="Task-not_awaited") t_awaited = asyncio.create_task(func_awaited(), name="Task-awaited") await t_awaited # At this point, we have not awaited t_not_awaited but it should have finished # before t_awaited as the delay is much shorter. # Returning it to avoid the warning on unused variable. return t_not_awaited def main(): while True: asyncio.run(parent()) if __name__ == "__main__": main() ``` Before the change: `func_not_awaited` gets its own Flame Graph, outside `parent`. <img width="1391" height="127" alt="image" src="https://github.com/user-attachments/assets/db9c804d-eb78-43ad-81f3-650f1b11ed72" /> After the change: even though `func_not_awaited` is run in Task that isn't being awaited by the Task running `parent`, it appears under it because it was created by that coroutine. <img width="1393" height="128" alt="image" src="https://github.com/user-attachments/assets/580ef206-662a-4d00-8ac4-034b6ca8affb" /> ## Testing I added a unit test and tested in staging. ## Performance This change should come at very little (or zero) performance cost. We now do more work than we used to in the Python patches (every `create_task` call is instrumented) but that isn't in the _real_ hot path. On the C++ side of things, the processing is slightly more complex (because we need to keep track of Weak Links on top of the ones we already kept track of before) but the complexity is unchanged and those parts of the code aren't what we spend the better part of our time in today.

KowalskiThomas mentioned this pull request Dec 25, 2025

fix(profiling): workaround on-CPU Task race condition #15780

Merged

KowalskiThomas changed the title ~~fix(profiling): improve stacks for on-CPU Tasks~~ perf(profiling): unwind only one Frame per Task Dec 25, 2025

KowalskiThomas added the changelog/no-changelog A changelog entry is not required for this PR. label Dec 25, 2025

KowalskiThomas force-pushed the kowalski/fix-profiling-workaround-for-on-cpu-task-race-condition-fix-reboot branch from 687aa2c to d41ce1c Compare December 26, 2025 11:07

KowalskiThomas force-pushed the kowalski/perf-profiling-unwind-only-one-frame-per-task branch 2 times, most recently from e8cde65 to f21c6da Compare December 26, 2025 11:15

KowalskiThomas mentioned this pull request Dec 26, 2025

feat(profiling): add support for weak links #15792

Merged

KowalskiThomas marked this pull request as ready for review December 26, 2025 17:55

KowalskiThomas requested a review from a team as a code owner December 26, 2025 17:55

KowalskiThomas requested a review from taegyunkim December 26, 2025 17:55

KowalskiThomas force-pushed the kowalski/fix-profiling-workaround-for-on-cpu-task-race-condition-fix-reboot branch from d41ce1c to 7aa74f5 Compare December 31, 2025 15:41

KowalskiThomas requested a review from a team as a code owner December 31, 2025 15:41

KowalskiThomas requested review from r1viollet and vlad-scherbich and removed request for a team December 31, 2025 15:41

KowalskiThomas force-pushed the kowalski/perf-profiling-unwind-only-one-frame-per-task branch from f21c6da to a24f407 Compare December 31, 2025 15:45

Base automatically changed from kowalski/fix-profiling-workaround-for-on-cpu-task-race-condition-fix-reboot to main January 1, 2026 01:47

KowalskiThomas force-pushed the kowalski/perf-profiling-unwind-only-one-frame-per-task branch from a24f407 to 3a46ae6 Compare January 1, 2026 19:07

r1viollet reviewed Jan 5, 2026

View reviewed changes

ddtrace/internal/datadog/profiling/stack/echion/echion/tasks.h Show resolved Hide resolved

r1viollet reviewed Jan 5, 2026

View reviewed changes

ddtrace/internal/datadog/profiling/stack/echion/echion/stacks.h Outdated Show resolved Hide resolved

dd-octo-sts bot mentioned this pull request Jan 6, 2026

fix(profiling): workaround on-CPU Task race condition [backport 4.1] #15854

Merged

KowalskiThomas force-pushed the kowalski/perf-profiling-unwind-only-one-frame-per-task branch from 8304d45 to 912e274 Compare January 6, 2026 09:55

KowalskiThomas requested a review from r1viollet January 6, 2026 10:48

KowalskiThomas enabled auto-merge (squash) January 6, 2026 11:23

perf(profiling): unwind only one Frame per Task

04aa20b

KowalskiThomas force-pushed the kowalski/perf-profiling-unwind-only-one-frame-per-task branch from 912e274 to 04aa20b Compare January 6, 2026 16:05

r1viollet approved these changes Jan 7, 2026

View reviewed changes

KowalskiThomas merged commit a46012d into main Jan 7, 2026
447 checks passed

KowalskiThomas deleted the kowalski/perf-profiling-unwind-only-one-frame-per-task branch January 7, 2026 12:34

KowalskiThomas added the Profiling Continous Profling label Jan 8, 2026 — with ddtool CLI

Conversation

KowalskiThomas commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

cit-pr-commenter bot commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codeowners resolved as

Uh oh!

pr-commenter bot commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance SLOs

✅ add_aspect

✅ add_inplace_aspect

✅ add_inplace_noaspect

✅ add_noaspect

✅ bytearray_aspect

✅ bytearray_extend_aspect

✅ bytearray_extend_noaspect

✅ bytearray_noaspect

✅ bytes_aspect

✅ bytes_noaspect

✅ bytesio_aspect

✅ bytesio_noaspect

✅ capitalize_aspect

✅ capitalize_noaspect

✅ casefold_aspect

✅ casefold_noaspect

✅ decode_aspect

✅ decode_noaspect

✅ encode_aspect

✅ encode_noaspect

✅ format_aspect

✅ format_map_aspect

✅ format_map_noaspect

✅ format_noaspect

✅ index_aspect

✅ index_noaspect

✅ join_aspect

✅ join_noaspect

✅ ljust_aspect

✅ ljust_noaspect

✅ lower_aspect

✅ lower_noaspect

✅ lstrip_aspect

✅ lstrip_noaspect

✅ modulo_aspect

✅ modulo_aspect_for_bytearray_bytearray

✅ modulo_aspect_for_bytes

✅ modulo_aspect_for_bytes_bytearray

✅ modulo_noaspect

✅ replace_aspect

✅ replace_noaspect

✅ repr_aspect

✅ repr_noaspect

✅ rstrip_aspect

✅ rstrip_noaspect

✅ slice_aspect

✅ slice_noaspect

✅ stringio_aspect

✅ stringio_noaspect

✅ strip_aspect

✅ strip_noaspect

✅ swapcase_aspect

✅ swapcase_noaspect

✅ title_aspect

✅ title_noaspect

✅ translate_aspect

✅ translate_noaspect

✅ upper_aspect

✅ upper_noaspect

✅ ospathbasename_aspect

✅ ospathbasename_noaspect

✅ ospathjoin_aspect

✅ ospathjoin_noaspect

✅ ospathnormcase_aspect

✅ ospathnormcase_noaspect

✅ ospathsplit_aspect

✅ ospathsplit_noaspect

✅ ospathsplitdrive_aspect

KowalskiThomas commented Dec 25, 2025 •

edited

Loading

cit-pr-commenter bot commented Dec 25, 2025 •

edited

Loading

pr-commenter bot commented Dec 25, 2025 •

edited

Loading