Debug what's going on with pipeline monitor #6576

andrewlock · 2025-01-20T14:15:34Z

Summary of changes

Enable debug for the pipeline monitor

Reason for change

Our pipeline monitor started dropping traces recently, and we want to know why

Implementation details

Moved some config to the pipeline to avoid needing to re-initialize the tracer
Enabled telemetry and debug

Test coverage

This will hopefully give some info

datadog-ddstaging · 2025-01-20T14:37:14Z

Datadog Report

Branch report: andrew/ci/understand-pipeline-errors
Commit report: 496a4a8
Test service: dd-trace-dotnet

✅ 0 Failed, 242283 Passed, 1984 Skipped, 19h 0m 56.48s Total Time

andrewlock · 2025-01-20T15:10:28Z

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing the following branches/commits:

Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:

Welch test with statistical test for significance of 5%
Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.6.2) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6576) - mean (69ms)  : 67, 71
     .   : milestone, 69,
    master - mean (69ms)  : 66, 72
     .   : milestone, 69,

    section CallTarget+Inlining+NGEN
    This PR (6576) - mean (985ms)  : 961, 1009
     .   : milestone, 985,
    master - mean (982ms)  : 957, 1007
     .   : milestone, 982,

gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6576) - mean (108ms)  : 105, 111
     .   : milestone, 108,
    master - mean (108ms)  : 105, 110
     .   : milestone, 108,

    section CallTarget+Inlining+NGEN
    This PR (6576) - mean (680ms)  : 666, 694
     .   : milestone, 680,
    master - mean (681ms)  : 666, 696
     .   : milestone, 681,

gantt
    title Execution time (ms) FakeDbCommand (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6576) - mean (91ms)  : 89, 93
     .   : milestone, 91,
    master - mean (92ms)  : 90, 94
     .   : milestone, 92,

    section CallTarget+Inlining+NGEN
    This PR (6576) - mean (635ms)  : 621, 649
     .   : milestone, 635,
    master - mean (634ms)  : 616, 651
     .   : milestone, 634,

gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.6.2) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6576) - mean (189ms)  : 186, 193
     .   : milestone, 189,
    master - mean (189ms)  : 184, 193
     .   : milestone, 189,

    section CallTarget+Inlining+NGEN
    This PR (6576) - mean (1,086ms)  : 1052, 1119
     .   : milestone, 1086,
    master - mean (1,082ms)  : 1051, 1114
     .   : milestone, 1082,

gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6576) - mean (275ms)  : 271, 280
     .   : milestone, 275,
    master - mean (276ms)  : 271, 281
     .   : milestone, 276,

    section CallTarget+Inlining+NGEN
    This PR (6576) - mean (868ms)  : 842, 894
     .   : milestone, 868,
    master - mean (866ms)  : 835, 897
     .   : milestone, 866,

gantt
    title Execution time (ms) HttpMessageHandler (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6576) - mean (264ms)  : 259, 268
     .   : milestone, 264,
    master - mean (263ms)  : 259, 266
     .   : milestone, 263,

    section CallTarget+Inlining+NGEN
    This PR (6576) - mean (843ms)  : 810, 876
     .   : milestone, 843,
    master - mean (848ms)  : 810, 886
     .   : milestone, 848,

andrewlock · 2025-01-20T15:48:34Z

Benchmarks Report for tracer 🐌

Benchmarks for #6576 compared to master:

3 benchmarks are slower, with geometric mean 1.130
All benchmarks have the same allocations

The following thresholds were used for comparing the benchmark speeds:

Mann–Whitney U test with statistical test for significance of 5%
Only results indicating a difference greater than 10% and 0.3 ns are considered.

Allocation changes below 0.5% are ignored.

Benchmark details

Benchmarks.Trace.ActivityBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Gen 1	Gen 2	Allocated
master	`StartStopWithChild`	net6.0	8.17μs	46ns	312ns	0.0159	0.00397	0	5.61 KB
master	`StartStopWithChild`	netcoreapp3.1	10.2μs	56.5ns	348ns	0.0197	0.00986	0	5.8 KB
master	`StartStopWithChild`	net472	16μs	43ns	161ns	1.05	0.322	0.0967	6.22 KB
#6576	`StartStopWithChild`	net6.0	7.9μs	43.9ns	260ns	0.019	0.00761	0	5.61 KB
#6576	`StartStopWithChild`	netcoreapp3.1	9.88μs	53.2ns	305ns	0.0295	0.0147	0	5.79 KB
#6576	`StartStopWithChild`	net472	16.3μs	67.3ns	261ns	1.03	0.295	0.082	6.2 KB

Benchmarks.Trace.AgentWriterBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`WriteAndFlushEnrichedTraces`	net6.0	526μs	1.46μs	5.65μs	0	2.7 KB
master	`WriteAndFlushEnrichedTraces`	netcoreapp3.1	652μs	189ns	653ns	0	2.7 KB
master	`WriteAndFlushEnrichedTraces`	net472	837μs	406ns	1.57μs	0.417	3.3 KB
#6576	`WriteAndFlushEnrichedTraces`	net6.0	487μs	304ns	1.1μs	0	2.7 KB
#6576	`WriteAndFlushEnrichedTraces`	netcoreapp3.1	657μs	343ns	1.24μs	0	2.7 KB
#6576	`WriteAndFlushEnrichedTraces`	net472	851μs	572ns	2.14μs	0.425	3.3 KB

Benchmarks.Trace.AspNetCoreBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`SendRequest`	net6.0	133μs	506ns	1.96μs	0.186	14.47 KB
master	`SendRequest`	netcoreapp3.1	147μs	333ns	1.29μs	0.218	17.27 KB
master	`SendRequest`	net472	0.00622ns	0.00194ns	0.00751ns	0	0 b
#6576	`SendRequest`	net6.0	129μs	462ns	1.79μs	0.189	14.47 KB
#6576	`SendRequest`	netcoreapp3.1	146μs	279ns	1.08μs	0.218	17.27 KB
#6576	`SendRequest`	net472	0.00292ns	0.00114ns	0.00442ns	0	0 b

Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Gen 1	Gen 2	Allocated
master	`WriteAndFlushEnrichedTraces`	net6.0	550μs	2.07μs	7.76μs	0.534	0	0	41.66 KB
master	`WriteAndFlushEnrichedTraces`	netcoreapp3.1	683μs	3.81μs	29.7μs	0.322	0	0	41.73 KB
master	`WriteAndFlushEnrichedTraces`	net472	839μs	3.82μs	14.8μs	8.45	2.53	0.422	53.29 KB
#6576	`WriteAndFlushEnrichedTraces`	net6.0	566μs	3.12μs	18.4μs	0.558	0	0	41.53 KB
#6576	`WriteAndFlushEnrichedTraces`	netcoreapp3.1	663μs	3.58μs	19.3μs	0.334	0	0	41.71 KB
#6576	`WriteAndFlushEnrichedTraces`	net472	835μs	2.07μs	7.48μs	8.28	2.48	0.414	53.3 KB

Benchmarks.Trace.DbCommandBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Gen 1	Allocated
master	`ExecuteNonQuery`	net6.0	1.32μs	0.563ns	2.03ns	0.014	0	1.02 KB
master	`ExecuteNonQuery`	netcoreapp3.1	1.77μs	1.87ns	7.23ns	0.0133	0	1.02 KB
master	`ExecuteNonQuery`	net472	2.07μs	2.57ns	9.95ns	0.157	0.00104	987 B
#6576	`ExecuteNonQuery`	net6.0	1.28μs	1ns	3.88ns	0.0141	0	1.02 KB
#6576	`ExecuteNonQuery`	netcoreapp3.1	1.74μs	2.43ns	9.41ns	0.0138	0	1.02 KB
#6576	`ExecuteNonQuery`	net472	2.05μs	2.19ns	8.49ns	0.157	0.00102	987 B

Benchmarks.Trace.ElasticsearchBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`CallElasticsearch`	net6.0	1.18μs	0.872ns	3.38ns	0.0135	976 B
master	`CallElasticsearch`	netcoreapp3.1	1.5μs	0.619ns	2.32ns	0.0135	976 B
master	`CallElasticsearch`	net472	2.55μs	1.56ns	5.82ns	0.158	995 B
master	`CallElasticsearchAsync`	net6.0	1.27μs	0.712ns	2.66ns	0.0133	952 B
master	`CallElasticsearchAsync`	netcoreapp3.1	1.59μs	1.04ns	4.03ns	0.0144	1.02 KB
master	`CallElasticsearchAsync`	net472	2.59μs	1.31ns	4.9ns	0.167	1.05 KB
#6576	`CallElasticsearch`	net6.0	1.22μs	0.504ns	1.89ns	0.0138	976 B
#6576	`CallElasticsearch`	netcoreapp3.1	1.53μs	0.985ns	3.68ns	0.013	976 B
#6576	`CallElasticsearch`	net472	2.64μs	2.28ns	8.85ns	0.158	995 B
#6576	`CallElasticsearchAsync`	net6.0	1.17μs	1.73ns	6.7ns	0.0135	952 B
#6576	`CallElasticsearchAsync`	netcoreapp3.1	1.69μs	0.694ns	2.6ns	0.0133	1.02 KB
#6576	`CallElasticsearchAsync`	net472	2.65μs	1.01ns	3.79ns	0.167	1.05 KB

Benchmarks.Trace.GraphQLBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`ExecuteAsync`	net6.0	1.21μs	0.593ns	2.22ns	0.013	952 B
master	`ExecuteAsync`	netcoreapp3.1	1.77μs	1.17ns	4.4ns	0.0131	952 B
master	`ExecuteAsync`	net472	1.89μs	0.508ns	1.97ns	0.145	915 B
#6576	`ExecuteAsync`	net6.0	1.26μs	0.518ns	1.94ns	0.0133	952 B
#6576	`ExecuteAsync`	netcoreapp3.1	1.71μs	0.777ns	2.91ns	0.0128	952 B
#6576	`ExecuteAsync`	net472	1.88μs	0.435ns	1.69ns	0.145	915 B

Benchmarks.Trace.HttpClientBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`SendAsync`	net6.0	4.52μs	1.39ns	5.21ns	0.0315	2.31 KB
master	`SendAsync`	netcoreapp3.1	5.3μs	4.79ns	18.5ns	0.0368	2.85 KB
master	`SendAsync`	net472	7.4μs	1.43ns	5.35ns	0.495	3.12 KB
#6576	`SendAsync`	net6.0	4.3μs	2.11ns	7.9ns	0.0325	2.31 KB
#6576	`SendAsync`	netcoreapp3.1	5.3μs	1.63ns	6.1ns	0.0372	2.85 KB
#6576	`SendAsync`	net472	7.48μs	2.46ns	9.52ns	0.494	3.12 KB

Benchmarks.Trace.ILoggerBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`EnrichedLog`	net6.0	1.45μs	0.98ns	3.67ns	0.0231	1.64 KB
master	`EnrichedLog`	netcoreapp3.1	2.12μs	1.62ns	6.07ns	0.0222	1.64 KB
master	`EnrichedLog`	net472	2.74μs	1.5ns	5.8ns	0.25	1.57 KB
#6576	`EnrichedLog`	net6.0	1.47μs	1.16ns	4.48ns	0.0227	1.64 KB
#6576	`EnrichedLog`	netcoreapp3.1	2.09μs	2.35ns	8.81ns	0.0218	1.64 KB
#6576	`EnrichedLog`	net472	2.64μs	0.731ns	2.83ns	0.249	1.57 KB

Benchmarks.Trace.Log4netBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Gen 1	Allocated
master	`EnrichedLog`	net6.0	114μs	189ns	732ns	0.0569	0	4.28 KB
master	`EnrichedLog`	netcoreapp3.1	119μs	206ns	798ns	0	0	4.28 KB
master	`EnrichedLog`	net472	149μs	72.5ns	271ns	0.668	0.223	4.46 KB
#6576	`EnrichedLog`	net6.0	116μs	171ns	661ns	0.0578	0	4.28 KB
#6576	`EnrichedLog`	netcoreapp3.1	120μs	142ns	551ns	0	0	4.28 KB
#6576	`EnrichedLog`	net472	151μs	78.1ns	302ns	0.678	0.226	4.46 KB

Benchmarks.Trace.NLogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`EnrichedLog`	net6.0	3.01μs	1.07ns	4.13ns	0.0305	2.2 KB
master	`EnrichedLog`	netcoreapp3.1	4.23μs	1.89ns	7.32ns	0.0296	2.2 KB
master	`EnrichedLog`	net472	4.84μs	0.408ns	1.47ns	0.32	2.02 KB
#6576	`EnrichedLog`	net6.0	3μs	2.92ns	11.3ns	0.0299	2.2 KB
#6576	`EnrichedLog`	netcoreapp3.1	4.12μs	2.07ns	8.02ns	0.0289	2.2 KB
#6576	`EnrichedLog`	net472	4.82μs	1.47ns	5.31ns	0.319	2.02 KB

Benchmarks.Trace.RedisBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`SendReceive`	net6.0	1.33μs	1.18ns	4.41ns	0.0159	1.14 KB
master	`SendReceive`	netcoreapp3.1	1.75μs	5.92ns	22.9ns	0.0158	1.14 KB
master	`SendReceive`	net472	2.12μs	0.722ns	2.8ns	0.183	1.16 KB
#6576	`SendReceive`	net6.0	1.28μs	0.973ns	3.51ns	0.016	1.14 KB
#6576	`SendReceive`	netcoreapp3.1	1.85μs	0.88ns	3.29ns	0.0148	1.14 KB
#6576	`SendReceive`	net472	2.11μs	0.714ns	2.67ns	0.183	1.16 KB

Benchmarks.Trace.SerilogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`EnrichedLog`	net6.0	2.71μs	0.648ns	2.34ns	0.0216	1.6 KB
master	`EnrichedLog`	netcoreapp3.1	3.85μs	1.65ns	5.94ns	0.0214	1.65 KB
master	`EnrichedLog`	net472	4.43μs	3.58ns	13.8ns	0.324	2.04 KB
#6576	`EnrichedLog`	net6.0	2.72μs	1.14ns	4.43ns	0.0217	1.6 KB
#6576	`EnrichedLog`	netcoreapp3.1	4.01μs	2.45ns	9.48ns	0.0221	1.65 KB
#6576	`EnrichedLog`	net472	4.38μs	3.06ns	11.9ns	0.323	2.04 KB

Benchmarks.Trace.SpanBenchmark - Slower ⚠️ Same allocations ✔️

Slower ⚠️ in #6576

Benchmark	diff/base	Base Median (ns)	Diff Median (ns)	Modality
Benchmarks.Trace.SpanBenchmark.StartFinishScope‑net6.0	1.135	478.06	542.46
Benchmarks.Trace.SpanBenchmark.StartFinishSpan‑net6.0	1.129	400.90	452.46

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`StartFinishSpan`	net6.0	401ns	0.496ns	1.92ns	0.00815	576 B
master	`StartFinishSpan`	netcoreapp3.1	568ns	1.8ns	6.98ns	0.00775	576 B
master	`StartFinishSpan`	net472	604ns	0.974ns	3.77ns	0.0918	578 B
master	`StartFinishScope`	net6.0	477ns	0.702ns	2.72ns	0.00967	696 B
master	`StartFinishScope`	netcoreapp3.1	665ns	1.03ns	3.97ns	0.00935	696 B
master	`StartFinishScope`	net472	874ns	1.41ns	5.47ns	0.104	658 B
#6576	`StartFinishSpan`	net6.0	452ns	0.799ns	2.99ns	0.00802	576 B
#6576	`StartFinishSpan`	netcoreapp3.1	621ns	0.471ns	1.7ns	0.00775	576 B
#6576	`StartFinishSpan`	net472	630ns	0.49ns	1.9ns	0.0917	578 B
#6576	`StartFinishScope`	net6.0	543ns	0.683ns	2.64ns	0.00973	696 B
#6576	`StartFinishScope`	netcoreapp3.1	717ns	0.944ns	3.66ns	0.00957	696 B
#6576	`StartFinishScope`	net472	788ns	2.74ns	10.6ns	0.104	658 B

Benchmarks.Trace.TraceAnnotationsBenchmark - Slower ⚠️ Same allocations ✔️

Slower ⚠️ in #6576

Benchmark	diff/base	Base Median (ns)	Diff Median (ns)	Modality
Benchmarks.Trace.TraceAnnotationsBenchmark.RunOnMethodBegin‑net6.0	1.127	646.22	728.39

Raw results

Branch	Method	Toolchain	Mean	StdError	StdDev	Gen 0	Allocated
master	`RunOnMethodBegin`	net6.0	646ns	0.639ns	2.48ns	0.00962	696 B
master	`RunOnMethodBegin`	netcoreapp3.1	905ns	1.14ns	4.43ns	0.00912	696 B
master	`RunOnMethodBegin`	net472	1.11μs	2.23ns	8.64ns	0.104	658 B
#6576	`RunOnMethodBegin`	net6.0	728ns	0.983ns	3.81ns	0.00957	696 B
#6576	`RunOnMethodBegin`	netcoreapp3.1	905ns	1.87ns	7.23ns	0.00912	696 B
#6576	`RunOnMethodBegin`	net472	1.12μs	2.02ns	7.54ns	0.104	658 B

## Summary of changes Tries to fix the pipeline monitor ## Reason for change The pipeline monitor job we run at the end of our pipeline, which drives some of our dashboards, started failing on master for some reason recently. #6576 added some debug logs to try to figure out what's going on. As expected, the issue was that we were creating giant traces, and the spans were being dropped 😅 ## Implementation details - Fix the sample rate config value (it's a value from 0-1, setting it to 100 is a bit over the top) - Disable debug logs (it was actually the instrumentation telemetry that allowed me to diagnose this) - Enable partial flush ## Test coverage This is the test ## Other details We could consider adding a debug log mentioning when an overfull buffer is encountered. Currently we log that spans were dropped, but not explicitly _why_ in this case. There's an argument for not doing it though - if an app is generating sufficient spans to hit this limit, it's possible enabling debug logs to try to find this issue could cause their server to fall over. Then again, one _more_ log probably isn't going to make the difference as to whether they _can_ enable it or not, so maybe best to do it anyway?

Debug what's going on with pipeline monitor

496a4a8

andrewlock added the area:builds project files, build scripts, pipelines, versioning, releases, packages label Jan 20, 2025

andrewlock requested a review from a team as a code owner January 20, 2025 14:15

andrewlock force-pushed the andrew/ci/understand-pipeline-errors branch from cd5582d to 496a4a8 Compare January 20, 2025 14:33

andrewlock requested a review from a team as a code owner January 20, 2025 14:33

kevingosse approved these changes Jan 21, 2025

View reviewed changes

andrewlock merged commit bede7cc into master Jan 22, 2025
125 of 127 checks passed

andrewlock deleted the andrew/ci/understand-pipeline-errors branch January 22, 2025 13:52

github-actions bot added this to the vNext-v3 milestone Jan 22, 2025

andrewlock mentioned this pull request Feb 3, 2025

Try fix pipeline monitor #6620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug what's going on with pipeline monitor #6576

Debug what's going on with pipeline monitor #6576

andrewlock commented Jan 20, 2025

datadog-ddstaging bot commented Jan 20, 2025 •

edited

Loading

andrewlock commented Jan 20, 2025

andrewlock commented Jan 20, 2025

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Slower ⚠️ in #6576

Raw results

Slower ⚠️ in #6576

Raw results

Debug what's going on with pipeline monitor #6576

Debug what's going on with pipeline monitor #6576

Conversation

andrewlock commented Jan 20, 2025

Summary of changes

Reason for change

Implementation details

Test coverage

datadog-ddstaging bot commented Jan 20, 2025 • edited Loading

Datadog Report

andrewlock commented Jan 20, 2025

Execution-Time Benchmarks Report ⏱️

andrewlock commented Jan 20, 2025

Benchmarks Report for tracer 🐌

Benchmark details

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Raw results

Slower ⚠️ in #6576

Raw results

Slower ⚠️ in #6576

Raw results

datadog-ddstaging bot commented Jan 20, 2025 •

edited

Loading