Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug what's going on with pipeline monitor #6576

Merged
merged 1 commit into from
Jan 22, 2025

Conversation

andrewlock
Copy link
Member

Summary of changes

Enable debug for the pipeline monitor

Reason for change

Our pipeline monitor started dropping traces recently, and we want to know why

Implementation details

  • Moved some config to the pipeline to avoid needing to re-initialize the tracer
  • Enabled telemetry and debug

Test coverage

This will hopefully give some info

@andrewlock andrewlock added the area:builds project files, build scripts, pipelines, versioning, releases, packages label Jan 20, 2025
@andrewlock andrewlock requested a review from a team as a code owner January 20, 2025 14:15
@andrewlock andrewlock force-pushed the andrew/ci/understand-pipeline-errors branch from cd5582d to 496a4a8 Compare January 20, 2025 14:33
@andrewlock andrewlock requested a review from a team as a code owner January 20, 2025 14:33
@datadog-ddstaging
Copy link

datadog-ddstaging bot commented Jan 20, 2025

Datadog Report

Branch report: andrew/ci/understand-pipeline-errors
Commit report: 496a4a8
Test service: dd-trace-dotnet

✅ 0 Failed, 242283 Passed, 1984 Skipped, 19h 0m 56.48s Total Time

@andrewlock
Copy link
Member Author

Execution-Time Benchmarks Report ⏱️

Execution-time results for samples comparing the following branches/commits:

Execution-time benchmarks measure the whole time it takes to execute a program. And are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are shown in red. The following thresholds were used for comparing the execution times:

  • Welch test with statistical test for significance of 5%
  • Only results indicating a difference greater than 5% and 5 ms are considered.

Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard.

Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph).

gantt
    title Execution time (ms) FakeDbCommand (.NET Framework 4.6.2) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6576) - mean (69ms)  : 67, 71
     .   : milestone, 69,
    master - mean (69ms)  : 66, 72
     .   : milestone, 69,

    section CallTarget+Inlining+NGEN
    This PR (6576) - mean (985ms)  : 961, 1009
     .   : milestone, 985,
    master - mean (982ms)  : 957, 1007
     .   : milestone, 982,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6576) - mean (108ms)  : 105, 111
     .   : milestone, 108,
    master - mean (108ms)  : 105, 110
     .   : milestone, 108,

    section CallTarget+Inlining+NGEN
    This PR (6576) - mean (680ms)  : 666, 694
     .   : milestone, 680,
    master - mean (681ms)  : 666, 696
     .   : milestone, 681,

Loading
gantt
    title Execution time (ms) FakeDbCommand (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6576) - mean (91ms)  : 89, 93
     .   : milestone, 91,
    master - mean (92ms)  : 90, 94
     .   : milestone, 92,

    section CallTarget+Inlining+NGEN
    This PR (6576) - mean (635ms)  : 621, 649
     .   : milestone, 635,
    master - mean (634ms)  : 616, 651
     .   : milestone, 634,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Framework 4.6.2) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6576) - mean (189ms)  : 186, 193
     .   : milestone, 189,
    master - mean (189ms)  : 184, 193
     .   : milestone, 189,

    section CallTarget+Inlining+NGEN
    This PR (6576) - mean (1,086ms)  : 1052, 1119
     .   : milestone, 1086,
    master - mean (1,082ms)  : 1051, 1114
     .   : milestone, 1082,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET Core 3.1) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6576) - mean (275ms)  : 271, 280
     .   : milestone, 275,
    master - mean (276ms)  : 271, 281
     .   : milestone, 276,

    section CallTarget+Inlining+NGEN
    This PR (6576) - mean (868ms)  : 842, 894
     .   : milestone, 868,
    master - mean (866ms)  : 835, 897
     .   : milestone, 866,

Loading
gantt
    title Execution time (ms) HttpMessageHandler (.NET 6) 
    dateFormat  X
    axisFormat %s
    todayMarker off
    section Baseline
    This PR (6576) - mean (264ms)  : 259, 268
     .   : milestone, 264,
    master - mean (263ms)  : 259, 266
     .   : milestone, 263,

    section CallTarget+Inlining+NGEN
    This PR (6576) - mean (843ms)  : 810, 876
     .   : milestone, 843,
    master - mean (848ms)  : 810, 886
     .   : milestone, 848,

Loading

@andrewlock
Copy link
Member Author

Benchmarks Report for tracer 🐌

Benchmarks for #6576 compared to master:

  • 3 benchmarks are slower, with geometric mean 1.130
  • All benchmarks have the same allocations

The following thresholds were used for comparing the benchmark speeds:

  • Mann–Whitney U test with statistical test for significance of 5%
  • Only results indicating a difference greater than 10% and 0.3 ns are considered.

Allocation changes below 0.5% are ignored.

Benchmark details

Benchmarks.Trace.ActivityBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master StartStopWithChild net6.0 8.17μs 46ns 312ns 0.0159 0.00397 0 5.61 KB
master StartStopWithChild netcoreapp3.1 10.2μs 56.5ns 348ns 0.0197 0.00986 0 5.8 KB
master StartStopWithChild net472 16μs 43ns 161ns 1.05 0.322 0.0967 6.22 KB
#6576 StartStopWithChild net6.0 7.9μs 43.9ns 260ns 0.019 0.00761 0 5.61 KB
#6576 StartStopWithChild netcoreapp3.1 9.88μs 53.2ns 305ns 0.0295 0.0147 0 5.79 KB
#6576 StartStopWithChild net472 16.3μs 67.3ns 261ns 1.03 0.295 0.082 6.2 KB
Benchmarks.Trace.AgentWriterBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master WriteAndFlushEnrichedTraces net6.0 526μs 1.46μs 5.65μs 0 0 0 2.7 KB
master WriteAndFlushEnrichedTraces netcoreapp3.1 652μs 189ns 653ns 0 0 0 2.7 KB
master WriteAndFlushEnrichedTraces net472 837μs 406ns 1.57μs 0.417 0 0 3.3 KB
#6576 WriteAndFlushEnrichedTraces net6.0 487μs 304ns 1.1μs 0 0 0 2.7 KB
#6576 WriteAndFlushEnrichedTraces netcoreapp3.1 657μs 343ns 1.24μs 0 0 0 2.7 KB
#6576 WriteAndFlushEnrichedTraces net472 851μs 572ns 2.14μs 0.425 0 0 3.3 KB
Benchmarks.Trace.AspNetCoreBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendRequest net6.0 133μs 506ns 1.96μs 0.186 0 0 14.47 KB
master SendRequest netcoreapp3.1 147μs 333ns 1.29μs 0.218 0 0 17.27 KB
master SendRequest net472 0.00622ns 0.00194ns 0.00751ns 0 0 0 0 b
#6576 SendRequest net6.0 129μs 462ns 1.79μs 0.189 0 0 14.47 KB
#6576 SendRequest netcoreapp3.1 146μs 279ns 1.08μs 0.218 0 0 17.27 KB
#6576 SendRequest net472 0.00292ns 0.00114ns 0.00442ns 0 0 0 0 b
Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master WriteAndFlushEnrichedTraces net6.0 550μs 2.07μs 7.76μs 0.534 0 0 41.66 KB
master WriteAndFlushEnrichedTraces netcoreapp3.1 683μs 3.81μs 29.7μs 0.322 0 0 41.73 KB
master WriteAndFlushEnrichedTraces net472 839μs 3.82μs 14.8μs 8.45 2.53 0.422 53.29 KB
#6576 WriteAndFlushEnrichedTraces net6.0 566μs 3.12μs 18.4μs 0.558 0 0 41.53 KB
#6576 WriteAndFlushEnrichedTraces netcoreapp3.1 663μs 3.58μs 19.3μs 0.334 0 0 41.71 KB
#6576 WriteAndFlushEnrichedTraces net472 835μs 2.07μs 7.48μs 8.28 2.48 0.414 53.3 KB
Benchmarks.Trace.DbCommandBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master ExecuteNonQuery net6.0 1.32μs 0.563ns 2.03ns 0.014 0 0 1.02 KB
master ExecuteNonQuery netcoreapp3.1 1.77μs 1.87ns 7.23ns 0.0133 0 0 1.02 KB
master ExecuteNonQuery net472 2.07μs 2.57ns 9.95ns 0.157 0.00104 0 987 B
#6576 ExecuteNonQuery net6.0 1.28μs 1ns 3.88ns 0.0141 0 0 1.02 KB
#6576 ExecuteNonQuery netcoreapp3.1 1.74μs 2.43ns 9.41ns 0.0138 0 0 1.02 KB
#6576 ExecuteNonQuery net472 2.05μs 2.19ns 8.49ns 0.157 0.00102 0 987 B
Benchmarks.Trace.ElasticsearchBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master CallElasticsearch net6.0 1.18μs 0.872ns 3.38ns 0.0135 0 0 976 B
master CallElasticsearch netcoreapp3.1 1.5μs 0.619ns 2.32ns 0.0135 0 0 976 B
master CallElasticsearch net472 2.55μs 1.56ns 5.82ns 0.158 0 0 995 B
master CallElasticsearchAsync net6.0 1.27μs 0.712ns 2.66ns 0.0133 0 0 952 B
master CallElasticsearchAsync netcoreapp3.1 1.59μs 1.04ns 4.03ns 0.0144 0 0 1.02 KB
master CallElasticsearchAsync net472 2.59μs 1.31ns 4.9ns 0.167 0 0 1.05 KB
#6576 CallElasticsearch net6.0 1.22μs 0.504ns 1.89ns 0.0138 0 0 976 B
#6576 CallElasticsearch netcoreapp3.1 1.53μs 0.985ns 3.68ns 0.013 0 0 976 B
#6576 CallElasticsearch net472 2.64μs 2.28ns 8.85ns 0.158 0 0 995 B
#6576 CallElasticsearchAsync net6.0 1.17μs 1.73ns 6.7ns 0.0135 0 0 952 B
#6576 CallElasticsearchAsync netcoreapp3.1 1.69μs 0.694ns 2.6ns 0.0133 0 0 1.02 KB
#6576 CallElasticsearchAsync net472 2.65μs 1.01ns 3.79ns 0.167 0 0 1.05 KB
Benchmarks.Trace.GraphQLBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master ExecuteAsync net6.0 1.21μs 0.593ns 2.22ns 0.013 0 0 952 B
master ExecuteAsync netcoreapp3.1 1.77μs 1.17ns 4.4ns 0.0131 0 0 952 B
master ExecuteAsync net472 1.89μs 0.508ns 1.97ns 0.145 0 0 915 B
#6576 ExecuteAsync net6.0 1.26μs 0.518ns 1.94ns 0.0133 0 0 952 B
#6576 ExecuteAsync netcoreapp3.1 1.71μs 0.777ns 2.91ns 0.0128 0 0 952 B
#6576 ExecuteAsync net472 1.88μs 0.435ns 1.69ns 0.145 0 0 915 B
Benchmarks.Trace.HttpClientBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendAsync net6.0 4.52μs 1.39ns 5.21ns 0.0315 0 0 2.31 KB
master SendAsync netcoreapp3.1 5.3μs 4.79ns 18.5ns 0.0368 0 0 2.85 KB
master SendAsync net472 7.4μs 1.43ns 5.35ns 0.495 0 0 3.12 KB
#6576 SendAsync net6.0 4.3μs 2.11ns 7.9ns 0.0325 0 0 2.31 KB
#6576 SendAsync netcoreapp3.1 5.3μs 1.63ns 6.1ns 0.0372 0 0 2.85 KB
#6576 SendAsync net472 7.48μs 2.46ns 9.52ns 0.494 0 0 3.12 KB
Benchmarks.Trace.ILoggerBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net6.0 1.45μs 0.98ns 3.67ns 0.0231 0 0 1.64 KB
master EnrichedLog netcoreapp3.1 2.12μs 1.62ns 6.07ns 0.0222 0 0 1.64 KB
master EnrichedLog net472 2.74μs 1.5ns 5.8ns 0.25 0 0 1.57 KB
#6576 EnrichedLog net6.0 1.47μs 1.16ns 4.48ns 0.0227 0 0 1.64 KB
#6576 EnrichedLog netcoreapp3.1 2.09μs 2.35ns 8.81ns 0.0218 0 0 1.64 KB
#6576 EnrichedLog net472 2.64μs 0.731ns 2.83ns 0.249 0 0 1.57 KB
Benchmarks.Trace.Log4netBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net6.0 114μs 189ns 732ns 0.0569 0 0 4.28 KB
master EnrichedLog netcoreapp3.1 119μs 206ns 798ns 0 0 0 4.28 KB
master EnrichedLog net472 149μs 72.5ns 271ns 0.668 0.223 0 4.46 KB
#6576 EnrichedLog net6.0 116μs 171ns 661ns 0.0578 0 0 4.28 KB
#6576 EnrichedLog netcoreapp3.1 120μs 142ns 551ns 0 0 0 4.28 KB
#6576 EnrichedLog net472 151μs 78.1ns 302ns 0.678 0.226 0 4.46 KB
Benchmarks.Trace.NLogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net6.0 3.01μs 1.07ns 4.13ns 0.0305 0 0 2.2 KB
master EnrichedLog netcoreapp3.1 4.23μs 1.89ns 7.32ns 0.0296 0 0 2.2 KB
master EnrichedLog net472 4.84μs 0.408ns 1.47ns 0.32 0 0 2.02 KB
#6576 EnrichedLog net6.0 3μs 2.92ns 11.3ns 0.0299 0 0 2.2 KB
#6576 EnrichedLog netcoreapp3.1 4.12μs 2.07ns 8.02ns 0.0289 0 0 2.2 KB
#6576 EnrichedLog net472 4.82μs 1.47ns 5.31ns 0.319 0 0 2.02 KB
Benchmarks.Trace.RedisBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendReceive net6.0 1.33μs 1.18ns 4.41ns 0.0159 0 0 1.14 KB
master SendReceive netcoreapp3.1 1.75μs 5.92ns 22.9ns 0.0158 0 0 1.14 KB
master SendReceive net472 2.12μs 0.722ns 2.8ns 0.183 0 0 1.16 KB
#6576 SendReceive net6.0 1.28μs 0.973ns 3.51ns 0.016 0 0 1.14 KB
#6576 SendReceive netcoreapp3.1 1.85μs 0.88ns 3.29ns 0.0148 0 0 1.14 KB
#6576 SendReceive net472 2.11μs 0.714ns 2.67ns 0.183 0 0 1.16 KB
Benchmarks.Trace.SerilogBenchmark - Same speed ✔️ Same allocations ✔️

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net6.0 2.71μs 0.648ns 2.34ns 0.0216 0 0 1.6 KB
master EnrichedLog netcoreapp3.1 3.85μs 1.65ns 5.94ns 0.0214 0 0 1.65 KB
master EnrichedLog net472 4.43μs 3.58ns 13.8ns 0.324 0 0 2.04 KB
#6576 EnrichedLog net6.0 2.72μs 1.14ns 4.43ns 0.0217 0 0 1.6 KB
#6576 EnrichedLog netcoreapp3.1 4.01μs 2.45ns 9.48ns 0.0221 0 0 1.65 KB
#6576 EnrichedLog net472 4.38μs 3.06ns 11.9ns 0.323 0 0 2.04 KB
Benchmarks.Trace.SpanBenchmark - Slower ⚠️ Same allocations ✔️

Slower ⚠️ in #6576

Benchmark diff/base Base Median (ns) Diff Median (ns) Modality
Benchmarks.Trace.SpanBenchmark.StartFinishScope‑net6.0 1.135 478.06 542.46
Benchmarks.Trace.SpanBenchmark.StartFinishSpan‑net6.0 1.129 400.90 452.46

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master StartFinishSpan net6.0 401ns 0.496ns 1.92ns 0.00815 0 0 576 B
master StartFinishSpan netcoreapp3.1 568ns 1.8ns 6.98ns 0.00775 0 0 576 B
master StartFinishSpan net472 604ns 0.974ns 3.77ns 0.0918 0 0 578 B
master StartFinishScope net6.0 477ns 0.702ns 2.72ns 0.00967 0 0 696 B
master StartFinishScope netcoreapp3.1 665ns 1.03ns 3.97ns 0.00935 0 0 696 B
master StartFinishScope net472 874ns 1.41ns 5.47ns 0.104 0 0 658 B
#6576 StartFinishSpan net6.0 452ns 0.799ns 2.99ns 0.00802 0 0 576 B
#6576 StartFinishSpan netcoreapp3.1 621ns 0.471ns 1.7ns 0.00775 0 0 576 B
#6576 StartFinishSpan net472 630ns 0.49ns 1.9ns 0.0917 0 0 578 B
#6576 StartFinishScope net6.0 543ns 0.683ns 2.64ns 0.00973 0 0 696 B
#6576 StartFinishScope netcoreapp3.1 717ns 0.944ns 3.66ns 0.00957 0 0 696 B
#6576 StartFinishScope net472 788ns 2.74ns 10.6ns 0.104 0 0 658 B
Benchmarks.Trace.TraceAnnotationsBenchmark - Slower ⚠️ Same allocations ✔️

Slower ⚠️ in #6576

Benchmark diff/base Base Median (ns) Diff Median (ns) Modality
Benchmarks.Trace.TraceAnnotationsBenchmark.RunOnMethodBegin‑net6.0 1.127 646.22 728.39

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master RunOnMethodBegin net6.0 646ns 0.639ns 2.48ns 0.00962 0 0 696 B
master RunOnMethodBegin netcoreapp3.1 905ns 1.14ns 4.43ns 0.00912 0 0 696 B
master RunOnMethodBegin net472 1.11μs 2.23ns 8.64ns 0.104 0 0 658 B
#6576 RunOnMethodBegin net6.0 728ns 0.983ns 3.81ns 0.00957 0 0 696 B
#6576 RunOnMethodBegin netcoreapp3.1 905ns 1.87ns 7.23ns 0.00912 0 0 696 B
#6576 RunOnMethodBegin net472 1.12μs 2.02ns 7.54ns 0.104 0 0 658 B

@andrewlock andrewlock merged commit bede7cc into master Jan 22, 2025
125 of 127 checks passed
@andrewlock andrewlock deleted the andrew/ci/understand-pipeline-errors branch January 22, 2025 13:52
@github-actions github-actions bot added this to the vNext-v3 milestone Jan 22, 2025
andrewlock added a commit that referenced this pull request Feb 4, 2025
## Summary of changes

Tries to fix the pipeline monitor

## Reason for change

The pipeline monitor job we run at the end of our pipeline, which drives
some of our dashboards, started failing on master for some reason
recently. #6576 added some debug logs to try to figure out what's going
on. As expected, the issue was that we were creating giant traces, and
the spans were being dropped 😅

## Implementation details

- Fix the sample rate config value (it's a value from 0-1, setting it to
100 is a bit over the top)
- Disable debug logs (it was actually the instrumentation telemetry that
allowed me to diagnose this)
- Enable partial flush

## Test coverage

This is the test

## Other details

We could consider adding a debug log mentioning when an overfull buffer
is encountered. Currently we log that spans were dropped, but not
explicitly _why_ in this case. There's an argument for not doing it
though - if an app is generating sufficient spans to hit this limit,
it's possible enabling debug logs to try to find this issue could cause
their server to fall over. Then again, one _more_ log probably isn't
going to make the difference as to whether they _can_ enable it or not,
so maybe best to do it anyway?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:builds project files, build scripts, pipelines, versioning, releases, packages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants