-
Notifications
You must be signed in to change notification settings - Fork 398
DEBUG-4548 Telemetry: send events in forked children + telemetry metrics reset after fork #5159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Typing analysisNote: Ignored files are excluded from the next sections.
|
60cf81b to
c3dbf11
Compare
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage 🔗 Commit SHA: 9433984 | Docs | Datadog PR Page | Was this helpful? Give us feedback! |
c3dbf11 to
58a925a
Compare
c70c900 to
fef8703
Compare
681f596 to
953ab61
Compare
953ab61 to
ddb2e5e
Compare
2921959 to
10b5119
Compare
sig/datadog/core/telemetry/event/synth_app_client_configuration_change.rbs
Show resolved
Hide resolved
|
The tests I added are still flaking in CI. The most recent failure was on ruby 2.5 and I have seen previous failures on 2.5 also. I spent ~3 weeks trying to figure out the root cause and the problem is that once I start adding diagnostics the failures disappear. So, after 3 weeks, I am skipping the tests on 2.5 and if they fail on anything below 3.0 I will skip all of those versions also. |
* master: (129 commits) Transports: remove api_version (#5164) DEBUG-4548 Telemetry: send events in forked children + telemetry metrics reset after fork (#5159) Ignore "leaked" pipe file descriptors in JRuby, improve diagnostics (#5188) debug-4548 Increase number of iterations for flakiness (#5184) [🤖] Update System Tests: https://github.com/DataDog/dd-trace-rb/actions/runs/20684824141 [🤖] Update System Tests: https://github.com/DataDog/dd-trace-rb/actions/runs/20616292456 (#5190) downgrade ffi for ruby 4.0 & 2.5 (#5189) Fix ruby warnings when accessing undefined instance variables (#5178) DEBUG-3499 DI: fix accounting when intrumenting upon class definition, add instr… (#5168) DEBUG-3499 RC: add diagnostics for invalid values (#5167) DEBUG-4548 Core: fix worker shutdown race (#5176) Retry system-test build (#5181) Fix Baggage type check (#5182) [🤖] Update System Tests: https://github.com/DataDog/dd-trace-rb/actions/runs/20487829791 (#5183) [🤖] Update Latest Dependency: https://github.com/DataDog/dd-trace-rb/actions/runs/20401889084 (#5180) DEBUG-3499 DI: do not instrument when there is already an installed probe with the same id (#5169) DEBUG-3499 DI: rework RC interface (#5165) set DI test duration upper bound to 1000 seconds (#5161) add missing supported config default value [🤖] Update System Tests: https://github.com/DataDog/dd-trace-rb/actions/runs/20401907816 (#5179) ...
* master: Transports: remove api_version (#5164) DEBUG-4548 Telemetry: send events in forked children + telemetry metrics reset after fork (#5159) Ignore "leaked" pipe file descriptors in JRuby, improve diagnostics (#5188) debug-4548 Increase number of iterations for flakiness (#5184) [🤖] Update System Tests: https://github.com/DataDog/dd-trace-rb/actions/runs/20684824141
Worker race fix (described in https://github.com/DataDog/ruby-guild/issues/279 and originally reported as performance regression) has been extracted into #5176.
What does this PR do?
Restores #5074 and implements telemetry metrics reset after fork, which was causing wrong metric counts in datadog-ci end to end tests.
The metrics reset is done via the "at fork monkey patch". Existing worker race fix (in #5176) used the worker after-fork treatment - this PR consolidates after fork logic into the monkey patch handler.
Motivation:
Dynamic Instrumentation / Live Debugger require telemetry app-heartbeat events to properly render UI. These events are normally sent from forked children in forking web servers, and presently are missing for most customers.
Change log entry
Yes: fix Live Debugger / Dynamic Instrumentation UI for forking web servers
Additional Notes:
The original PR and the race fix are in separate commits for ease of review.
This PR now uses "at fork monkey patch" to reset the metrics manager after fork. The added tests caused existing tests to start failing due to shared global state; this was fixed in #5175.
We still need to mark the worker as "restarting after fork", but the restart is triggered from the monkey patch callback and not from enqueue call like before.
How to test the change?
New integration tests added. They have been aggravatingly flaky in CI with the previous implementation that used both worker after fork logic and at fork monkey patch, and they seem much less flaky with the current implementation that does everything from the monkey patch.
Additionally I manually tested against @anmarchenko 's reproducer.