Skip to content

Conversation

@p-datadog
Copy link
Member

@p-datadog p-datadog commented Dec 16, 2025

Worker race fix (described in https://github.com/DataDog/ruby-guild/issues/279 and originally reported as performance regression) has been extracted into #5176.

What does this PR do?

Restores #5074 and implements telemetry metrics reset after fork, which was causing wrong metric counts in datadog-ci end to end tests.

The metrics reset is done via the "at fork monkey patch". Existing worker race fix (in #5176) used the worker after-fork treatment - this PR consolidates after fork logic into the monkey patch handler.

Motivation:

Dynamic Instrumentation / Live Debugger require telemetry app-heartbeat events to properly render UI. These events are normally sent from forked children in forking web servers, and presently are missing for most customers.

Change log entry
Yes: fix Live Debugger / Dynamic Instrumentation UI for forking web servers

Additional Notes:

The original PR and the race fix are in separate commits for ease of review.

This PR now uses "at fork monkey patch" to reset the metrics manager after fork. The added tests caused existing tests to start failing due to shared global state; this was fixed in #5175.

We still need to mark the worker as "restarting after fork", but the restart is triggered from the monkey patch callback and not from enqueue call like before.

How to test the change?
New integration tests added. They have been aggravatingly flaky in CI with the previous implementation that used both worker after fork logic and at fork monkey patch, and they seem much less flaky with the current implementation that does everything from the monkey patch.

Additionally I manually tested against @anmarchenko 's reproducer.

@p-datadog p-datadog requested a review from a team as a code owner December 16, 2025 19:49
@github-actions github-actions bot added the core Involves Datadog core libraries label Dec 16, 2025
@github-actions
Copy link

github-actions bot commented Dec 16, 2025

Typing analysis

Note: Ignored files are excluded from the next sections.

steep:ignore comments

This PR introduces 1 steep:ignore comment.

steep:ignore comments (+1-0)Introduced:
lib/datadog/core/telemetry/worker.rb:269

Untyped methods

This PR introduces 1 untyped method and 7 partially typed methods, and clears 1 untyped method and 7 partially typed methods. It increases the percentage of typed methods from 56.14% to 56.34% (+0.2%).

Untyped methods (+1-1)Introduced:
sig/datadog/core/telemetry/worker.rbs:61
└── def buffer_klass: () -> untyped
Cleared:
sig/datadog/core/telemetry/worker.rbs:60
└── def buffer_klass: () -> untyped
Partially typed methods (+7-7)Introduced:
sig/datadog/core/telemetry/component.rbs:24
└── def self.build: (untyped settings, Datadog::Core::Configuration::AgentSettings agent_settings, Datadog::Core::Logger logger) -> Component
sig/datadog/core/telemetry/component.rbs:26
└── def initialize: (logger: Core::Logger, settings: untyped, agent_settings: Datadog::Core::Configuration::AgentSettings, enabled: true | false) -> void
sig/datadog/core/telemetry/event/app_started.rbs:19
└── def configuration: (untyped settings, Core::Configuration::AgentSettings agent_settings) -> Array[Hash[Symbol, untyped]]
sig/datadog/core/telemetry/event/app_started.rbs:23
└── def conf_value: (String name, untyped value, Integer seq_id, String origin) -> Hash[Symbol, untyped]
sig/datadog/core/telemetry/event/app_started.rbs:27
└── def install_signature: (untyped settings) -> Hash[Symbol, Object]
sig/datadog/core/telemetry/event/app_started.rbs:29
└── def get_telemetry_origin: (untyped settings, String config_path) -> String
sig/datadog/core/telemetry/event/synth_app_client_configuration_change.rbs:8
└── def payload: () -> { ?products: untyped, configuration: untyped, ?install_signature: untyped }
Cleared:
sig/datadog/core/telemetry/component.rbs:23
└── def self.build: (untyped settings, Datadog::Core::Configuration::AgentSettings agent_settings, Datadog::Core::Logger logger) -> Component
sig/datadog/core/telemetry/component.rbs:25
└── def initialize: (logger: Core::Logger, settings: untyped, agent_settings: Datadog::Core::Configuration::AgentSettings, enabled: true | false) -> void
sig/datadog/core/telemetry/event/app_started.rbs:17
└── def configuration: (untyped settings, Core::Configuration::AgentSettings agent_settings) -> Array[Hash[Symbol, untyped]]
sig/datadog/core/telemetry/event/app_started.rbs:21
└── def conf_value: (String name, untyped value, Integer seq_id, String origin) -> Hash[Symbol, untyped]
sig/datadog/core/telemetry/event/app_started.rbs:25
└── def install_signature: (untyped settings) -> Hash[Symbol, Object]
sig/datadog/core/telemetry/event/app_started.rbs:27
└── def get_telemetry_origin: (untyped settings, String config_path) -> String
sig/datadog/core/telemetry/event/synth_app_client_configuration_change.rbs:8
└── def payload: () -> { configuration: untyped }

If you believe a method or an attribute is rightfully untyped or partially typed, you can add # untyped:accept to the end of the line to remove it from the stats.

@p-datadog p-datadog requested review from a team as code owners December 17, 2025 01:23
@p-datadog p-datadog requested a review from vpellan December 17, 2025 01:23
@p-datadog p-datadog force-pushed the telemetry-fork-2 branch 4 times, most recently from 60cf81b to c3dbf11 Compare December 17, 2025 01:40
@datadog-official
Copy link

datadog-official bot commented Dec 17, 2025

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage
Patch Coverage: 82.52%
Overall Coverage: 95.17% (-0.04%)

View detailed report

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 9433984 | Docs | Datadog PR Page | Was this helpful? Give us feedback!

@pr-commenter
Copy link

pr-commenter bot commented Dec 17, 2025

Benchmarks

Benchmark execution time: 2026-01-05 19:02:02

Comparing candidate commit 9433984 in PR branch telemetry-fork-2 with baseline commit 9dcc803 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 44 metrics, 2 unstable metrics.

@p-datadog p-datadog added this to the 2.24.0 milestone Dec 17, 2025
@p-datadog p-datadog changed the title Telemetry: send events in forked children + worker race fix DEBUG-4548 Telemetry: send events in forked children + worker race fix Dec 17, 2025
@p-datadog
Copy link
Member Author

Worker race fix has been extracted in #5176, this PR will be rebased on top of #5176.

@p-datadog p-datadog marked this pull request as draft December 20, 2025 14:40
@p-datadog p-datadog force-pushed the telemetry-fork-2 branch 2 times, most recently from c70c900 to fef8703 Compare December 20, 2025 19:48
@p-datadog p-datadog changed the title DEBUG-4548 Telemetry: send events in forked children + worker race fix DEBUG-4548 Telemetry: send events in forked children + telemetry metrics reset after fork Dec 22, 2025
@p-datadog p-datadog force-pushed the telemetry-fork-2 branch 6 times, most recently from 681f596 to 953ab61 Compare December 29, 2025 23:52
@p-datadog p-datadog marked this pull request as ready for review December 30, 2025 02:59
@p-datadog
Copy link
Member Author

The tests I added are still flaking in CI. The most recent failure was on ruby 2.5 and I have seen previous failures on 2.5 also. I spent ~3 weeks trying to figure out the root cause and the problem is that once I start adding diagnostics the failures disappear. So, after 3 weeks, I am skipping the tests on 2.5 and if they fail on anything below 3.0 I will skip all of those versions also.

@p-datadog p-datadog merged commit 7403efc into master Jan 5, 2026
636 of 637 checks passed
@p-datadog p-datadog deleted the telemetry-fork-2 branch January 5, 2026 19:19
p-datadog pushed a commit that referenced this pull request Jan 5, 2026
* master: (129 commits)
  Transports: remove api_version (#5164)
  DEBUG-4548 Telemetry: send events in forked children + telemetry metrics reset after fork (#5159)
  Ignore "leaked" pipe file descriptors in JRuby, improve diagnostics  (#5188)
  debug-4548 Increase number of iterations for flakiness (#5184)
  [🤖] Update System Tests: https://github.com/DataDog/dd-trace-rb/actions/runs/20684824141
  [🤖] Update System Tests: https://github.com/DataDog/dd-trace-rb/actions/runs/20616292456 (#5190)
  downgrade ffi for ruby 4.0 & 2.5 (#5189)
  Fix ruby warnings when accessing undefined instance variables (#5178)
  DEBUG-3499 DI: fix accounting when intrumenting upon class definition, add instr… (#5168)
  DEBUG-3499 RC: add diagnostics for invalid values (#5167)
  DEBUG-4548 Core: fix worker shutdown race  (#5176)
  Retry system-test build (#5181)
  Fix Baggage type check (#5182)
  [🤖] Update System Tests: https://github.com/DataDog/dd-trace-rb/actions/runs/20487829791 (#5183)
  [🤖] Update Latest Dependency: https://github.com/DataDog/dd-trace-rb/actions/runs/20401889084 (#5180)
  DEBUG-3499 DI: do not instrument when there is already an installed probe with the same id (#5169)
  DEBUG-3499 DI: rework RC interface (#5165)
  set DI test duration upper bound to 1000 seconds (#5161)
  add missing supported config default value
  [🤖] Update System Tests: https://github.com/DataDog/dd-trace-rb/actions/runs/20401907816 (#5179)
  ...
p-datadog pushed a commit that referenced this pull request Jan 5, 2026
* master:
  Transports: remove api_version (#5164)
  DEBUG-4548 Telemetry: send events in forked children + telemetry metrics reset after fork (#5159)
  Ignore "leaked" pipe file descriptors in JRuby, improve diagnostics  (#5188)
  debug-4548 Increase number of iterations for flakiness (#5184)
  [🤖] Update System Tests: https://github.com/DataDog/dd-trace-rb/actions/runs/20684824141
@Strech Strech mentioned this pull request Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Involves Datadog core libraries tracing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants