Skip to content
66 changes: 66 additions & 0 deletions docs/TelemetryDevelopment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Telemetry Development

## Telemetry Presence

`dd-trace-rb` is written to assume that the telemetry component is always
present. If telemetry is disabled, the component is still created but does
nothing.

Most components call methods on `telemetry` unconditionally. There are two
exceptons: DI and Data Streams are written to assume that `telemetry` may be nil.
However, this assumption is not necessary and these components may be
changed in the future to assume that `telemetry` is always present.

## Event Submission Prior To Start

Telemetry is unique among other components in that it permits events to be
submitted to it prior to its worker starting. This is done so that errors
during `Datadog.configure` processing can be reported via telemetry, because
the errors can be produced prior to telemetry worker starting. The telemetry
component keeps the events and sends them after the worker starts.

## Initial Event

`dd-trace-rb` can be initialized multiple times during application boot.
For example, if customers follow our documentation and require
`datadog/auto_instrument`, and call `Datadog.configure`, they would get
`Datadog.configure` invoked two times total (the first time by `auto_instrument`)
and thus telemetry instance would be created twice. This happens in the
applications used with system tests.

System tests, on the other hand, require that there is only one `app-started`
event emitted, because they think the application is launched once.
To deal with this we have a hack in the telemetry code to send an
`app-client-configuration-change` event instead of the second `app-started`
event. This is implemented via the `SynthAppClientConfigurationChange` class.

## Fork Handling

We must send telemetry data from forked children.

Telemetry started out as a diagnostic tool used during application boot,
but is now used for reporting application liveness (and settings/state)
throughout the application lifetime. Live Debugger / Dynamic Instrumentation,
for example, require ongoing `app-heartbeat` events emitted via telemetry
to provide a working UI to customers.

It is somewhat common for customers to preload the application in the parent
web server process and process requests from children. This means telemetry
is initialized from the parent process, and it must emit events in the
forked children.

We use the standard worker `after_fork` handler to recreated the worker
thread in forked children. However, there are two caveats to keep in mind
which are specific to telemetry:

1. Due to telemetry permitting event submission prior to its start, it is
not sufficient to simply reset the state from the worker's `perform` method,
as is done in other components. We must only reset the state when we are
in the forked child, otherwise we'll trash any events submitted to telemetry
prior to its worker starting.

2. The child process is a brand new application as far as the backend/UI is
concerned, having a new runtime ID, and therefore the initial event in the
forked child must always be `app-started`. Since we track the initial event
in the telemetry component, this event must be changed to `app-started` in
forked children regardless of what it was in the parent.
63 changes: 51 additions & 12 deletions lib/datadog/core/telemetry/component.rb
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,26 @@
module Datadog
module Core
module Telemetry
# Telemetry entrypoint, coordinates sending telemetry events at various points in app lifecycle.
# Note: Telemetry does not spawn its worker thread in fork processes, thus no telemetry is sent in forked processes.
# Telemetry entry point, coordinates sending telemetry events at
# various points in application lifecycle.
#
# @api private
class Component
ENDPOINT_COLLECTION_MESSAGE_LIMIT = 300

attr_reader :enabled, :logger, :transport, :worker
ONLY_ONCE = Utils::OnlyOnce.new

attr_reader :enabled
attr_reader :logger
attr_reader :transport
attr_reader :worker
attr_reader :settings
attr_reader :agent_settings
attr_reader :metrics_manager

# Alias for consistency with other components.
# TODO Remove +enabled+ method
alias_method :enabled?, :enabled

include Core::Utils::Forking
include Telemetry::Logging
Expand Down Expand Up @@ -50,6 +62,17 @@ def initialize( # standard:disable Metrics/MethodLength
logger:,
enabled:
)
ONLY_ONCE.run do
Utils::AtForkMonkeyPatch.apply!

# All of the other at fork monkey patch callbacks reference
# globals, follow that pattern here to avoid having the component
# referenced via the at fork callbacks.
Datadog::Core::Utils::AtForkMonkeyPatch.at_fork(:child) do
Datadog.send(:components, allow_initialization: false)&.telemetry&.after_fork
end
end

@enabled = enabled
@log_collection_enabled = settings.telemetry.log_collection_enabled
@logger = logger
Expand Down Expand Up @@ -110,7 +133,7 @@ def disable!
end

def start(initial_event_is_change = false, components:)
return if !@enabled
return unless enabled?

initial_event = if initial_event_is_change
Event::SynthAppClientConfigurationChange.new(
Expand All @@ -136,19 +159,19 @@ def shutdown!
end

def emit_closing!
return if !@enabled || forked?
return unless enabled?

@worker.enqueue(Event::AppClosing.new)
end

def integrations_change!
return if !@enabled || forked?
return unless enabled?

@worker.enqueue(Event::AppIntegrationsChange.new)
end

def log!(event)
return if !@enabled || forked? || !@log_collection_enabled
return unless enabled? && @log_collection_enabled

@worker.enqueue(event)
end
Expand All @@ -158,22 +181,22 @@ def log!(event)
# been flushed, or nil if telemetry is disabled.
#
# @api private
def flush
return if !@enabled || forked?
def flush(timeout: nil)
return unless enabled?

@worker.flush
@worker.flush(timeout: timeout)
end

# Report configuration changes caused by Remote Configuration.
def client_configuration_change!(changes)
return if !@enabled || forked?
return unless enabled?

@worker.enqueue(Event::AppClientConfigurationChange.new(changes, 'remote_config'))
end

# Report application endpoints
def app_endpoints_loaded(endpoints, page_size: ENDPOINT_COLLECTION_MESSAGE_LIMIT)
return if !@enabled || forked?
return unless enabled?

endpoints.each_slice(page_size).with_index do |endpoints_slice, i|
@worker.enqueue(Event::AppEndpointsLoaded.new(endpoints_slice, is_first: i.zero?))
Expand Down Expand Up @@ -204,6 +227,22 @@ def rate(namespace, metric_name, value, tags: {}, common: true)
def distribution(namespace, metric_name, value, tags: {}, common: true)
@metrics_manager.distribution(namespace, metric_name, value, tags: tags, common: common)
end

# When a fork happens, we generally need to do two things inside the
# child proess:
# 1. Restart the worker.
# 2. Discard any events and metrics that were submitted in the
# parent process (because they will be sent out in the parent
# process, sending them in the child would cause duplicate
# submission).
def after_fork
# We cannot simply create a new instance of metrics manager because
# it is referenced from other objects (e.g. the worker).
# We must reset the existing instance.
@metrics_manager.clear

worker&.send(:after_fork_monkey_patched)
end
end
end
end
Expand Down
15 changes: 15 additions & 0 deletions lib/datadog/core/telemetry/event/app_started.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@ class AppStarted < Base
def initialize(components:)
# To not hold a reference to the component tree, generate
# the event payload here in the constructor.
#
# Important: do not store data that contains (or is derived from)
# the runtime_id or sequence numbers.
# This event is reused when a process forks, but in the
# child process the runtime_id would be different and sequence
# number is reset.
@configuration = configuration(components.settings, components.agent_settings)
@install_signature = install_signature(components.settings)
@products = products(components)
Expand All @@ -30,6 +36,15 @@ def payload
}
end

# Whether the event is actually the app-started event.
# For the app-started event we follow up by sending
# app-dependencies-loaded, if the event is
# app-client-configuration-change we don't send
# app-dependencies-loaded.
def app_started?
true
end

private

def products(components)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,36 @@ module Event
# and app-closing events.
class SynthAppClientConfigurationChange < AppStarted
def type
'app-client-configuration-change'
if reset?
super
else
'app-client-configuration-change'
end
end

def payload
{
configuration: @configuration,
}
if reset?
super
else
{
configuration: @configuration,
}
end
end

def app_started?
reset?
end

# Revert this event to a "regular" AppStarted event.
#
# Used in after_fork to send the AppStarted event in child processes.
def reset!
@reset = true
end

def reset?
defined?(@reset) && !!@reset
end
end
end
Expand Down
9 changes: 9 additions & 0 deletions lib/datadog/core/telemetry/metrics_manager.rb
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,11 @@ module Datadog
module Core
module Telemetry
# MetricsManager aggregates and flushes metrics and distributions
#
# @api private
class MetricsManager
attr_reader :enabled
attr_reader :collections

def initialize(aggregation_interval:, enabled:)
@interval = aggregation_interval
Expand Down Expand Up @@ -68,6 +71,12 @@ def disable!
@enabled = false
end

def clear
@mutex.synchronize do
@collections = {}
end
end

private

def fetch_or_create_collection(namespace)
Expand Down
Loading
Loading