refactor: Only reload telemetry when needed by BrynCooke · Pull Request #8328 · apollographql/router

BrynCooke · 2025-09-24T09:21:21Z

This PR contains a refactor of the telemetry reload lifecycle to allow us to fix UpDownCounters and move away from Gauges.

The problem with our hot reloads currently is that telemetry will ALWAYS get reinitialized. This is problematic for up down counters as we cannot rely on them for capturing important metrics across reloads, for instance client connections.

We now only reload telemetry if significant config has changed, thus as long as the user dow not modify telemetry config UpDownCounters will work as intended.

The first few commits of this PR can be reviewed, but later commits are either moves or fixes.

Notes:

logic to handle safe shutdown of exporters is now centralized Activation. This struct is frees the rest of the code from dealing with the blocking io that may occur during shutdown.
Collection and orchestration during reload is handled in builder.rs. It will check for changes in config and build what is needed.
Code that relates to building the individual exporters has been pulled out into separate modules. We now have a dedicated builder for tracing as well as metrics.

Once this is merged there will be a separate PR to fix migrate gauges to UpDownCounters where the same instrument is retained to allow the down to happen on drop.

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

Exceptions

No new functionality so no docs.
This is super hard to unit test as the generated exporters are opaque to us. Was a view configured correctly? Difficult to tell.

Note any exceptions here

Notes

It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩

Move some lifecycle stuff into activation.

Introduce a new builder that is responsible detecting if config needs reloading. If it does then config is reloaded and new information is moved into activation which is responsible for actually applying the new config and shutting down old exporters.

apollo-librarian · 2025-09-24T09:21:31Z

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 2 changed, 0 removed

* graphos/routing/(latest)/observability/graphos/graphos-reporting.mdx
* graphos/routing/(latest)/observability/router-telemetry-otel/apm-guides/prometheus/otel-traces-to-prometheus.mdx

Build ID: 1c0ee032b4c4b9a1834238b8
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/1c0ee032b4c4b9a1834238b8

Updating the config should be identical to initial startup

Removed custom spawn blocking

aaronArinder

first pass; mostly clarifying questions, moving on to trying to break it

aaronArinder · 2025-09-25T20:06:11Z

Cargo.lock

 "percent-encoding",
 "pin-project-lite",
- "socket2 0.5.10",
+ "socket2 0.6.0",


did this need to change? it came as part of bbcb3963033937bd4f9c3b, but there wasn't a change to any Cargo.tomls in that commit. worried about increased risk, but who knows, maybe it was necessary for some reason?

Merge with dev removed this

aaronArinder · 2025-09-25T20:12:47Z

apollo-router/src/metrics/aggregation.rs

-    providers: HashMap<MeterProviderType, (FilterMeterProvider, HashMap<MeterId, Meter>)>,
+    providers: Vec<(FilterMeterProvider, HashMap<MeterId, Meter>)>,


I'm not sure I fully understand the change to a vec? easier to iterate over?

I have converted the code to always expect something for every meter provider type. Hashmap would allow the slot to be empty which adds a layer of complexity. Now we just have a fixed set of meter providers always.

aaronArinder · 2025-09-29T16:02:39Z

apollo-router/src/plugins/telemetry/error_counter/tests.rs

 }

-#[tokio::test]
+#[tokio::test(flavor = "multi_thread")]


why so many changes to multi-threaded tests?

The move to use block_in_place requires it as it requires a multi-threaded tokio runtime.

apollo-router/src/plugins/telemetry/apollo.rs

apollo-router/src/plugins/telemetry/reload/builder.rs

aaronArinder · 2025-09-29T18:55:22Z

apollo-router/src/plugins/telemetry/reload/builder.rs

+        if self.tracing_config_changed::<otlp::Config>()
+            || self.tracing_config_changed::<datadog::Config>()
+            || self.tracing_config_changed::<zipkin::Config>()
+            || self.tracing_config_changed::<apollo::Config>()


if I'm understanding this correctly, any change to the apollo bit of the config section will trigger a telemetry reload, even something like client_name_header. Is that expected/wanted? Wondering if folks will be confused by the difference in metrics after reloading with something like new client headers

This section will replace the tracer provider, so it won't replace metrics. But yeah, if anything all of tracing needs to be reinitialized. There is no way around this because there can only be one tracer provider and it has to have all the configured exporters in it.

I don't think users will be changing their telemetry config very often.

aaronArinder · 2025-09-29T19:05:59Z

see #8353 for more details, but I think changes to telemetry.exporters.tracing.* aren't actually reloading the plugin. Here's what I see:

reload does trigger
reload log shows up (with config: true)
the disjunction in builder.rs's setup_public_tracing evalutes to true
weird grpc error
test's expectations (noted in comments in the above pr's code) fail; metrics only increment, they don't reset (which I'm assuming they would if the plugin were reloaded)

BrynCooke · 2025-09-30T22:14:24Z

see #8353 for more details, but I think changes to telemetry.exporters.tracing.* aren't actually reloading the plugin. Here's what I see:

reload does trigger

reload log shows up (with config: true)

the disjunction in builder.rs's setup_public_tracing evalutes to true

weird grpc error

test's expectations (noted in comments in the above pr's code) fail; metrics only increment, they don't reset (which I'm assuming they would if the plugin were reloaded)

I took a look and I think things are behaving as they should. Modifying tracing config won't cause metrics to reload so this is why the prom metrics don't zero out.

I've added a tracing reload test in 6ddfce9.

Let's talk about the GRPC error though as I' not sure what this refers to.

lrlna

Some observations from a first pass at this!

apollo-router/src/metrics/aggregation.rs

apollo-router/src/metrics/filter.rs

apollo-router/src/plugins/telemetry/reload/activation.rs

lrlna · 2025-10-01T10:43:05Z

apollo-router/src/plugins/telemetry/reload/activation.rs

+/// are also shut down in a safe way.
+pub(crate) struct Activation {
+    /// The new tracer provider. None means leave the existing one
+    trace_provider: Option<opentelemetry_sdk::trace::TracerProvider>,


I think for all of these None should mean None were provided. Otherwise, you're encoding both the state of this field and its value in an Option.

From what I am seeing in builder.rs, we always provide a value for all of these fields if it's available in the new config. If it's not available in the new config, it needs to, of course, be removed. So None actually should mean there isn't any.

I'm going to rename the fields to communicate intent. To disable a meter provider our trace provider we actually set the noop versions rather than having nothing, this removes the complexity of having the rest of the code having to deal with Some or None in many locations. It's not something I see often in Rust but it is in line with the otel libraries which also do not accept None as absence.

Hopefully this helps: c77255a

apollo-router/src/plugins/telemetry/reload/builder.rs

Co-authored-by: Iryna Shestak <shestak.irina@gmail.com>

bnjjj · 2025-10-01T14:08:42Z

apollo-router/src/plugins/telemetry/reload/activation.rs

+
+/// Allows us to keep track of the last registry that was used. Not ideal. Plugins would be better to have state
+/// that can be maintained across reloads.
+static REGISTRY: LazyLock<Mutex<Option<Registry>>> = LazyLock::new(Default::default);


Just a question, could it be replaced by OnceLock instead ? I don't know I'm not able to see all the usages of REGISTRY here but just asking to confirm

LazyLock is like a sImplified OnceLock, it just requires static initialisation so the code is simpler.

But do you need a Mutex ? Because OnceLock will have better perf than a Mutex if you don't need a mutex

I do need to be able to change the contained value over time when telemetry is reloaded. As far as I can tell for OnceLock I can't do this as it can only be set once?

aaronArinder

behavior looks good to me! feel free to take over #8353 if you think it's useful, but otherwise I'll close it after this pr goes in

…ad if an exporter is hanging shutting down.

bryn added 6 commits September 23, 2025 12:23

Implement PartialEq for exporter config

dccc21b

Move activation into lifecycle.rs

33d9170

Remove Option when setting a meter provider on Aggregate meter provider.

bbcb396

Move some lifecycle stuff into activation.

Initialization rework

4923eb0

Introduce a new builder that is responsible detecting if config needs reloading. If it does then config is reloaded and new information is moved into activation which is responsible for actually applying the new config and shutting down old exporters.

Renames and move stuff around

1eacd98

Clippy

6faaab8

This comment has been minimized.

Sign in to view

bryn added 11 commits September 24, 2025 10:53

Move creation of logging layer into builder and apply during activation

96e5fc9

Make propagation reload uniform

323cf7e

Move stuff around

f4c5269

Fix change detection for apollo trace exporting

316cd3b

Move TracingConfigurator

cc970f0

Move prometheus logging

520c4e8

Fix shutdown test

300abcf

Fix apollo metrics config

21ed7ca

Fix some tests, but there are still some failures

116a5f7

Prevent schema reload from choosing a different port for prom

9a51f05

Updating the config should be identical to initial startup

Wait a little longer for metrics

6039eb5

BrynCooke changed the title ~~Only reload otel when needed~~ refactor: Only reload telemetry when needed Sep 25, 2025

bryn added 3 commits September 25, 2025 11:36

Changelog

93ab64e

Move view logic to builder

f65de82

Add test for metrics reloading to show that reloads don't always happen

78f3706

BrynCooke marked this pull request as ready for review September 25, 2025 11:21

BrynCooke requested a review from a team September 25, 2025 11:21

BrynCooke requested a review from a team as a code owner September 25, 2025 11:21

bryn added 4 commits September 25, 2025 15:07

Add some unit testing for the reload functionality

3ab26e9

Improve logging

b882be9

Improve logic around spawning a blocking safe task.

2e990e4

Use block_in_place during configuration just to be safe.

dd037c4

Removed custom spawn blocking

BrynCooke requested review from aaronArinder and bnjjj September 29, 2025 09:40

aaronArinder reviewed Sep 29, 2025

View reviewed changes

apollo-router/src/plugins/telemetry/reload/builder.rs Outdated Show resolved Hide resolved

aaronArinder reviewed Sep 29, 2025

View reviewed changes

aaronArinder mentioned this pull request Sep 29, 2025

chore(test): exploration of telemetry reloading #8353

Closed

10 tasks

bryn added 2 commits September 30, 2025 12:09

Merge branch 'dev' into bryn/otel-reload-simplification

984c233

Add tracing reload test

6ddfce9

lrlna reviewed Oct 1, 2025

View reviewed changes

Update apollo-router/src/metrics/aggregation.rs

f5d684d

Co-authored-by: Iryna Shestak <shestak.irina@gmail.com>

bnjjj approved these changes Oct 1, 2025

View reviewed changes

aaronArinder approved these changes Oct 1, 2025

View reviewed changes

bryn added 6 commits October 1, 2025 15:48

(refactor) create_registered_instrument logic moved to inner

1feb276

(docs) Clarify activation fields

6f72675

(refactor) Remove unused fns and convert others to test

b25b296

(refactor) Rename fields in activation

c77255a

(refactor) Rename boolean fns with is_ prefix

a7d51c6

(docs) deadlock explaination

1cb9e12

lrlna approved these changes Oct 2, 2025

View reviewed changes

bryn added 4 commits October 2, 2025 10:54

(refactor) Renames based on sync review

31d9956

(docs) Code comments and docs

9f26ac4

Lints

3faef4f

Use spawn_blocking instead of block in place to prevent blocking relo…

469a12a

…ad if an exporter is hanging shutting down.

BrynCooke merged commit d6b97d7 into dev Oct 2, 2025
15 checks passed

BrynCooke deleted the bryn/otel-reload-simplification branch October 2, 2025 12:44

abernix mentioned this pull request Oct 27, 2025

prep release: v2.8.0 #8495

Merged

		providers: HashMap<MeterProviderType, (FilterMeterProvider, HashMap<MeterId, Meter>)>,
		providers: Vec<(FilterMeterProvider, HashMap<MeterId, Meter>)>,

Conversation

BrynCooke commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

apollo-librarian bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Docs preview ready

Uh oh!

This comment has been minimized.

aaronArinder left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronArinder commented Sep 29, 2025

Uh oh!

BrynCooke commented Sep 30, 2025

Uh oh!

lrlna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrynCooke Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronArinder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BrynCooke commented Sep 24, 2025 •

edited

Loading

apollo-librarian bot commented Sep 24, 2025 •

edited

Loading

BrynCooke Oct 1, 2025 •

edited

Loading