Skip to content

refactor: Only reload telemetry when needed#8328

Merged
BrynCooke merged 37 commits intodevfrom
bryn/otel-reload-simplification
Oct 2, 2025
Merged

refactor: Only reload telemetry when needed#8328
BrynCooke merged 37 commits intodevfrom
bryn/otel-reload-simplification

Conversation

@BrynCooke
Copy link
Contributor

@BrynCooke BrynCooke commented Sep 24, 2025

This PR contains a refactor of the telemetry reload lifecycle to allow us to fix UpDownCounters and move away from Gauges.

The problem with our hot reloads currently is that telemetry will ALWAYS get reinitialized. This is problematic for up down counters as we cannot rely on them for capturing important metrics across reloads, for instance client connections.

We now only reload telemetry if significant config has changed, thus as long as the user dow not modify telemetry config UpDownCounters will work as intended.

The first few commits of this PR can be reviewed, but later commits are either moves or fixes.

Notes:

  1. logic to handle safe shutdown of exporters is now centralized Activation. This struct is frees the rest of the code from dealing with the blocking io that may occur during shutdown.
  2. Collection and orchestration during reload is handled in builder.rs. It will check for changes in config and build what is needed.
  3. Code that relates to building the individual exporters has been pulled out into separate modules. We now have a dedicated builder for tracing as well as metrics.

Once this is merged there will be a separate PR to fix migrate gauges to UpDownCounters where the same instrument is retained to allow the down to happen on drop.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • PR description explains the motivation for the change and relevant context for reviewing
  • PR description links appropriate GitHub/Jira tickets (creating when necessary)
  • Changeset is included for user-facing changes
  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Metrics and logs are added3 and documented
  • Tests added and passing4
    • Unit tests
    • Integration tests
    • Manual tests, as necessary

Exceptions

No new functionality so no docs.
This is super hard to unit test as the generated exporters are opaque to us. Was a view configured correctly? Difficult to tell.

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices.

  4. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

bryn added 6 commits September 23, 2025 12:23
 Introduce a new builder that is responsible detecting if config needs reloading. If it does then config is reloaded and new information is moved into activation which is responsible for actually applying the new config and shutting down old exporters.
@apollo-librarian
Copy link

apollo-librarian bot commented Sep 24, 2025

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 2 changed, 0 removed
* graphos/routing/(latest)/observability/graphos/graphos-reporting.mdx
* graphos/routing/(latest)/observability/router-telemetry-otel/apm-guides/prometheus/otel-traces-to-prometheus.mdx

Build ID: 1c0ee032b4c4b9a1834238b8
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/1c0ee032b4c4b9a1834238b8

@github-actions

This comment has been minimized.

@BrynCooke BrynCooke changed the title Only reload otel when needed refactor: Only reload telemetry when needed Sep 25, 2025
@BrynCooke BrynCooke marked this pull request as ready for review September 25, 2025 11:21
@BrynCooke BrynCooke requested a review from a team September 25, 2025 11:21
@BrynCooke BrynCooke requested a review from a team as a code owner September 25, 2025 11:21
Copy link
Contributor

@aaronArinder aaronArinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass; mostly clarifying questions, moving on to trying to break it

Cargo.lock Outdated
"percent-encoding",
"pin-project-lite",
"socket2 0.5.10",
"socket2 0.6.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did this need to change? it came as part of bbcb3963033937bd4f9c3b, but there wasn't a change to any Cargo.tomls in that commit. worried about increased risk, but who knows, maybe it was necessary for some reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge with dev removed this

Comment on lines -75 to +80
providers: HashMap<MeterProviderType, (FilterMeterProvider, HashMap<MeterId, Meter>)>,
providers: Vec<(FilterMeterProvider, HashMap<MeterId, Meter>)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand the change to a vec? easier to iterate over?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have converted the code to always expect something for every meter provider type. Hashmap would allow the slot to be empty which adds a layer of complexity. Now we just have a fixed set of meter providers always.

}

#[tokio::test]
#[tokio::test(flavor = "multi_thread")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why so many changes to multi-threaded tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The move to use block_in_place requires it as it requires a multi-threaded tokio runtime.

if self.tracing_config_changed::<otlp::Config>()
|| self.tracing_config_changed::<datadog::Config>()
|| self.tracing_config_changed::<zipkin::Config>()
|| self.tracing_config_changed::<apollo::Config>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I'm understanding this correctly, any change to the apollo bit of the config section will trigger a telemetry reload, even something like client_name_header. Is that expected/wanted? Wondering if folks will be confused by the difference in metrics after reloading with something like new client headers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section will replace the tracer provider, so it won't replace metrics. But yeah, if anything all of tracing needs to be reinitialized. There is no way around this because there can only be one tracer provider and it has to have all the configured exporters in it.

I don't think users will be changing their telemetry config very often.

@aaronArinder
Copy link
Contributor

see #8353 for more details, but I think changes to telemetry.exporters.tracing.* aren't actually reloading the plugin. Here's what I see:

  • reload does trigger
  • reload log shows up (with config: true)
  • the disjunction in builder.rs's setup_public_tracing evalutes to true
  • weird grpc error
  • test's expectations (noted in comments in the above pr's code) fail; metrics only increment, they don't reset (which I'm assuming they would if the plugin were reloaded)

@BrynCooke
Copy link
Contributor Author

see #8353 for more details, but I think changes to telemetry.exporters.tracing.* aren't actually reloading the plugin. Here's what I see:

  • reload does trigger
  • reload log shows up (with config: true)
  • the disjunction in builder.rs's setup_public_tracing evalutes to true
  • weird grpc error
  • test's expectations (noted in comments in the above pr's code) fail; metrics only increment, they don't reset (which I'm assuming they would if the plugin were reloaded)

I took a look and I think things are behaving as they should. Modifying tracing config won't cause metrics to reload so this is why the prom metrics don't zero out.

I've added a tracing reload test in 6ddfce9.

Let's talk about the GRPC error though as I' not sure what this refers to.

Copy link
Member

@lrlna lrlna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some observations from a first pass at this!

/// are also shut down in a safe way.
pub(crate) struct Activation {
/// The new tracer provider. None means leave the existing one
trace_provider: Option<opentelemetry_sdk::trace::TracerProvider>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for all of these None should mean None were provided. Otherwise, you're encoding both the state of this field and its value in an Option.

From what I am seeing in builder.rs, we always provide a value for all of these fields if it's available in the new config. If it's not available in the new config, it needs to, of course, be removed. So None actually should mean there isn't any.

Copy link
Contributor Author

@BrynCooke BrynCooke Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to rename the fields to communicate intent. To disable a meter provider our trace provider we actually set the noop versions rather than having nothing, this removes the complexity of having the rest of the code having to deal with Some or None in many locations. It's not something I see often in Rust but it is in line with the otel libraries which also do not accept None as absence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this helps: c77255a

Co-authored-by: Iryna Shestak <shestak.irina@gmail.com>

/// Allows us to keep track of the last registry that was used. Not ideal. Plugins would be better to have state
/// that can be maintained across reloads.
static REGISTRY: LazyLock<Mutex<Option<Registry>>> = LazyLock::new(Default::default);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, could it be replaced by OnceLock instead ? I don't know I'm not able to see all the usages of REGISTRY here but just asking to confirm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LazyLock is like a sImplified OnceLock, it just requires static initialisation so the code is simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do you need a Mutex ? Because OnceLock will have better perf than a Mutex if you don't need a mutex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do need to be able to change the contained value over time when telemetry is reloaded. As far as I can tell for OnceLock I can't do this as it can only be set once?

Copy link
Contributor

@aaronArinder aaronArinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

behavior looks good to me! feel free to take over #8353 if you think it's useful, but otherwise I'll close it after this pr goes in

@BrynCooke BrynCooke merged commit d6b97d7 into dev Oct 2, 2025
15 checks passed
@BrynCooke BrynCooke deleted the bryn/otel-reload-simplification branch October 2, 2025 12:44
@abernix abernix mentioned this pull request Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants