refactor: Only reload telemetry when needed#8328
Conversation
Move some lifecycle stuff into activation.
Introduce a new builder that is responsible detecting if config needs reloading. If it does then config is reloaded and new information is moved into activation which is responsible for actually applying the new config and shutting down old exporters.
✅ Docs preview readyThe preview is ready to be viewed. View the preview File Changes 0 new, 2 changed, 0 removedBuild ID: 1c0ee032b4c4b9a1834238b8 URL: https://www.apollographql.com/docs/deploy-preview/1c0ee032b4c4b9a1834238b8 |
This comment has been minimized.
This comment has been minimized.
Updating the config should be identical to initial startup
aaronArinder
left a comment
There was a problem hiding this comment.
first pass; mostly clarifying questions, moving on to trying to break it
Cargo.lock
Outdated
| "percent-encoding", | ||
| "pin-project-lite", | ||
| "socket2 0.5.10", | ||
| "socket2 0.6.0", |
There was a problem hiding this comment.
did this need to change? it came as part of bbcb3963033937bd4f9c3b, but there wasn't a change to any Cargo.tomls in that commit. worried about increased risk, but who knows, maybe it was necessary for some reason?
There was a problem hiding this comment.
Merge with dev removed this
| providers: HashMap<MeterProviderType, (FilterMeterProvider, HashMap<MeterId, Meter>)>, | ||
| providers: Vec<(FilterMeterProvider, HashMap<MeterId, Meter>)>, |
There was a problem hiding this comment.
I'm not sure I fully understand the change to a vec? easier to iterate over?
There was a problem hiding this comment.
I have converted the code to always expect something for every meter provider type. Hashmap would allow the slot to be empty which adds a layer of complexity. Now we just have a fixed set of meter providers always.
| } | ||
|
|
||
| #[tokio::test] | ||
| #[tokio::test(flavor = "multi_thread")] |
There was a problem hiding this comment.
why so many changes to multi-threaded tests?
There was a problem hiding this comment.
The move to use block_in_place requires it as it requires a multi-threaded tokio runtime.
| if self.tracing_config_changed::<otlp::Config>() | ||
| || self.tracing_config_changed::<datadog::Config>() | ||
| || self.tracing_config_changed::<zipkin::Config>() | ||
| || self.tracing_config_changed::<apollo::Config>() |
There was a problem hiding this comment.
if I'm understanding this correctly, any change to the apollo bit of the config section will trigger a telemetry reload, even something like client_name_header. Is that expected/wanted? Wondering if folks will be confused by the difference in metrics after reloading with something like new client headers
There was a problem hiding this comment.
This section will replace the tracer provider, so it won't replace metrics. But yeah, if anything all of tracing needs to be reinitialized. There is no way around this because there can only be one tracer provider and it has to have all the configured exporters in it.
I don't think users will be changing their telemetry config very often.
|
see #8353 for more details, but I think changes to
|
I took a look and I think things are behaving as they should. Modifying tracing config won't cause metrics to reload so this is why the prom metrics don't zero out. I've added a tracing reload test in 6ddfce9. Let's talk about the GRPC error though as I' not sure what this refers to. |
lrlna
left a comment
There was a problem hiding this comment.
Some observations from a first pass at this!
| /// are also shut down in a safe way. | ||
| pub(crate) struct Activation { | ||
| /// The new tracer provider. None means leave the existing one | ||
| trace_provider: Option<opentelemetry_sdk::trace::TracerProvider>, |
There was a problem hiding this comment.
I think for all of these None should mean None were provided. Otherwise, you're encoding both the state of this field and its value in an Option.
From what I am seeing in builder.rs, we always provide a value for all of these fields if it's available in the new config. If it's not available in the new config, it needs to, of course, be removed. So None actually should mean there isn't any.
There was a problem hiding this comment.
I'm going to rename the fields to communicate intent. To disable a meter provider our trace provider we actually set the noop versions rather than having nothing, this removes the complexity of having the rest of the code having to deal with Some or None in many locations. It's not something I see often in Rust but it is in line with the otel libraries which also do not accept None as absence.
Co-authored-by: Iryna Shestak <shestak.irina@gmail.com>
|
|
||
| /// Allows us to keep track of the last registry that was used. Not ideal. Plugins would be better to have state | ||
| /// that can be maintained across reloads. | ||
| static REGISTRY: LazyLock<Mutex<Option<Registry>>> = LazyLock::new(Default::default); |
There was a problem hiding this comment.
Just a question, could it be replaced by OnceLock instead ? I don't know I'm not able to see all the usages of REGISTRY here but just asking to confirm
There was a problem hiding this comment.
LazyLock is like a sImplified OnceLock, it just requires static initialisation so the code is simpler.
There was a problem hiding this comment.
But do you need a Mutex ? Because OnceLock will have better perf than a Mutex if you don't need a mutex
There was a problem hiding this comment.
I do need to be able to change the contained value over time when telemetry is reloaded. As far as I can tell for OnceLock I can't do this as it can only be set once?
aaronArinder
left a comment
There was a problem hiding this comment.
behavior looks good to me! feel free to take over #8353 if you think it's useful, but otherwise I'll close it after this pr goes in
…ad if an exporter is hanging shutting down.
This PR contains a refactor of the telemetry reload lifecycle to allow us to fix
UpDownCountersand move away from Gauges.The problem with our hot reloads currently is that telemetry will ALWAYS get reinitialized. This is problematic for up down counters as we cannot rely on them for capturing important metrics across reloads, for instance client connections.
We now only reload telemetry if significant config has changed, thus as long as the user dow not modify telemetry config
UpDownCounterswill work as intended.The first few commits of this PR can be reviewed, but later commits are either moves or fixes.
Notes:
Activation. This struct is frees the rest of the code from dealing with the blocking io that may occur during shutdown.Once this is merged there will be a separate PR to fix migrate gauges to UpDownCounters where the same instrument is retained to allow the down to happen on drop.
Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
No new functionality so no docs.
This is super hard to unit test as the generated exporters are opaque to us. Was a view configured correctly? Difficult to tell.
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and
debug-level logs. Please read this guidance on metrics best-practices. ↩Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩