Provide unit conversion for common non-second duration instruments (TSH-20621)#8415
Provide unit conversion for common non-second duration instruments (TSH-20621)#8415theJC wants to merge 8 commits intoapollographql:devfrom
Conversation
bnjjj
left a comment
There was a problem hiding this comment.
Could you provide a proper description of the PR please to make sure I understand the end goal. Also I'm not against this change but I can see 1 main issue. First one is I think the lack of consistency (we probably already have) on the way we measure time here. I think it might be good to be consistent and always use seconds everywhere except if it's for really long duration potentially.
|
@bnjjj -- Updated the description, apologies, I ran out of steam before crashing last night ;) In a clean room implementation of a brand new supergraph, I completely agree with you on wanting to be consistent with time units when possible. However for migration cases where there already exists a metric from the Router predecessor in which the migration plan requires continuity of the emanation of the same metric from Router OR case where OTLP integration platform has limitations, I believe Router's customers require the ability for a small amount of flexibility on the units. Also, in case this helps, ref: TSH-20621 |
|
System testing this change leveraging otel-collector docker image: Sent a request in and had the collector log out the custom metric (using ms) and the http.client.request.duration (using seconds). Additional attributes on the metric removed here to keep these snippets terse: |
7515b9d to
0a3f1cc
Compare
| telemetry: | ||
| instrumentation: | ||
| instruments: | ||
| router: | ||
| http.server.request.duration: | ||
| unit: "ms" # Values are now automatically converted to milliseconds |
There was a problem hiding this comment.
This configuration is probably wrong. Because it's not a custom metric it's a built-in/otel metric. So it would work for a custom one but not for built-in/otel ones.
There was a problem hiding this comment.
Good catch! Updated to an example that does work and is representative of actual intended use case.
There was a problem hiding this comment.
I think it's still invalid. Let me give you a correct one
There was a problem hiding this comment.
Thanks for the suggestion, I've used it
| /// Defaults to seconds for any other unit string. | ||
| fn duration_to_f64(duration: std::time::Duration, unit: &str) -> f64 { | ||
| match unit { | ||
| "ms" => duration.as_secs_f64() * 1000.0, |
There was a problem hiding this comment.
why not as_millis() as f64 for consistency with the other units?
There was a problem hiding this comment.
Good 👀. I'm using as_secs_f64() * 1000 here because:
- duration.as_millis() returns a u128 integer, which truncates any fractional milliseconds.
- as_millis_f64() exists but it's currently unstable (Tracking Issue for
Duration::as_millis_{f64,f32}rust-lang/rust#122451)
…-generated suggestions)
…-generated suggestions)
4e47c08 to
2a688cc
Compare
| telemetry: | ||
| instrumentation: | ||
| instruments: | ||
| router: | ||
| http.server.request.duration: | ||
| unit: "ms" # Values are now automatically converted to milliseconds |
There was a problem hiding this comment.
I think it's still invalid. Let me give you a correct one
Co-authored-by: Coenen Benjamin <benjamin.coenen@hotmail.com>
|
@Mergifyio copy dev |
✅ Pull request copies have been createdDetails
|
Customers of Apollo integrate their OTLP streams with various different observability platform providers that have varying levels of sophistication on how they ingest incoming data.
We have a migration blocking use case where we require the ability to have a metric stream be sent in units of milliseconds. If this was a brand new metric we wouldnt need to, but:
The custom metric has existed for years on our Apollo Gateway based federated graph, emanating this metric as milliseconds and is defined as milliseconds in Datadog
Our internal customers have significantly leveraged this metric over the years and there are 2,322 different datadog artifacts (dashboards, SLOs, monitors, etc) that use this metric, is referred 4,184 different times. We do not have the capacity of transitioning customers to a new metric that uses different units at this time, it would have to be a gradual process over time, after completion of our Router migration
We need both Router and Gateway to be producing this metric as we migrate the remainder of traffic off our Gateway solution so that those using this metric to monitor the performance and availability of subgraphs have continuity throughout the migration of clients from the Gateway soluction to the Router solution.
When attempting to send this metric via Router, it was observed that the values emanated from Router are 1000 times lower... the Datadog ingestion pipeline will not do unit conversion of incoming data to match how the metric is defined in the metric metadata.
Therefore we need Router the ability for Router to emanate this metric in ms units. I am fine with verbiage in the documentation that one should strive to use second units for durations, and only use non-seconds when they uncover the reality of integrating with various OTLP ingesting systems requires some flexibility, especially for migrations where your customers may have heavily invested in a particular metric in their previous incarnation of a federated graph with Gateway.
Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and
debug-level logs. Please read this guidance on metrics best-practices. ↩Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩