Skip to content

Provide unit conversion for common non-second duration instruments (TSH-20621)#8415

Closed
theJC wants to merge 8 commits intoapollographql:devfrom
theJC:convertNonSecondTimeUnits
Closed

Provide unit conversion for common non-second duration instruments (TSH-20621)#8415
theJC wants to merge 8 commits intoapollographql:devfrom
theJC:convertNonSecondTimeUnits

Conversation

@theJC
Copy link
Contributor

@theJC theJC commented Oct 14, 2025

Customers of Apollo integrate their OTLP streams with various different observability platform providers that have varying levels of sophistication on how they ingest incoming data.

We have a migration blocking use case where we require the ability to have a metric stream be sent in units of milliseconds. If this was a brand new metric we wouldnt need to, but:

  • The custom metric has existed for years on our Apollo Gateway based federated graph, emanating this metric as milliseconds and is defined as milliseconds in Datadog

  • Our internal customers have significantly leveraged this metric over the years and there are 2,322 different datadog artifacts (dashboards, SLOs, monitors, etc) that use this metric, is referred 4,184 different times. We do not have the capacity of transitioning customers to a new metric that uses different units at this time, it would have to be a gradual process over time, after completion of our Router migration

  • We need both Router and Gateway to be producing this metric as we migrate the remainder of traffic off our Gateway solution so that those using this metric to monitor the performance and availability of subgraphs have continuity throughout the migration of clients from the Gateway soluction to the Router solution.

  • When attempting to send this metric via Router, it was observed that the values emanated from Router are 1000 times lower... the Datadog ingestion pipeline will not do unit conversion of incoming data to match how the metric is defined in the metric metadata.

  • Therefore we need Router the ability for Router to emanate this metric in ms units. I am fine with verbiage in the documentation that one should strive to use second units for durations, and only use non-seconds when they uncover the reality of integrating with various OTLP ingesting systems requires some flexibility, especially for migrations where your customers may have heavily invested in a particular metric in their previous incarnation of a federated graph with Gateway.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • PR description explains the motivation for the change and relevant context for reviewing
  • PR description links appropriate GitHub/Jira tickets (creating when necessary)
  • Changeset is included for user-facing changes
  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Metrics and logs are added3 and documented
  • Tests added and passing4
    • Unit tests
    • Integration tests
    • Manual tests, as necessary

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices.

  4. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@theJC theJC requested a review from a team October 14, 2025 04:48
Copy link
Contributor

@bnjjj bnjjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a proper description of the PR please to make sure I understand the end goal. Also I'm not against this change but I can see 1 main issue. First one is I think the lack of consistency (we probably already have) on the way we measure time here. I think it might be good to be consistent and always use seconds everywhere except if it's for really long duration potentially.

@theJC
Copy link
Contributor Author

theJC commented Oct 14, 2025

@bnjjj -- Updated the description, apologies, I ran out of steam before crashing last night ;)

In a clean room implementation of a brand new supergraph, I completely agree with you on wanting to be consistent with time units when possible. However for migration cases where there already exists a metric from the Router predecessor in which the migration plan requires continuity of the emanation of the same metric from Router OR case where OTLP integration platform has limitations, I believe Router's customers require the ability for a small amount of flexibility on the units.

Also, in case this helps, ref: TSH-20621

@theJC theJC changed the title Provide unit conversion for common non-second duration instruments Provide unit conversion for common non-second duration instruments (TSH-20621) Oct 14, 2025
@theJC
Copy link
Contributor Author

theJC commented Oct 14, 2025

System testing this change leveraging otel-collector docker image:

Sent a request in and had the collector log out the custom metric (using ms) and the http.client.request.duration (using seconds). Additional attributes on the metric removed here to keep these snippets terse:

Metric #0
Descriptor:
     -> Name: http.client.request.duration
     -> Description: Duration of HTTP client requests.
     -> Unit: s
     -> DataType: Histogram
     -> AggregationTemporality: Delta
HistogramDataPoints #0
Data point attributes:
     -> clientname: Str(client-name)
     -> graphql.operation.name: Str(thisIsATestOperation)
     -> http.response.status_code: Int(200)
     -> subgraph.name: Str(graphql-diagnostics-api)
StartTimestamp: 2025-10-14 15:33:42.996522 +0000 UTC
Timestamp: 2025-10-14 15:57:08.024024 +0000 UTC
Count: 1
Sum: 0.122261
Min: 0.122261
Max: 0.122261
Metric #1
Descriptor:
     -> Name: custom.metric.call.time
     -> Description: Call time for a subgraph service and process results
     -> Unit: ms
     -> DataType: Histogram
     -> AggregationTemporality: Delta
HistogramDataPoints #0
Data point attributes:
     -> clientname: Str(client-name)
     -> graphql.operation.name: Str(thisIsATestOperation)
     -> subgraph.name: Str(graphql-diagnostics-api)
StartTimestamp: 2025-10-14 15:33:42.996545 +0000 UTC
Timestamp: 2025-10-14 15:57:08.024048 +0000 UTC
Count: 1
Sum: 122.359542
Min: 122.359542
Max: 122.359542

@theJC theJC force-pushed the convertNonSecondTimeUnits branch from 7515b9d to 0a3f1cc Compare October 14, 2025 16:11
@theJC theJC requested a review from a team as a code owner October 14, 2025 17:47
Comment on lines +19 to +24
telemetry:
instrumentation:
instruments:
router:
http.server.request.duration:
unit: "ms" # Values are now automatically converted to milliseconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration is probably wrong. Because it's not a custom metric it's a built-in/otel metric. So it would work for a custom one but not for built-in/otel ones.

Copy link
Contributor Author

@theJC theJC Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Updated to an example that does work and is representative of actual intended use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still invalid. Let me give you a correct one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I've used it

/// Defaults to seconds for any other unit string.
fn duration_to_f64(duration: std::time::Duration, unit: &str) -> f64 {
match unit {
"ms" => duration.as_secs_f64() * 1000.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not as_millis() as f64 for consistency with the other units?

Copy link
Contributor Author

@theJC theJC Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good 👀. I'm using as_secs_f64() * 1000 here because:

@theJC theJC force-pushed the convertNonSecondTimeUnits branch from 4e47c08 to 2a688cc Compare October 15, 2025 15:04
@theJC theJC requested review from BrynCooke and bnjjj October 15, 2025 15:07
Comment on lines +19 to +24
telemetry:
instrumentation:
instruments:
router:
http.server.request.duration:
unit: "ms" # Values are now automatically converted to milliseconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still invalid. Let me give you a correct one

Co-authored-by: Coenen Benjamin <benjamin.coenen@hotmail.com>
@bnjjj
Copy link
Contributor

bnjjj commented Oct 15, 2025

@Mergifyio copy dev
@theJC Thanks so much for opening this! Now that it looks like approval is on the horizon, we're going to move this PR over to a direct branch on the repository so the full CI run can happen, including having access to our GITHUB_TOKEN which allows us to go over the GitHub anonymous download rate-limits which aren't currently being permitted on your PR.
You will briefly see a new PR show up in the metadata here, and it will preserve your contribution credit!

@mergify
Copy link
Contributor

mergify bot commented Oct 15, 2025

copy dev

✅ Pull request copies have been created

Details

@bnjjj bnjjj closed this Oct 15, 2025
bnjjj pushed a commit that referenced this pull request Oct 16, 2025
…SH-20621) (copy #8415) (#8423)

Signed-off-by: Benjamin <5719034+bnjjj@users.noreply.github.com>
Co-authored-by: Jon Christiansen <467023+theJC@users.noreply.github.com>
@abernix abernix mentioned this pull request Oct 27, 2025
@theJC theJC deleted the convertNonSecondTimeUnits branch December 8, 2025 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants