Skip to content

[Subgraph Insights] Add Apollo Subgraph Fetch Histogram to Telemetry Plugin#8013

Merged
rregitsky merged 47 commits intodevfrom
rreg/PULSR-1673/top-level-subgraph-fetch-historgram
Aug 5, 2025
Merged

[Subgraph Insights] Add Apollo Subgraph Fetch Histogram to Telemetry Plugin#8013
rregitsky merged 47 commits intodevfrom
rreg/PULSR-1673/top-level-subgraph-fetch-historgram

Conversation

@rregitsky
Copy link
Contributor

@rregitsky rregitsky commented Jul 30, 2025


Description
This change adds a new, experimental histogram to capture subgraph fetch duration for Apollo Studio. The instrument,
apollo.router.operations.fetch.duration has the following attributes:

  • apollo.client.name
  • apollo.client.version
  • has_errors
  • apollo.operation.id
  • graphql.operation.kind
  • graphql.operation.name
  • subgraph.name

This can be toggled on using a new boolean config flag:

telemetry:
  apollo:
    experimental_subgraph_metrics: true

The instrument is currently only sent to GraphOS and is not available in 3rd-party OTel export targets. It is not
user customizable. For this purpose, users can take advantage of the existing customizable instrument
http.client.request.duration measuring the same value.

Implementation
This is implemented using the custom instrument framework. These are Apollo-controlled metrics, so rather than allowing for user customization like existing use cases, we hardcode the CustomHistogram with specific attributes. This allows us to add more easily add apollo metrics in the future and means we can use the shared standard ways of pulling data off of the request/response/context.

Cardinality is high, so the new metric follows the realtime_metrics path which allows for the scheduled_delay config to change how often the metric is sent.

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • PR description explains the motivation for the change and relevant context for reviewing
  • PR description links appropriate GitHub/Jira tickets (creating when necessary)
  • Changeset is included for user-facing changes
  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Metrics and logs are added3 and documented
  • Tests added and passing4
    • Unit tests
    • Integration tests
    • Manual tests, as necessary

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices.

  4. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@apollo-librarian
Copy link

apollo-librarian bot commented Jul 30, 2025

✅ Docs preview has no changes

The preview was not built because there were no changes.

Build ID: 7643b6341b64a907fc66240e
Build Logs: View logs

@github-actions

This comment has been minimized.

Copy link
Contributor

@bonnici bonnici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parts that I understand are looking good to me.

StaticInstrument::Histogram(
meter
.f64_histogram(APOLLO_ROUTER_OPERATIONS_FETCH_DURATION)
.with_unit("s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess seconds is the standard unit to use? Milliseconds or nanoseconds makes more sense to me but again I'm new to this code.

Copy link
Contributor Author

@rregitsky rregitsky Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't have control over this for CustomHistogram.

Some(instant.elapsed().as_secs_f64())

In fact, all histogram refs I could find in the router were all using f64 to measure seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose if it was possible for this to be milliseconds maybe we would be able to use a 32 bit value and save some space over the wire. I'd need to also handle this on the ingestion side since we don't respect the unit yet and just assume that everything is in seconds. https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto#L198 - not sure that this would work since the protobuf specifies a double type for the bounds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand it is indeed the standard unit to use for all time measurements, but anyone displaying it would convert the unit to a most-appropriate scale at that point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timbotnik as I mentioned, CustomHistogram controls the unit recorded to the underlying histogram. We could likely add a way to specify a unit in its constructor, but if that's the route we want I'd prefer to add it as a follow-up to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO there's not a great reason to change this now (an argument about precision would be the only reason, but the numbers we're dealing with are precise enough). Let's make a tech debt ticket which would include updating the ingestion side to fail gracefully on different units (or support them!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attributes: SubgraphAttributes::builder()
.subgraph_name(StandardAttribute::Bool(true))
.graphql_operation_type(StandardAttribute::Aliased {
alias: "operation.kind".to_string(),
Copy link
Contributor

@timbotnik timbotnik Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're looking for graphql.operation.type on the ingestion side so this probably should come from the supergraph operation type and shouldn't require an alias. Then again, I wonder if there are any use-cases where the supergraph operation type and the subgraph operation type can be different? Actually in the case of federation or subscriptions perhaps it can be different... in which case we might actually want to track both the supergraph type and subgraph type. Let's ask someone from the Router team to confirm that it's possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation here actually notes that it should always be the same:

// Subgraph operation type wil always match the supergraph operation type

That said, I may as well pull it from the context to make it "future proof" in case that changes in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rregitsky rregitsky marked this pull request as ready for review July 31, 2025 14:18
@rregitsky rregitsky requested a review from a team July 31, 2025 14:18
@rregitsky rregitsky requested a review from a team as a code owner July 31, 2025 14:18
@rregitsky rregitsky merged commit 9a0ac15 into dev Aug 5, 2025
15 checks passed
@rregitsky rregitsky deleted the rreg/PULSR-1673/top-level-subgraph-fetch-historgram branch August 5, 2025 14:12
@lrlna lrlna mentioned this pull request Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants