Skip to content

(feat) add router overhead metric#8455

Merged
BrynCooke merged 16 commits intodevfrom
bryn/router-overhead
Oct 21, 2025
Merged

(feat) add router overhead metric#8455
BrynCooke merged 16 commits intodevfrom
bryn/router-overhead

Conversation

@BrynCooke
Copy link
Contributor

@BrynCooke BrynCooke commented Oct 20, 2025

Router overhead metric

The apollo.router.overhead histogram provides a direct measurement of router processing overhead. This metric tracks the time the router spends on tasks other than waiting for subgraph requests—including GraphQL parsing, validation, query planning, response composition, and plugin execution.

The overhead calculation excludes time spent waiting for subgraph responses, giving you visibility into the router's actual processing time versus subgraph latency. This metric helps identify when the router itself is a bottleneck versus when delays are caused by downstream services.

Important considerations for router overhead metrics:

  • Version variability: Router overhead may vary between router versions. For example, a correctness fix or security improvement may result in higher overhead. Always compare overhead measurements within the same router version.

  • Configuration requirements: For meaningful overhead measurements, configure operation limits and traffic shaping. Without these controls, unbounded request complexity or traffic spikes can skew overhead measurements.

  • CPU saturation: High overhead values often indicate CPU saturation. When the router's CPU resources are exhausted, processing time increases significantly. Monitor CPU utilization alongside overhead metrics to identify resource constraints.

Available attributes:

  • subgraph.active_requests: A boolean indicating whether any subgraph requests were active at the time the overhead was calculated. This attribute is critical for filtering meaningful overhead measurements.

    When to filter out subgraph.active_requests: true: For operations that stream results (such as queries with @defer), the overhead metric becomes less meaningful when subgraph requests are still active, since the router is in a waiting state rather than actively processing. When analyzing overhead to identify router processing bottlenecks, exclude measurements where subgraph.active_requests: true to focus only on pure router processing time without subgraph wait time interference.

Configuration example:

telemetry:
  instrumentation:
    instruments:
      router:
        apollo.router.overhead: true

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • PR description explains the motivation for the change and relevant context for reviewing
  • PR description links appropriate GitHub/Jira tickets (creating when necessary)
  • Changeset is included for user-facing changes
  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Metrics and logs are added3 and documented
  • Tests added and passing4
    • Unit tests
    • Integration tests
    • Manual tests, as necessary

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices.

  4. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@BrynCooke BrynCooke requested a review from a team October 20, 2025 09:34
@BrynCooke BrynCooke requested a review from a team as a code owner October 20, 2025 09:34
@apollo-librarian
Copy link

apollo-librarian bot commented Oct 20, 2025

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

4 new, 9 changed, 9 removed
+ graphos/routing/(latest)/performance/caching/distributed.mdx
+ graphos/routing/(latest)/performance/caching/entity.mdx
+ graphos/routing/(latest)/performance/caching/in-memory.mdx
+ graphos/routing/(latest)/performance/caching/index.mdx
* graphos/routing/(latest)/configuration/yaml.mdx
* graphos/routing/(latest)/customization/coprocessor/reference.mdx
* graphos/routing/(latest)/observability/router-telemetry-otel/enabling-telemetry/selectors.mdx
* graphos/routing/(latest)/observability/router-telemetry-otel/enabling-telemetry/standard-instruments.mdx
* graphos/routing/(latest)/observability/router-telemetry-otel/enabling-telemetry/usage-guides/debugging-subgraph-requests.mdx
* graphos/routing/(latest)/performance/query-batching.mdx
* graphos/routing/(latest)/query-planning/query-planning-best-practices.mdx
* graphos/routing/(latest)/graphos-features.mdx
* graphos/routing/(latest)/_sidebar.yaml
- graphos/routing/(latest)/operations/apq.mdx
- graphos/routing/(latest)/performance/caching/response-caching/customization.mdx
- graphos/routing/(latest)/performance/caching/response-caching/faq.mdx
- graphos/routing/(latest)/performance/caching/response-caching/invalidation.mdx
- graphos/routing/(latest)/performance/caching/response-caching/observability.mdx
- graphos/routing/(latest)/performance/caching/response-caching/overview.mdx
- graphos/routing/(latest)/performance/caching/response-caching/quickstart.mdx
- graphos/routing/(latest)/performance/index.mdx
- graphos/routing/(latest)/query-planning/caching.mdx

Build ID: db0ec3998b801bf109974510
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/db0ec3998b801bf109974510

@github-actions

This comment has been minimized.

@BrynCooke BrynCooke force-pushed the bryn/router-overhead branch 2 times, most recently from cddaf34 to 6bf8c79 Compare October 20, 2025 10:19
bryn added 7 commits October 20, 2025 11:59
…Value

This refactors the increment handling in the telemetry system to use
opentelemetry::Value (F64) instead of i64. This provides better type
compatibility with the OpenTelemetry specification and allows for more
flexible metric value handling.

Technical changes:
- Updated Increment enum variants to use Option<opentelemetry::Value>
- Modified value conversion functions to work with opentelemetry::Value
- Updated all increment operations in counters and histograms
Adds a new metric to track router processing overhead, which measures
the time spent in the router that is not waiting for subgraph requests.

This metric provides insight into the router's own processing time and
helps identify performance bottlenecks within the router itself.

Technical changes:
- Added router_overhead instrument configuration
- Integrated RouterOverheadAttributes for metric attributes
- Wired up metric collection in router request/response lifecycle
@BrynCooke BrynCooke force-pushed the bryn/router-overhead branch from 6bf8c79 to a69f46d Compare October 20, 2025 12:14
@BrynCooke BrynCooke requested a review from bnjjj October 20, 2025 12:46
@BrynCooke BrynCooke force-pushed the bryn/router-overhead branch from d7eb2a6 to 82af50e Compare October 20, 2025 12:50
.f64_histogram(ROUTER_OVERHEAD_METRIC)
.with_unit("s")
.with_description(
"Router processing overhead (time not spent waiting for subgraphs).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document somewhere that if they have coprocessor it's included in this metric ?

Copy link
Contributor Author

@BrynCooke BrynCooke Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnjjj
Copy link
Contributor

bnjjj commented Oct 20, 2025

BTW if this PR lands #8389 we'll be also able to enable this layer for coprocessors I think

@BrynCooke BrynCooke force-pushed the bryn/router-overhead branch from 6aa1dab to 1b7b555 Compare October 20, 2025 14:39
@BrynCooke BrynCooke requested a review from bnjjj October 20, 2025 15:36
@BrynCooke BrynCooke enabled auto-merge (squash) October 21, 2025 09:40
@BrynCooke BrynCooke merged commit e819a7a into dev Oct 21, 2025
15 checks passed
@BrynCooke BrynCooke deleted the bryn/router-overhead branch October 21, 2025 09:59
@abernix abernix mentioned this pull request Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants