[Error counting] Move error counting to telemetry plugin by timbotnik · Pull Request #7781 · apollographql/router

timbotnik · 2025-06-29T23:58:43Z

This migrates error counting for telemetry from the Router service layer to the Telemetry plugin. This will allow us to capture more errors than before, including:

Errors introduced after the router layer, usually in plugins (e.g. Free Plan rate limiting)
Errors with information redacted before the router layer (e.g. subgraph errors redacted by include_subgraph_error config)

Errors are now counted at each layer in the response path. We prevent double counting using a new Error ID released in a previous PR to keep track of previously counted errors. The ID is internal to the router and is not serializable. Each time we count an error on the response, we then store its ID in the response context for the next layer to check against. The outlier is the Router Service layer where we are working with a serialized response. To avoid adding additional latency by deserializing, we instead store the list of raw errors in context (keyed by ID) when the response is first created. We then check the IDs against previously counted errors as normal.

Part 3 of a split from #7357. Part 1 can be found here: #7699 and part 2 here: #7712.

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

Exceptions

Note any exceptions here

Notes

It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩

…efactor_PULSR_1504

Co-authored-by: timbotnik <tim@apollographql.com>

…efactor_PULSR_1504

…LSR_1504' into rreg/err_count_error_refactor_PULSR_1504

…efactor_PULSR_1504 # Conflicts: # apollo-router/src/plugins/connectors/handle_responses.rs # apollo-router/src/services/layers/persisted_queries/mod.rs

apollo-librarian · 2025-06-30T02:42:00Z

✅ Docs preview has no changes

The preview was not built because there were no changes.

Build ID: 7abce71001fd18553629afaf

…_telemetry_PULSR_1504

apollo-router/tests/integration/telemetry/apollo_otel_metrics.rs

apollo-router/tests/common.rs

apollo-router/src/plugins/license_enforcement/mod.rs

apollo-router/src/plugins/telemetry/error_counter.rs

apollo-router/src/plugins/telemetry/mod.rs

apollo-router/tests/integration/telemetry/apollo_otel_metrics.rs

...r/tests/integration/telemetry/fixtures/apollo_otel_metrics_csrf_required_headers.router.yaml

…_telemetry_PULSR_1504

…move todos

rregitsky · 2025-07-18T19:11:55Z

apollo-router/src/metrics/mod.rs

+        fn equal_attributes(expected: &AttributeSet, actual: &[KeyValue]) -> bool {
+            // If lengths are different, we can short circuit. This also accounts for a bug where
+            // an empty attributes list would always be considered "equal" due to zip capping at
+            // the shortest iter's length
+            if expected.iter().count() != actual.len() {
+                return false;
+            }
+            // This works because the attributes are always sorted
+            expected.iter().zip(actual.iter()).all(|((k, v), kv)| {


Noting that I fixed a bug here where these would have always passed regardless of the actual number of attributes in a matching emitted metric due to how zip works.:

// This used to pass u64_counter!('my.other.metric', 1, attr = "val"); assert_counter!('my.metric.count', 1, &[]) assert_counter!('my.metric.count', 1) // This also used to pass u64_counter!('my.other.metric', 1); assert_counter!('my.other.metric, 1, attr = "val");

Didn't want it to get lost in the massive deletions in this file.

E.g. this miswritten existing test that should have been failing:

router/apollo-router/src/metrics/mod.rs

Lines 1772 to 1773 in 52ce4da

i64_up_down_counter_with_unit!("test", "test description", "{request}", 1);

assert_up_down_counter!("test", 1, "attr" = "val");

timbotnik

This is looking pretty good to me. One question but I'd be pushing for a final Router team review.

timbotnik · 2025-07-21T07:29:47Z

apollo-router/src/plugins/telemetry/mod.rs


-                        response
+                        if let Ok(resp) = response {
+                            Ok(count_router_errors(resp, &config.apollo.errors).await)


Not sure how hard this is, but wondering if instead of rebuilding the response we could use a mutator like resp.count_router_errors. Might save some clones.

Hmm... I'm not sure there's a good way to avoid it. I have to split the response (technically the response.response) into parts to be able to pull out the errors. That action consumes the original response which I believe means we are forced to rebuild it. Happy to be told otherwise though.

The only exception is the router layer, in which the errors are sitting on the context. I'd prefer to keep a similar pattern between all layers though instead of mutating the response in the router layer only.

BrynCooke

Hopefully fairly simple, but the telemetry mod.rs is now 4000 lines. Please can you split the tests for the new functionality into a new module so that we can start breaking stuff up.

…_telemetry_PULSR_1504

This reverts commit 87ff2d4.

apollo-router/src/lib.rs

apollo-router/src/context/mod.rs

rregitsky and others added 22 commits June 16, 2025 16:43

Add error builder files

c77fa51

add setter, update some tests

3351df8

remove changes to invalidation_endpoint

2d1d80d

simplify some test id overrides. Revert response refactor changes

995696b

Merge remote-tracking branch 'origin/dev' into rreg/err_count_error_r…

b30f3ff

…efactor_PULSR_1504

resolve bool extension_code to None

40412ba

Switch id overwrites to new assert macros

8d25a94

new assert in a few more tests

3c817b2

more test cleanup

8bf1095

Merge remote-tracking branch 'origin/dev' into rreg/err_count_error_r…

c273ee3

…efactor_PULSR_1504

Fix typo

d92357b

Co-authored-by: timbotnik <tim@apollographql.com>

Merge remote-tracking branch 'origin/dev' into rreg/err_count_error_r…

a6346be

…efactor_PULSR_1504

Merge remote-tracking branch 'origin/rreg/err_count_error_refactor_PU…

dbb4237

…LSR_1504' into rreg/err_count_error_refactor_PULSR_1504

Review notes

6c95f60

Fix default issues

83ac7b3

Merge remote-tracking branch 'origin/dev' into rreg/err_count_error_r…

4dfc4de

…efactor_PULSR_1504 # Conflicts: # apollo-router/src/plugins/connectors/handle_responses.rs # apollo-router/src/services/layers/persisted_queries/mod.rs

Merge fix

bdffaba

Merge branch 'dev' into rreg/err_count_error_refactor_PULSR_1504

a86e9e5

Merge branch 'dev' into rreg/err_count_error_refactor_PULSR_1504

4cedc83

pull files for error counting

fdfd2e1

pull in remaining files

1bd1d4f

router plugin tests. Fix router layer ID mismatch

2bb088d

This comment has been minimized.

Sign in to view

timbotnik changed the base branch from dev to rreg/err_count_error_refactor_PULSR_1504 June 29, 2025 23:58

timbotnik added 2 commits June 30, 2025 10:37

Fix existing error_counter unit tests

15f255e

error_counter: add subgraph layer unit test

c6ec49a

timbotnik added 3 commits June 30, 2025 12:56

error_counter: add execution layer unit test

d371288

error_counter: add router layer unit test

b9e1120

error_counter: remove superfluous defaults

abe820c

rregitsky added 2 commits July 17, 2025 10:57

lint fixes

ed4fc96

Merge remote-tracking branch 'origin/dev' into rreg/err_count_move_to…

69fc040

…_telemetry_PULSR_1504

rregitsky marked this pull request as ready for review July 17, 2025 16:20

rregitsky requested a review from a team July 17, 2025 16:20

rregitsky reviewed Jul 17, 2025

View reviewed changes

apollo-router/tests/integration/telemetry/apollo_otel_metrics.rs Show resolved Hide resolved

Fix otlp wait flakiness and reduce timings

db73218

timbotnik commented Jul 18, 2025

View reviewed changes

apollo-router/tests/common.rs Show resolved Hide resolved

timbotnik commented Jul 18, 2025

View reviewed changes

rregitsky added 5 commits July 18, 2025 09:22

Remove todos. Remove mut in telemetry plugin

df5a395

inline integration test config

81f6be9

Merge remote-tracking branch 'origin/dev' into rreg/err_count_move_to…

ccb11af

…_telemetry_PULSR_1504

Fix graphql_error empty string bug. Fix bug with equal_attributes. Re…

52ce4da

…move todos

clippy fixes

9540c2e

rregitsky reviewed Jul 18, 2025

View reviewed changes

rregitsky added 2 commits July 18, 2025 15:20

Fix broken test

7f3c860

lint fix

3632d2e

timbotnik commented Jul 21, 2025

View reviewed changes

BrynCooke requested changes Jul 22, 2025

View reviewed changes

rregitsky added 2 commits July 22, 2025 09:44

Merge remote-tracking branch 'origin/dev' into rreg/err_count_move_to…

10b8448

…_telemetry_PULSR_1504

move telemetry tests into their own file

87ff2d4

rregitsky requested a review from BrynCooke July 22, 2025 13:54

rregitsky added 3 commits July 22, 2025 10:04

Revert "move telemetry tests into their own file"

e611421

This reverts commit 87ff2d4.

move err counter to own dir. Split tests to own file

8b6aaea

lint fixes

3087199

BrynCooke requested changes Jul 22, 2025

View reviewed changes

apollo-router/src/lib.rs Outdated Show resolved Hide resolved

apollo-router/src/context/mod.rs Outdated Show resolved Hide resolved

un-public json_ext. Add cautionary comments to new consts

729b180

BrynCooke self-requested a review July 23, 2025 09:21

BrynCooke approved these changes Jul 23, 2025

View reviewed changes

rregitsky merged commit 4ad0cc7 into dev Jul 23, 2025
15 checks passed

rregitsky deleted the rreg/err_count_move_to_telemetry_PULSR_1504 branch July 23, 2025 14:52

	i64_up_down_counter_with_unit!("test", "test description", "{request}", 1);
	assert_up_down_counter!("test", 1, "attr" = "val");

Conversation

timbotnik commented Jun 29, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

This comment has been minimized.

apollo-librarian bot commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Docs preview has no changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rregitsky Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rregitsky Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

timbotnik left a comment

Choose a reason for hiding this comment

Uh oh!

timbotnik Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

rregitsky Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

BrynCooke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timbotnik commented Jun 29, 2025 •

edited by atlassian bot

Loading

apollo-librarian bot commented Jun 30, 2025 •

edited

Loading

rregitsky Jul 18, 2025 •

edited

Loading