Skip to content

[Error counting] Move error counting to telemetry plugin#7781

Merged
rregitsky merged 77 commits intodevfrom
rreg/err_count_move_to_telemetry_PULSR_1504
Jul 23, 2025
Merged

[Error counting] Move error counting to telemetry plugin#7781
rregitsky merged 77 commits intodevfrom
rreg/err_count_move_to_telemetry_PULSR_1504

Conversation

@timbotnik
Copy link
Contributor

@timbotnik timbotnik commented Jun 29, 2025

This migrates error counting for telemetry from the Router service layer to the Telemetry plugin. This will allow us to capture more errors than before, including:

  • Errors introduced after the router layer, usually in plugins (e.g. Free Plan rate limiting)
  • Errors with information redacted before the router layer (e.g. subgraph errors redacted by include_subgraph_error config)

Errors are now counted at each layer in the response path. We prevent double counting using a new Error ID released in a previous PR to keep track of previously counted errors. The ID is internal to the router and is not serializable. Each time we count an error on the response, we then store its ID in the response context for the next layer to check against. The outlier is the Router Service layer where we are working with a serialized response. To avoid adding additional latency by deserializing, we instead store the list of raw errors in context (keyed by ID) when the response is first created. We then check the IDs against previously counted errors as normal.

Part 3 of a split from #7357. Part 1 can be found here: #7699 and part 2 here: #7712.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • PR description explains the motivation for the change and relevant context for reviewing
  • PR description links appropriate GitHub/Jira tickets (creating when necessary)
  • Changeset is included for user-facing changes
  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Metrics and logs are added3 and documented
  • Tests added and passing4
    • Unit tests
    • Integration tests
    • Manual tests, as necessary

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices.

  4. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

rregitsky and others added 22 commits June 16, 2025 16:43
Co-authored-by: timbotnik <tim@apollographql.com>
…LSR_1504' into rreg/err_count_error_refactor_PULSR_1504
…efactor_PULSR_1504

# Conflicts:
#	apollo-router/src/plugins/connectors/handle_responses.rs
#	apollo-router/src/services/layers/persisted_queries/mod.rs
@github-actions

This comment has been minimized.

@timbotnik timbotnik changed the base branch from dev to rreg/err_count_error_refactor_PULSR_1504 June 29, 2025 23:58
@apollo-librarian
Copy link

apollo-librarian bot commented Jun 30, 2025

✅ Docs preview has no changes

The preview was not built because there were no changes.

Build ID: 7abce71001fd18553629afaf

@rregitsky rregitsky marked this pull request as ready for review July 17, 2025 16:20
@rregitsky rregitsky requested a review from a team July 17, 2025 16:20
Comment on lines +379 to +387
fn equal_attributes(expected: &AttributeSet, actual: &[KeyValue]) -> bool {
// If lengths are different, we can short circuit. This also accounts for a bug where
// an empty attributes list would always be considered "equal" due to zip capping at
// the shortest iter's length
if expected.iter().count() != actual.len() {
return false;
}
// This works because the attributes are always sorted
expected.iter().zip(actual.iter()).all(|((k, v), kv)| {
Copy link
Contributor

@rregitsky rregitsky Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that I fixed a bug here where these would have always passed regardless of the actual number of attributes in a matching emitted metric due to how zip works.:

// This used to pass
u64_counter!('my.other.metric', 1, attr = "val");
assert_counter!('my.metric.count', 1, &[])
assert_counter!('my.metric.count', 1)

// This also used to pass
u64_counter!('my.other.metric', 1);
assert_counter!('my.other.metric, 1, attr = "val");

Didn't want it to get lost in the massive deletions in this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. this miswritten existing test that should have been failing:

i64_up_down_counter_with_unit!("test", "test description", "{request}", 1);
assert_up_down_counter!("test", 1, "attr" = "val");

Copy link
Contributor Author

@timbotnik timbotnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good to me. One question but I'd be pushing for a final Router team review.


response
if let Ok(resp) = response {
Ok(count_router_errors(resp, &config.apollo.errors).await)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how hard this is, but wondering if instead of rebuilding the response we could use a mutator like resp.count_router_errors. Might save some clones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I'm not sure there's a good way to avoid it. I have to split the response (technically the response.response) into parts to be able to pull out the errors. That action consumes the original response which I believe means we are forced to rebuild it. Happy to be told otherwise though.

The only exception is the router layer, in which the errors are sitting on the context. I'd prefer to keep a similar pattern between all layers though instead of mutating the response in the router layer only.

Copy link
Contributor

@BrynCooke BrynCooke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully fairly simple, but the telemetry mod.rs is now 4000 lines. Please can you split the tests for the new functionality into a new module so that we can start breaking stuff up.

@rregitsky rregitsky requested a review from BrynCooke July 22, 2025 13:54
@BrynCooke BrynCooke self-requested a review July 23, 2025 09:21
@rregitsky rregitsky merged commit 4ad0cc7 into dev Jul 23, 2025
15 checks passed
@rregitsky rregitsky deleted the rreg/err_count_move_to_telemetry_PULSR_1504 branch July 23, 2025 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants