[Error counting] Move error counting to telemetry plugin#7781
[Error counting] Move error counting to telemetry plugin#7781
Conversation
…efactor_PULSR_1504
…efactor_PULSR_1504
…efactor_PULSR_1504
…LSR_1504' into rreg/err_count_error_refactor_PULSR_1504
…efactor_PULSR_1504 # Conflicts: # apollo-router/src/plugins/connectors/handle_responses.rs # apollo-router/src/services/layers/persisted_queries/mod.rs
This comment has been minimized.
This comment has been minimized.
✅ Docs preview has no changesThe preview was not built because there were no changes. Build ID: 7abce71001fd18553629afaf |
…_telemetry_PULSR_1504
apollo-router/tests/integration/telemetry/apollo_otel_metrics.rs
Outdated
Show resolved
Hide resolved
...r/tests/integration/telemetry/fixtures/apollo_otel_metrics_csrf_required_headers.router.yaml
Outdated
Show resolved
Hide resolved
…_telemetry_PULSR_1504
| fn equal_attributes(expected: &AttributeSet, actual: &[KeyValue]) -> bool { | ||
| // If lengths are different, we can short circuit. This also accounts for a bug where | ||
| // an empty attributes list would always be considered "equal" due to zip capping at | ||
| // the shortest iter's length | ||
| if expected.iter().count() != actual.len() { | ||
| return false; | ||
| } | ||
| // This works because the attributes are always sorted | ||
| expected.iter().zip(actual.iter()).all(|((k, v), kv)| { |
There was a problem hiding this comment.
Noting that I fixed a bug here where these would have always passed regardless of the actual number of attributes in a matching emitted metric due to how zip works.:
// This used to pass
u64_counter!('my.other.metric', 1, attr = "val");
assert_counter!('my.metric.count', 1, &[])
assert_counter!('my.metric.count', 1)
// This also used to pass
u64_counter!('my.other.metric', 1);
assert_counter!('my.other.metric, 1, attr = "val");Didn't want it to get lost in the massive deletions in this file.
There was a problem hiding this comment.
E.g. this miswritten existing test that should have been failing:
router/apollo-router/src/metrics/mod.rs
Lines 1772 to 1773 in 52ce4da
timbotnik
left a comment
There was a problem hiding this comment.
This is looking pretty good to me. One question but I'd be pushing for a final Router team review.
|
|
||
| response | ||
| if let Ok(resp) = response { | ||
| Ok(count_router_errors(resp, &config.apollo.errors).await) |
There was a problem hiding this comment.
Not sure how hard this is, but wondering if instead of rebuilding the response we could use a mutator like resp.count_router_errors. Might save some clones.
There was a problem hiding this comment.
Hmm... I'm not sure there's a good way to avoid it. I have to split the response (technically the response.response) into parts to be able to pull out the errors. That action consumes the original response which I believe means we are forced to rebuild it. Happy to be told otherwise though.
The only exception is the router layer, in which the errors are sitting on the context. I'd prefer to keep a similar pattern between all layers though instead of mutating the response in the router layer only.
BrynCooke
left a comment
There was a problem hiding this comment.
Hopefully fairly simple, but the telemetry mod.rs is now 4000 lines. Please can you split the tests for the new functionality into a new module so that we can start breaking stuff up.
This migrates error counting for telemetry from the Router service layer to the Telemetry plugin. This will allow us to capture more errors than before, including:
Errors are now counted at each layer in the response path. We prevent double counting using a new Error ID released in a previous PR to keep track of previously counted errors. The ID is internal to the router and is not serializable. Each time we count an error on the response, we then store its ID in the response context for the next layer to check against. The outlier is the Router Service layer where we are working with a serialized response. To avoid adding additional latency by deserializing, we instead store the list of raw errors in context (keyed by ID) when the response is first created. We then check the IDs against previously counted errors as normal.
Part 3 of a split from #7357. Part 1 can be found here: #7699 and part 2 here: #7712.
Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and
debug-level logs. Please read this guidance on metrics best-practices. ↩Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩