fix: documenting and updating span / metric error cases#2681
Conversation
|
Preview deployment for your docs. Learn more about Mintlify Previews.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR distinguishes client-side disconnections from server-side errors in OpenTelemetry spans and metrics. Client disconnections (context.Canceled) no longer mark spans as ERROR or increment error metrics, while server-side failures continue to do so. Implementation includes coordinated changes across error handling, tracing, and metrics code paths. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Router image scan passed✅ No security vulnerabilities found in image: |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
router/core/errors.go (1)
126-133: Consider documenting the distinction fromcontext.DeadlineExceeded.The code specifically handles
context.Canceled(client disconnect / graceful shutdown) but notcontext.DeadlineExceeded(timeout). This seems intentional since timeouts are typically server-side configuration issues that should be tracked as errors.Consider adding a brief comment clarifying this distinction, especially since the
getErrorTypefunction (Line 64-76) already distinguishes betweenerrorTypeContextCanceledanderrorTypeContextTimeout.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router/core/errors.go` around lines 126 - 133, Add a brief clarifying comment above the context.Canceled handling to explicitly state why context.Canceled (client disconnect / graceful shutdown) is treated differently from context.DeadlineExceeded (timeout) and therefore not marked as a server-side error; reference the existing getErrorType function (which distinguishes errorTypeContextCanceled and errorTypeContextTimeout) to explain that timeouts are treated as server-side errors/metrics while cancellations are not. Ensure the comment mentions requestContext.SetError(err) and requestContext.clientDisconnected = true remain for logging/response handling but that spans/metrics are not incremented for cancellations.router/core/operation_metrics_test.go (1)
97-105: Remove unused fieldrequestErrorCalledfromspyRouterMetrics.The
requestErrorCalledfield on line 104 is declared but never read or written. The actual tracking ofMeasureRequestErrorcalls happens viaspyMetricStore.requestErrorCalled(line 135), which is correctly used in the test assertions.🧹 Proposed fix
type spyRouterMetrics struct { store metric.Store schemaUsageCalled bool schemaUsageHasError bool promUsageCalled bool promUsageHasError bool - requestErrorCalled bool }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router/core/operation_metrics_test.go` around lines 97 - 105, The struct spyRouterMetrics contains an unused field requestErrorCalled; remove this unused field from the spyRouterMetrics definition and any related tests so tracking uses the existing spyMetricStore.requestErrorCalled instead; specifically update the spyRouterMetrics type (remove the requestErrorCalled field) and ensure MeasureRequestError-related checks rely solely on spyMetricStore.requestErrorCalled rather than the deleted field.router-tests/telemetry/span_error_status_test.go (1)
57-57: Consider usingrequire.Eventuallyinstead oftime.Sleepfor robustness.Hard-coded sleeps (500ms) can be flaky in CI environments with variable timing. The coding guidelines recommend using
require.Eventuallyto poll for expected items. This would make the tests more reliable while potentially running faster.As per coding guidelines: "For periodic exporters, wait for ALL expected items using require.Eventually, not just one sentinel value, to avoid race conditions with export cycles."
Also applies to: 102-102, 145-145
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router-tests/telemetry/span_error_status_test.go` at line 57, Replace the hard-coded time.Sleep(500 * time.Millisecond) calls in the tests with require.Eventually polling to wait for the expected telemetry items; import testify/require if not already present and use require.Eventually(t, func() bool { /* check that all expected spans/errors have been exported (same check used after sleep) */ return condition }, timeout, interval) so the tests in span_error_status_test.go (the three occurrences flagged) wait deterministically for ALL expected items rather than sleeping a fixed 500ms.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@router-tests/telemetry/span_error_status_test.go`:
- Around line 57-74: The test currently skips assertions if no spans are
exported because rootSpan(exporter.GetSpans().Snapshots()) can be nil; add a
fast fail assertion like require.NotNil(t, rootSpan, "expected root span to be
exported for client disconnections") immediately after computing rootSpan to
ensure the test fails when no spans are present, then keep the existing checks
(Status code and exception event). Reference the rootSpan(...) call and
exporter.GetSpans().Snapshots() to locate the change.
---
Nitpick comments:
In `@router-tests/telemetry/span_error_status_test.go`:
- Line 57: Replace the hard-coded time.Sleep(500 * time.Millisecond) calls in
the tests with require.Eventually polling to wait for the expected telemetry
items; import testify/require if not already present and use
require.Eventually(t, func() bool { /* check that all expected spans/errors have
been exported (same check used after sleep) */ return condition }, timeout,
interval) so the tests in span_error_status_test.go (the three occurrences
flagged) wait deterministically for ALL expected items rather than sleeping a
fixed 500ms.
In `@router/core/errors.go`:
- Around line 126-133: Add a brief clarifying comment above the context.Canceled
handling to explicitly state why context.Canceled (client disconnect / graceful
shutdown) is treated differently from context.DeadlineExceeded (timeout) and
therefore not marked as a server-side error; reference the existing getErrorType
function (which distinguishes errorTypeContextCanceled and
errorTypeContextTimeout) to explain that timeouts are treated as server-side
errors/metrics while cancellations are not. Ensure the comment mentions
requestContext.SetError(err) and requestContext.clientDisconnected = true remain
for logging/response handling but that spans/metrics are not incremented for
cancellations.
In `@router/core/operation_metrics_test.go`:
- Around line 97-105: The struct spyRouterMetrics contains an unused field
requestErrorCalled; remove this unused field from the spyRouterMetrics
definition and any related tests so tracking uses the existing
spyMetricStore.requestErrorCalled instead; specifically update the
spyRouterMetrics type (remove the requestErrorCalled field) and ensure
MeasureRequestError-related checks rely solely on
spyMetricStore.requestErrorCalled rather than the deleted field.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 97412857-7a68-446e-851e-78af4ab59a8b
📒 Files selected for processing (12)
docs-website/docs.jsondocs-website/router/metrics-and-monitoring.mdxdocs-website/router/open-telemetry/span-error-status.mdxrouter-tests/telemetry/span_error_status_test.gorouter/core/context.gorouter/core/errors.gorouter/core/graphql_prehandler.gorouter/core/operation_metrics.gorouter/core/operation_metrics_test.gorouter/core/request_context_fields.gorouter/pkg/grpcconnector/grpccommon/grpc_plugin_client.gorouter/pkg/grpcconnector/grpccommon/grpc_plugin_client_test.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2681 +/- ##
==========================================
- Coverage 64.27% 63.00% -1.27%
==========================================
Files 305 245 -60
Lines 43272 26300 -16972
Branches 4630 0 -4630
==========================================
- Hits 27811 16570 -11241
+ Misses 15439 8388 -7051
- Partials 22 1342 +1320
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
router/pkg/grpcconnector/grpccommon/grpc_plugin_client_test.go (1)
50-59: Consider extracting the exception event check to a helper function.The exception event verification loop is duplicated in two test cases. A small helper would improve readability and maintainability.
♻️ Proposed helper extraction
+func hasExceptionEvent(t *testing.T, span tracetest.SpanStub) bool { + t.Helper() + for _, event := range span.Events { + if event.Name == "exception" { + return true + } + } + return false +}Then replace the inline loops:
- hasException := false - for _, event := range span.Events { - if event.Name == "exception" { - hasException = true - break - } - } - require.True(t, hasException, "span should have an exception event") + require.True(t, hasExceptionEvent(t, span), "span should have an exception event")Also applies to: 91-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router/pkg/grpcconnector/grpccommon/grpc_plugin_client_test.go` around lines 50 - 59, Extract the duplicated loop that checks for an "exception" event on span.Events into a small test helper function, e.g. func assertSpanHasExceptionEvent(t *testing.T, span trace.SpanSnapshot) { require.NotEmpty(t, span.Events); for _, e := range span.Events { if e.Name == "exception" { return } }; require.True(t, false, "span should have an exception event") }, then replace the inline loops in grpc_plugin_client_test.go (both occurrences around the span.Events checks) with calls to assertSpanHasExceptionEvent(t, span); ensure you import or reference the same trace/S panSnapshot type and require is still used for assertions.router/core/graphql_prehandler.go (1)
1222-1239: Minor: Consider usingsync.WaitGroup.Go()for Go 1.25+.Based on learnings, the codebase prefers
sync.WaitGroup.Go(func())over manualwg.Add(1)followed bygo func() { defer wg.Done(); ... }()patterns for Go 1.25+.♻️ Proposed refactor using WaitGroup.Go
wg := &sync.WaitGroup{} - wg.Add(1) - go func() { - defer wg.Done() + wg.Go(func() { if err := h.metrics.MetricStore().Flush(ctx); err != nil { requestLogger.Error("Failed to flush OTEL metrics", zap.Error(err)) } - }() + }) - wg.Add(1) - go func() { - defer wg.Done() + wg.Go(func() { if err := h.tracerProvider.ForceFlush(ctx); err != nil { requestLogger.Error("Failed to flush OTEL tracer", zap.Error(err)) } - }() + })Based on learnings: "In Go code (Go 1.25+), prefer using sync.WaitGroup.Go(func()) to run a function in a new goroutine."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router/core/graphql_prehandler.go` around lines 1222 - 1239, Replace the manual wg.Add(1); go func() { defer wg.Done(); ... }() pattern with the Go 1.25+ WaitGroup.Go method for the two flush goroutines: use wg.Go(func() { if err := h.metrics.MetricStore().Flush(ctx); err != nil { requestLogger.Error("Failed to flush OTEL metrics", zap.Error(err)) } }) and similarly use wg.Go(func() { if err := h.tracerProvider.ForceFlush(ctx); err != nil { requestLogger.Error("Failed to flush OTEL tracer", zap.Error(err)) } }); keep the same wg.Wait() semantics and existing requestLogger/zap error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@router/core/graphql_prehandler.go`:
- Around line 1222-1239: Replace the manual wg.Add(1); go func() { defer
wg.Done(); ... }() pattern with the Go 1.25+ WaitGroup.Go method for the two
flush goroutines: use wg.Go(func() { if err :=
h.metrics.MetricStore().Flush(ctx); err != nil { requestLogger.Error("Failed to
flush OTEL metrics", zap.Error(err)) } }) and similarly use wg.Go(func() { if
err := h.tracerProvider.ForceFlush(ctx); err != nil {
requestLogger.Error("Failed to flush OTEL tracer", zap.Error(err)) } }); keep
the same wg.Wait() semantics and existing requestLogger/zap error handling.
In `@router/pkg/grpcconnector/grpccommon/grpc_plugin_client_test.go`:
- Around line 50-59: Extract the duplicated loop that checks for an "exception"
event on span.Events into a small test helper function, e.g. func
assertSpanHasExceptionEvent(t *testing.T, span trace.SpanSnapshot) {
require.NotEmpty(t, span.Events); for _, e := range span.Events { if e.Name ==
"exception" { return } }; require.True(t, false, "span should have an exception
event") }, then replace the inline loops in grpc_plugin_client_test.go (both
occurrences around the span.Events checks) with calls to
assertSpanHasExceptionEvent(t, span); ensure you import or reference the same
trace/S panSnapshot type and require is still used for assertions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bd08fa01-1dc3-4c87-ae8e-c3df3a1d3ab4
📒 Files selected for processing (7)
router/core/errors.gorouter/core/graphql_prehandler.gorouter/core/operation_metrics.gorouter/core/operation_metrics_test.gorouter/core/request_context_fields.gorouter/pkg/grpcconnector/grpccommon/grpc_plugin_client.gorouter/pkg/grpcconnector/grpccommon/grpc_plugin_client_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- router/core/operation_metrics.go
- router/core/request_context_fields.go
- router/pkg/grpcconnector/grpccommon/grpc_plugin_client.go
The goal of this PR is to identify if there any scenarios where we don't mark spans as errors, as well as update documentation.
This PR introduces
@StarpTech should we allow Client Disconnections not being recorded as an error for spans be based on a flag? I could imagine users where they want it to be treated as errors (for spans).
Summary by CodeRabbit
Release Notes
Documentation
Bug Fixes
Tests
Checklist
Open Source AI Manifesto
This project follows the principles of the Open Source AI Manifesto. Please ensure your contribution aligns with its principles.