Skip to content

fix: documenting and updating span / metric error cases#2681

Merged
SkArchon merged 10 commits intomainfrom
milinda/eng-8286-evaluate-span-metric-error-behaviour
Mar 24, 2026
Merged

fix: documenting and updating span / metric error cases#2681
SkArchon merged 10 commits intomainfrom
milinda/eng-8286-evaluate-span-metric-error-behaviour

Conversation

@SkArchon
Copy link
Copy Markdown
Contributor

@SkArchon SkArchon commented Mar 23, 2026

The goal of this PR is to identify if there any scenarios where we don't mark spans as errors, as well as update documentation.

This PR introduces

  • Client Disconnections: error is recorded but span is not marked as error
  • GRPC service connectivity issues: the span is set to error
  • Updated documentation

@StarpTech should we allow Client Disconnections not being recorded as an error for spans be based on a flag? I could imagine users where they want it to be treated as errors (for spans).

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added documentation clarifying how client disconnections vs server-side failures affect span error status and error metrics reporting.
  • Bug Fixes

    • Client disconnections are now excluded from error spans, error metrics, and error-level logging; only server-side failures increment error counts.
  • Tests

    • Added test coverage for client disconnection telemetry, error handling behavior, and schema usage metrics export.

Checklist

Open Source AI Manifesto

This project follows the principles of the Open Source AI Manifesto. Please ensure your contribution aligns with its principles.

@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented Mar 23, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
wundergraphinc 🟢 Ready View Preview Mar 23, 2026, 1:34 PM

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR distinguishes client-side disconnections from server-side errors in OpenTelemetry spans and metrics. Client disconnections (context.Canceled) no longer mark spans as ERROR or increment error metrics, while server-side failures continue to do so. Implementation includes coordinated changes across error handling, tracing, and metrics code paths.

Changes

Cohort / File(s) Summary
Documentation Navigation & Reference
docs-website/docs.json, docs-website/router/metrics-and-monitoring.mdx, docs-website/router/metrics-and-monitoring/span-error-status.mdx
Added new documentation page entry in navigation; clarified existing docs that wg.request.error represents server-side errors only; created comprehensive new reference page defining client vs. server error taxonomy and span/metric status rules.
Core Error Handling
router/core/errors.go, router/core/graphql_prehandler.go, router/core/request_context_fields.go
Modified error-to-tracing/logging behavior to check errors.Is(err, context.Canceled) before attaching errors to spans or assigning error log levels; client disconnects now bypass span error attachment and error-level logging.
Metrics & Operation Lifecycle
router/core/operation_metrics.go, router/core/operation_metrics_test.go
Changed request error metric emission and schema-usage export to exclude context.Canceled errors via hasError := error != nil && !errors.Is(error, context.Canceled); added unit test covering three request-outcome scenarios with spy implementations.
gRPC Plugin Client Tracing
router/pkg/grpcconnector/grpccommon/grpc_plugin_client.go, router/pkg/grpcconnector/grpccommon/grpc_plugin_client_test.go
Added span status updates (codes.Error) on failure paths; introduced shared pluginNotActiveErr variable; added test coverage for three Invoke scenarios (plugin timeout, client closed, healthy operation) verifying span emission and error handling.
Integration Test Coverage
router-tests/telemetry/span_error_status_test.go
New test module with TestClientDisconnectionBehavior validating telemetry outcomes (tracing, metrics, logging) for client disconnection vs. downstream failure scenarios using in-memory exporters.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: documenting error-handling behavior and updating how spans/metrics treat error cases, particularly for client disconnections.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 23, 2026

Router image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-b902f6a88601123a946ebe7eb8355653ab303460

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
router/core/errors.go (1)

126-133: Consider documenting the distinction from context.DeadlineExceeded.

The code specifically handles context.Canceled (client disconnect / graceful shutdown) but not context.DeadlineExceeded (timeout). This seems intentional since timeouts are typically server-side configuration issues that should be tracked as errors.

Consider adding a brief comment clarifying this distinction, especially since the getErrorType function (Line 64-76) already distinguishes between errorTypeContextCanceled and errorTypeContextTimeout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/core/errors.go` around lines 126 - 133, Add a brief clarifying comment
above the context.Canceled handling to explicitly state why context.Canceled
(client disconnect / graceful shutdown) is treated differently from
context.DeadlineExceeded (timeout) and therefore not marked as a server-side
error; reference the existing getErrorType function (which distinguishes
errorTypeContextCanceled and errorTypeContextTimeout) to explain that timeouts
are treated as server-side errors/metrics while cancellations are not. Ensure
the comment mentions requestContext.SetError(err) and
requestContext.clientDisconnected = true remain for logging/response handling
but that spans/metrics are not incremented for cancellations.
router/core/operation_metrics_test.go (1)

97-105: Remove unused field requestErrorCalled from spyRouterMetrics.

The requestErrorCalled field on line 104 is declared but never read or written. The actual tracking of MeasureRequestError calls happens via spyMetricStore.requestErrorCalled (line 135), which is correctly used in the test assertions.

🧹 Proposed fix
 type spyRouterMetrics struct {
 	store metric.Store

 	schemaUsageCalled   bool
 	schemaUsageHasError bool
 	promUsageCalled     bool
 	promUsageHasError   bool
-	requestErrorCalled  bool
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/core/operation_metrics_test.go` around lines 97 - 105, The struct
spyRouterMetrics contains an unused field requestErrorCalled; remove this unused
field from the spyRouterMetrics definition and any related tests so tracking
uses the existing spyMetricStore.requestErrorCalled instead; specifically update
the spyRouterMetrics type (remove the requestErrorCalled field) and ensure
MeasureRequestError-related checks rely solely on
spyMetricStore.requestErrorCalled rather than the deleted field.
router-tests/telemetry/span_error_status_test.go (1)

57-57: Consider using require.Eventually instead of time.Sleep for robustness.

Hard-coded sleeps (500ms) can be flaky in CI environments with variable timing. The coding guidelines recommend using require.Eventually to poll for expected items. This would make the tests more reliable while potentially running faster.

As per coding guidelines: "For periodic exporters, wait for ALL expected items using require.Eventually, not just one sentinel value, to avoid race conditions with export cycles."

Also applies to: 102-102, 145-145

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router-tests/telemetry/span_error_status_test.go` at line 57, Replace the
hard-coded time.Sleep(500 * time.Millisecond) calls in the tests with
require.Eventually polling to wait for the expected telemetry items; import
testify/require if not already present and use require.Eventually(t, func() bool
{ /* check that all expected spans/errors have been exported (same check used
after sleep) */ return condition }, timeout, interval) so the tests in
span_error_status_test.go (the three occurrences flagged) wait deterministically
for ALL expected items rather than sleeping a fixed 500ms.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@router-tests/telemetry/span_error_status_test.go`:
- Around line 57-74: The test currently skips assertions if no spans are
exported because rootSpan(exporter.GetSpans().Snapshots()) can be nil; add a
fast fail assertion like require.NotNil(t, rootSpan, "expected root span to be
exported for client disconnections") immediately after computing rootSpan to
ensure the test fails when no spans are present, then keep the existing checks
(Status code and exception event). Reference the rootSpan(...) call and
exporter.GetSpans().Snapshots() to locate the change.

---

Nitpick comments:
In `@router-tests/telemetry/span_error_status_test.go`:
- Line 57: Replace the hard-coded time.Sleep(500 * time.Millisecond) calls in
the tests with require.Eventually polling to wait for the expected telemetry
items; import testify/require if not already present and use
require.Eventually(t, func() bool { /* check that all expected spans/errors have
been exported (same check used after sleep) */ return condition }, timeout,
interval) so the tests in span_error_status_test.go (the three occurrences
flagged) wait deterministically for ALL expected items rather than sleeping a
fixed 500ms.

In `@router/core/errors.go`:
- Around line 126-133: Add a brief clarifying comment above the context.Canceled
handling to explicitly state why context.Canceled (client disconnect / graceful
shutdown) is treated differently from context.DeadlineExceeded (timeout) and
therefore not marked as a server-side error; reference the existing getErrorType
function (which distinguishes errorTypeContextCanceled and
errorTypeContextTimeout) to explain that timeouts are treated as server-side
errors/metrics while cancellations are not. Ensure the comment mentions
requestContext.SetError(err) and requestContext.clientDisconnected = true remain
for logging/response handling but that spans/metrics are not incremented for
cancellations.

In `@router/core/operation_metrics_test.go`:
- Around line 97-105: The struct spyRouterMetrics contains an unused field
requestErrorCalled; remove this unused field from the spyRouterMetrics
definition and any related tests so tracking uses the existing
spyMetricStore.requestErrorCalled instead; specifically update the
spyRouterMetrics type (remove the requestErrorCalled field) and ensure
MeasureRequestError-related checks rely solely on
spyMetricStore.requestErrorCalled rather than the deleted field.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 97412857-7a68-446e-851e-78af4ab59a8b

📥 Commits

Reviewing files that changed from the base of the PR and between f0aa414 and 70bdea1.

📒 Files selected for processing (12)
  • docs-website/docs.json
  • docs-website/router/metrics-and-monitoring.mdx
  • docs-website/router/open-telemetry/span-error-status.mdx
  • router-tests/telemetry/span_error_status_test.go
  • router/core/context.go
  • router/core/errors.go
  • router/core/graphql_prehandler.go
  • router/core/operation_metrics.go
  • router/core/operation_metrics_test.go
  • router/core/request_context_fields.go
  • router/pkg/grpcconnector/grpccommon/grpc_plugin_client.go
  • router/pkg/grpcconnector/grpccommon/grpc_plugin_client_test.go

Comment thread router-tests/telemetry/span_error_status_test.go
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.00%. Comparing base (733c14b) to head (db129c2).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2681      +/-   ##
==========================================
- Coverage   64.27%   63.00%   -1.27%     
==========================================
  Files         305      245      -60     
  Lines       43272    26300   -16972     
  Branches     4630        0    -4630     
==========================================
- Hits        27811    16570   -11241     
+ Misses      15439     8388    -7051     
- Partials       22     1342    +1320     
Files with missing lines Coverage Δ
router/core/errors.go 76.74% <100.00%> (ø)
router/core/graphql_prehandler.go 84.66% <100.00%> (ø)
router/core/operation_metrics.go 100.00% <100.00%> (ø)
router/core/request_context_fields.go 72.02% <100.00%> (ø)
...pkg/grpcconnector/grpccommon/grpc_plugin_client.go 78.82% <100.00%> (ø)

... and 545 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread router/core/context.go Outdated
Comment thread router/core/graphql_prehandler.go Outdated
Comment thread router/core/errors.go Outdated
Comment thread router/pkg/grpcconnector/grpccommon/grpc_plugin_client.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
router/pkg/grpcconnector/grpccommon/grpc_plugin_client_test.go (1)

50-59: Consider extracting the exception event check to a helper function.

The exception event verification loop is duplicated in two test cases. A small helper would improve readability and maintainability.

♻️ Proposed helper extraction
+func hasExceptionEvent(t *testing.T, span tracetest.SpanStub) bool {
+	t.Helper()
+	for _, event := range span.Events {
+		if event.Name == "exception" {
+			return true
+		}
+	}
+	return false
+}

Then replace the inline loops:

-		hasException := false
-		for _, event := range span.Events {
-			if event.Name == "exception" {
-				hasException = true
-				break
-			}
-		}
-		require.True(t, hasException, "span should have an exception event")
+		require.True(t, hasExceptionEvent(t, span), "span should have an exception event")

Also applies to: 91-98

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/pkg/grpcconnector/grpccommon/grpc_plugin_client_test.go` around lines
50 - 59, Extract the duplicated loop that checks for an "exception" event on
span.Events into a small test helper function, e.g. func
assertSpanHasExceptionEvent(t *testing.T, span trace.SpanSnapshot) {
require.NotEmpty(t, span.Events); for _, e := range span.Events { if e.Name ==
"exception" { return } }; require.True(t, false, "span should have an exception
event") }, then replace the inline loops in grpc_plugin_client_test.go (both
occurrences around the span.Events checks) with calls to
assertSpanHasExceptionEvent(t, span); ensure you import or reference the same
trace/S panSnapshot type and require is still used for assertions.
router/core/graphql_prehandler.go (1)

1222-1239: Minor: Consider using sync.WaitGroup.Go() for Go 1.25+.

Based on learnings, the codebase prefers sync.WaitGroup.Go(func()) over manual wg.Add(1) followed by go func() { defer wg.Done(); ... }() patterns for Go 1.25+.

♻️ Proposed refactor using WaitGroup.Go
 	wg := &sync.WaitGroup{}
-	wg.Add(1)
-	go func() {
-		defer wg.Done()
+	wg.Go(func() {
 		if err := h.metrics.MetricStore().Flush(ctx); err != nil {
 			requestLogger.Error("Failed to flush OTEL metrics", zap.Error(err))
 		}
-	}()
+	})

-	wg.Add(1)
-	go func() {
-		defer wg.Done()
+	wg.Go(func() {
 		if err := h.tracerProvider.ForceFlush(ctx); err != nil {
 			requestLogger.Error("Failed to flush OTEL tracer", zap.Error(err))
 		}
-	}()
+	})

Based on learnings: "In Go code (Go 1.25+), prefer using sync.WaitGroup.Go(func()) to run a function in a new goroutine."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/core/graphql_prehandler.go` around lines 1222 - 1239, Replace the
manual wg.Add(1); go func() { defer wg.Done(); ... }() pattern with the Go 1.25+
WaitGroup.Go method for the two flush goroutines: use wg.Go(func() { if err :=
h.metrics.MetricStore().Flush(ctx); err != nil { requestLogger.Error("Failed to
flush OTEL metrics", zap.Error(err)) } }) and similarly use wg.Go(func() { if
err := h.tracerProvider.ForceFlush(ctx); err != nil {
requestLogger.Error("Failed to flush OTEL tracer", zap.Error(err)) } }); keep
the same wg.Wait() semantics and existing requestLogger/zap error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@router/core/graphql_prehandler.go`:
- Around line 1222-1239: Replace the manual wg.Add(1); go func() { defer
wg.Done(); ... }() pattern with the Go 1.25+ WaitGroup.Go method for the two
flush goroutines: use wg.Go(func() { if err :=
h.metrics.MetricStore().Flush(ctx); err != nil { requestLogger.Error("Failed to
flush OTEL metrics", zap.Error(err)) } }) and similarly use wg.Go(func() { if
err := h.tracerProvider.ForceFlush(ctx); err != nil {
requestLogger.Error("Failed to flush OTEL tracer", zap.Error(err)) } }); keep
the same wg.Wait() semantics and existing requestLogger/zap error handling.

In `@router/pkg/grpcconnector/grpccommon/grpc_plugin_client_test.go`:
- Around line 50-59: Extract the duplicated loop that checks for an "exception"
event on span.Events into a small test helper function, e.g. func
assertSpanHasExceptionEvent(t *testing.T, span trace.SpanSnapshot) {
require.NotEmpty(t, span.Events); for _, e := range span.Events { if e.Name ==
"exception" { return } }; require.True(t, false, "span should have an exception
event") }, then replace the inline loops in grpc_plugin_client_test.go (both
occurrences around the span.Events checks) with calls to
assertSpanHasExceptionEvent(t, span); ensure you import or reference the same
trace/S panSnapshot type and require is still used for assertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bd08fa01-1dc3-4c87-ae8e-c3df3a1d3ab4

📥 Commits

Reviewing files that changed from the base of the PR and between a58990a and c70571d.

📒 Files selected for processing (7)
  • router/core/errors.go
  • router/core/graphql_prehandler.go
  • router/core/operation_metrics.go
  • router/core/operation_metrics_test.go
  • router/core/request_context_fields.go
  • router/pkg/grpcconnector/grpccommon/grpc_plugin_client.go
  • router/pkg/grpcconnector/grpccommon/grpc_plugin_client_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • router/core/operation_metrics.go
  • router/core/request_context_fields.go
  • router/pkg/grpcconnector/grpccommon/grpc_plugin_client.go

@SkArchon SkArchon requested a review from Noroth March 24, 2026 10:59
@SkArchon SkArchon merged commit d413a26 into main Mar 24, 2026
54 of 55 checks passed
@SkArchon SkArchon deleted the milinda/eng-8286-evaluate-span-metric-error-behaviour branch March 24, 2026 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants