Skip to content

feat: expression support for retry condition#2167

Merged
SkArchon merged 21 commits intomainfrom
dustin/eng-7974-allow-to-configure-retry-condition
Aug 28, 2025
Merged

feat: expression support for retry condition#2167
SkArchon merged 21 commits intomainfrom
dustin/eng-7974-allow-to-configure-retry-condition

Conversation

@StarpTech
Copy link
Copy Markdown
Contributor

@StarpTech StarpTech commented Aug 26, 2025

Summary by CodeRabbit

  • New Features
    • Expression-driven, configurable retry policies for subgraph requests with sensible defaults; retries skip mutations and honor Retry-After. Transport clients now wire logging, circuit-breaker, and conditional tracing; OnRetry callbacks receive computed sleep durations.
  • Configuration
    • Retry expression exposed in config (env/YAML) with a safe default and new algorithm/max-duration/interval env bindings.
  • Tests
    • Extensive unit and integration coverage for retry expressions, Retry-After semantics, callbacks, and edge cases.
  • Chores
    • Dependency upgraded to expr-lang v1.17.6.
  • Breaking Changes
    • Public retry API expanded; call sites updated to new signature.

Example configuration

traffic_shaping:
  all:
    retry:
      expression: IsRetryableStatusCode() || IsConnectionError() || IsTimeout()

Retry Strategy Decision Table

Scenario Expression Description
Conservative IsRetryableStatusCode() Only retry standard HTTP errors (500,502,503,504)
Balanced IsRetryableStatusCode() || IsTimeout() || IsConnectionError() Most transient errors - recommended default
Aggressive Is5xxError() || IsTimeout() || IsConnectionError() Maximum availability, higher retry volume
Timeout Only IsTimeout() Only network/HTTP timeouts
HTTP Read Timeout IsHttpReadTimeout() Only HTTP read timeouts
Connection Only IsConnectionError() Only network connectivity issues
Custom Status statusCode == 503 Specific HTTP status codes
Error Text error contains "timeout" Match error message text

Available Methods: IsTimeout(), IsConnectionError(), IsConnectionRefused(), IsConnectionReset(), Is5xxError(), IsRetryableStatusCode(), IsHttpReadTimeout()
Available Fields: statusCode (int), error (string)

Default: IsRetryableStatusCode() || IsConnectionError() || IsTimeout()

Note: Mutations are never retried regardless of this expression. EOF errors are always retried.

Checklist

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 26, 2025

Walkthrough

Adds expression-driven subgraph retry logic: new RetryContext and RetryExpressionManager, expression-to-ShouldRetry builder, Retry-After-aware transport retries and OnRetry semantics, config/schema retry expression fields, expanded WithSubgraphRetryOptions API, an expr dependency bump, and related tests.

Changes

Cohort / File(s) Summary
Dependency updates
router/go.mod, router-tests/go.mod
Bump github.com/expr-lang/expr v1.17.3 → v1.17.6.
Core wiring & API changes
router/core/graph_server.go, router/core/router.go, router/core/supervisor_instance.go, router/core/context.go
Introduce processing of retry options via ProcessRetryOptions; wire TransportOptions fields Logger, EnableTraceClient, and CircuitBreaker; compute and pass EnableTraceClient; remove inline retry guard and unused import; expand WithSubgraphRetryOptions signature and update call sites; remove isMutationRequest.
Retry builder
router/core/retry_builder.go, router/core/retry_builder_test.go
Add ProcessRetryOptions and buildRetryFunction to validate algorithm, compile expression into a ShouldRetry function (with default expression), treat EOF as default retryable, disable retries for mutations, and add comprehensive tests.
Expression subsystem
router/internal/expr/retry_context.go, router/internal/expr/retry_expression.go, router/internal/expr/retry_expression_test.go
Add RetryContext with helper predicates (timeouts, connection errors, retryable status codes), RetryExpressionManager to compile/evaluate boolean expressions against RetryContext, and extensive unit tests.
Retry transport & tests
router/internal/retrytransport/retry_transport.go, router/internal/retrytransport/retry_transport_test.go
Add OnRetryFunc, Algorithm, and Expression fields to RetryOptions; parse/honor Retry-After header (clamped to MaxDuration); drain response bodies between retries; remove legacy IsRetryableError; update tests for new OnRetry signature and Retry-After semantics.
Config & schema
router/pkg/config/config.go, router/pkg/config/config.schema.json, router/pkg/config/testdata/config_defaults.json, router/pkg/config/testdata/config_full.json
Add Expression field to BackoffJitterRetry with env binding and default expression; expose RETRY_ALGORITHM, RETRY_MAX_ATTEMPTS, RETRY_MAX_DURATION, RETRY_INTERVAL env vars; update JSON testdata and schema.
Integration & unit test updates
router-tests/*, router-tests/retry_test.go
Update multiple test call sites to the expanded WithSubgraphRetryOptions arity (router-tests/error_handling_test.go, router-tests/panic_test.go, router-tests/structured_logging_test.go, router-tests/telemetry/telemetry_test.go); add router-tests/retry_test.go integration tests for retry counters, Retry-After timing, and flaky retry scenarios.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Aug 26, 2025

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-32283b9fc758e09cde1c77a35c114f590b91af5e-nonroot

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (22)
router/pkg/config/testdata/config_defaults.json (1)

149-151: Default retry expression added — ensure empty/invalid expressions gracefully fall back at runtime.

The default expression matches the PR intent. Please confirm the loader/evaluator:

  • Treats empty or whitespace-only expressions as “use default”.
  • On compile/eval error, logs a clear message including the offending config path (All vs Subgraph) and falls back to the default instead of disabling retries.

You could add a tiny regression around defaults to lock this in. Example (Go test outline): load this defaults file, assert that the resolved expression equals the runtime DefaultRetryExpression and that the builder returns a ShouldRetry func which retries on 500 and timeout errors.

router/pkg/config/testdata/config_full.json (1)

353-355: Clarify semantics of empty expression for subgraph-level override.

Setting "Expression": "" for products implies fallback-to-default at runtime. Two suggestions:

  • Document in code comments/tests that empty string and missing property behave identically.
  • Optionally consider omitting the property (or allowing null) in the sample to make “inherit default” self-explanatory.

To verify behavior, assert in tests that:

  • Empty string => uses DefaultRetryExpression
  • Non-empty invalid expression => logs + falls back to DefaultRetryExpression
  • Valid custom expression (e.g., "statusCode == 503") is honored
router/pkg/config/config.schema.json (1)

3242-3247: Strengthen schema docs with examples and explicit field/method list.

Great addition. Concrete examples will help users and tooling. Consider enriching the property with examples and a crisper “available fields/methods” list to mirror the PR description.

Apply this diff inside the "expression" property:

-            "expression": {
-              "type": "string",
-              "description": "The expression used to determine if a request should be retried. The expression can reference status codes, error messages, and helper functions like IsRetryableStatusCode(), IsConnectionError(), IsHttpReadTimeout(), IsTimeout() (includes HTTP read timeouts). See https://expr-lang.org/ for expression syntax. Note: Mutations are never retried regardless of this expression. EOF errors are always retried at the transport layer regardless of this expression.",
-              "default": "IsRetryableStatusCode() || IsConnectionError() || IsTimeout()"
-            }
+            "expression": {
+              "type": "string",
+              "description": "Boolean expr-lang expression deciding whether to retry a subgraph request. Available methods: IsTimeout(), IsConnectionError(), IsConnectionRefused(), IsConnectionReset(), Is5xxError(), IsRetryableStatusCode(), IsHttpReadTimeout(). Available fields: status_code (int), error (string). See https://expr-lang.org/ for syntax. Notes: Mutations are never retried regardless of this expression. EOF errors are always retried at the transport layer regardless of this expression.",
+              "default": "IsRetryableStatusCode() || IsConnectionError() || IsTimeout()",
+              "examples": [
+                "IsRetryableStatusCode()",
+                "IsRetryableStatusCode() || IsTimeout() || IsConnectionError()",
+                "Is5xxError() || IsTimeout() || IsConnectionError()",
+                "IsTimeout()",
+                "IsHttpReadTimeout()",
+                "IsConnectionError()",
+                "status_code == 503",
+                "error contains \"timeout\""
+              ]
+            }

Also ensure the runtime treats empty string as “inherit default”. If that’s the contract, consider adding a short note like “Empty string inherits the default” to the description.

router/internal/expr/retry_context.go (6)

21-31: Broaden HTTP read-timeout detection to cover common net/http phrasing.

Some Go versions and environments surface the header wait timeout as “Client.Timeout exceeded while awaiting headers”. We already lowercase the string, so consider matching that variant too, not only “timeout awaiting response headers”.

Apply this diff:

 func (ctx RetryContext) IsHttpReadTimeout() bool {
   // Only check for HTTP-specific timeout awaiting response headers
   if ctx.Error != "" {
     errLower := strings.ToLower(ctx.Error)
-    return strings.Contains(errLower, "timeout awaiting response headers")
+    return strings.Contains(errLower, "timeout awaiting response headers") ||
+      strings.Contains(errLower, "client.timeout exceeded while awaiting headers")
   }

   return false
 }

3-10: Import context to robustly detect deadline exceeded errors.

To be future-proof and explicit, add context to imports; we’ll use it right below in IsTimeout().

 import (
   "errors"
+  "context"
   "net"
   "net/http"
   "os"
   "strings"
   "syscall"
 )

33-58: Consider explicitly handling context.DeadlineExceeded in IsTimeout().

os.ErrDeadlineExceeded maps to context.DeadlineExceeded in modern Go, but being explicit improves readability and backward compatibility with older toolchains. No functional change for Go ≥1.20; safer on mixed environments.

 func (ctx RetryContext) IsTimeout() bool {
   // Check for HTTP-specific read timeouts
   if ctx.IsHttpReadTimeout() {
     return true
   }

   // Check for net package timeout errors using the standard Go method
   if ctx.originalError != nil {
     if netErr, ok := ctx.originalError.(net.Error); ok && netErr.Timeout() {
       return true
     }
     // Check for deadline exceeded errors
-    if errors.Is(ctx.originalError, os.ErrDeadlineExceeded) {
+    if errors.Is(ctx.originalError, os.ErrDeadlineExceeded) ||
+      errors.Is(ctx.originalError, context.DeadlineExceeded) {
       return true
     }
     // Also check for direct syscall timeout errors not wrapped in net.Error
     if errors.Is(ctx.originalError, syscall.ETIMEDOUT) {
       return true
     }
   }

   return false
 }

60-78: Expand connection error classification (DNS/route/unreachable).

Today we only cover refused/reset and a few strings. Consider catching common “network is unreachable” classes and DNS failures via errors.As and syscall checks. This reduces reliance on error message wording.

 func (ctx RetryContext) IsConnectionError() bool {
   // Use existing helpers for specific connection errors
   if ctx.IsConnectionRefused() || ctx.IsConnectionReset() {
     return true
   }

+  if ctx.originalError != nil {
+    // DNS resolution errors (e.g., NXDOMAIN, temporary failures)
+    var dnsErr *net.DNSError
+    if errors.As(ctx.originalError, &dnsErr) {
+      return true
+    }
+    // Common routing/unreachable classes
+    if errors.Is(ctx.originalError, syscall.EHOSTUNREACH) ||
+      errors.Is(ctx.originalError, syscall.ENETUNREACH) ||
+      errors.Is(ctx.originalError, syscall.ECONNABORTED) {
+      return true
+    }
+  }
+
   // Fall back to string matching for other connection errors not covered by specific helpers
   if ctx.Error != "" {
     errLower := strings.ToLower(ctx.Error)
     return strings.Contains(errLower, "no such host") ||
+      strings.Contains(errLower, "no route to host") ||
+      strings.Contains(errLower, "network is unreachable") ||
       strings.Contains(errLower, "handshake failure") ||
       strings.Contains(errLower, "handshake timeout")
   }

   return false
 }

21-31: Optional: Go initialism style alias (IsHTTPReadTimeout).

External expressions currently use IsHttpReadTimeout(). If you want to align with Go naming without breaking expressions, add a small alias that delegates to the existing method.

 func (ctx RetryContext) IsHttpReadTimeout() bool {
   // ...
 }
 
+// IsHTTPReadTimeout is an alias for IsHttpReadTimeout to satisfy Go initialism conventions.
+// It’s exposed so users may call either in expressions.
+func (ctx RetryContext) IsHTTPReadTimeout() bool {
+  return ctx.IsHttpReadTimeout()
+}

101-116: Cross-platform errno (Windows) note.

If Windows clients/routers are relevant, consider mapping WSAECONNREFUSED/WSAECONNRESET under build tags in a platform-specific file. Referencing those constants directly here would break non-Windows builds, so only do it with per-OS files.

router/pkg/config/config.go (1)

231-238: Default expression duplication risk between config and core.

The envDefault here hardcodes the same default as core.DefaultRetryExpression. If one changes and the other doesn’t, behavior and docs will drift.

Options:

  • Minimal: Add a unit test (in core or a small integration test) that asserts the two defaults match.
  • Better: Remove envDefault here and rely on BuildRetryFunction("") to inject the default; the config stays empty when not specified. This avoids duplication but would require updating config_defaults.json expectations.

Would you like me to draft the test or a refactor plan?

router/core/supervisor_instance.go (1)

194-203: Consider future per-subgraph overrides for retry expressions.

Transport and circuit breaker already support SubgraphMap overrides. If product direction allows, adding per-subgraph retry expression overrides would complete the symmetry with other traffic-shaping knobs.

router/core/retry_expression_builder_test.go (1)

93-114: Add coverage for HTTP read-timeout string matching.

Given IsHttpReadTimeout uses string inspection, add a test to ensure “timeout awaiting response headers” (and the common “Client.Timeout exceeded…” variant) triggers retries under the default expression.

Apply this diff to append a new test case:

@@
 	t.Run("empty expression uses default", func(t *testing.T) {
 		fn, err := BuildRetryFunction("", logger)
 		assert.NoError(t, err)
 		assert.NotNil(t, fn)
@@
 		assert.False(t, fn(err, req, nil))
 	})
 
+	t.Run("http read timeout strings are treated as retryable by default", func(t *testing.T) {
+		fn, err := BuildRetryFunction(DefaultRetryExpression, logger)
+		assert.NoError(t, err)
+		assert.NotNil(t, fn)
+
+		req, _ := http.NewRequest("GET", "http://example.com", nil)
+
+		// net/http canonical phrasing
+		err = errors.New("net/http: timeout awaiting response headers")
+		assert.True(t, fn(err, req, nil))
+
+		// Common alternate phrasing
+		err = errors.New("Client.Timeout exceeded while awaiting headers")
+		assert.True(t, fn(err, req, nil))
+	})
router/internal/retrytransport/retry_transport.go (2)

52-54: Consider adding safeguards against excessive backoff duration.

The backoff mechanism could potentially wait for long durations as MaxDuration increases. Consider adding documentation about recommended values or implementing a safety check to prevent overly long sleep durations that could impact request latency.

Would you like me to suggest appropriate safeguards or documentation for backoff duration limits?


59-59: Consider refactoring the retry condition for better clarity.

The retry condition combines EOF retries with expression-based retries using an OR operation. This could be made clearer by extracting the logic into a separate method or adding a comment explaining the precedence.

-	for (isDefaultRetryableError(err) || rt.RetryOptions.ShouldRetry(err, req, resp)) && retries < rt.RetryOptions.MaxRetryCount {
+	// Retry if: (1) it's an EOF error (always retryable) OR (2) the expression evaluates to true
+	// AND we haven't exceeded the max retry count
+	for (isDefaultRetryableError(err) || rt.RetryOptions.ShouldRetry(err, req, resp)) && retries < rt.RetryOptions.MaxRetryCount {
router/internal/expr/retry_expression_test.go (8)

128-136: Future-proof the table test loop and tighten failure expectations.

  • Capture the range variable to prevent the classic t.Run + range pitfall if these ever become parallelized.
  • When a compile error is expected, also assert that the manager is nil.
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
+	for _, tt := range tests {
+		tt := tt // capture range variable for safety if parallelization is introduced
+		t.Run(tt.name, func(t *testing.T) {
 			manager, err := NewRetryExpressionManager(tt.expression)
 			if tt.expectErr {
-				assert.Error(t, err)
+				assert.Error(t, err)
+				assert.Nil(t, manager)
 				return
 			}
 			require.NoError(t, err)
 			require.NotNil(t, manager)

79-104: Normalize subtest names to match exported helper identifiers.

Purely cosmetic but keeps names consistent with the public API and improves grepability.

-			name:       "is5xxError helper",
+			name:       "Is5xxError helper",
...
-			name:       "is5xxError with non-5xx",
+			name:       "Is5xxError with non-5xx",
...
-			name:       "isConnectionError helper",
+			name:       "IsConnectionError helper",
...
-			name:       "isRetryableStatusCode helper",
+			name:       "IsRetryableStatusCode helper",

103-109: Exercise the full decision table for IsRetryableStatusCode (500, 502, 503, 504, 429) and a negative.

Right now only 429 is covered. Add explicit status cases to lock the contract.

 		{
-			name:       "isRetryableStatusCode helper",
+			name:       "IsRetryableStatusCode helper",
 			expression: "IsRetryableStatusCode()",
 			ctx:        RetryContext{StatusCode: 429},
 			expected:   true,
 		},
+		{
+			name:       "IsRetryableStatusCode 500",
+			expression: "IsRetryableStatusCode()",
+			ctx:        RetryContext{StatusCode: 500},
+			expected:   true,
+		},
+		{
+			name:       "IsRetryableStatusCode 502",
+			expression: "IsRetryableStatusCode()",
+			ctx:        RetryContext{StatusCode: 502},
+			expected:   true,
+		},
+		{
+			name:       "IsRetryableStatusCode 503",
+			expression: "IsRetryableStatusCode()",
+			ctx:        RetryContext{StatusCode: 503},
+			expected:   true,
+		},
+		{
+			name:       "IsRetryableStatusCode 504",
+			expression: "IsRetryableStatusCode()",
+			ctx:        RetryContext{StatusCode: 504},
+			expected:   true,
+		},
+		{
+			name:       "IsRetryableStatusCode negative (501)",
+			expression: "IsRetryableStatusCode()",
+			ctx:        RetryContext{StatusCode: 501},
+			expected:   false,
+		},

3-14: Add context.DeadlineExceeded coverage (imports).

You already cover os.ErrDeadlineExceeded; adding context.DeadlineExceeded ensures both common timeout sentinels are recognized.

 import (
+	"context"
 	"errors"
 	"fmt"
 	"net"
 	"net/http"
 	"os"
 	"syscall"
 	"testing"

351-400: Add a test for context.DeadlineExceeded in IsTimeout syscall section.

This is a frequent timeout surface (e.g., context with deadline on client). Ensuring positive detection keeps behavior consistent with os.ErrDeadlineExceeded.

 		tests := []struct {
 			name     string
 			err      error
 			expected bool
 		}{
 			{
 				name:     "ETIMEDOUT detected",
 				err:      syscall.ETIMEDOUT,
 				expected: true,
 			},
+			{
+				name:     "context.DeadlineExceeded detected",
+				err:      context.DeadlineExceeded,
+				expected: true,
+			},
 			{
 				name:     "wrapped ETIMEDOUT detected",
 				err:      fmt.Errorf("read error: %w", syscall.ETIMEDOUT),
 				expected: true,
 			},

16-149: Optionally test runtime evaluation errors from ShouldRetry.

You cover compile-time errors and empty expressions. Add a runtime failure (e.g., division by zero based on context) to validate error propagation from ShouldRetry.

Add this new test (outside the existing table) near other manager tests:

func TestRetryExpressionManager_RuntimeError(t *testing.T) {
	expression := "(1 / statusCode) > 0" // compiles, but will divide by zero at runtime when statusCode == 0
	manager, err := NewRetryExpressionManager(expression)
	require.NoError(t, err)
	require.NotNil(t, manager)

	ctx := RetryContext{StatusCode: 0}
	_, runErr := manager.ShouldRetry(ctx)
	assert.Error(t, runErr)
}

16-35: Follow-up: consider constraining expression size with expr.MaxNodes at compile time.

Small, but valuable hardening: set a reasonable node budget (and unit-test the rejection path) to prevent user-supplied expressions from becoming pathological.

  • Example: add expr.MaxNodes(N) to NewRetryExpressionManager compile options and a test asserting compilation fails for an oversized expression.

Would you like a follow-up PR to add expr.MaxNodes and a unit test that exercises the budget breach?


16-126: Optional: add a tiny property loop for Is5xxError.

A quick loop from 500..599 would catch accidental off-by-one regressions (e.g., 600 boundary).

Add alongside existing Is5xxError tests:

t.Run("Is5xxError range 500..599", func(t *testing.T) {
	manager, err := NewRetryExpressionManager("Is5xxError()")
	require.NoError(t, err)
	require.NotNil(t, manager)

	for code := 500; code <= 599; code++ {
		t.Run(fmt.Sprintf("code=%d", code), func(t *testing.T) {
			ctx := RetryContext{StatusCode: code}
			result, err := manager.ShouldRetry(ctx)
			assert.NoError(t, err)
			assert.True(t, result)
		})
	}
})
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9d6aabb and 17ca80d.

⛔ Files ignored due to path filters (2)
  • router-tests/go.sum is excluded by !**/*.sum
  • router/go.sum is excluded by !**/*.sum
📒 Files selected for processing (16)
  • router-tests/go.mod (1 hunks)
  • router/core/graph_server.go (2 hunks)
  • router/core/retry_expression_builder.go (1 hunks)
  • router/core/retry_expression_builder_test.go (1 hunks)
  • router/core/router.go (1 hunks)
  • router/core/supervisor_instance.go (1 hunks)
  • router/go.mod (1 hunks)
  • router/internal/expr/retry_context.go (1 hunks)
  • router/internal/expr/retry_expression.go (1 hunks)
  • router/internal/expr/retry_expression_test.go (1 hunks)
  • router/internal/retrytransport/retry_transport.go (4 hunks)
  • router/internal/retrytransport/retry_transport_test.go (12 hunks)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
PR: wundergraph/cosmo#2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router-tests/go.mod
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/core/supervisor_instance.go
🧬 Code graph analysis (9)
router/core/retry_expression_builder.go (3)
router/internal/retrytransport/retry_transport.go (1)
  • ShouldRetryFunc (13-13)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (17-36)
router/internal/expr/retry_context.go (1)
  • LoadRetryContext (138-152)
router/core/retry_expression_builder_test.go (1)
router/core/retry_expression_builder.go (2)
  • BuildRetryFunction (15-50)
  • DefaultRetryExpression (12-12)
router/core/router.go (1)
router/internal/retrytransport/retry_transport.go (1)
  • RetryOptions (15-23)
router/internal/expr/retry_expression.go (1)
router/internal/expr/retry_context.go (1)
  • RetryContext (13-19)
router/internal/expr/retry_context.go (1)
router/pkg/pubsub/datasource/error.go (1)
  • Error (3-6)
router/core/graph_server.go (2)
router/core/retry_expression_builder.go (1)
  • BuildRetryFunction (15-50)
router/internal/retrytransport/retry_transport.go (1)
  • RetryOptions (15-23)
router/internal/expr/retry_expression_test.go (2)
router/internal/expr/retry_context.go (2)
  • RetryContext (13-19)
  • LoadRetryContext (138-152)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (17-36)
router/core/supervisor_instance.go (1)
router/pkg/config/config.go (1)
  • BackoffJitterRetry (231-238)
router/internal/retrytransport/retry_transport_test.go (1)
router/internal/retrytransport/retry_transport.go (2)
  • RetryHTTPTransport (27-31)
  • RetryOptions (15-23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: build_test
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: lint
  • GitHub Check: image_scan
  • GitHub Check: Analyze (go)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (21)
router/go.mod (1)

67-67: Approve expr v1.17.6 upgrade; no breaking evaluator changes detected

  • Release notes for v1.17.6 (Aug 12, 2025) show only parser/lexer improvements (#811), struct-tag filtering and “in” behavior (#806), unaddressable‐slice fix (#803), and a VM memory‐leak patch (#813)—no changes to opcode/AST semantics, short-circuiting, type coercion, IsTimeout(), or string-contains behavior (github.com)
  • Verified all expr.Compile and expr.Run call sites handle errors appropriately via handleExpressionError or wrapped fmt.Errorf:
    • router/internal/expr/resolvers.go (lines 12–14, 23–25, 43–45)
    • router/internal/expr/retry_expression.go (lines 28–30, 45–47)
    • router/internal/expr/expr_manager.go (lines 41–43)
  • No further changes required; merging.
router/pkg/config/testdata/config_full.json (1)

314-316: Top-level “All” expression matches intended default — LGTM.

Keeps behavior explicit in the full sample and aligns with defaults.

router/internal/expr/retry_context.go (1)

86-100: Retryable status codes: looks good.

The 500/502/503/504/429 set matches common practice. No further action.

router/core/supervisor_instance.go (1)

196-203: Wiring expression: LGTM.

Plumbing BackoffJitterRetry.Expression into WithSubgraphRetryOptions is correct and preserves the “empty means default” semantics downstream.

router/internal/retrytransport/retry_transport.go (2)

15-23: LGTM!

The addition of the Expression field to RetryOptions enables expression-based retry configuration while maintaining backward compatibility. The existing ShouldRetry field continues to hold the compiled function for runtime evaluation.


120-130: LGTM!

Good implementation of the default retry behavior for EOF errors. The function correctly checks for the "unexpected eof" error which indicates connection issues that should always trigger a retry, regardless of the configured expression.

router/core/retry_expression_builder.go (2)

15-50: LGTM! Well-structured retry expression builder.

The implementation properly handles empty expressions by defaulting to DefaultRetryExpression, ensures mutations are never retried, and includes proper error handling for expression evaluation failures. The error logging is appropriately placed to help debug configuration issues without failing the request.


30-31: No missing isMutationRequest – function exists in the same package

  • The function isMutationRequest(ctx context.Context) bool is implemented in router/core/context.go at lines 638–644.
  • Since both files reside in the router/core package, no additional import is required.

No further action required.

router/core/graph_server.go (3)

1159-1164: LGTM! Proper error handling for retry expression compilation.

The code correctly builds the retry function from the expression and handles compilation errors appropriately by returning an error that will prevent the server from starting with an invalid configuration.


1165-1172: LGTM! Clean retry options setup.

The RetryOptions struct is properly initialized with all necessary fields, including the compiled ShouldRetry function and the original Expression for reference/debugging purposes.


1188-1201: Review TransportOptions utilization

Based on the struct definition (router/core/transport.go:354–367) and the graph server initialization (router/core/graph_server.go:1188–1201), all three new fields are indeed declared and populated:

  • CircuitBreaker: defined in the struct and set to s.circuitBreakerManager in graph_server.go
  • Logger: defined in the struct and set to s.logger in graph_server.go
  • EnableTraceClient: defined in the struct and set to enableTraceClient in graph_server.go

Additionally, we confirmed that opts.Logger is consumed by the transport layer (router/core/transport.go:384) and that subgraph-circuit-breaker options are exercised in existing circuit-breaker tests. However, I didn’t find an explicit use of:

  • opts.CircuitBreaker within the request execution pipeline
  • opts.EnableTraceClient to conditionally wrap or inject the trace client

Please review router/core/transport.go (around the struct literal at lines ~362–384) and ensure that both CircuitBreaker and EnableTraceClient are actually acted upon when building or executing transports, so that neither flag becomes a silent no-op.

router/internal/expr/retry_expression.go (2)

17-36: LGTM! Well-designed expression manager with proper null handling.

The constructor properly handles empty expressions by returning nil, allowing for graceful degradation to default behavior. The expression compilation is correctly configured with RetryContext as the environment and boolean return type enforcement.


38-56: LGTM! Robust evaluation with proper type checking.

The ShouldRetry method includes appropriate nil checks, comprehensive error handling, and type assertion to ensure the expression returns a boolean value. The fallback to false when no expression is configured is a safe default.

router/internal/retrytransport/retry_transport_test.go (5)

20-30: LGTM! Good test helper function.

The simpleShouldRetry function provides a clear and simple retry logic for testing purposes, making the tests more readable and maintainable.


40-84: LGTM! Comprehensive test for HTTP 5xx retries.

The test properly validates that the retry mechanism respects MaxRetryCount and tracks both retry count and total attempts. The assertions correctly verify that the system retries exactly maxRetries times and makes maxRetries + 1 total attempts.


130-188: LGTM! Excellent test for custom ShouldRetry logic.

This test effectively validates that the retry mechanism properly respects the ShouldRetry function's decision to stop retrying when it returns false. The scenario progression from retryable errors to a non-retryable error is well-designed.


218-258: LGTM! Important short-circuit test.

This test validates a critical optimization - ensuring that successful responses immediately return without triggering any retries. The assertion that OnRetry should not be called is a good verification of this behavior.


448-506: LGTM! Thorough test of OnRetry callback.

Excellent test coverage for the OnRetry callback, validating that it's called with correct parameters (0-based retry count, error, and response) for each retry attempt. The test properly captures and validates all callback invocations.

router/internal/expr/retry_expression_test.go (3)

16-554: Excellent coverage and realistic scenarios — this suite meaningfully de-risks the expression engine.

  • Broad table-driven checks, syscall-wrapped errors, mixed-case string matching, and integration-style ShouldRetry assertions are all on point.
  • Nice separation of concerns: manager compile semantics, context loader, syscall helpers, and composed expressions are each exercised.

498-554: Nice syscall + status composition tests.

These cover the cross-product of helpers and HTTP codes well and verify expression wiring end-to-end.


1-554: Sanity check — defaults & integration verified

I ran the grep you requested. Evidence shows the default expression, the builder, the mutation guard, EOF handling, and tests are present.

  • router/core/retry_expression_builder.go — DefaultRetryExpression = "IsRetryableStatusCode() || IsConnectionError() || IsTimeout()"; BuildRetryFunction falls back to the default when empty and the returned ShouldRetry func early-returns false for mutation requests (mutation guard).
  • router/internal/retrytransport/retry_transport.go — transport retry loop uses isDefaultRetryableError(err) || rt.RetryOptions.ShouldRetry(err, req, resp); EOF is handled as always-retryable (comment + strings.Contains(errStr, "unexpected eof")).
  • router/core/retry_expression_builder_test.go — unit tests call BuildRetryFunction(DefaultRetryExpression, ...) and explicitly assert EOF is handled at the transport layer (assert.False for "unexpected EOF").
  • router/internal/expr/retry_expression_test.go — comprehensive expression tests (status codes, IsTimeout/IsHttpReadTimeout, syscall errors) exercise manager.ShouldRetry.
  • router/pkg/config/config.schema.json — documentation/schema includes the same default expression and the explicit note: "Mutations are never retried regardless of this expression. EOF errors are always retried at the transport layer regardless of this expression."

Conclusion: the default expression, builder behavior, mutation guard, EOF transport handling, and tests are implemented and wired as requested. No further changes required from this check.

Comment thread router-tests/go.mod
Comment thread router/core/router.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
router/core/retry_expression_builder.go (3)

28-33: Defensive nil request check to avoid panic

ShouldRetry dereferences req via req.Context(). If an upstream caller ever passes a nil request (even by mistake), this will panic. Cheap guard recommended.

-	return func(err error, req *http.Request, resp *http.Response) bool {
-		reqContext := getRequestContext(req.Context())
+	return func(err error, req *http.Request, resp *http.Response) bool {
+		// Defensive: ensure we never panic on a nil *http.Request
+		if req == nil {
+			return false
+		}
+		reqContext := getRequestContext(req.Context())

35-38: Use OperationTypeMutation constant and guard missing operation context

Comparing to the string "mutation" is brittle, and Operation() may be nil for some edge flows. Guard and use the typed constant to avoid regressions if the literal ever changes.

-		// Never retry mutations, regardless of expression result
-		if reqContext.Operation().Type() == "mutation" {
-			return false
-		}
+		// Never retry mutations, regardless of expression result
+		if op := reqContext.Operation(); op == nil || op.Type() == OperationTypeMutation {
+			return false
+		}

47-51: Optional: enrich evaluation-error logs with context

If evaluation fails repeatedly, having quick context helps. Consider adding operation name/type and statusCode/error from the retry context to the log entry.

Example:

  • zap.String("operation", reqContext.Operation().Name())
  • zap.String("operation_type", string(reqContext.Operation().Type()))
  • zap.Int("status_code", ctx.StatusCode)
  • zap.String("error", ctx.Error)
router/core/graph_server.go (1)

1159-1172: Keep RetryOptions.Expression in sync with the effective expression (handles empty -> default)

You compile a fallback expression when config is empty, but RetryOptions.Expression still carries the original (possibly empty) value. Persist the effective expression for observability/metrics consistency.

-	// Build retry options and handle any expression compilation errors
-	shouldRetryFunc, err := BuildRetryFunction(s.retryOptions.Expression)
+	// Build retry options and handle any expression compilation errors
+	effectiveExpr := s.retryOptions.Expression
+	if effectiveExpr == "" {
+		effectiveExpr = DefaultRetryExpression
+	}
+	shouldRetryFunc, err := BuildRetryFunction(effectiveExpr)
 	if err != nil {
 		return nil, fmt.Errorf("failed to build retry function: %w", err)
 	}
 
-	retryOptions := retrytransport.RetryOptions{
+	retryOptions := retrytransport.RetryOptions{
 		Enabled:       s.retryOptions.Enabled,
 		MaxRetryCount: s.retryOptions.MaxRetryCount,
 		MaxDuration:   s.retryOptions.MaxDuration,
 		Interval:      s.retryOptions.Interval,
-		Expression:    s.retryOptions.Expression,
+		Expression:    effectiveExpr,
 		ShouldRetry:   shouldRetryFunc,
 	}

Optional micro-optimization (no diff necessary): only compile the expression when s.retryOptions.Enabled is true; otherwise, defer the cost. Transport can handle a nil ShouldRetry when retries are disabled.

router/core/retry_expression_builder_test.go (3)

317-363: Add a test for missing request context (nil reqContext) to pin the early-return behavior

There’s no test that exercises ShouldRetry when the request lacks a bound requestContext. This helps prevent accidental future changes that remove the guard.

Add the following test (outside the current subtests):

func TestBuildRetryFunction_NoRequestContext(t *testing.T) {
	fn, err := BuildRetryFunction("true")
	assert.NoError(t, err)
	assert.NotNil(t, fn)

	// Intentionally do NOT attach a requestContext
	req, _ := http.NewRequest("POST", "http://example.com/graphql", nil)
	resp := &http.Response{StatusCode: 500}

	// Should short-circuit to false without a request context
	assert.False(t, fn(nil, req, resp))
}

51-74: Optional: mark subtests parallel and use require for construction errors

Speed up the suite and fail fast on setup errors.

  • Call t.Parallel() at the start of TestBuildRetryFunction and in each t.Run body.
  • Use require.NoError for BuildRetryFunction(...) to avoid follow-up nil derefs in the same subtest.

68-73: Optional: use a realistic connection error

To better reflect IsConnectionError() semantics, consider constructing a net.OpError wrapping syscall.ECONNREFUSED instead of a plain error string. Keeps tests robust if implementation stops matching string contents.

Example:

opErr := &net.OpError{Op: "dial", Err: syscall.ECONNREFUSED}
assert.True(t, fn(opErr, req, nil))
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 17ca80d and 7d70025.

⛔ Files ignored due to path filters (1)
  • router-tests/go.sum is excluded by !**/*.sum
📒 Files selected for processing (4)
  • router/core/context.go (0 hunks)
  • router/core/graph_server.go (2 hunks)
  • router/core/retry_expression_builder.go (1 hunks)
  • router/core/retry_expression_builder_test.go (1 hunks)
💤 Files with no reviewable changes (1)
  • router/core/context.go
🧰 Additional context used
🧬 Code graph analysis (3)
router/core/retry_expression_builder_test.go (3)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/core/retry_expression_builder.go (2)
  • BuildRetryFunction (15-58)
  • DefaultRetryExpression (12-12)
router/core/context.go (3)
  • OperationTypeQuery (488-488)
  • OperationTypeMutation (489-489)
  • OperationTypeSubscription (490-490)
router/core/retry_expression_builder.go (4)
router/internal/retrytransport/retry_transport.go (1)
  • ShouldRetryFunc (13-13)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (17-36)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/expr/retry_context.go (1)
  • LoadRetryContext (138-152)
router/core/graph_server.go (2)
router/core/retry_expression_builder.go (1)
  • BuildRetryFunction (15-58)
router/internal/retrytransport/retry_transport.go (1)
  • RetryOptions (15-23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: lint
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: Analyze (go)
  • GitHub Check: build_push_image
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
🔇 Additional comments (4)
router/core/retry_expression_builder.go (1)

14-26: Good defaulting and compile-time validation

Using DefaultRetryExpression as a fallback and compiling once up front is solid. Clear error wrapping on compile failures is also helpful for ops.

router/core/graph_server.go (1)

1189-1195: LGTM: wiring retry into TransportOptions

RetryOptions plumbing into TransportOptions looks correct and preserves existing pre/post handlers and metrics.

router/core/retry_expression_builder_test.go (2)

49-151: Strong coverage across default/empty/custom expressions

Good breadth: default path, empty expression fallback, compile error, status and error predicates. Clear assertions and readable helpers.


193-217: Mutation guard behavior well covered

Validates that even “true” expressions won’t retry mutations. This preserves the contract regardless of config.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
router-tests/structured_logging_test.go (1)

712-712: Optional: de-duplicate the repeated “no-retry” option.

This file repeats core.WithSubgraphRetryOptions(false, 0, 0, 0, "") several times. Consider a small helper var to DRY it up.

Apply at file scope (near imports or top-level consts):

+var noSubgraphRetry = core.WithSubgraphRetryOptions(false, 0, 0, 0, "")

And then replace per call site, e.g.:

-                core.WithSubgraphRetryOptions(false, 0, 0, 0, ""),
+                noSubgraphRetry,
router-tests/error_handling_test.go (1)

1383-1480: Optional: factor a shared “no-retry” Option for readability.

This file also repeats the disabled-retry Option. Consider a file-local var to reduce repetition.

Add near the top of the file:

+var noSubgraphRetry = core.WithSubgraphRetryOptions(false, 0, 0, 0, "")

Then replace occurrences within RouterOptions:

-                core.WithSubgraphRetryOptions(false, 0, 0, 0, ""),
+                noSubgraphRetry,
router-tests/telemetry/telemetry_test.go (1)

8709-8710: Use explicit default retry expression constant for clarity (nit).

Passing an empty string works (retries are disabled here anyway), but using an explicit default improves readability and future-proofs intent if defaults ever change.

-                    core.WithSubgraphRetryOptions(false, 0, 0, 0, ""),
+                    core.WithSubgraphRetryOptions(false, 0, 0, 0, core.DefaultRetryExpression),

If core.DefaultRetryExpression isn’t exported in the core package, keep the empty string.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7d70025 and 36c93f0.

📒 Files selected for processing (4)
  • router-tests/error_handling_test.go (4 hunks)
  • router-tests/panic_test.go (2 hunks)
  • router-tests/structured_logging_test.go (5 hunks)
  • router-tests/telemetry/telemetry_test.go (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router-tests/telemetry/telemetry_test.go
  • router-tests/structured_logging_test.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

  • router-tests/telemetry/telemetry_test.go
🧬 Code graph analysis (4)
router-tests/error_handling_test.go (1)
router/core/router.go (1)
  • WithSubgraphRetryOptions (1752-1762)
router-tests/panic_test.go (1)
router/core/router.go (1)
  • WithSubgraphRetryOptions (1752-1762)
router-tests/telemetry/telemetry_test.go (1)
router/core/router.go (1)
  • WithSubgraphRetryOptions (1752-1762)
router-tests/structured_logging_test.go (1)
router/core/router.go (1)
  • WithSubgraphRetryOptions (1752-1762)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_test
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: lint
  • GitHub Check: Analyze (go)
🔇 Additional comments (12)
router-tests/panic_test.go (2)

51-51: API arity update applied correctly (expression arg).

Passing "" for the new expression while retries are disabled (enabled=false) keeps behavior unchanged. LGTM.


83-83: Consistent update to WithSubgraphRetryOptions.

Same note as above; good alignment with the new 5-arg signature.

router-tests/structured_logging_test.go (6)

712-712: Arity bump handled; retries remain disabled.

Adding the empty expression argument is correct and preserves the intent of the test (NoRetryClient + disabled subgraph retries).


832-832: Correct propagation of new expression parameter.

No change in semantics since enabled=false; good.


964-964: LGTM: expression parameter added while keeping retries off.

Matches the updated Option signature.


1100-1100: LGTM: updated WithSubgraphRetryOptions call.

Maintains previous behavior of the test.


2214-2214: LGTM: consistent use of the new argument across tests.

All call sites here pass the empty expression; consistent with other tests in this suite.


712-712: Verification Passed: All WithSubgraphRetryOptions Call Sites Now Use 5 Arguments
Ran the recommended repo-wide verifier script; no instances with fewer than five arguments were found.

router-tests/error_handling_test.go (4)

1383-1383: Updated call site matches new signature.

Empty expression with retries disabled is correct for this test setup.


1416-1416: LGTM: expression parameter added while keeping retries off.


1448-1448: LGTM: consistent Option usage after API change.


1480-1480: LGTM: correct 5th argument added.

Same semantics retained.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (8)
router/internal/expr/retry_context.go (1)

60-78: Recognize DNS resolution failures via errors.As

Relying solely on substring matching (“no such host”) is brittle. Use errors.As to detect net.DNSError explicitly and treat it as a connection error.

Apply this diff:

 func (ctx RetryContext) IsConnectionError() bool {
 	// Use existing helpers for specific connection errors
 	if ctx.IsConnectionRefused() || ctx.IsConnectionReset() {
 		return true
 	}
 
+	// Detect DNS resolution failures via type assertion
+	if ctx.originalError != nil {
+		var dnsErr *net.DNSError
+		if errors.As(ctx.originalError, &dnsErr) {
+			return true
+		}
+	}
+
 	// Fall back to string matching for other connection errors not covered by specific helpers
 	if ctx.Error != "" {
 		errLower := strings.ToLower(ctx.Error)
 		return strings.Contains(errLower, "no such host") ||
 			strings.Contains(errLower, "handshake failure") ||
 			strings.Contains(errLower, "handshake timeout")
 	}
router/internal/retrytransport/retry_transport.go (4)

60-74: Honor Retry-After for 503 Service Unavailable as well

RFC 7231 allows Retry-After on 503 in addition to 429. Using the header for 503 improves backoff fidelity on server overload.

Apply this diff:

-// shouldUseRetryAfter determines if we should use Retry-After header for 429 responses
+// shouldUseRetryAfter determines if we should use Retry-After header for
+// 429 (Too Many Requests) and 503 (Service Unavailable) responses.
 func shouldUseRetryAfter(resp *http.Response) (time.Duration, bool) {
-	if resp == nil || resp.StatusCode != http.StatusTooManyRequests {
+	if resp == nil || (resp.StatusCode != http.StatusTooManyRequests && resp.StatusCode != http.StatusServiceUnavailable) {
 		return 0, false
 	}

42-46: Trim whitespace when parsing Retry-After seconds

Minor robustness improvement: handle headers with surrounding whitespace.

Apply this diff:

 	// Try parsing as delay-seconds (integer)
-	if seconds, err := strconv.Atoi(retryAfter); err == nil && seconds >= 0 {
+	if seconds, err := strconv.Atoi(strings.TrimSpace(retryAfter)); err == nil && seconds >= 0 {
 		return time.Duration(seconds) * time.Second
 	}

158-160: Clarify comment grammar in drainBody

Tighten the wording for clarity.

Apply this diff:

-	// When we close the body only will go internally marks the persisted connection as true
-	// which is important so that it can reuse the connection internally for retrying
+	// Draining and closing the body marks the connection as reusable,
+	// which is important so the transport can reuse it for the retry.

172-182: Consider expanding default retryable errors (optional)

Only “unexpected EOF” is covered. You may also want to consider other transient network errors (e.g., “server closed idle connection”) to improve resilience. If you prefer keeping defaults minimal, leave as-is and rely on expressions.

router/internal/retrytransport/retry_transport_test.go (3)

520-582: These tests will sleep in real time; consider a Sleeper injection to make them fast

Tests using Retry-After of 1s (and HTTP-date +1s) will introduce multi-second sleeps. Recommend adding an optional Sleeper func(time.Duration) to RetryOptions (defaulting to time.Sleep) and using it in RoundTrip. Tests can then inject a no-op or scaled sleeper to keep the suite fast and deterministic.

I can send a small patch to:

  • add RetryOptions.Sleeper,
  • use it instead of time.Sleep in RoundTrip,
  • update these tests to inject a no-op sleeper,
    so they complete near-instantly. Want me to open that follow-up?

272-280: Remove unused mockRoundTripper type

This type is not referenced anywhere in the file; keeping both MockTransport and mockRoundTripper is redundant.

Apply this diff to delete it:

-// Mock round tripper for testing
-type mockRoundTripper struct {
-	roundTripFunc func(req *http.Request) (*http.Response, error)
-}
-
-func (m *mockRoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
-	return m.roundTripFunc(req)
-}

447-458: Rename variable for clarity

The local name unusedLogger is misleading; it is returned and used. Rename to logger.

Apply this diff:

-	unusedLogger := zap.New(core)
-	return &buf, unusedLogger
+	logger := zap.New(core)
+	return &buf, logger
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 36c93f0 and a394639.

📒 Files selected for processing (4)
  • router/internal/expr/retry_context.go (1 hunks)
  • router/internal/expr/retry_expression_test.go (1 hunks)
  • router/internal/retrytransport/retry_transport.go (5 hunks)
  • router/internal/retrytransport/retry_transport_test.go (12 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/internal/expr/retry_expression_test.go
🧰 Additional context used
🧬 Code graph analysis (2)
router/internal/retrytransport/retry_transport_test.go (1)
router/internal/retrytransport/retry_transport.go (2)
  • RetryHTTPTransport (28-32)
  • RetryOptions (16-24)
router/internal/expr/retry_context.go (1)
router/pkg/pubsub/datasource/error.go (1)
  • Error (3-6)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/internal/retrytransport/retry_transport_test.go (1)

880-929: Nice coverage of 429 opt-out behavior

Good negative test ensuring that 429s are not retried when ShouldRetry excludes them, even with Retry-After present. Aligns with the configurable policy intent.

Comment thread router/internal/expr/retry_context.go
Comment thread router/internal/expr/retry_context.go
Comment thread router/internal/retrytransport/retry_transport.go Outdated
Comment thread router/internal/retrytransport/retry_transport.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
router/internal/expr/retry_context.go (2)

3-10: Include context.DeadlineExceeded in IsTimeout() and import context

Many network timeouts surface as context.DeadlineExceeded. Treat it as a timeout alongside os.ErrDeadlineExceeded and ETIMEDOUT.

Apply this diff:

 import (
+	"context"
 	"errors"
 	"net"
 	"net/http"
 	"os"
 	"strings"
 	"syscall"
 )
@@
 	// Check for net package timeout errors using the standard Go method
 	if ctx.originalError != nil {
 		if netErr, ok := ctx.originalError.(net.Error); ok && netErr.Timeout() {
 			return true
 		}
 		// Check for deadline exceeded errors
 		if errors.Is(ctx.originalError, os.ErrDeadlineExceeded) {
 			return true
 		}
+		// Treat context deadline exceeded as a timeout
+		if errors.Is(ctx.originalError, context.DeadlineExceeded) {
+			return true
+		}
 		// Also check for direct syscall timeout errors not wrapped in net.Error
 		if errors.Is(ctx.originalError, syscall.ETIMEDOUT) {
 			return true
 		}
 	}

Also applies to: 42-58


87-99: Include 429 (Too Many Requests) in IsRetryableStatusCode to align default behavior and Retry-After handling

Per the PR objectives and tests around Retry-After, 429 should be considered retryable by the helper, so the default “Balanced” expression can honor Retry-After without custom ShouldRetry overrides.

Apply this diff:

-// IsRetryableStatusCode returns true if the HTTP status code is generally
-// considered retryable, including 500, 502, 503, and 504.
+// IsRetryableStatusCode returns true if the HTTP status code is generally
+// considered retryable, including 429, 500, 502, 503, and 504.
 func (ctx RetryContext) IsRetryableStatusCode() bool {
 	switch ctx.StatusCode {
+	case http.StatusTooManyRequests,
 	case http.StatusInternalServerError,
 		http.StatusBadGateway,
 		http.StatusServiceUnavailable,
 		http.StatusGatewayTimeout:
 		return true
 	default:
 		return false
 	}
 }

To verify impact and adjust tests if needed, run:

#!/bin/bash
# Where do we reference 429 in tests and expressions?
rg -n --no-heading -C2 'StatusTooManyRequests|429' .

# Where is IsRetryableStatusCode() used in expression defaults?
rg -n --no-heading -C3 'IsRetryableStatusCode\(\)' .

# Sanity-check tests that explicitly exclude 429 to ensure intent remains correct
rg -n --no-heading -C2 'shouldNotRetry429|shouldRetryWith429' router/internal/retrytransport
🧹 Nitpick comments (4)
router/internal/expr/retry_context.go (1)

61-79: Broaden IsConnectionError to cover unreachable host/network conditions

Optional hardening: include common OS-level connectivity failures (EHOSTUNREACH, ENETUNREACH) and their string variants. This improves signal for transient network partitions without overrelying on string checks.

Apply this diff:

 func (ctx RetryContext) IsConnectionError() bool {
 	// Use existing helpers for specific connection errors
 	if ctx.IsConnectionRefused() || ctx.IsConnectionReset() {
 		return true
 	}
 
+	// OS-level unreachable/network conditions
+	if ctx.originalError != nil {
+		if errors.Is(ctx.originalError, syscall.EHOSTUNREACH) ||
+			errors.Is(ctx.originalError, syscall.ENETUNREACH) {
+			return true
+		}
+	}
+
 	// Fall back to string matching for other connection errors not covered by specific helpers
 	if ctx.Error != "" {
 		errLower := strings.ToLower(ctx.Error)
 		return strings.Contains(errLower, "no such host") ||
+			strings.Contains(errLower, "no route to host") ||
+			strings.Contains(errLower, "network is unreachable") ||
+			strings.Contains(errLower, "host is unreachable") ||
 			strings.Contains(errLower, "handshake failure") ||
 			strings.Contains(errLower, "handshake timeout")
 	}
 
 	return false
 }
router/internal/retrytransport/retry_transport_test.go (3)

511-572: Tests depend on real time.Sleep for Retry-After; add a test seam to avoid slow/flaky timing

The 429 tests use real Retry-After delays (e.g., 1s). Consider injecting a sleeper/clock into RetryHTTPTransport so tests can stub it and assert exact durations without wall-clock waits.

Proposed approach (high level):

  • Add an optional Sleeper func(time.Duration) to RetryOptions (default: time.Sleep).
  • Use it wherever backoff/delay is applied (including Retry-After).
  • In tests, pass a stub sleeper that records durations without sleeping.

Example diffs (snippets; production file changes required):

- type RetryOptions struct {
+ type RetryOptions struct {
     Enabled       bool
     MaxRetryCount int
     Interval      time.Duration
     MaxDuration   time.Duration
     Expression    string
     OnRetry       func(count int, req *http.Request, resp *http.Response, err error)
     ShouldRetry   ShouldRetryFunc
+    Sleeper       func(time.Duration) // nil => time.Sleep
  }
- sleeper := time.Sleep
+ sleeper := time.Sleep
+ if rt.RetryOptions.Sleeper != nil {
+   sleeper = rt.RetryOptions.Sleeper
+ }
 // ... later
- time.Sleep(delay)
+ sleeper(delay)

Then in tests, use:

var slept []time.Duration
noopSleeper := func(d time.Duration) { slept = append(slept, d) }
RetryOptions{ Sleeper: noopSleeper, ... }

This will speed up CI and let you assert the exact delay logic deterministically.


978-1057: Add a ShouldUseRetryAfter test case for "Retry-After: 0"

A header of 0 seconds is valid and should mean “retry immediately.” We already cover parseRetryAfterHeader("0"). Add a case to TestShouldUseRetryAfter ensuring duration == 0 and use == true for 429 with header "0".

Apply this diff to extend the table:

 		{
 			name: "429 with invalid Retry-After",
 			resp: func() *http.Response {
 				resp := &http.Response{
 					StatusCode: http.StatusTooManyRequests,
 					Header:     make(http.Header),
 				}
 				resp.Header.Set("Retry-After", "invalid")
 				return resp
 			}(),
 			expectedDur: 0,
 			expectedUse: false,
 		},
+		{
+			name: "429 with zero Retry-After seconds",
+			resp: func() *http.Response {
+				resp := &http.Response{
+					StatusCode: http.StatusTooManyRequests,
+					Header:     make(http.Header),
+				}
+				resp.Header.Set("Retry-After", "0")
+				return resp
+			}(),
+			expectedDur: 0,
+			// Zero is a valid header and should be honored as "no wait"
+			expectedUse: true,
+		},

441-449: Nit: variable name “unusedLogger” is misleading

This logger is returned and used by callers; rename to logger.

-	unusedLogger := zap.New(core)
-	return &buf, unusedLogger
+	logger := zap.New(core)
+	return &buf, logger
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a394639 and a496eff.

📒 Files selected for processing (3)
  • router/core/retry_expression_builder.go (1 hunks)
  • router/internal/expr/retry_context.go (1 hunks)
  • router/internal/retrytransport/retry_transport_test.go (12 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/core/retry_expression_builder.go
🧰 Additional context used
🧬 Code graph analysis (2)
router/internal/retrytransport/retry_transport_test.go (1)
router/internal/retrytransport/retry_transport.go (2)
  • RetryHTTPTransport (28-32)
  • RetryOptions (16-24)
router/internal/expr/retry_context.go (1)
router/pkg/pubsub/datasource/error.go (1)
  • Error (3-6)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./telemetry)

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
router/internal/retrytransport/retry_transport.go (2)

109-111: Guard against nil ShouldRetry to avoid panics.

RoundTrip dereferences ShouldRetry without a nil check. If wiring forgets to set it (e.g., retries disabled), this panics.

Apply this diff:

-	for (isDefaultRetryableError(err) || rt.RetryOptions.ShouldRetry(err, req, resp)) && retries < rt.RetryOptions.MaxRetryCount {
+	for (isDefaultRetryableError(err) || (rt.RetryOptions.ShouldRetry != nil && rt.RetryOptions.ShouldRetry(err, req, resp))) && retries < rt.RetryOptions.MaxRetryCount {

135-137: Make backoff sleep cancelable via request context.

Currently time.Sleep blocks even if the request is canceled, increasing tail latency under client aborts/timeouts.

Apply this diff:

-		// Wait for the specified duration
-		time.Sleep(sleepDuration)
+		// Wait for the specified duration, but allow context cancellation
+		if sleepDuration > 0 {
+			timer := time.NewTimer(sleepDuration)
+			select {
+			case <-req.Context().Done():
+				if !timer.Stop() {
+					<-timer.C
+				}
+				return resp, req.Context().Err()
+			case <-timer.C:
+			}
+		}
🧹 Nitpick comments (7)
router-tests/panic_test.go (2)

51-51: Make the 7-arg call self-documenting to prevent misordered params

The update to the new signature is correct. Given three zeros and two nils, it’s easy to misorder or misread the durations and overrides in future edits. Consider expanding to multi-line with inline comments for clarity.

-               core.WithSubgraphRetryOptions(false, 0, 0, 0, "", nil, nil),
+               core.WithSubgraphRetryOptions(
+                 false, // enabled
+                 0,     // maxRetryCount
+                 0,     // retryMaxDuration
+                 0,     // retryInterval
+                 "",    // expression (unused because retries are disabled)
+                 nil,   // onRetryFunc
+                 nil,   // roundTripOverride
+               ),

83-83: Repeat readability tweak for the second occurrence

Same reasoning as above; expanding and commenting makes intent explicit and reduces future maintenance risk.

-               core.WithSubgraphRetryOptions(false, 0, 0, 0, "", nil, nil),
+               core.WithSubgraphRetryOptions(
+                 false, // enabled
+                 0,     // maxRetryCount
+                 0,     // retryMaxDuration
+                 0,     // retryInterval
+                 "",    // expression (unused because retries are disabled)
+                 nil,   // onRetryFunc
+                 nil,   // roundTripOverride
+               ),
router-tests/telemetry/telemetry_test.go (1)

8709-8710: Consider a tiny helper to avoid churn if the API changes again

This call shape will likely recur across tests. A small wrapper reads better and centralizes any future signature updates.

Apply this diff in this file:

-					core.WithSubgraphRetryOptions(false, 0, 0, 0, "", nil, nil),
+					withNoSubgraphRetries(),

And add this helper (anywhere in this test file or a shared test helper):

func withNoSubgraphRetries() core.Option {
	return core.WithSubgraphRetryOptions(false, 0, 0, 0, "", nil, nil)
}
router/core/retry_expression_builder.go (1)

12-12: Expose default expression in docs and config examples.

DefaultRetryExpression matches the PR’s decision table “Balanced (default)”. Please mirror this constant in docs/examples so users know what they get when expression is empty.

router/internal/retrytransport/retry_transport.go (1)

180-190: Treat both io.ErrUnexpectedEOF and io.EOF as retryable defaults.

PR notes say “EOF errors are always retried,” but the current check only matches “unexpected EOF” by substring. Use errors.Is for both canonical errors.

Apply this diff:

-func isDefaultRetryableError(err error) bool {
-	if err == nil {
-		return false
-	}
-
-	errStr := strings.ToLower(err.Error())
-	// EOF errors are always retryable as they indicate connection issues
-	return strings.Contains(errStr, "unexpected eof")
-}
+func isDefaultRetryableError(err error) bool {
+	if err == nil {
+		return false
+	}
+	// EOF errors indicate connection issues; always retry them by default.
+	return errors.Is(err, io.EOF) || errors.Is(err, io.ErrUnexpectedEOF)
+}

Additional change required: add the missing import.

Add this import near the top of the file:

import "errors"
router-tests/retry_test.go (2)

133-181: Time-based assertions: consider widening tolerance to reduce flakiness.

Backoff and scheduling jitter can vary under load. You already subtract 20–50ms; if CI shows occasional flakes, consider increasing slack or asserting “within a range”.


387-394: Optional: add a mutation case to assert “mutations are never retried.”

To lock in the invariant, add a subtest that issues a failing mutation with expression "true" and verifies no retries occur (onRetryCounter == 0, service called once).

Example (add as a new t.Run):

t.Run("mutations are never retried", func(t *testing.T) {
	t.Parallel()
	onRetryCounter := atomic.Int32{}
	retryCounterFunc := CreateRetryCounterFunc(&onRetryCounter)

	options := core.WithSubgraphRetryOptions(true, 5, 5*time.Second, 200*time.Millisecond, "true", retryCounterFunc, nil)

	testenv.Run(t, &testenv.Config{
		NoRetryClient: true,
		RouterOptions: []core.Option{ options },
		Subgraphs: testenv.SubgraphsConfig{
			Employees: testenv.SubgraphConfig{
				Middleware: func(_ http.Handler) http.Handler {
					return http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
						w.WriteHeader(http.StatusBadGateway)
					})
				},
			},
		},
	}, func(t *testing.T, xEnv *testenv.Environment) {
		_, err := xEnv.MakeGraphQLRequest(testenv.GraphQLRequest{
			Query: `mutation { doSomething }`,
		})
		require.NoError(t, err)
		require.Equal(t, 0, int(onRetryCounter.Load()))
	})
})
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a496eff and d39bcee.

📒 Files selected for processing (12)
  • router-tests/error_handling_test.go (4 hunks)
  • router-tests/panic_test.go (2 hunks)
  • router-tests/retry_test.go (1 hunks)
  • router-tests/structured_logging_test.go (5 hunks)
  • router-tests/telemetry/telemetry_test.go (1 hunks)
  • router/core/graph_server.go (2 hunks)
  • router/core/retry_expression_builder.go (1 hunks)
  • router/core/retry_expression_builder_test.go (1 hunks)
  • router/core/router.go (1 hunks)
  • router/core/supervisor_instance.go (1 hunks)
  • router/internal/expr/retry_expression_test.go (1 hunks)
  • router/internal/retrytransport/retry_transport.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • router/core/graph_server.go
  • router/core/supervisor_instance.go
  • router/core/retry_expression_builder_test.go
  • router/internal/expr/retry_expression_test.go
  • router-tests/error_handling_test.go
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router-tests/telemetry/telemetry_test.go
  • router-tests/structured_logging_test.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

  • router-tests/telemetry/telemetry_test.go
  • router-tests/structured_logging_test.go
🧬 Code graph analysis (6)
router-tests/telemetry/telemetry_test.go (1)
router/core/router.go (1)
  • WithSubgraphRetryOptions (1752-1773)
router-tests/retry_test.go (4)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Body (92-94)
router-tests/testenv/testenv.go (3)
  • Run (105-122)
  • SubgraphsConfig (365-378)
  • Environment (1727-1763)
router/core/router.go (2)
  • WithSubgraphRetryOptions (1752-1773)
  • Option (172-172)
router/pkg/config/config.go (1)
  • CustomAttribute (43-47)
router-tests/panic_test.go (1)
router/core/router.go (1)
  • WithSubgraphRetryOptions (1752-1773)
router/core/router.go (1)
router/internal/retrytransport/retry_transport.go (2)
  • OnRetryFunc (15-15)
  • RetryOptions (17-28)
router/core/retry_expression_builder.go (3)
router/internal/retrytransport/retry_transport.go (2)
  • RetryOptions (17-28)
  • ShouldRetryFunc (14-14)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (17-36)
router/internal/expr/retry_context.go (1)
  • LoadRetryContext (138-152)
router-tests/structured_logging_test.go (1)
router/core/router.go (1)
  • WithSubgraphRetryOptions (1752-1773)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (5)
router-tests/telemetry/telemetry_test.go (2)

8709-8710: API update to WithSubgraphRetryOptions looks correct

The expanded call matches the new signature and keeps retries disabled for this scenario. Passing 0 for the duration parameters and nil for the optional hooks is acceptable here.


8709-8710: All WithSubgraphRetryOptions calls use the 7-arg signature
Verified via ripgrep and comma-count check that every WithSubgraphRetryOptions(...) invocation across the tests and core code has exactly six commas (seven arguments), so no instances of the old 4-arg form remain.

router-tests/structured_logging_test.go (1)

712-713: All WithSubgraphRetryOptions calls are now using the new 7-arg signature

Verified across the entire repo that every invocation of WithSubgraphRetryOptions includes all seven parameters—no outdated 5-arg or lower-arity calls remain. Passing an empty expression ("") and nil hooks is appropriate for these test scenarios.

• No action required on call sites in:
– router-tests/panic_test.go
– router-tests/retry_test.go
– router-tests/error_handling_test.go
– router-tests/structured_logging_test.go
– router-tests/telemetry/telemetry_test.go
– router/core/supervisor_instance.go
– router/core/router.go

Optional: you may implement a short-lived compatibility shim (see comment in core/router.go) to ease the transition for external integrators.

router/core/retry_expression_builder.go (1)

35-64: Expression evaluation and mutation guard look solid.

  • Never retrying mutations aligns with PR notes.
  • Logging on evaluation error and failing closed is reasonable since the transport still retries default EOFs.

No changes needed.

router-tests/retry_test.go (1)

186-217: EOF default retry test looks good.

Using RoundTripOverride to inject io.ErrUnexpectedEOF validates the “always retry EOF” path independent of expression. Assertions on retry counts are correct.

Comment thread router-tests/retry_test.go Outdated
Comment thread router/core/router.go
Comment thread router/internal/retrytransport/retry_transport.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
router/internal/retrytransport/retry_transport.go (2)

145-146: Make backoff sleep cancelable via request context

If the client cancels the request, we should stop sleeping and return promptly. Replace time.Sleep with a context-aware wait.

Apply this diff:

-		// Wait for the specified duration
-		time.Sleep(sleepDuration)
+		// Wait for the specified duration, but allow context cancellation
+		if sleepDuration > 0 {
+			timer := time.NewTimer(sleepDuration)
+			select {
+			case <-req.Context().Done():
+				if !timer.Stop() {
+					<-timer.C
+				}
+				return resp, req.Context().Err()
+			case <-timer.C:
+			}
+		}

117-121: Fix potential nil dereference when retries are disabled (ShouldRetry can be nil)

When retries are disabled, BuildRetryFunction returns nil and ShouldRetry remains nil. The loop dereferences it unconditionally and will panic on first failure response. Guard the call.

Apply this diff:

-	for (rt.RetryOptions.ShouldRetry(err, req, resp)) && retries < rt.RetryOptions.MaxRetryCount {
+	for (rt.RetryOptions.ShouldRetry != nil && rt.RetryOptions.ShouldRetry(err, req, resp)) && retries < rt.RetryOptions.MaxRetryCount {
🧹 Nitpick comments (9)
router/internal/retrytransport/retry_transport.go (3)

115-116: Guard against nil request logger provider

If getRequestLogger is nil or returns nil, later Debug/Error calls will panic. Default to zap.NewNop().

Apply this diff:

-	requestLogger := rt.getRequestLogger(req)
+	requestLogger := zap.NewNop()
+	if rt.getRequestLogger != nil {
+		if l := rt.getRequestLogger(req); l != nil {
+			requestLogger = l
+		}
+	}

66-72: Nil-safe logging for malformed Retry-After

parseRetryAfterHeader logs with logger.Error without verifying logger is non-nil. Add a guard.

Apply this diff:

-	if errJoin != nil {
-		logger.Error("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))
-	}
+	if errJoin != nil && logger != nil {
+		logger.Error("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))
+	}

141-143: Align OnRetry count semantics with tests (0-based vs 1-based)

OnRetry currently receives count starting at 1 (because retries++ happens before the call). The tests expect a 0-based count. Pass retries-1 to the callback.

Apply this diff:

-		if rt.RetryOptions.OnRetry != nil {
-			rt.RetryOptions.OnRetry(retries, req, resp, sleepDuration, err)
-		}
+		if rt.RetryOptions.OnRetry != nil {
+			rt.RetryOptions.OnRetry(retries-1, req, resp, sleepDuration, err)
+		}

If you prefer 1-based semantics, adjust the tests instead. Pick one convention and document it on OnRetryFunc.

router/internal/retrytransport/retry_transport_test.go (1)

348-349: Reduce flakiness in Retry-After/interval assertions

Asserting NotEqual to retryInterval can still occasionally match due to jitter. Prefer range checks, e.g., 0 < sleepDuration < retryInterval+some_tolerance, or explicitly assert shouldUseRetryAfter(...) outcome.

Example:

assert.Greater(t, sleepDuration.Load(), int64(0))
assert.InDelta(t, int64(retryInterval/2), sleepDuration.Load(), int64(retryInterval)) // within a loose window

Also applies to: 397-398

router-tests/retry_test.go (2)

198-245: Good: validates 429 Retry-After seconds and captures sleepDuration via OnRetry

Clear, deterministic check using headerRetryIntervalInSeconds. As a tiny enhancement, consider validating that all retries used the same duration (not only the last observed), e.g., accumulate durations into a slice and assert them all equal to the header value.


352-400: Zero Retry-After semantics: clarify expectation

With “Retry-After: 0”, current transport falls back to normal backoff; this test only asserts it’s not equal to the fixed retryInterval. If you intend “immediate retry” on zero, assert sleepDuration == 0 and update transport accordingly. Otherwise, consider a tighter range assertion (e.g., sleepDuration > 0 and != retryInterval).

router/core/retry_expression_builder.go (3)

13-13: Default expression matches the “Balanced” profile; add a doc comment for discoverability

This constant aligns with the PR’s decision table. Add a short doc comment so it’s easy for contributors to find in IDE hovers and godoc.

-const DefaultRetryExpression = "IsRetryableStatusCode() || IsConnectionError() || IsTimeout()"
+// DefaultRetryExpression is applied when RetryOptions.Expression is empty.
+// It corresponds to the "Balanced" profile in the retry decision table.
+const DefaultRetryExpression = "IsRetryableStatusCode() || IsConnectionError() || IsTimeout()"

35-47: Guard against nil http.Request to avoid a potential panic

If ShouldRetry is ever called with a nil req (unlikely but possible during error paths), getRequestContext(req.Context()) would panic. A minimal guard is cheap and safe.

-	return func(err error, req *http.Request, resp *http.Response) bool {
-		reqContext := getRequestContext(req.Context())
+	return func(err error, req *http.Request, resp *http.Response) bool {
+		if req == nil {
+			return false
+		}
+		reqContext := getRequestContext(req.Context())

Would you confirm the retry transport never calls ShouldRetry with a nil request?


55-65: Avoid potential nil logger dereference on eval error

If reqContext.logger can ever be nil, the error logging would panic and mask the evaluation failure. Defensive check keeps this path safe.

-		if evalErr != nil {
-			reqContext.logger.Error("Failed to evaluate retry expression",
+		if evalErr != nil {
+			if reqContext.logger != nil {
+				reqContext.logger.Error("Failed to evaluate retry expression",
 					zap.Error(evalErr),
 					zap.String("expression", expression),
-			)
+				)
+			}
 			// Disable retries on evaluation error
 			return false
 		}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d39bcee and d639aec.

📒 Files selected for processing (12)
  • router-tests/error_handling_test.go (4 hunks)
  • router-tests/panic_test.go (2 hunks)
  • router-tests/retry_test.go (1 hunks)
  • router-tests/structured_logging_test.go (5 hunks)
  • router-tests/telemetry/telemetry_test.go (1 hunks)
  • router/core/graph_server.go (2 hunks)
  • router/core/retry_expression_builder.go (1 hunks)
  • router/core/retry_expression_builder_test.go (1 hunks)
  • router/core/router.go (1 hunks)
  • router/core/supervisor_instance.go (1 hunks)
  • router/internal/retrytransport/retry_transport.go (3 hunks)
  • router/internal/retrytransport/retry_transport_test.go (12 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • router-tests/error_handling_test.go
  • router/core/router.go
  • router/core/retry_expression_builder_test.go
  • router-tests/panic_test.go
  • router-tests/telemetry/telemetry_test.go
  • router/core/supervisor_instance.go
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router-tests/structured_logging_test.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

  • router-tests/structured_logging_test.go
🧬 Code graph analysis (4)
router/core/graph_server.go (2)
router/core/retry_expression_builder.go (1)
  • BuildRetryFunction (16-69)
router/internal/retrytransport/retry_transport.go (1)
  • RetryOptions (17-28)
router/core/retry_expression_builder.go (4)
router/internal/retrytransport/retry_transport.go (2)
  • RetryOptions (17-28)
  • ShouldRetryFunc (14-14)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (17-36)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/expr/retry_context.go (1)
  • LoadRetryContext (138-152)
router-tests/retry_test.go (4)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Body (92-94)
router-tests/testenv/testenv.go (3)
  • Run (105-122)
  • SubgraphsConfig (365-378)
  • Environment (1727-1763)
router/core/router.go (2)
  • WithSubgraphRetryOptions (1752-1765)
  • Option (172-172)
router/pkg/config/config.go (1)
  • CustomAttribute (43-47)
router/internal/retrytransport/retry_transport_test.go (1)
router/internal/retrytransport/retry_transport.go (2)
  • RetryHTTPTransport (32-36)
  • RetryOptions (17-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: image_scan
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: Analyze (go)
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
🔇 Additional comments (5)
router-tests/structured_logging_test.go (1)

712-713: LGTM: test call sites updated to new WithSubgraphRetryOptions signature

All updated invocations correctly pass the additional expression string and onRetry func (using empty string and nil where not needed). No further changes required here.

Also applies to: 831-832, 963-964, 1099-1100, 2215-2216

router/internal/retrytransport/retry_transport_test.go (1)

513-574: Comprehensive coverage for Retry-After and backoff behavior looks solid

The new tests cover numeric and HTTP-date Retry-After parsing, invalid/negative headers, and mixed 429/non-429 scenarios. They also validate OnRetry and attempt counts. Nicely done.

Also applies to: 576-638, 640-684, 686-751, 753-801, 802-849, 850-935, 937-984, 986-1041, 1043-1142

router-tests/retry_test.go (1)

252-300: Nice bound checks for max duration vs interval

The test asserts both upper and lower bounds, leaving jitter slack. Good balance between determinism and real-world backoff behavior.

router/core/retry_expression_builder.go (2)

15-21: Early exit when retries are disabled: LGTM

Returning (nil, nil) when retries are disabled is clear and prevents unnecessary expression compilation. This also intentionally ignores bad expressions when disabled, which is reasonable.


23-34: Defaulting and compilation path looks solid

Falling back to DefaultRetryExpression and surfacing a wrapped compile error keeps failure modes explicit. Nice.

Comment thread router/core/graph_server.go
Comment thread router/core/retry_expression_builder.go
Comment thread router/internal/retrytransport/retry_transport_test.go
Comment thread router/internal/retrytransport/retry_transport.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
router/internal/retrytransport/retry_transport.go (3)

116-119: Nil-guard ShouldRetry to avoid panics.

RoundTrip dereferences RetryOptions.ShouldRetry without a nil check. If wiring forgets to set it, this will panic. Short-circuit the call when it’s nil.

Apply:

-	for (rt.RetryOptions.ShouldRetry(err, req, resp)) && retries < rt.RetryOptions.MaxRetryCount {
+	for (rt.RetryOptions.ShouldRetry != nil && rt.RetryOptions.ShouldRetry(err, req, resp)) && retries < rt.RetryOptions.MaxRetryCount {

145-147: Make sleep cancelable via request context.

Blocking time.Sleep ignores request cancellation and can hold resources unnecessarily.

Apply:

-		// Wait for the specified duration
-		time.Sleep(sleepDuration)
+		// Wait for the specified duration, but allow context cancellation
+		if sleepDuration > 0 {
+			timer := time.NewTimer(sleepDuration)
+			select {
+			case <-req.Context().Done():
+				if !timer.Stop() {
+					<-timer.C
+				}
+				return resp, req.Context().Err()
+			case <-timer.C:
+			}
+		}

84-90: Fix clamp-to-zero bug when MaxDuration is unset (0).

If MaxDuration is 0 (meaning “no cap”), the current logic forces duration to 0 and disables Retry-After handling. Guard the clamp with maxDuration > 0.

Apply:

-	if duration > maxDuration {
-		duration = maxDuration
-	}
+	if maxDuration > 0 && duration > maxDuration {
+		duration = maxDuration
+	}
🧹 Nitpick comments (9)
router/internal/retrytransport/retry_transport.go (4)

73-91: Honor Retry-After for 503 Service Unavailable as well.

Per RFC 7231, Retry-After applies to both 429 and 503. Supporting 503 improves interoperability with common upstreams/CDNs.

Apply:

-func shouldUseRetryAfter(logger *zap.Logger, resp *http.Response, maxDuration time.Duration) (time.Duration, bool) {
-	if resp == nil || resp.StatusCode != http.StatusTooManyRequests {
+func shouldUseRetryAfter(logger *zap.Logger, resp *http.Response, maxDuration time.Duration) (time.Duration, bool) {
+	if resp == nil || (resp.StatusCode != http.StatusTooManyRequests && resp.StatusCode != http.StatusServiceUnavailable) {
 		return 0, false
 	}

And update the debug message at Line 125 to avoid hardcoding “429” (optional).


114-115: Defensive: default to a no-op logger if getRequestLogger returns nil.

Avoid potential nil deref in logging paths.

Apply:

-	requestLogger := rt.getRequestLogger(req)
+	requestLogger := rt.getRequestLogger(req)
+	if requestLogger == nil {
+		requestLogger = zap.NewNop()
+	}

117-121: Optional: respect RetryOptions.Enabled to bypass retries entirely.

Right now retries run solely based on ShouldRetry. If Enabled=false is a supported config, short-circuit before retry loop.

Apply:

 	// Retry logic
 	retries := 0
+	if !rt.RetryOptions.Enabled {
+		return resp, err
+	}

65-71: Right-size log level for malformed Retry-After headers.

Parsing failures on third-party headers are frequent and non-fatal. Consider downgrading to Warn/Debug or sampling to reduce noise.

For example:

-	logger.Error("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))
+	logger.Warn("Failed to parse Retry-After header", zap.String("retry-after", retryAfter), zap.Error(errJoin))

(or wrap with a sampled logger).

router/internal/retrytransport/retry_transport_test.go (5)

551-559: Speed up 429 seconds-based test by capping MaxDuration to milliseconds.

This test currently incurs ~2s of real sleep (1s per retry). Reduce runtime by capping transport MaxDuration so the sleep is bounded while still exercising Retry-After logic.

Apply:

-			MaxDuration:   10 * time.Second,
+			MaxDuration:   10 * time.Millisecond,

If you also want to assert the actual chosen delay, capture sleepDuration via OnRetry and assert it equals the cap.


615-623: Align transport MaxDuration with test’s local maxDuration to keep it fast and verify capping end-to-end.

Right now the handler validates shouldUseRetryAfter with a 500ms cap, but the transport still sleeps up to 1s. Use the same cap in RetryOptions.

Apply:

-			MaxDuration:   10 * time.Second,
+			MaxDuration:   maxDuration,

Optionally, assert that the OnRetry sleepDuration equals maxDuration.


729-737: Reduce HTTP-date based 429 test duration.

Bound transport MaxDuration to a few milliseconds; the header path is still taken, but the sleep is short.

Apply:

-			MaxDuration:   10 * time.Second,
+			MaxDuration:   10 * time.Millisecond,

909-917: Shrink mixed 429/other-errors test runtime.

Use millisecond-level MaxDuration instead of 10s to avoid long sleeps for the 429-with-header case.

Apply:

-			MaxDuration:   10 * time.Second,
+			MaxDuration:   10 * time.Millisecond,

513-575: Avoid real-time sleeps in tests by asserting sleepDuration via OnRetry or by injecting a sleeper.

These 429 tests currently incur real sleeps (seconds or ~1s) which slow the suite. Two options:

  • Minimal change: capture the sleepDuration parameter in OnRetry and assert it equals the expected Retry-After (or the cap). Then reduce MaxDuration so real sleep is tiny.
  • Better: inject a Sleeper func into the transport for tests (e.g., in RetryOptions) so tests can bypass sleeping entirely.

Example transport tweak (requires small production change):

 type RetryOptions struct {
   ...
   OnRetry OnRetryFunc
+  // Test/DI: optional override for sleeping between retries
+  Sleep func(ctx context.Context, d time.Duration) error
 }
 
 // in RoundTrip in place of time.Sleep(sleepDuration):
- time.Sleep(sleepDuration)
+ if rt.RetryOptions.Sleep != nil {
+   if err := rt.RetryOptions.Sleep(req.Context(), sleepDuration); err != nil {
+     return resp, err
+   }
+ } else {
+   // existing cancelable timer path
+ }

Then in tests set Sleep to a no-op that records d.

Also applies to: 576-639, 686-752, 802-849

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d639aec and 259f539.

📒 Files selected for processing (2)
  • router/internal/retrytransport/retry_transport.go (3 hunks)
  • router/internal/retrytransport/retry_transport_test.go (12 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-26T17:19:28.144Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.144Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.

Applied to files:

  • router/internal/retrytransport/retry_transport.go
🧬 Code graph analysis (1)
router/internal/retrytransport/retry_transport_test.go (1)
router/internal/retrytransport/retry_transport.go (2)
  • RetryHTTPTransport (31-35)
  • RetryOptions (17-27)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: integration_test (./events)
  • GitHub Check: Analyze (go)
🔇 Additional comments (4)
router/internal/retrytransport/retry_transport.go (2)

164-182: Body draining looks correct.

Draining then closing ensures connection reuse; nil checks and error logs are in place.


184-188: OK: success criteria is conventional 2xx.

Guarded by err==nil before invocation; no nil deref risk.

router/internal/retrytransport/retry_transport_test.go (2)

453-511: Good: OnRetry callback contract verified (count is 1-based).

The test validates call count and parameters; aligns with transport semantics.


312-375: Body-draining behavior is exercised well.

TrackableBody with throwOnRead/Close and log assertions give confidence in connection reuse and error logging.

Also applies to: 377-438

Comment thread router/internal/retrytransport/retry_transport_test.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
router/core/retry_expression_builder_test.go (1)

177-191: Clarify EOF behavior vs. PR note; add targeted test or adjust wording

PR text says “EOF errors are always retried.” Implementation (by design) matches only “unexpected EOF” via string matching; plain io.EOF is not retried. This is intentional per team preference captured in learnings, but it conflicts with the broad “EOF errors” phrasing. Consider either:

  • Add a test for io.EOF reflecting current behavior (not retried), and update docs to say “unexpected EOF” is always retried, or
  • Expand the default check to include io.EOF (and add tests).

Given the team’s stated preference to keep it simple and wait for feedback, I recommend updating wording/tests for clarity now.

Would you like me to push a follow-up commit that updates the docs/test names to explicitly say “unexpected EOF”?

router/core/retry_expression_builder.go (1)

75-85: Comment says “EOF errors are always retryable” but implementation matches only “unexpected EOF”

This is intentional per team guidance (keep it simple; customers can extend via expression). Let’s make the comment exact to avoid confusion with the PR blurb.

Apply this diff:

-// isDefaultRetryableError checks for errors that should always be retryable
-// regardless of the configured retry expression
+// isDefaultRetryableError checks for errors that should always be retryable
+// regardless of the configured retry expression.
+// Note: intentionally limited to “unexpected EOF” string match for now (team decision);
+// customers can handle other EOF variants via custom expressions.
@@
-	// EOF errors are always retryable as they indicate connection issues
+	// Unexpected EOFs are treated as retryable (indicative of truncated/aborted reads)
 	return strings.Contains(errStr, "unexpected eof")
🧹 Nitpick comments (5)
router/core/retry_expression_builder_test.go (2)

84-85: Outdated comment: EOF is handled before expression evaluation, not in the transport

The handling occurs via isDefaultRetryableError inside BuildRetryFunction, prior to expression evaluation.

Apply this diff:

-    assert.True(t, fn(errors.New("unexpected EOF"), req, nil)) // EOF is now handled at transport layer, not expression
+    // EOF is handled prior to expression evaluation via isDefaultRetryableError (not by the expression itself)
+    assert.True(t, fn(errors.New("unexpected EOF"), req, nil))

316-335: Optional coverage: verify behavior with missing request context

We already cover the happy path. A small test asserting the function returns false when the request lacks an attached requestContext would guard against regressions.

Example test to append:

func TestShouldRetry_NoRequestContext_ReturnsFalse(t *testing.T) {
	fn, err := BuildRetryFunction(retrytransport.RetryOptions{
		Enabled:    true,
		Expression: "true",
	})
	assert.NoError(t, err)
	assert.NotNil(t, fn)

	req, _ := http.NewRequest("POST", "http://example.com/graphql", nil) // no requestContext attached
	assert.False(t, fn(nil, req, &http.Response{StatusCode: 500}))
}
router/core/retry_expression_builder.go (3)

33-37: Avoid double-wrapping the compile error message

expr.NewRetryExpressionManager already formats its error with “failed to compile retry expression: …”. Wrapping again duplicates the phrase.

Apply this diff:

-	manager, err := expr.NewRetryExpressionManager(expression)
-	if err != nil {
-		return nil, fmt.Errorf("failed to compile retry expression: %w", err)
-	}
+	manager, err := expr.NewRetryExpressionManager(expression)
+	if err != nil {
+		return nil, err
+	}

41-46: Defensive nil check for req to prevent a potential panic

ShouldRetryFunc is an internal callback; guarding against a nil http.Request is cheap and eliminates a narrow edge-case panic.

Apply this diff:

-		reqContext := getRequestContext(req.Context())
+		if req == nil {
+			return false
+		}
+		reqContext := getRequestContext(req.Context())

47-50: Use EqualFold and the OperationTypeMutation constant for clarity

Minor readability: avoid ToLower and compare against the declared constant.

Apply this diff:

-		// Never retry mutations, regardless of expression result
-		if strings.ToLower(reqContext.Operation().Type()) == "mutation" {
+		// Never retry mutations, regardless of expression result
+		if strings.EqualFold(reqContext.Operation().Type(), string(OperationTypeMutation)) {
 			return false
 		}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 259f539 and 95de8ad.

📒 Files selected for processing (3)
  • router/core/retry_expression_builder.go (1 hunks)
  • router/core/retry_expression_builder_test.go (1 hunks)
  • router/core/router.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/core/router.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-26T17:30:41.187Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/retry_expression_builder.go:3-11
Timestamp: 2025-08-26T17:30:41.187Z
Learning: In router/core/retry_expression_builder.go, the team prefers to keep EOF error detection simple with string matching ("unexpected eof") rather than using errors.Is() for io.EOF and io.ErrUnexpectedEOF. They want to wait for customer feedback before making it more robust, since customers can modify retry expressions to handle specific cases.

Applied to files:

  • router/core/retry_expression_builder_test.go
  • router/core/retry_expression_builder.go
🧬 Code graph analysis (2)
router/core/retry_expression_builder_test.go (4)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/core/retry_expression_builder.go (2)
  • BuildRetryFunction (20-73)
  • DefaultRetryExpression (13-13)
router/internal/retrytransport/retry_transport.go (1)
  • RetryOptions (17-27)
router/core/context.go (3)
  • OperationTypeQuery (488-488)
  • OperationTypeMutation (489-489)
  • OperationTypeSubscription (490-490)
router/core/retry_expression_builder.go (4)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/retrytransport/retry_transport.go (2)
  • RetryOptions (17-27)
  • ShouldRetryFunc (14-14)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (17-36)
router/internal/expr/retry_context.go (1)
  • LoadRetryContext (138-152)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/core/retry_expression_builder.go (1)

19-25: Contract note: returning a noop func when disabled is fine; document it

Given the test was expecting nil, it’s worth documenting that the disabled path returns a non-nil noop to simplify downstream usage (no nil checks).

If you want, I can add a docstring clarifying this behavior.

Comment thread router/core/retry_expression_builder_test.go
Comment thread router/pkg/config/config.go
Comment thread router/core/retry_expression_builder.go Outdated
Comment thread router/core/retry_expression_builder.go Outdated
@SkArchon SkArchon requested a review from Noroth August 27, 2025 18:28
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
router/core/retry_expression_builder.go (1)

82-85: Comment vs behavior mismatch for EOFs; tiny tweak or clarify comment

Current code only matches “unexpected eof”, but the docs say “EOF errors are always retried.” If you want to keep the simple string approach (per team preference), consider also matching the common plain “eof” case; otherwise, narrow the comment.

Option A — broaden match (still string-based, minimal change):

-	errStr := strings.ToLower(err.Error())
-	// EOF errors are always retryable as they indicate connection issues
-	return strings.Contains(errStr, "unexpected eof")
+	errStr := strings.ToLower(err.Error())
+	// Treat common EOFs as retryable connection issues.
+	// Keep this simple per team guidance: match exact "eof" and "unexpected eof".
+	return errStr == "eof" || strings.Contains(errStr, "unexpected eof")

Option B — keep behavior, fix comment:

-	// EOF errors are always retryable as they indicate connection issues
+	// "unexpected eof" errors are always retryable as they indicate connection issues
🧹 Nitpick comments (3)
router/core/retry_expression_builder.go (3)

15-17: Type the noop func explicitly to the public alias

Minor clarity/maintainability win; future signature changes will fail fast at declaration.

-var noopRetryFunc = func(err error, req *http.Request, resp *http.Response) bool {
+var noopRetryFunc retrytransport.ShouldRetryFunc = func(err error, req *http.Request, resp *http.Response) bool {
 	return false
 }

41-46: Defensive nil-check for req to avoid a rare panic path

If an upstream caller ever passes a nil request, req.Context() will panic. Cheap guard.

-		reqContext := getRequestContext(req.Context())
-
-		if reqContext == nil {
-			return false
-		}
+		if req == nil {
+			return false
+		}
+		reqContext := getRequestContext(req.Context())
+		if reqContext == nil {
+			return false
+		}

48-50: Use EqualFold instead of ToLower for case-insensitive compare

Avoids allocs and reads clearer.

-		if strings.ToLower(reqContext.Operation().Type()) == "mutation" {
+		if strings.EqualFold(reqContext.Operation().Type(), "mutation") {
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4a34c7e and e7d8ed0.

📒 Files selected for processing (3)
  • router/core/retry_expression_builder.go (1 hunks)
  • router/core/retry_expression_builder_test.go (1 hunks)
  • router/pkg/config/config.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • router/pkg/config/config.go
  • router/core/retry_expression_builder_test.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-26T17:30:41.212Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/retry_expression_builder.go:3-11
Timestamp: 2025-08-26T17:30:41.212Z
Learning: In router/core/retry_expression_builder.go, the team prefers to keep EOF error detection simple with string matching ("unexpected eof") rather than using errors.Is() for io.EOF and io.ErrUnexpectedEOF. They want to wait for customer feedback before making it more robust, since customers can modify retry expressions to handle specific cases.

Applied to files:

  • router/core/retry_expression_builder.go
🧬 Code graph analysis (1)
router/core/retry_expression_builder.go (4)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/retrytransport/retry_transport.go (2)
  • RetryOptions (17-27)
  • ShouldRetryFunc (14-14)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (17-36)
router/internal/expr/retry_context.go (1)
  • LoadRetryContext (138-152)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
router/core/retry_expression_builder.go (2)

19-26: Good early-return guard for disabled retries

Clear behavior when retries are off; bad expressions won’t block startup. LGTM.


60-69: Fail-closed on evaluation errors: confirm intent and log volume

Returning false on eval errors is a conservative choice; just confirm this is desired UX vs. “fail-open” for transient eval issues. Also, this log can be hot if an expression keeps failing—consider rate limiting or downgrading after the first hit.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
router/core/router.go (1)

1755-1776: Preserve backward compatibility: add a compat overload

The signature change breaks external callers. Add a deprecated 4-arg overload that forwards to the new one.

+// Deprecated: use WithSubgraphRetryOptions with algorithm/expression/hooks.
+func WithSubgraphRetryOptionsCompat(
+	enabled bool,
+	maxRetryCount int,
+	retryMaxDuration, retryInterval time.Duration,
+) Option {
+	return WithSubgraphRetryOptions(enabled, backoffJitter, maxRetryCount, retryMaxDuration, retryInterval, "", nil)
+}
🧹 Nitpick comments (2)
router/core/router.go (1)

1755-1776: Clarify that onRetryFunc is test-only

Document onRetryFunc as a test hook to avoid production misuse.

router/core/retry_builder_test.go (1)

440-472: Add a test for disabled retries with empty/invalid algorithm

Covers the new short-circuit behavior in ProcessRetryOptions to prevent regressions.

 func TestProcessRetryOptions(t *testing.T) {
+	t.Run("disabled retries tolerate empty algorithm", func(t *testing.T) {
+		opts := retrytransport.RetryOptions{
+			Enabled:   false,
+			Algorithm: "",
+		}
+		got, err := ProcessRetryOptions(opts)
+		assert.NoError(t, err)
+		assert.NotNil(t, got)
+		assert.NotNil(t, got.ShouldRetry)
+		assert.False(t, got.Enabled)
+	})
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e7d8ed0 and 9a2c9d7.

📒 Files selected for processing (12)
  • router-tests/error_handling_test.go (4 hunks)
  • router-tests/panic_test.go (2 hunks)
  • router-tests/retry_test.go (1 hunks)
  • router-tests/structured_logging_test.go (5 hunks)
  • router-tests/telemetry/telemetry_test.go (1 hunks)
  • router/core/graph_server.go (2 hunks)
  • router/core/retry_builder.go (1 hunks)
  • router/core/retry_builder_test.go (1 hunks)
  • router/core/router.go (1 hunks)
  • router/core/supervisor_instance.go (1 hunks)
  • router/internal/retrytransport/retry_transport.go (3 hunks)
  • router/pkg/config/config.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • router/pkg/config/config.go
  • router-tests/retry_test.go
  • router-tests/error_handling_test.go
  • router/internal/retrytransport/retry_transport.go
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/retry_expression_builder.go:3-11
Timestamp: 2025-08-26T17:30:41.212Z
Learning: In router/core/retry_expression_builder.go, the team prefers to keep EOF error detection simple with string matching ("unexpected eof") rather than using errors.Is() for io.EOF and io.ErrUnexpectedEOF. They want to wait for customer feedback before making it more robust, since customers can modify retry expressions to handle specific cases.
📚 Learning: 2025-08-26T17:30:41.212Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/retry_expression_builder.go:3-11
Timestamp: 2025-08-26T17:30:41.212Z
Learning: In router/core/retry_expression_builder.go, the team prefers to keep EOF error detection simple with string matching ("unexpected eof") rather than using errors.Is() for io.EOF and io.ErrUnexpectedEOF. They want to wait for customer feedback before making it more robust, since customers can modify retry expressions to handle specific cases.

Applied to files:

  • router/core/retry_builder.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/core/supervisor_instance.go
  • router-tests/panic_test.go
  • router-tests/structured_logging_test.go
  • router-tests/telemetry/telemetry_test.go
  • router/core/graph_server.go
  • router/core/router.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

  • router-tests/structured_logging_test.go
  • router-tests/telemetry/telemetry_test.go
🧬 Code graph analysis (8)
router/core/retry_builder.go (4)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/retrytransport/retry_transport.go (2)
  • RetryOptions (17-28)
  • ShouldRetryFunc (14-14)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (17-36)
router/internal/expr/retry_context.go (1)
  • LoadRetryContext (138-152)
router/core/retry_builder_test.go (4)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/retrytransport/retry_transport.go (1)
  • RetryOptions (17-28)
router/core/context.go (3)
  • OperationTypeQuery (488-488)
  • OperationTypeMutation (489-489)
  • OperationTypeSubscription (490-490)
router/core/retry_builder.go (1)
  • ProcessRetryOptions (23-48)
router/core/supervisor_instance.go (1)
router/pkg/config/config.go (1)
  • BackoffJitterRetry (231-238)
router-tests/panic_test.go (1)
router/core/router.go (1)
  • WithSubgraphRetryOptions (1755-1776)
router-tests/structured_logging_test.go (1)
router/core/router.go (1)
  • WithSubgraphRetryOptions (1755-1776)
router-tests/telemetry/telemetry_test.go (1)
router/core/router.go (1)
  • WithSubgraphRetryOptions (1755-1776)
router/core/graph_server.go (2)
router/core/retry_builder.go (1)
  • ProcessRetryOptions (23-48)
router/internal/retrytransport/retry_transport.go (1)
  • RetryOptions (17-28)
router/core/router.go (1)
router/internal/retrytransport/retry_transport.go (2)
  • OnRetryFunc (15-15)
  • RetryOptions (17-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_test
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: image_scan
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (5)
router/core/supervisor_instance.go (1)

196-204: API update correctly wired: Algorithm and Expression are passed to WithSubgraphRetryOptions

The parameter order matches the updated signature (enabled, algorithm, maxAttempts, maxDuration, interval, expression, onRetry). Passing nil for onRetry is appropriate here.

router-tests/panic_test.go (1)

51-52: Tests updated to new 7-arg retry API — disabling retries explicitly

Using false, "", 0, 0, 0, "", nil cleanly disables retry in both subtests and aligns with the new signature.

Also applies to: 83-84

router-tests/telemetry/telemetry_test.go (1)

8709-8710: Ensure no calls to WithSubgraphRetryOptions use fewer than seven arguments
Run the following to list any 4–6-arg invocations and update them to the new 7-argument signature:

ast-grep --pattern $'WithSubgraphRetryOptions($_, $_, $_, $_)'  
ast-grep --pattern $'WithSubgraphRetryOptions($_, $_, $_, $_, $_)'  
ast-grep --pattern $'WithSubgraphRetryOptions($_, $_, $_, $_, $_, $_)'  
router/core/retry_builder.go (1)

113-116: EOF matching kept intentionally simple (ack)

String match for "unexpected eof" aligns with the team’s preference; no change requested.

router/core/graph_server.go (1)

1158-1165: Good: centralizing retry option processing and ensuring ShouldRetry is non-nil

Using ProcessRetryOptions here removes prior nil-callback pitfalls and centralizes validation.

Comment thread router-tests/structured_logging_test.go
Comment thread router/core/retry_builder.go
Comment thread router/core/retry_builder.go
Comment thread router/core/router.go
Comment thread router/core/supervisor_instance.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
router/core/retry_builder.go (1)

72-78: Guard against nil req and nil Operation; use constant + EqualFold for mutation check

Current code can panic if req is nil or if Operation() returns nil. Also prefer the typed constant and EqualFold.

Apply:

@@
-    reqContext := getRequestContext(req.Context())
+    if req == nil {
+        return false
+    }
+    reqContext := getRequestContext(req.Context())
@@
-    // Never retry mutations, regardless of expression result
-    if strings.ToLower(reqContext.Operation().Type()) == "mutation" {
+    // Never retry mutations, regardless of expression result
+    if op := reqContext.Operation(); op == nil {
+        return false
+    } else if strings.EqualFold(op.Type(), string(OperationTypeMutation)) {
         return false
     }

Also applies to: 79-82

🧹 Nitpick comments (4)
router/core/retry_builder.go (2)

60-63: De-duplicate default expression source

"defaultRetryExpression" duplicates config defaults. To avoid drift, centralize this constant (e.g., shared internal package) and reference it from both config and core.


114-117: “EOF errors are always retried”: include io.EOF and wrapped forms?

Implementation only matches “unexpected eof”. If the intent (and docs) is to retry all EOF-class errors, handle io.EOF and wrapped errors explicitly.

Option A (minimal, robust):

@@
-func isDefaultRetryableError(err error) bool {
+func isDefaultRetryableError(err error) bool {
     if err == nil {
         return false
     }
-    errStr := strings.ToLower(err.Error())
-    // EOF errors are always retryable as they indicate connection issues
-    return strings.Contains(errStr, "unexpected eof")
+    // EOF errors are always retryable as they indicate connection issues
+    if errors.Is(err, io.EOF) || errors.Is(err, io.ErrUnexpectedEOF) {
+        return true
+    }
+    // Keep the string fallback to catch non-wrapped variants
+    errStr := strings.ToLower(err.Error())
+    return strings.Contains(errStr, "unexpected eof")
 }

Add imports:

@@
-import (
+import (
+    "errors"
     "fmt"
+    "io"
     "net/http"
     "strings"

If the team intentionally limits to “unexpected eof” only (see learning), please confirm and update docs/tests to reflect that narrower scope.

router/core/retry_builder_test.go (2)

53-53: Add nil request and missing request-context guards

Ensure the retry func is resilient when req is nil or lacks a requestContext. This complements the guard fix in buildRetryFunction.

Append tests:

@@
 func TestBuildRetryFunction(t *testing.T) {
+    t.Run("graceful when request is nil", func(t *testing.T) {
+        fn, err := buildRetryFunction(retrytransport.RetryOptions{
+            Enabled:    true,
+            Expression: defaultRetryExpression,
+        })
+        assert.NoError(t, err)
+        assert.False(t, fn(nil, nil, &http.Response{StatusCode: 500}))
+    })
+
+    t.Run("graceful when request context is missing", func(t *testing.T) {
+        fn, err := buildRetryFunction(retrytransport.RetryOptions{
+            Enabled:    true,
+            Expression: defaultRetryExpression,
+        })
+        assert.NoError(t, err)
+        req, _ := http.NewRequest("POST", "http://example.com/graphql", nil) // no requestContext attached
+        assert.False(t, fn(nil, req, &http.Response{StatusCode: 500}))
+    })

440-481: Add coverage for empty algorithm when retries are enabled

ProcessRetryOptions currently errors when Enabled is true and Algorithm is empty. Add a test to lock this behavior (and improve error clarity if desired).

@@ func TestProcessRetryOptions(t *testing.T) {
     t.Run("process valid options", func(t *testing.T) {
         options := retrytransport.RetryOptions{
             Enabled:    true,
             Algorithm:  "backoff_jitter",
             Expression: "statusCode == 500 || IsTimeout() || IsConnectionError()",
         }
         response, err := ProcessRetryOptions(options)
         assert.NoError(t, err)
         assert.NotSame(t, &options, response)
     })
+
+    t.Run("process empty algorithm when enabled returns error", func(t *testing.T) {
+        _, err := ProcessRetryOptions(retrytransport.RetryOptions{
+            Enabled:    true,
+            Algorithm:  "",
+            Expression: "statusCode == 500",
+        })
+        assert.Error(t, err)
+        assert.ErrorContains(t, err, "unsupported retry algorithm")
+    })
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9a2c9d7 and f53e424.

📒 Files selected for processing (2)
  • router/core/retry_builder.go (1 hunks)
  • router/core/retry_builder_test.go (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-28T09:59:19.956Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/supervisor_instance.go:196-204
Timestamp: 2025-08-28T09:59:19.956Z
Learning: In router/pkg/config/config.go, the BackoffJitterRetry struct uses envDefault tags to provide default values for Algorithm ("backoff_jitter") and Expression ("IsRetryableStatusCode() || IsConnectionError() || IsTimeout()"), ensuring that even when YAML configs omit these fields, valid defaults are applied automatically, preventing ProcessRetryOptions from rejecting empty values.

Applied to files:

  • router/core/retry_builder.go
📚 Learning: 2025-08-26T17:30:41.212Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/retry_expression_builder.go:3-11
Timestamp: 2025-08-26T17:30:41.212Z
Learning: In router/core/retry_expression_builder.go, the team prefers to keep EOF error detection simple with string matching ("unexpected eof") rather than using errors.Is() for io.EOF and io.ErrUnexpectedEOF. They want to wait for customer feedback before making it more robust, since customers can modify retry expressions to handle specific cases.

Applied to files:

  • router/core/retry_builder.go
📚 Learning: 2025-08-28T09:59:10.337Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/router.go:1755-1776
Timestamp: 2025-08-28T09:59:10.337Z
Learning: In the router retry configuration, the algorithm parameter should not be defaulted when empty - users are expected to explicitly provide the algorithm value rather than relying on implicit defaults.

Applied to files:

  • router/core/retry_builder.go
🧬 Code graph analysis (2)
router/core/retry_builder_test.go (4)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/retrytransport/retry_transport.go (1)
  • RetryOptions (17-28)
router/core/context.go (3)
  • OperationTypeQuery (488-488)
  • OperationTypeMutation (489-489)
  • OperationTypeSubscription (490-490)
router/core/retry_builder.go (1)
  • ProcessRetryOptions (23-49)
router/core/retry_builder.go (4)
router/internal/expr/expr.go (2)
  • Request (65-74)
  • Context (34-38)
router/internal/retrytransport/retry_transport.go (2)
  • RetryOptions (17-28)
  • ShouldRetryFunc (14-14)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (17-36)
router/internal/expr/retry_context.go (1)
  • LoadRetryContext (138-152)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: build_test
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
router/core/retry_builder.go (1)

23-49: ProcessRetryOptions: validation + copy semantics look solid

Algorithm validation respects the disabled flag, and returning a copy avoids mutating caller-owned options. No issues.

router/core/retry_builder_test.go (1)

53-438: Strong, comprehensive coverage

Good matrix across statuses, errors, expression shapes, and op types; compile-time error cases and option processing are covered.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
router/core/router.go (1)

1755-1762: Public API break: add a compat wrapper or a new -WithHooks function; document test-only hook

You expanded WithSubgraphRetryOptions; external callers using the 4-arg form will break. Either:

  • keep this 7-arg and add a deprecated 4-arg wrapper, or
  • rename this to WithSubgraphRetryOptionsWithHooks and reintroduce the old 4-arg WithSubgraphRetryOptions that forwards.

Also annotate OnRetry as test-only in the doc comment. This mirrors prior feedback.

Proposed shim (minimal, keeps current name; add below this function):

+// Deprecated: use WithSubgraphRetryOptionsWithHooks to pass algorithm, expression or test hooks.
+func WithSubgraphRetryOptions(
+	enabled bool,
+	maxRetryCount int,
+	retryMaxDuration, retryInterval time.Duration,
+) Option {
+	return WithSubgraphRetryOptionsWithHooks(
+		enabled,
+		backoffJitter, // explicit default to preserve prior behavior
+		maxRetryCount,
+		retryMaxDuration, retryInterval,
+		"",   // no expression
+		nil,  // no OnRetry hook
+	)
+}
+
+// WithSubgraphRetryOptionsWithHooks configures retry including algorithm, expression and test-only hooks.
+func WithSubgraphRetryOptionsWithHooks(
+	enabled bool,
+	algorithm string,
+	maxRetryCount int,
+	retryMaxDuration, retryInterval time.Duration,
+	expression string,
+	onRetryFunc retrytransport.OnRetryFunc,
+) Option {
+	return func(r *Router) {
+		r.retryOptions = retrytransport.RetryOptions{
+			Enabled:       enabled,
+			Algorithm:     algorithm,
+			MaxRetryCount: maxRetryCount,
+			MaxDuration:   retryMaxDuration,
+			Interval:      retryInterval,
+			Expression:    expression,
+			// Test case overrides
+			OnRetry: onRetryFunc,
+		}
+	}
+}

Additionally, add a brief doc comment above the live function noting OnRetry is intended for tests only.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f53e424 and 80c7cab.

⛔ Files ignored due to path filters (2)
  • router-tests/go.sum is excluded by !**/*.sum
  • router/go.sum is excluded by !**/*.sum
📒 Files selected for processing (4)
  • router-tests/go.mod (1 hunks)
  • router/core/graph_server.go (2 hunks)
  • router/core/router.go (1 hunks)
  • router/go.mod (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • router/go.mod
  • router-tests/go.mod
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-28T10:07:27.637Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router-tests/structured_logging_test.go:712-713
Timestamp: 2025-08-28T10:07:27.637Z
Learning: In the router retry configuration, when retries are disabled (enabled=false in WithSubgraphRetryOptions), the algorithm parameter should not be validated since it won't be used. Empty algorithm strings are acceptable when retries are disabled.

Applied to files:

  • router/core/router.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
PR: wundergraph/cosmo#2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/core/router.go
📚 Learning: 2025-08-28T09:59:19.956Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/supervisor_instance.go:196-204
Timestamp: 2025-08-28T09:59:19.956Z
Learning: In router/pkg/config/config.go, the BackoffJitterRetry struct uses envDefault tags to provide default values for Algorithm ("backoff_jitter") and Expression ("IsRetryableStatusCode() || IsConnectionError() || IsTimeout()"), ensuring that even when YAML configs omit these fields, valid defaults are applied automatically, preventing ProcessRetryOptions from rejecting empty values.

Applied to files:

  • router/core/router.go
📚 Learning: 2025-08-28T09:59:10.337Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/router.go:1755-1776
Timestamp: 2025-08-28T09:59:10.337Z
Learning: In the router retry configuration, the algorithm parameter should not be defaulted when empty - users are expected to explicitly provide the algorithm value rather than relying on implicit defaults.

Applied to files:

  • router/core/router.go
🧬 Code graph analysis (2)
router/core/graph_server.go (2)
router/core/retry_builder.go (1)
  • ProcessRetryOptions (23-49)
router/internal/retrytransport/retry_transport.go (1)
  • RetryOptions (17-28)
router/core/router.go (1)
router/internal/retrytransport/retry_transport.go (2)
  • OnRetryFunc (15-15)
  • RetryOptions (17-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
router/core/router.go (1)

1764-1774: Intentional: no implicit defaulting of algorithm when empty

Leaving algorithm as-is (validated later) aligns with the design decision not to silently default the algorithm. Good.

router/core/graph_server.go (1)

1179-1185: Passing processed retry options looks correct

Using the processed options (with compiled expression) in TransportOptions is the right integration point.

Comment thread router/core/graph_server.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
router/internal/expr/retry_expression_test.go (3)

340-375: Add nil error case for IsTimeout.

Ensures IsTimeout() is false when no error is present.

       {
         name:     "operation timed out string not detected (no string matching)",
         err:      errors.New("operation timed out"),
         expected: false,
       },
+      {
+        name:     "nil error",
+        err:      nil,
+        expected: false,
+      },
       {
         name:     "connection errors not detected by IsTimeout",
         err:      syscall.ECONNREFUSED,
         expected: false,
       },

16-16: Speed up: mark top-level tests as parallel-safe.

These tests are read-only and independent. Enable t.Parallel() at top level for faster CI. Avoid subtest-level parallelism to keep the table loop simple.

 func TestRetryExpressionManager(t *testing.T) {
+  t.Parallel()
 func TestRetryExpressionManager_EmptyExpression(t *testing.T) {
+  t.Parallel()
 func TestLoadRetryContext(t *testing.T) {
+  t.Parallel()
 func TestRetryContext_SyscallErrorDetection(t *testing.T) {
+  t.Parallel()
 func TestRetryContext_ImprovedErrorDetection(t *testing.T) {
+  t.Parallel()
 func TestRetryExpressionManager_WithSyscallErrors(t *testing.T) {
+  t.Parallel()

Also applies to: 140-140, 146-146, 183-183, 297-297, 493-493


16-137: Optional: DRY up the expression-eval pattern with a small helper.

Reduces repetition and makes new cases trivial to add.

func eval(t *testing.T, expression string, ctx RetryContext) bool {
	t.Helper()
	m, err := NewRetryExpressionManager(expression)
	require.NoError(t, err)
	require.NotNil(t, m)
	got, err := m.ShouldRetry(ctx)
	require.NoError(t, err)
	return got
}

Use eval() inside TestRetryExpressionManager table loop.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 23ac555 and 8f5d283.

📒 Files selected for processing (2)
  • router/internal/expr/retry_context.go (1 hunks)
  • router/internal/expr/retry_expression_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/internal/expr/retry_context.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-26T17:30:41.212Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2167
File: router/core/retry_expression_builder.go:3-11
Timestamp: 2025-08-26T17:30:41.212Z
Learning: In router/core/retry_expression_builder.go, the team prefers to keep EOF error detection simple with string matching ("unexpected eof") rather than using errors.Is() for io.EOF and io.ErrUnexpectedEOF. They want to wait for customer feedback before making it more robust, since customers can modify retry expressions to handle specific cases.

Applied to files:

  • router/internal/expr/retry_expression_test.go
🧬 Code graph analysis (1)
router/internal/expr/retry_expression_test.go (2)
router/internal/expr/retry_context.go (2)
  • RetryContext (13-19)
  • LoadRetryContext (132-146)
router/internal/expr/retry_expression.go (1)
  • NewRetryExpressionManager (17-36)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: build_push_image
  • GitHub Check: build_test
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: image_scan
  • GitHub Check: build_test
🔇 Additional comments (3)
router/internal/expr/retry_expression_test.go (3)

55-65: Good coverage of HTTP read timeout semantics and selective string matching.

Nice job distinguishing the specific “timeout awaiting response headers” case (true) from generic timeout strings (false), which matches the intended stricter matching.

Also applies to: 366-389


183-295: Solid syscall surface coverage and expression integration.

Tests for ECONNREFUSED/RESET/ETIMEDOUT across direct, wrapped, and net.OpError are comprehensive; the expression integration tests validate end-to-end behavior.

Also applies to: 493-549


92-97: No change needed: 429 is intentionally not retryable by default
The default IsRetryableStatusCode implementation deliberately excludes HTTP 429 (Too Many Requests) in the Balanced expression; the existing test expecting false is correct and should remain unchanged.

Likely an incorrect or invalid review comment.

Comment thread router/internal/expr/retry_expression_test.go
Comment thread router/internal/expr/retry_expression_test.go
@SkArchon SkArchon merged commit c1236f5 into main Aug 28, 2025
37 of 38 checks passed
@SkArchon SkArchon deleted the dustin/eng-7974-allow-to-configure-retry-condition branch August 28, 2025 11:57
@Noroth Noroth mentioned this pull request Sep 30, 2025
5 tasks
@coderabbitai coderabbitai Bot mentioned this pull request Oct 13, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants