Skip to content

fix: jwt validation blocks on multiple requests#2216

Merged
SkArchon merged 18 commits intomainfrom
milinda/eng-8149-jwk-refreshunknownkid-causing-deadlocks
Sep 23, 2025
Merged

fix: jwt validation blocks on multiple requests#2216
SkArchon merged 18 commits intomainfrom
milinda/eng-8149-jwk-refreshunknownkid-causing-deadlocks

Conversation

@SkArchon
Copy link
Copy Markdown
Contributor

@SkArchon SkArchon commented Sep 15, 2025

We right now have the following line in our JWT configuration

RefreshUnknownKID: rate.NewLimiter(rate.Every(5*time.Minute), 1),

This means whenever a valid formed token (does not mean it is valid for auth) is sent to the router, it will try to find the correct JWKs entry based on the key id, but if it cannot find the key id it will on demand re-query the JWKs endpoint OUTSIDE of the refresh cycle within that request, hoping that the key id has been added to the JWKs. However to prevent misuse of this feature, there is a rate limit, so that if multiple requests are sent with an unknown keyid, we don't call the JWKs endpoint asap. However this blocks each request for 5 minutes cumulatively from my testing (because the burst is 1).

Cumulatively meaning,
if we send 6 requests at once

* 1st response is returned immediately
* 2nd response is returned in 5 * 1 minutes
* 3rd response is returned in 5 * 2 minutes
* 4th response is returned in 5 * 3 minutes
* 5th response is returned in 5 * 4 minutes
* 6th response is returned in 5 * 5 minutes

What happens now.

  • We refresh the jwks from the jwks endpoint per defined refresh cycle (1m, etc)
  • If the keyid is not present in the jwks, after the first request, subsequent requests will be blocked for 5 mins minimum
  • If by chance the keyid has been added to the jwks outside of the refresh cycle, setting RefreshUnknownKID would help

What happens with this PR

  • We still refresh the jwks in the refresh cycle
  • User can configure refreshing unknown kids, the following is an example with the defaults
authentication:
  jwt:
    jwks:
      - refresh_interval: 10m
        url: "https://www.googleapis.com/oauth2/v3/certs"
        refresh_unknown_kid:
          enabled: false # These are defaults
          max_wait: 2m 
          interval: 30s
          burst: 2

Also note on how max wait behaves with 2m and 30 second interval (0.5m = 30s) is the following

* 1st response is returned immediately
* 2nd response is returned in 0.5 * 1 minutes
* 3rd response is returned in 0.5 * 2 minutes
* 4th response is returned in 0.5 * 3 minutes
* 5th response is returned in 0.5 * 4 minutes
* 6th response is returned in 0.5 * 5 minutes, exceeds max_wait of 2 minutes, returns immediately
* 7th response is returned in 0.5 * 7 minutes, exceeds max_wait of 2 minutes, returns immediately

Summary by CodeRabbit

  • New Features

    • Configurable rate limiting for JWKS refreshes on unknown JWT KIDs (enable, interval, burst, max-wait).
  • Configuration

    • Added refresh_unknown_kid block to JWT JWKS settings in schema and examples.
  • Public API

    • JWKS configuration now exposes allowed signing algorithms and unknown-KID refresh settings.
    • Test signing helper signature updated to accept an additional flag.
  • Tests

    • Expanded tests for unknown-KID behavior, timing, bursts, max-wait, and concurrent scenarios.
    • Added test helper to run asynchronous assertions with timeouts.

Checklist

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 15, 2025

Walkthrough

Adds configurable rate limiting for unknown JWT KID refreshes, extends JWKSConfig with AllowedAlgorithms and RefreshUnknownKID, wires config through supervisor and config schema, and updates tests and JWKS test server API (TokenForKID gains a boolean flag) to cover new behaviors including algorithm filtering and burst/wait handling.

Changes

Cohort / File(s) Summary of changes
Tests and helpers
router-tests/authentication_test.go, router-tests/header_set_test.go, router-tests/jwks/jwks.go, router-tests/testenv/utils.go
Expanded authentication tests for unknown KID burst/wait logic and algorithm filtering; updated calls to TokenForKID(..., bool); added toJWKSConfig(..., allowedAlgorithms...); modified JWKS test server TokenForKID to accept a third boolean to enable signing with an invalid KID; added AwaitFunc test helper.
JWKS token decoder
router/pkg/authentication/jwks_token_decoder.go
Added exported RefreshUnknownKIDConfig and RefreshUnknownKID field on JWKSConfig; conditional rate-limiter creation using Interval/Burst and MaxWait when enabled; imports updated to include rate limiter.
Core wiring
router/core/supervisor_instance.go
Populated authentication.JWKSConfig.RefreshUnknownKID from runtime config when constructing token decoders.
Config types
router/pkg/config/config.go
Added public RefreshUnknownKID type and added JWKSConfiguration.RefreshUnknownKID field with YAML/env tags.
Config schema and samples
router/pkg/config/config.schema.json, router/pkg/config/testdata/config_full.json, router/pkg/config/fixtures/full.yaml
Added refresh_unknown_kid schema block (enabled, max_wait, interval, burst); updated full JSON and YAML fixtures with RefreshUnknownKID entries.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately summarizes the PR's primary goal: fixing JWT validation stalls that occur when concurrent requests trigger on-demand JWKS refreshes; the implemented configurable RefreshUnknownKID behavior and rate-limit max-wait directly address this blocking issue.
✨ Finishing touches
  • 📝 Generate Docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Sep 15, 2025

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-4f86597259c9fb2b9a5eacc7cf781cbe9bc51c41-nonroot

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
router-tests/testenv/testenv.go (1)

2535-2554: Timeout formatting + parent context; consider cancelability to avoid goroutine leaks

  • Format specifier: time.Duration with %d prints raw ns. Prefer human‑readable.
  • Parent context: tie the timeout to the environment’s context.
  • Optional: testFunction never observes cancellation; on timeout it keeps running in a goroutine. Expose a ctx to it or document the risk.

Apply this minimal diff:

-func (e *Environment) WaitForTest(timeout time.Duration, testFunction func()) {
+func (e *Environment) WaitForTest(timeout time.Duration, testFunction func()) {
 	e.t.Helper()

-	ctx, cancel := context.WithTimeout(context.Background(), timeout)
+	ctx, cancel := context.WithTimeout(e.Context, timeout)
 	defer cancel()
@@
 	case <-ctx.Done():
-		e.t.Fatalf("test timed out, want %d", timeout)
-		return
+		e.t.Fatalf("test timed out after %s", timeout)
 	}
 }

Optional follow-up (nice-to-have): add a ctx variant and forward it into the closure to enable cooperative cancellation.

// wrapper keeps current signature for call sites
func (e *Environment) WaitForTestCtx(timeout time.Duration, fn func(ctx context.Context)) {
  ctx, cancel := context.WithTimeout(e.Context, timeout)
  defer cancel()
  done := make(chan struct{})
  go func() { defer close(done); fn(ctx) }()
  select { case <-done: return; case <-ctx.Done(): e.t.Fatalf("test timed out after %s", timeout) }
}
router-tests/jwks/jwks.go (1)

53-69: Stabilize provider selection when useInvalidKid=true

Current selection iterates over a map, which is randomized in Go. If future tests depend on the specific algorithm chosen for the “invalid kid” token, this may cause flakiness. Pick a deterministic provider (e.g., smallest KID) and guard the empty-map case.

 func (s *Server) TokenForKID(kid string, claims map[string]any, useInvalidKid bool) (string, error) {
 	provider, ok := s.providers[kid]
 	if !useInvalidKid && !ok {
 		return "", jwt.ErrInvalidKey
 	} else if useInvalidKid {
-		// If we don't care about the kid we don't care about the provider
-		// we just get the first random provider provided
-		for _, pr := range s.providers {
-			provider = pr
-			break
-		}
+		if len(s.providers) == 0 {
+			return "", jwt.ErrInvalidKey
+		}
+		// Choose deterministically
+		keys := make([]string, 0, len(s.providers))
+		for k := range s.providers { keys = append(keys, k) }
+		sort.Strings(keys)
+		provider = s.providers[keys[0]]
 	}

(Remember to import "sort".)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6267f8 and ba7534f.

📒 Files selected for processing (5)
  • router-tests/authentication_test.go (5 hunks)
  • router-tests/header_set_test.go (1 hunks)
  • router-tests/jwks/jwks.go (1 hunks)
  • router-tests/testenv/testenv.go (1 hunks)
  • router/pkg/authentication/jwks_token_decoder.go (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
router-tests/authentication_test.go (4)
router-tests/testenv/testenv.go (3)
  • Run (107-124)
  • Config (286-342)
  • Environment (1729-1765)
router-tests/utils.go (1)
  • ConfigureAuth (45-57)
router/core/router.go (2)
  • Option (172-172)
  • WithAccessController (1815-1819)
router/core/access_controller.go (1)
  • NewAccessController (23-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: build_test
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./events)
  • GitHub Check: Analyze (go)
🔇 Additional comments (7)
router/pkg/authentication/jwks_token_decoder.go (1)

92-99: LGTM: removing on-demand refresh on unknown KID to prevent request blocking

The HTTP client now only uses scheduled refresh via jwkset storage. This should avoid stampedes/deadlocks on invalid tokens. One behavioral change to acknowledge: tokens signed with a newly rotated KID may fail until the next RefreshInterval elapses. Make sure this is documented and the interval is acceptable for your deployments.

Would you like a brief doc blurb added to JWKSConfig explaining the trade-off and guidance on RefreshInterval?

router-tests/header_set_test.go (1)

278-279: LGTM: updated TokenForKID call matches new API

Explicit false reads clearly. No concerns.

router-tests/authentication_test.go (5)

1222-1224: LGTM: API migration to 3-arg TokenForKID

Matches new helper semantics; clear intent with false.


1347-1351: LGTM: API migration to 3-arg TokenForKID in OIDC tests

Usage is correct.


1461-1465: LGTM: API migration to 3-arg TokenForKID in multiple-keys tests

No issues.


1645-1647: LGTM: API migration to 3-arg TokenForKID in SupportedAlgorithms

Works as intended.


831-867: Use a classic for-loop for Go <1.22 and close res.Body explicitly per iteration

Verification failed — go.mod not found when running the check; confirm the project’s Go version. If the toolchain is <1.22, apply this portable loop + explicit close diff:

-				for range 5 {
+				for i := 0; i < 5; i++ {
 					func() {
@@
-						defer func() {
-							_ = res.Body.Close()
-						}()
+						// Close explicitly at end of iteration to avoid stacking defers.
@@
-						data, err := io.ReadAll(res.Body)
+						data, err := io.ReadAll(res.Body)
 						require.NoError(t, err)
 						require.JSONEq(t, unauthorizedExpectedData, string(data))
+						_ = res.Body.Close()
 					}()
 				}

Comment thread router-tests/jwks/jwks.go Outdated
Comment thread router-tests/jwks/jwks.go Outdated
Comment thread router-tests/testenv/testenv.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
router-tests/authentication_test.go (5)

831-872: Good regression test to prevent request‑path blocking; consider a slightly larger timeout.

To reduce flakiness on slow CI, bump maxDuration from 5s to 10s. Optional.

-            maxDuration := 5 * time.Second
+            maxDuration := 10 * time.Second

1227-1227: Boolean arg is unclear at call site.

TokenForKID(..., false) lacks intent. Consider an options pattern or a tiny helper wrapper in tests to document the flag.


1352-1352: Boolean arg is unclear at call site.

Same nit: replace the bare false with a named option/helper for readability in tests.


1466-1466: Boolean arg is unclear at call site.

Prefer a named option/helper over a raw false.


1650-1650: Boolean arg is unclear at call site.

Use a helper/option to avoid “magic boolean” literals in tests.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba7534f and 79adec8.

📒 Files selected for processing (3)
  • router-tests/authentication_test.go (5 hunks)
  • router-tests/jwks/jwks.go (1 hunks)
  • router/go.mod (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router-tests/jwks/jwks.go
🧰 Additional context used
🧬 Code graph analysis (1)
router-tests/authentication_test.go (5)
router-tests/testenv/testenv.go (3)
  • Run (107-124)
  • Config (286-342)
  • Environment (1729-1765)
router-tests/utils.go (1)
  • ConfigureAuth (45-57)
router/core/router.go (2)
  • Option (172-172)
  • WithAccessController (1815-1819)
router/core/access_controller.go (1)
  • NewAccessController (23-28)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: build_test
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (1)
router/go.mod (1)

168-168: x/time moved to indirect — no lingering direct usage found.

Search found only redis_rate.NewLimiter in router/core/ratelimiter.go (NewCosmoRateLimiter); no golang.org/x/time/rate imports or RefreshUnknownKID references detected.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
router-tests/jwks/jwks.go (1)

58-63: Fix nondeterministic provider selection (map iteration order is random).

Randomly picking a provider will cause flaky tests and unpredictable alg selection when multiple providers exist.

Apply this diff for deterministic behavior (pick the sole provider, otherwise the lexicographically smallest KID):

-        // If we don't care about the kid we don't care about the provider
-        // we just get the first random provider provided
-        for _, pr := range s.providers {
-            provider = pr
-            break
-        }
+        // Deterministic selection to avoid flakiness
+        if len(s.providers) == 1 {
+            for _, pr := range s.providers { provider = pr }
+        } else {
+            keys := make([]string, 0, len(s.providers))
+            for k := range s.providers { keys = append(keys, k) }
+            sort.Strings(keys)
+            provider = s.providers[keys[0]]
+        }

Add this import as well:

 import (
   "context"
   "encoding/json"
   "fmt"
   "log"
   "net"
   "net/http"
   "net/http/httptest"
   "testing"
   "time"
+  "sort"
🧹 Nitpick comments (2)
router-tests/jwks/jwks.go (2)

66-69: Guard against nil provider to avoid panic.

If no providers are configured (or selection fails), calling methods on a nil Crypto will panic.

- token := jwt.NewWithClaims(provider.SigningMethod(), jwt.MapClaims(claims))
+ if provider == nil {
+   return "", jwt.ErrInvalidKey
+ }
+ token := jwt.NewWithClaims(provider.SigningMethod(), jwt.MapClaims(claims))

53-53: Clarify API intent: avoid boolean flag parameter.

Boolean flags in public APIs reduce readability. Consider adding a clear wrapper and using it in tests.

Add alongside the existing method:

// TokenForUnknownKID creates a token whose header KID is not present in the JWKS.
func (s *Server) TokenForUnknownKID(kid string, claims map[string]any) (string, error) {
	return s.TokenForKID(kid, claims, true)
}

Optionally deprecate the boolean form in tests over time.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba7534f and 3166aa8.

📒 Files selected for processing (3)
  • router-tests/authentication_test.go (5 hunks)
  • router-tests/jwks/jwks.go (1 hunks)
  • router/go.mod (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • router-tests/authentication_test.go
🔇 Additional comments (1)
router/go.mod (1)

168-168: LGTM — golang.org/x/time is indirect; no direct imports found.

Scanned repo: no occurrences of golang.org/x/time or golang.org/x/time/rate; go.mod shows it only as indirect (router/go.mod:168, router-tests/go.mod:178, demo/go.mod:174). router/core/ratelimiter.go:36 calls redis_rate.NewLimiter (transitive dependency).

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
router-tests/authentication_test.go (1)

846-865: Stronger signal: issue the 5 calls concurrently

The regression involves blocking under concurrent load. Consider running the 5 requests in parallel (WaitGroup + per-call assertions) within the same AwaitFunc to better reflect the real failure mode. Keep the existing sequential check if you prefer both.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3166aa8 and eaf3c13.

📒 Files selected for processing (2)
  • router-tests/authentication_test.go (5 hunks)
  • router-tests/testenv/utils.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
router-tests/authentication_test.go (4)
router-tests/testenv/testenv.go (3)
  • Run (107-124)
  • Config (286-342)
  • Environment (1729-1765)
router-tests/utils.go (1)
  • ConfigureAuth (45-57)
router/core/access_controller.go (1)
  • NewAccessController (23-28)
router-tests/testenv/utils.go (1)
  • AwaitFunc (32-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
router-tests/authentication_test.go (2)

1223-1223: LGTM: updated TokenForKID signature usage

Calls correctly pass the new boolean parameter (false) to preserve prior behavior.

Also applies to: 1348-1348, 1462-1462, 1645-1645


841-843: No action required — 'wg_static_kid' not present in JWKS fixtures.

rg over router-tests/jwks and router-tests returned the KID only in router-tests/authentication_test.go (lines 839–843); no matches in JWKS fixtures, so the test KID is effectively unknown to the JWKS servers.

Comment thread router-tests/authentication_test.go Outdated
Comment thread router-tests/testenv/utils.go
Comment thread router-tests/jwks/jwks.go Outdated
@StarpTech
Copy link
Copy Markdown
Contributor

Great PR description! I recommend making this configurable. I agree that, by default, we should be pessimistic and abort requests where the KID can't be found. That means only the refreshInterval is activated. Additionally, a user can define enabling refresh_unknown_kid, including a rate limit (disabled by default), if their case requires them to fetch them proactively. This could make sense to avoid avalanche effects in big router deployments.

Example

authentication:
  jwt:
    jwks:
      - refresh_interval: 10m
        url: "https://www.googleapis.com/oauth2/v3/certs"
        refresh_unknown_kid:
          enabled: true # Disabled by default
          max_wait: 10s # RateLimitWaitMax (Something reasonable like 10s)
          interval: 1m # rate.Every (Something reasonable like 1m)
          burst: 10 # burst of rate.Every (Something reasonable like 2)

RateLimitWaitMax is an option of jwkset.HTTPClientOptions

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (7)
router/pkg/config/testdata/config_full.json (1)

483-487: Example config uses zero interval/burst; prefer realistic, schema‑compliant values.

To avoid copy/paste of ineffective settings, set non‑zero examples matching the schema defaults (interval ≥ 1s, burst ≥ 1). Example:

           "RefreshUnknownKID": {
-            "Enabled": false,
-            "Interval": 0,
-            "Burst": 0
+            "Enabled": false,
+            "Interval": 60000000000,
+            "Burst": 2
           },

Ensure this file isn’t used by schema validation; if it is, zero values will violate the new constraints.

Also applies to: 500-504, 514-518

router/core/supervisor_instance.go (1)

268-272: Add defensive guards for bad limiter settings.

If users set enabled=true with interval ≤ 0 or burst < 1, the limiter can behave unexpectedly. Consider clamping or disabling with a warning before passing to the decoder.

router/pkg/config/config.go (1)

484-488: Optional: allow env overrides for nested fields.

If you want environment control, add an env prefix on the nested block.

-type RefreshUnknownKIDConfig struct {
+type RefreshUnknownKIDConfig struct {
   Enabled  bool          `yaml:"enabled" envDefault:"false"`
   Interval time.Duration `yaml:"interval" envDefault:"1m"`
   Burst    int           `yaml:"burst" envDefault:"2"`
 }

And (optional) on the parent field:

-RefreshUnknownKID RefreshUnknownKIDConfig `yaml:"refresh_unknown_kid"`
+RefreshUnknownKID RefreshUnknownKIDConfig `yaml:"refresh_unknown_kid" envPrefix:"REFRESH_UNKNOWN_KID_"`
router/pkg/config/config.schema.json (1)

1719-1745: Schema block looks good; suggest adding max_wait and disallowing for secret mode.

  • Add max_wait to cap per‑request wait on the limiter (matches jwkset HTTPClientOptions RateLimitWaitMax).
  • Optionally forbid refresh_unknown_kid on secret-based JWK entries to avoid confusion.
           "refresh_unknown_kid": {
             "type": "object",
             "description": "Controls rate-limited refresh behavior when a JWT KID is unknown.",
             "additionalProperties": false,
             "properties": {
               "enabled": {
                 "type": "boolean",
                 "description": "Enable refresh attempts on unknown KID.",
                 "default": false
               },
               "interval": {
                 "type": "string",
                 "description": "Token refill interval for the rate limiter.",
                 "default": "1m",
                 "duration": { "minimum": "1s" }
               },
+              "max_wait": {
+                "type": "string",
+                "description": "Maximum time a request may wait for a refresh slot before skipping the refresh.",
+                "default": "10s",
+                "duration": { "minimum": "0s" }
+              },
               "burst": {
                 "type": "integer",
                 "description": "Burst size for the rate limiter.",
                 "default": 2,
                 "minimum": 1
               }
             }
           }

And in the secret-based branch of the oneOf add:

-  "not": { "anyOf": [
+  "not": { "anyOf": [
       { "required": ["url"] },
       { "required": ["algorithms"] },
-      { "required": ["refresh_interval"] }
+      { "required": ["refresh_interval"] },
+      { "required": ["refresh_unknown_kid"] }
   ] }
router/pkg/authentication/jwks_token_decoder.go (1)

108-114: Optional: wire a max wait to prevent long per‑request blocks.

If you add refresh_unknown_kid.max_wait to config, pass it through to the HTTP client options.

-            jwksetHTTPClientOptions := jwkset.HTTPClientOptions{
+            jwksetHTTPClientOptions := jwkset.HTTPClientOptions{
                 HTTPURLs: map[string]jwkset.Storage{ c.URL: store },
                 PrioritizeHTTP:    true,
                 RefreshUnknownKID: refreshLimiter,
+                // RateLimitWaitMax caps per-request wait when refresh is enabled.
+                // Set when the config grows a corresponding field.
+                // RateLimitWaitMax: c.RefreshUnknownKID.MaxWait,
             }
router-tests/authentication_test.go (2)

59-107: Stabilize timing assertion to avoid CI flakiness

With Interval=1s and Burst=1, the second request should wait ~1s. A 600ms lower bound is brittle on slower runners. Bump to ~900ms (or similar) to reduce flakes.

-            require.True(t, elapsed >= 600*time.Millisecond)
+            require.True(t, elapsed >= 900*time.Millisecond, "expected refresh limiter to block ~1s; got %v", elapsed)

109-158: Relax non‑blocking threshold; 100ms can be too tight under load

Even when not blocked, end‑to‑end request times can exceed 100ms in CI. Loosen the bound a bit to avoid false failures.

-            require.True(t, elapsed < 100*time.Millisecond)
+            require.True(t, elapsed < 250*time.Millisecond, "should not block when within burst; elapsed=%v", elapsed)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dac1b99 and a04b953.

📒 Files selected for processing (6)
  • router-tests/authentication_test.go (5 hunks)
  • router/core/supervisor_instance.go (1 hunks)
  • router/pkg/authentication/jwks_token_decoder.go (3 hunks)
  • router/pkg/config/config.go (2 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2224
File: router/pkg/authentication/jwks_token_decoder.go:176-199
Timestamp: 2025-09-17T18:13:31.021Z
Learning: In router/pkg/authentication/jwks_token_decoder.go, when AllowEmptyAlgorithm is true, the allowedAlgorithms slice is populated by getSupportedAlgorithms() from validation_store.go, ensuring it's never empty and preventing unintended token rejections.
Learnt from: SkArchon
PR: wundergraph/cosmo#2224
File: router/pkg/authentication/jwks_token_decoder.go:176-199
Timestamp: 2025-09-17T18:13:31.021Z
Learning: In router/pkg/authentication/jwks_token_decoder.go, when AllowEmptyAlgorithm is true, the allowedAlgorithms slice is populated by getSupportedAlgorithms() from validation_store.go, ensuring it's never empty and preventing unintended token rejections. The validation store maintains its own internal supported algorithms map that gets returned regardless of input parameters.
📚 Learning: 2025-09-17T13:51:50.823Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2216
File: router-tests/authentication_test.go:847-847
Timestamp: 2025-09-17T13:51:50.823Z
Learning: Go 1.22 introduced the ability to range over integers, making `for range n` syntax valid for iterating n times in loops.

Applied to files:

  • router-tests/authentication_test.go
🧬 Code graph analysis (4)
router/core/supervisor_instance.go (2)
router/pkg/authentication/jwks_token_decoder.go (1)
  • RefreshUnknownKIDConfig (56-60)
router/pkg/config/config.go (1)
  • RefreshUnknownKIDConfig (484-488)
router/pkg/config/config.go (1)
router/pkg/authentication/jwks_token_decoder.go (1)
  • RefreshUnknownKIDConfig (56-60)
router/pkg/authentication/jwks_token_decoder.go (1)
router/pkg/config/config.go (1)
  • RefreshUnknownKIDConfig (484-488)
router-tests/authentication_test.go (6)
router-tests/testenv/testenv.go (3)
  • Run (107-124)
  • Config (286-342)
  • Environment (1729-1765)
router-tests/jwks/jwks.go (1)
  • NewServer (174-180)
router-tests/utils.go (1)
  • ConfigureAuthWithJwksConfig (59-70)
router/pkg/authentication/jwks_token_decoder.go (2)
  • JWKSConfig (42-54)
  • RefreshUnknownKIDConfig (56-60)
router/core/access_controller.go (1)
  • NewAccessController (23-28)
router-tests/testenv/utils.go (1)
  • AwaitFunc (32-42)
🔇 Additional comments (6)
router/core/supervisor_instance.go (1)

268-272: Wiring LGTM, but will be a no‑op until config decoding maps the nested block correctly.

Once JWKSConfiguration uses yaml:"refresh_unknown_kid" (not inline), these fields will populate as expected.

router-tests/authentication_test.go (5)

1330-1331: LGTM: updated TokenForKID signature use

Explicit third-arg usage looks correct here.


1455-1456: LGTM: updated TokenForKID signature use

Callsite reflects the new API and intent.


1569-1570: LGTM: updated TokenForKID signature use

Consistent with other callsites.


1753-1754: LGTM: updated TokenForKID signature use

Matches the revised API in supported algorithms tests.


160-202: No change needed — repository uses Go 1.25, so for range 5 is supported.
All go.mod files declare go 1.25, satisfying the Go 1.22+ requirement.

Comment thread router/pkg/authentication/jwks_token_decoder.go Outdated
Comment thread router/pkg/config/config.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router/pkg/config/config.go (1)

484-490: Consider adding validation constraints for rate limiting parameters.

The RefreshUnknownKID struct looks well-structured with sensible defaults. However, consider adding validation to ensure that the rate limiting parameters make sense together (e.g., Burst should be positive when Enabled is true, MaxWait should be reasonable, etc.).

Based on the search results, I can see that Go provides several approaches for struct validation, with the most popular being the go-playground/validator package. However, looking at the current codebase structure and the need to keep things simple, I'd recommend adding basic validation logic rather than introducing a new dependency.

 type RefreshUnknownKID struct {
-	Enabled  bool          `yaml:"enabled" envDefault:"false"`
-	MaxWait  time.Duration `yaml:"max_wait" envDefault:"2m"`
-	Interval time.Duration `yaml:"interval" envDefault:"30s"`
-	Burst    int           `yaml:"burst" envDefault:"2"`
+	Enabled  bool          `yaml:"enabled" envDefault:"false"`
+	MaxWait  time.Duration `yaml:"max_wait" envDefault:"2m" validate:"min=1s,max=10m"`
+	Interval time.Duration `yaml:"interval" envDefault:"30s" validate:"min=1s,max=5m"`
+	Burst    int           `yaml:"burst" envDefault:"2" validate:"min=1,max=100"`
+}
+
+// Validate checks that the RefreshUnknownKID configuration is valid
+func (r *RefreshUnknownKID) Validate() error {
+	if r.Enabled {
+		if r.Burst < 1 {
+			return fmt.Errorf("burst must be at least 1 when refresh_unknown_kid is enabled")
+		}
+		if r.MaxWait < time.Second {
+			return fmt.Errorf("max_wait must be at least 1 second")
+		}
+		if r.Interval < time.Second {
+			return fmt.Errorf("interval must be at least 1 second")
+		}
+		if r.MaxWait < r.Interval {
+			return fmt.Errorf("max_wait (%v) should be greater than or equal to interval (%v)", r.MaxWait, r.Interval)
+		}
+	}
+	return nil
 }

Or add this validation method to be called during configuration loading:

// Add to the Config struct or call during LoadConfig
func (c *Config) validateJWKSConfiguration() error {
	for i, jwks := range c.Authentication.JWT.JWKS {
		if err := jwks.RefreshUnknownKID.Validate(); err != nil {
			return fmt.Errorf("invalid JWKS[%d] refresh_unknown_kid configuration: %w", i, err)
		}
	}
	return nil
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd38a25 and 24025e4.

📒 Files selected for processing (2)
  • router-tests/authentication_test.go (6 hunks)
  • router/pkg/config/config.go (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-17T13:51:50.823Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2216
File: router-tests/authentication_test.go:847-847
Timestamp: 2025-09-17T13:51:50.823Z
Learning: Go 1.22 introduced the ability to range over integers, making `for range n` syntax valid for iterating n times in loops.

Applied to files:

  • router-tests/authentication_test.go
🧬 Code graph analysis (1)
router-tests/authentication_test.go (5)
router-tests/testenv/testenv.go (3)
  • Run (107-124)
  • Config (286-342)
  • Environment (1729-1765)
router-tests/jwks/jwks.go (1)
  • NewServer (174-180)
router-tests/utils.go (1)
  • ConfigureAuthWithJwksConfig (59-70)
router/pkg/authentication/jwks_token_decoder.go (2)
  • JWKSConfig (42-54)
  • RefreshUnknownKIDConfig (56-61)
router-tests/testenv/utils.go (1)
  • AwaitFunc (32-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: build_push_image
  • GitHub Check: build_test
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (12)
router/pkg/config/config.go (1)

468-473: LGTM! Configuration structure properly implements the schema.

The RefreshUnknownKID field is now properly configured with the correct YAML tag yaml:"refresh_unknown_kid" which matches the expected nested structure in the configuration schema. The field placement within JWKSConfiguration is appropriate and the integration looks good.

router-tests/authentication_test.go (11)

11-12: LGTM: Appropriate imports for concurrency testing

The addition of sync and sync/atomic imports is correct for the new parallel test cases that exercise rate limiting behaviors with concurrent requests.


61-109: Thorough test coverage for burst limit exceeded scenario

This test effectively validates the rate limiting behavior when the burst limit is exceeded. The test correctly:

  • Configures a restrictive burst limit of 1
  • Makes two sequential requests with unknown KIDs
  • Verifies the second request is delayed by at least 700ms (matching the 1-second interval minus some buffer)

111-160: Validates proper behavior when burst is within limits

This test correctly demonstrates that when enough time passes between requests (allowing the rate limiter to regenerate tokens), subsequent requests don't experience blocking delays. The 1200ms sleep appropriately exceeds the 1-second interval.


164-213: Excellent test for max wait boundary conditions

This test validates an important edge case where the rate limiter can predict that waiting wouldn't help (when required wait time exceeds MaxWait). The configuration is well-designed with MaxWait (700ms) being less than the interval (1s), ensuring immediate return.


215-264: Proper validation of wait behavior when MaxWait allows it

This test complements the previous one by showing the opposite scenario where MaxWait is larger than the interval, allowing the request to actually wait and succeed. The timing assertions are appropriate.


268-342: Sophisticated concurrent testing with atomic counters

This test demonstrates advanced testing techniques:

  • Uses goroutines to simulate concurrent requests
  • Employs atomic counters to track fast responses (< 100ms)
  • Validates that exactly one request returns quickly while others wait
  • The logic correctly accounts for the accumulated wait time scenario

344-386: Comprehensive test for disabled refresh behavior

This test ensures that when RefreshUnknownKID is disabled (the default), requests don't block and return immediately with unauthorized status. The use of testenv.AwaitFunc with multiple iterations provides good coverage.


369-369: Consider consistent loop syntax usage

Based on the retrieved learning, the for range 5 syntax is valid in Go 1.22+. However, for consistency across the codebase, consider whether all similar loops should use the same style.


1514-1514: Consistent usage of updated TokenForKID API

All calls to TokenForKID consistently use false as the third parameter, which appears to align with the test requirements. The usage pattern is consistent throughout the file.

Also applies to: 1639-1639, 1753-1753, 1936-1936


2836-2842: Helper function properly supports algorithm constraints

The toJWKSConfig helper function correctly accepts variadic allowedAlgorithms parameter and maps it to the AllowedAlgorithms field in the JWKSConfig struct, enabling algorithm filtering in tests.


85-85: No action required — the third parameter is intentional and correct

TokenForKID is defined at router-tests/jwks/jwks.go:53 as func (s *Server) TokenForKID(kid string, claims map[string]any, useInvalidKID bool) ...; when useInvalidKID==true it selects any provider to sign the token while still setting token.Header[jwkset.HeaderKID]=kid, so calls like TokenForKID("unknown_kid", nil, true) intentionally produce a token signed by a valid key with an unknown KID header.

Copy link
Copy Markdown
Contributor

@StarpTech StarpTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SkArchon SkArchon dismissed endigma’s stale review September 23, 2025 07:58

comments been met

@SkArchon SkArchon merged commit 2d5a02a into main Sep 23, 2025
37 of 38 checks passed
@SkArchon SkArchon deleted the milinda/eng-8149-jwk-refreshunknownkid-causing-deadlocks branch September 23, 2025 07:58
@Noroth Noroth mentioned this pull request Sep 30, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants