feat(router): enable fail open on rate limit redis availability failure#1659
Conversation
linter has lots of opinions
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
…-redis-unavailability
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
jensneuse
left a comment
There was a problem hiding this comment.
I have made a few comments we should clarify.
In addition, we need to add tests before we can merge this.
| } | ||
| key, err := c.generateKey(ctx) | ||
| if err != nil { | ||
| if err != nil && c.failOpen{ |
There was a problem hiding this comment.
generateKey is not interacting with redis, what's the reason you skip here as well?
There was a problem hiding this comment.
Will spin up my test environment and sort this out then get back to you!
| Debug: s.rateLimit.Debug, | ||
| RejectStatusCode: s.rateLimit.SimpleStrategy.RejectStatusCode, | ||
| KeySuffixExpression: s.rateLimit.KeySuffixExpression, | ||
| FailOpen: s.rateLimit.FailOpen, |
There was a problem hiding this comment.
Do we want to keep this a "generic" FailOpen (could be fine) or do we want to make the feature more specific?
Naming is hard, but something like: IgnoreRedisUnavailableErrors
There was a problem hiding this comment.
I am game for naming this however you think would best fit the specific use case. More specificity would almost certainly be a good idea. I like IgnoreRedisUnavailableErrors, IgnoreRateLimitsWhenRedisUnavailable, or FailOpenRateLimitsDependencyError.
| allow, err := c.limiter.AllowN(ctx.Context(), key, limit, requestRate) | ||
| if err != nil { | ||
| if err != nil && c.failOpen{ | ||
| return nil, nil |
There was a problem hiding this comment.
Can you check here if redis is truly unavailable?
If you're treating all errors equal, I'm wondering what the downsides could be.
There was a problem hiding this comment.
That's an entirely fair callout, we should be concerned with treating all errors equal here. When implementing this my thoughts were anything is better then dropping the request with an internal service error. I didn't encounter unexpected side effects at that time. I wonder if strict integration testing would be enough or if adding complexity to this solution is the best path forward?
…-redis-unavailability
…-redis-unavailability
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
…-redis-unavailability
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
…-redis-unavailability
WalkthroughA new boolean configuration field Changes
Possibly related issues
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
router/core/ratelimiter.go (2)
79-81: Reconsider fail-open behavior for key generation errors.The
generateKeyfunction doesn't interact with Redis - it only performs string operations and expression evaluation. Failing open for key generation errors may not be appropriate since these are likely configuration or logic errors rather than Redis unavailability issues.Consider removing the fail-open logic here and only applying it to Redis-related operations:
- if err != nil && c.failOpen{ - return nil, nil - } else if err != nil { + if err != nil { return nil, err }
85-87: Consider implementing more granular error handling for Redis operations.The current implementation treats all errors from
AllowNequally, but some errors might be configuration issues rather than Redis unavailability. Consider checking if the error is specifically related to Redis connectivity before applying fail-open behavior.This would provide more precise fail-open behavior and avoid masking configuration errors:
- if err != nil && c.failOpen{ - return nil, nil - } else if err != nil { + if err != nil { + if c.failOpen && isRedisConnectivityError(err) { + return nil, nil + } return nil, err }You would need to implement
isRedisConnectivityErrorto check for specific Redis connectivity-related errors.Let me search for examples of Redis connectivity error types to help implement more granular error handling:
#!/bin/bash # Search for Redis error handling patterns in the codebase rg -A 5 -B 5 "redis.*error|Error.*redis" --type go
🧹 Nitpick comments (2)
router/pkg/config/config.go (1)
476-476: Add a descriptive comment for the newFailOpenflagOther non-obvious fields in
RateLimitConfigurationare documented with inline comments. Adding one here keeps the struct consistently self-documenting.- FailOpen bool `yaml:"fail_open" envDefault:"false" env:"RATE_LIMIT_FAIL_OPEN"` + // FailOpen allows the router to “fail open” – i.e. continue serving + // requests when the Redis backend for rate-limiting is unreachable. + // When false (default) such storage failures result in a 5xx response. + FailOpen bool `yaml:"fail_open" envDefault:"false" env:"RATE_LIMIT_FAIL_OPEN"`router/internal/persistedoperation/operationstorage/redis/rdcloser.go (1)
79-79: Consider making the warning message more context-specific.The warning message mentions "Ratelimit Fail Open" but this Redis client is used for persisted operation storage, not just rate limiting. Consider making the message more generic or context-appropriate.
- opts.Logger.Warn(fmt.Sprintf("Ratelimit Fail Open activated: redis client is currently not responding with provided URLs: %q", err)) + opts.Logger.Warn(fmt.Sprintf("Fail Open activated: redis client is currently not responding with provided URLs: %q", err))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
router/core/graph_server.go(1 hunks)router/core/ratelimiter.go(5 hunks)router/core/router.go(2 hunks)router/internal/persistedoperation/operationstorage/redis/rdcloser.go(3 hunks)router/pkg/config/config.go(1 hunks)router/pkg/config/config.schema.json(1 hunks)router/pkg/config/testdata/config_defaults.json(1 hunks)router/pkg/config/testdata/config_full.json(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
router/pkg/config/testdata/config_defaults.json (2)
Learnt from: endigma
PR: wundergraph/cosmo#2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.755Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Learnt from: SkArchon
PR: wundergraph/cosmo#1929
File: router/internal/circuit/manager.go:16-25
Timestamp: 2025-06-30T20:39:02.376Z
Learning: In the Cosmo router project, parameter validation for circuit breaker configuration is handled at the JSON schema level rather than through runtime validation methods on structs. The config.schema.json file contains comprehensive validation constraints for circuit breaker parameters.
router/pkg/config/testdata/config_full.json (2)
Learnt from: endigma
PR: wundergraph/cosmo#2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.755Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Learnt from: SkArchon
PR: wundergraph/cosmo#1929
File: router/internal/circuit/manager.go:16-25
Timestamp: 2025-06-30T20:39:02.376Z
Learning: In the Cosmo router project, parameter validation for circuit breaker configuration is handled at the JSON schema level rather than through runtime validation methods on structs. The config.schema.json file contains comprehensive validation constraints for circuit breaker parameters.
router/pkg/config/config.go (1)
Learnt from: endigma
PR: wundergraph/cosmo#2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.755Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
router/pkg/config/config.schema.json (2)
undefined
<retrieved_learning>
Learnt from: SkArchon
PR: #1929
File: router/internal/circuit/manager.go:16-25
Timestamp: 2025-06-30T20:39:02.376Z
Learning: In the Cosmo router project, parameter validation for circuit breaker configuration is handled at the JSON schema level rather than through runtime validation methods on structs. The config.schema.json file contains comprehensive validation constraints for circuit breaker parameters.
</retrieved_learning>
<retrieved_learning>
Learnt from: endigma
PR: #2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.755Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
</retrieved_learning>
🔇 Additional comments (8)
router/core/graph_server.go (1)
1182-1182: LGTM! Proper configuration propagation.The addition of the
FailOpenfield correctly propagates the fail-open configuration from the server's rate limit settings to the rate limiter options, enabling the new fail-open behavior when Redis is unavailable.router/pkg/config/testdata/config_defaults.json (1)
251-252: LGTM! Conservative default configuration.Setting
FailOpentofalseby default is the right approach, ensuring backward compatibility and preventing unintended fail-open behavior unless explicitly enabled.router/pkg/config/testdata/config_full.json (1)
506-507: LGTM! Consistent configuration across test files.The addition of
FailOpen: falsemaintains consistency with the default configuration, ensuring uniform behavior across all test scenarios.router/core/router.go (2)
801-801: LGTM! Proper configuration propagation.The
FailOpenconfiguration is correctly passed to the Redis client options, enabling the fail-open behavior when Redis is unavailable.
1257-1257: LGTM! Good observability addition.Adding the
failOpenfield to the rate limiting configuration logs improves debugging and configuration verification capabilities.router/internal/persistedoperation/operationstorage/redis/rdcloser.go (2)
24-24: Verify the fail-open behavior implementation.The fail-open logic is correctly implemented - when Redis is unavailable and FailOpen is true, the function logs a warning and returns the Redis client anyway, allowing the application to continue operating despite Redis connectivity issues.
Also applies to: 78-81
78-78: Fix the formatting issue in the if statement.There's a missing space after
ifwhich will cause a syntax error.- if(opts.FailOpen) { + if opts.FailOpen {Likely an incorrect or invalid review comment.
router/core/ratelimiter.go (1)
32-32: LGTM on the struct modifications.The addition of the
FailOpenfield to both the options and the rate limiter struct is correctly implemented and properly initialized.Also applies to: 43-43, 65-65
| }, | ||
| "fail_open": { | ||
| "type": "boolean", | ||
| "description": "Enable Rate Limit fail open on redis availability failure. This interacts with Redis timeout configuration parameters, essentially adding to each requests latency in failure.", | ||
| "default": false | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add conditional schema logic to prevent silent mis-configuration
fail_open is a welcome addition, but with the current schema a user can set
rate_limit.enabled: true, omit storage.urls, and forget to flip fail_open to true.
This will compile but blow up at runtime when Redis is missing.
Guard against that class of error by expressing the relationship directly in the
schema:
"fail_open": {
"type": "boolean",
"description": "Enable Rate Limit fail open on redis availability failure. This interacts with Redis timeout configuration parameters, essentially adding to each requests latency in failure.",
"default": false
}
+ },
+ "allOf": [
+ {
+ "if": {
+ "properties": {
+ "enabled": { "const": true },
+ "fail_open": { "const": false }
+ }
+ },
+ "then": {
+ "required": ["storage"]
+ },
+ "else": {
+ "not": {
+ "anyOf": [
+ { "required": ["storage"] }
+ ]
+ }
+ }
+ }
+ ]This ensures:
- When rate-limiting is on and
fail_openis false →storagemust be present. - When
fail_openis true →storageis optional, mirroring the intended “Redis optional” behaviour.
Keeps configuration errors in CI instead of production.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| }, | |
| "fail_open": { | |
| "type": "boolean", | |
| "description": "Enable Rate Limit fail open on redis availability failure. This interacts with Redis timeout configuration parameters, essentially adding to each requests latency in failure.", | |
| "default": false | |
| } | |
| }, | |
| "fail_open": { | |
| "type": "boolean", | |
| "description": "Enable Rate Limit fail open on redis availability failure. This interacts with Redis timeout configuration parameters, essentially adding to each requests latency in failure.", | |
| "default": false | |
| } | |
| }, | |
| "allOf": [ | |
| { | |
| "if": { | |
| "properties": { | |
| "enabled": { "const": true }, | |
| "fail_open": { "const": false } | |
| } | |
| }, | |
| "then": { | |
| "required": ["storage"] | |
| }, | |
| "else": { | |
| "not": { | |
| "anyOf": [ | |
| { "required": ["storage"] } | |
| ] | |
| } | |
| } | |
| } | |
| ] |
🤖 Prompt for AI Agents
In router/pkg/config/config.schema.json around lines 1813 to 1818, the schema
allows setting fail_open without requiring storage.urls when rate_limit.enabled
is true, causing runtime errors if Redis is missing. To fix this, add
conditional schema logic using "if", "then", and "else" clauses to enforce that
when rate_limit.enabled is true and fail_open is false, storage.urls must be
present, and when fail_open is true, storage.urls can be omitted. This will
prevent misconfiguration by validating these dependencies at compile time.
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Closed as inactive. Feel free to reopen if this PR is still being worked on. |

Motivation and Context
Per #1555
On testing a Cosmo Rate Limit implementation I discovered every request would get a 500 response if Redis is scaled to zero, or if authentication is malformed.
This small change introduces a fail open configuration bool on rate_limit that allows requests to succeed, albeit delayed by the timeout on failing to hit Redis per request.
With fail open enabled the server is also able to start up without Redis availability, when Redis comes back up it will begin rate limiting again.
The per request penalty can be ameliorated by setting tighter timeouts on your Redis connection, for example:
redis://:PASSWORD@redis-master.redis.svc.cluster.local:6379?read_timeout=20ms&write_timeout=20ms&max_retries=-1&dial_timeout=100msPotential improvements
This implementation is naive in that you will pay a per request penalty whenever Redis isn't available, this wouldn't be ideal in a situation where Redis isn't available for > 1 request at a time.
It would be better to implement an internal cache method where Redis availability could be polled to at some interval async to the request pipeline. Then the request threads could just read the cache to determine if they should proceed with rate limiting or fail open fast. The tradeoff being that you can't guarantee you tried to rate limit on each request.
Checklist
Testing this additional feature is challenging and would likely need to take place in integration tests, glad to make a stab at that if y'all are interested in moving forward with this!
Summary by CodeRabbit
New Features
Bug Fixes
Documentation