Skip to content

feat(router): enable fail open on rate limit redis availability failure#1659

Closed
Sam-tesouro wants to merge 37 commits intowundergraph:mainfrom
Sam-tesouro:feat(router)--enable-fail-open-on-rate-limit-redis-unavailability
Closed

feat(router): enable fail open on rate limit redis availability failure#1659
Sam-tesouro wants to merge 37 commits intowundergraph:mainfrom
Sam-tesouro:feat(router)--enable-fail-open-on-rate-limit-redis-unavailability

Conversation

@Sam-tesouro
Copy link
Copy Markdown

@Sam-tesouro Sam-tesouro commented Mar 5, 2025

Motivation and Context

Per #1555

On testing a Cosmo Rate Limit implementation I discovered every request would get a 500 response if Redis is scaled to zero, or if authentication is malformed.

This small change introduces a fail open configuration bool on rate_limit that allows requests to succeed, albeit delayed by the timeout on failing to hit Redis per request.

With fail open enabled the server is also able to start up without Redis availability, when Redis comes back up it will begin rate limiting again.

The per request penalty can be ameliorated by setting tighter timeouts on your Redis connection, for example:
redis://:PASSWORD@redis-master.redis.svc.cluster.local:6379?read_timeout=20ms&write_timeout=20ms&max_retries=-1&dial_timeout=100ms

Potential improvements

This implementation is naive in that you will pay a per request penalty whenever Redis isn't available, this wouldn't be ideal in a situation where Redis isn't available for > 1 request at a time.

It would be better to implement an internal cache method where Redis availability could be polled to at some interval async to the request pipeline. Then the request threads could just read the cache to determine if they should proceed with rate limiting or fail open fast. The tradeoff being that you can't guarantee you tried to rate limit on each request.

Checklist

Testing this additional feature is challenging and would likely need to take place in integration tests, glad to make a stab at that if y'all are interested in moving forward with this!

Summary by CodeRabbit

  • New Features

    • Added a new "fail open" option for rate limiting, allowing requests to proceed if Redis encounters errors or is unavailable, based on configuration.
    • Introduced a configurable "fail open" setting via YAML, environment variables, and JSON schema for rate limiting.
  • Bug Fixes

    • Improved error handling for rate limiting by allowing optional bypass on internal errors when "fail open" is enabled.
  • Documentation

    • Updated configuration schema and example files to include the new "fail open" option for rate limiting.

@github-actions github-actions Bot added the router label Mar 5, 2025
@Sam-tesouro Sam-tesouro marked this pull request as draft March 5, 2025 14:09
Sam-tesouro and others added 2 commits March 5, 2025 09:24
@Sam-tesouro Sam-tesouro marked this pull request as ready for review March 5, 2025 14:36
@Sam-tesouro Sam-tesouro marked this pull request as draft March 7, 2025 21:14
@Sam-tesouro Sam-tesouro marked this pull request as ready for review March 8, 2025 03:09
Copy link
Copy Markdown

@Tesouro-Chris Tesouro-Chris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gif

@github-actions
Copy link
Copy Markdown

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions Bot added the Stale label May 13, 2025
Copy link
Copy Markdown
Member

@jensneuse jensneuse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made a few comments we should clarify.
In addition, we need to add tests before we can merge this.

}
key, err := c.generateKey(ctx)
if err != nil {
if err != nil && c.failOpen{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generateKey is not interacting with redis, what's the reason you skip here as well?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will spin up my test environment and sort this out then get back to you!

Debug: s.rateLimit.Debug,
RejectStatusCode: s.rateLimit.SimpleStrategy.RejectStatusCode,
KeySuffixExpression: s.rateLimit.KeySuffixExpression,
FailOpen: s.rateLimit.FailOpen,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this a "generic" FailOpen (could be fine) or do we want to make the feature more specific?
Naming is hard, but something like: IgnoreRedisUnavailableErrors

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am game for naming this however you think would best fit the specific use case. More specificity would almost certainly be a good idea. I like IgnoreRedisUnavailableErrors, IgnoreRateLimitsWhenRedisUnavailable, or FailOpenRateLimitsDependencyError.

allow, err := c.limiter.AllowN(ctx.Context(), key, limit, requestRate)
if err != nil {
if err != nil && c.failOpen{
return nil, nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check here if redis is truly unavailable?
If you're treating all errors equal, I'm wondering what the downsides could be.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an entirely fair callout, we should be concerned with treating all errors equal here. When implementing this my thoughts were anything is better then dropping the request with an internal service error. I didn't encounter unexpected side effects at that time. I wonder if strict integration testing would be enough or if adding complexity to this solution is the best path forward?

@jensneuse jensneuse requested a review from StarpTech May 13, 2025 07:00
@github-actions github-actions Bot removed the Stale label May 14, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions Bot added the Stale label Jun 4, 2025
@github-actions github-actions Bot removed the Stale label Jun 10, 2025
@github-actions
Copy link
Copy Markdown

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions Bot added the Stale label Jun 24, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jul 3, 2025

Walkthrough

A new boolean configuration field FailOpen is introduced to control whether rate limiting should fail open when Redis or internal errors occur. This option is added to configuration structs, JSON schema, and relevant logic in the rate limiter and Redis client initialization, allowing requests to proceed when enabled and errors are encountered.

Changes

File(s) Change Summary
router/pkg/config/config.go
router/pkg/config/config.schema.json
router/pkg/config/testdata/config_defaults.json
router/pkg/config/testdata/config_full.json
Added FailOpen boolean field to rate limit configuration structs, JSON schema, and test data.
router/core/ratelimiter.go Added FailOpen to options and implementation; updated error handling to optionally fail open.
router/core/graph_server.go
router/core/router.go
Propagated FailOpen from configuration to rate limiter and Redis client initialization; logging.
router/internal/persistedoperation/operationstorage/redis/rdcloser.go Added FailOpen to RedisCloserOptions and implemented fail-open logic for Redis connection errors.

Possibly related issues

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
router/core/ratelimiter.go (2)

79-81: Reconsider fail-open behavior for key generation errors.

The generateKey function doesn't interact with Redis - it only performs string operations and expression evaluation. Failing open for key generation errors may not be appropriate since these are likely configuration or logic errors rather than Redis unavailability issues.

Consider removing the fail-open logic here and only applying it to Redis-related operations:

-	if err != nil && c.failOpen{
-		return nil, nil
-	} else if err != nil {
+	if err != nil {
		return nil, err
	}

85-87: Consider implementing more granular error handling for Redis operations.

The current implementation treats all errors from AllowN equally, but some errors might be configuration issues rather than Redis unavailability. Consider checking if the error is specifically related to Redis connectivity before applying fail-open behavior.

This would provide more precise fail-open behavior and avoid masking configuration errors:

-	if err != nil && c.failOpen{
-		return nil, nil
-	} else if err != nil {
+	if err != nil {
+		if c.failOpen && isRedisConnectivityError(err) {
+			return nil, nil
+		}
		return nil, err
	}

You would need to implement isRedisConnectivityError to check for specific Redis connectivity-related errors.

Let me search for examples of Redis connectivity error types to help implement more granular error handling:

#!/bin/bash
# Search for Redis error handling patterns in the codebase
rg -A 5 -B 5 "redis.*error|Error.*redis" --type go
🧹 Nitpick comments (2)
router/pkg/config/config.go (1)

476-476: Add a descriptive comment for the new FailOpen flag

Other non-obvious fields in RateLimitConfiguration are documented with inline comments. Adding one here keeps the struct consistently self-documenting.

-    FailOpen            bool                        `yaml:"fail_open" envDefault:"false" env:"RATE_LIMIT_FAIL_OPEN"`
+    // FailOpen allows the router to “fail open” – i.e. continue serving
+    // requests when the Redis backend for rate-limiting is unreachable.
+    // When false (default) such storage failures result in a 5xx response.
+    FailOpen            bool                        `yaml:"fail_open" envDefault:"false" env:"RATE_LIMIT_FAIL_OPEN"`
router/internal/persistedoperation/operationstorage/redis/rdcloser.go (1)

79-79: Consider making the warning message more context-specific.

The warning message mentions "Ratelimit Fail Open" but this Redis client is used for persisted operation storage, not just rate limiting. Consider making the message more generic or context-appropriate.

-			opts.Logger.Warn(fmt.Sprintf("Ratelimit Fail Open activated: redis client is currently not responding with provided URLs: %q", err))
+			opts.Logger.Warn(fmt.Sprintf("Fail Open activated: redis client is currently not responding with provided URLs: %q", err))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c00e210 and 3898392.

📒 Files selected for processing (8)
  • router/core/graph_server.go (1 hunks)
  • router/core/ratelimiter.go (5 hunks)
  • router/core/router.go (2 hunks)
  • router/internal/persistedoperation/operationstorage/redis/rdcloser.go (3 hunks)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
router/pkg/config/testdata/config_defaults.json (2)
Learnt from: endigma
PR: wundergraph/cosmo#2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.755Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Learnt from: SkArchon
PR: wundergraph/cosmo#1929
File: router/internal/circuit/manager.go:16-25
Timestamp: 2025-06-30T20:39:02.376Z
Learning: In the Cosmo router project, parameter validation for circuit breaker configuration is handled at the JSON schema level rather than through runtime validation methods on structs. The config.schema.json file contains comprehensive validation constraints for circuit breaker parameters.
router/pkg/config/testdata/config_full.json (2)
Learnt from: endigma
PR: wundergraph/cosmo#2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.755Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Learnt from: SkArchon
PR: wundergraph/cosmo#1929
File: router/internal/circuit/manager.go:16-25
Timestamp: 2025-06-30T20:39:02.376Z
Learning: In the Cosmo router project, parameter validation for circuit breaker configuration is handled at the JSON schema level rather than through runtime validation methods on structs. The config.schema.json file contains comprehensive validation constraints for circuit breaker parameters.
router/pkg/config/config.go (1)
Learnt from: endigma
PR: wundergraph/cosmo#2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.755Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
router/pkg/config/config.schema.json (2)

undefined

<retrieved_learning>
Learnt from: SkArchon
PR: #1929
File: router/internal/circuit/manager.go:16-25
Timestamp: 2025-06-30T20:39:02.376Z
Learning: In the Cosmo router project, parameter validation for circuit breaker configuration is handled at the JSON schema level rather than through runtime validation methods on structs. The config.schema.json file contains comprehensive validation constraints for circuit breaker parameters.
</retrieved_learning>

<retrieved_learning>
Learnt from: endigma
PR: #2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.755Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
</retrieved_learning>

🔇 Additional comments (8)
router/core/graph_server.go (1)

1182-1182: LGTM! Proper configuration propagation.

The addition of the FailOpen field correctly propagates the fail-open configuration from the server's rate limit settings to the rate limiter options, enabling the new fail-open behavior when Redis is unavailable.

router/pkg/config/testdata/config_defaults.json (1)

251-252: LGTM! Conservative default configuration.

Setting FailOpen to false by default is the right approach, ensuring backward compatibility and preventing unintended fail-open behavior unless explicitly enabled.

router/pkg/config/testdata/config_full.json (1)

506-507: LGTM! Consistent configuration across test files.

The addition of FailOpen: false maintains consistency with the default configuration, ensuring uniform behavior across all test scenarios.

router/core/router.go (2)

801-801: LGTM! Proper configuration propagation.

The FailOpen configuration is correctly passed to the Redis client options, enabling the fail-open behavior when Redis is unavailable.


1257-1257: LGTM! Good observability addition.

Adding the failOpen field to the rate limiting configuration logs improves debugging and configuration verification capabilities.

router/internal/persistedoperation/operationstorage/redis/rdcloser.go (2)

24-24: Verify the fail-open behavior implementation.

The fail-open logic is correctly implemented - when Redis is unavailable and FailOpen is true, the function logs a warning and returns the Redis client anyway, allowing the application to continue operating despite Redis connectivity issues.

Also applies to: 78-81


78-78: Fix the formatting issue in the if statement.

There's a missing space after if which will cause a syntax error.

-		if(opts.FailOpen) {
+		if opts.FailOpen {

Likely an incorrect or invalid review comment.

router/core/ratelimiter.go (1)

32-32: LGTM on the struct modifications.

The addition of the FailOpen field to both the options and the rate limiter struct is correctly implemented and properly initialized.

Also applies to: 43-43, 65-65

Comment on lines +1813 to 1818
},
"fail_open": {
"type": "boolean",
"description": "Enable Rate Limit fail open on redis availability failure. This interacts with Redis timeout configuration parameters, essentially adding to each requests latency in failure.",
"default": false
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add conditional schema logic to prevent silent mis-configuration

fail_open is a welcome addition, but with the current schema a user can set
rate_limit.enabled: true, omit storage.urls, and forget to flip fail_open to true.
This will compile but blow up at runtime when Redis is missing.

Guard against that class of error by expressing the relationship directly in the
schema:

         "fail_open": {
           "type": "boolean",
           "description": "Enable Rate Limit fail open on redis availability failure. This interacts with Redis timeout configuration parameters, essentially adding to each requests latency in failure.",
           "default": false
         }
+      },
+      "allOf": [
+        {
+          "if": {
+            "properties": {
+              "enabled": { "const": true },
+              "fail_open": { "const": false }
+            }
+          },
+          "then": {
+            "required": ["storage"]
+          },
+          "else": {
+            "not": {
+              "anyOf": [
+                { "required": ["storage"] }
+              ]
+            }
+          }
+        }
+      ]

This ensures:

  1. When rate-limiting is on and fail_open is false → storage must be present.
  2. When fail_open is true → storage is optional, mirroring the intended “Redis optional” behaviour.

Keeps configuration errors in CI instead of production.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
},
"fail_open": {
"type": "boolean",
"description": "Enable Rate Limit fail open on redis availability failure. This interacts with Redis timeout configuration parameters, essentially adding to each requests latency in failure.",
"default": false
}
},
"fail_open": {
"type": "boolean",
"description": "Enable Rate Limit fail open on redis availability failure. This interacts with Redis timeout configuration parameters, essentially adding to each requests latency in failure.",
"default": false
}
},
"allOf": [
{
"if": {
"properties": {
"enabled": { "const": true },
"fail_open": { "const": false }
}
},
"then": {
"required": ["storage"]
},
"else": {
"not": {
"anyOf": [
{ "required": ["storage"] }
]
}
}
}
]
🤖 Prompt for AI Agents
In router/pkg/config/config.schema.json around lines 1813 to 1818, the schema
allows setting fail_open without requiring storage.urls when rate_limit.enabled
is true, causing runtime errors if Redis is missing. To fix this, add
conditional schema logic using "if", "then", and "else" clauses to enforce that
when rate_limit.enabled is true and fail_open is false, storage.urls must be
present, and when fail_open is true, storage.urls can be omitted. This will
prevent misconfiguration by validating these dependencies at compile time.

@github-actions github-actions Bot removed the Stale label Jul 4, 2025
@github-actions
Copy link
Copy Markdown

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions Bot added the Stale label Jul 18, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Aug 1, 2025

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions Bot closed this Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants