Add fault-tolerance for cache system errors #577

johnnyshields · 2022-03-18T11:46:01Z

Fixes #511.

This PR adds several features designed to increase fault-tolerance in case the cache infrastructure (Redis, Memcached, etc.) fails. This introduces a few new configs; the intention here is to make a "sensible default" that will work for 99% of real-world Rails apps, while adding customizability for the DIY crowd.

Catch and handle all errors once per request.
Remove the rescuing blocks from the store proxies; rescuing per-method (read, write, increment) is bad because (a) it may result in undefined behavior, and (b) it will trigger repeated connection timeouts if your cache is down, e.g. N * M * timeout latency where N is the number of Rack::Attack metrics and M is the cache requests per metric.
Add Rack::Attack.allowed_errors config. This defaults to Dalli::DalliError and Redis::BaseError.
Add Rack::Attack.failure_cooldown config. This temporarily disables Rack::Attack after an error occurs (including allowed errors), to prevent cache connection latency. The default is 60 seconds.
Add Rack::Attack.error_handler which takes a Proc for custom error handling. It's probably not needed but there may be esoteric use cases for it. You can also use the shortcut symbols :block, :throttle, and :allow to respond to errors using those.
Add Rack::Attack.calling? method which uses Thread.current (or RequestStore, if available) to indicate that Rack::Attack code is executing. The reason for this is so Rails Cache error hander can raise the error for Rack::Attack to handle (i.e. if the error occurred while Rack::Attack was executing.) Refer to readme.
Add "Fault Tolerance & Error Handling" section to Readme which includes all of the above.
Bit of refactor to main Rack::Attack class; code is now split into more methods.

In addition, I have added RSpec Mocks as a development dependency. Specifically for tests related to raising/handling errors, I find RSpec Mocks a heck of a lot easier to work with than MiniTest, however, if there are any MiniTest gurus out there who can suggest a pure-MiniTest way for my tests, I'd be happy to change it. Note that RSpec Mocks is added by itself without requiring all of RSpec.

Lastly, I'd like to give a special shoutout to Amazon MemoryDB for Redis for causing the failure that inspired me to write this PR.

- Catch and handle all errors once per request. - Remove the `rescuing` blocks from the store proxies; rescuing per-method (read, write, increment) is bad because (a) it may result in undefined behavior, and (b) it will trigger repeated connection timeouts if your cache is down, e.g. N * M * timeout latency where N is the number of Rack::Attack metrics and M is the cache requests per metric. - Add `Rack::Attack.ignored_errors` config. This defaults to Dalli::DalliError and Redis::BaseError. - Add `Rack::Attack.failure_cooldown` config. This temporarily disables Rack::Attack after an error occurs (including ignored errors), to prevent cache connection latency. The default is 60 seconds. - Add `Rack::Attack.error_handler` which takes a Proc for custom error handling. It's probably not needed but there may be esoteric use cases for it. You can also use the shortcut symbols :block, :throttle, and :allow to respond to errors using those. - Add `Rack::Attack.calling?` method which uses Thread.current (or RequestStore, if available) to indicate that Rack::Attack code is executing. The reason for this is to add custom error handlers in the Rails Cache, i.e. "raise the error if it occurred while Rack::Attack was executing, so that Rack::Attack and handle it." Refer to readme. - Add "Fault Tolerance & Error Handling" section to Readme which includes all of the above.

johnnyshields · 2022-04-01T17:26:45Z

@grzuy can you please review this PR?

johnnyshields · 2022-04-20T12:22:33Z

FYI I've been running this in production now for several weeks with no issue.

johnnyshields · 2022-06-22T05:46:32Z

@grzuy any consideration to merge this?

timdiggins

I hope this PR can be merged. Just adding a few points and a review in case helpful

lib/rack/attack.rb

README.md

lib/rack/attack.rb

- Improvements to readme

johnnyshields · 2022-06-24T06:32:13Z

@timdiggins I've updated the PR in response to your comments. Thanks!

johnnyshields · 2022-08-30T17:24:32Z

@grzuy any chance to merge this? I've been using in Production for 6 months.

johnnyshields · 2023-01-18T08:05:31Z

@grzuy please review.

justinhoward · 2023-01-19T23:11:10Z

I love the idea of this PR. I just have one concern about it being enabled by default. If an attacker is able to influence a system to enter the bypass mode, they might be able to compromise a critical security feature in some cases.

Consider if someone uses rack-attack to throttle requests to their sign in form. They also use the same Redis cache for other application features. If an attacker is able to DOS the system so that Redis fails and rack-attack enters bypass mode, they may be able to also bypass the throttling on the sign in form.

Personally, I'd classify rack-attack as a security component, and it violates the "principal of least surprise" that a security system would automatically disable itself when under threat. I do think there are use-cases where this is useful, but it should probably be something that users need to enable explicitly once they understand what it does.

Of course, I could be wrong, and maybe there are mitigating factors here I don't understand. I'm happy to be corrected. Thanks either way!

johnnyshields · 2023-01-20T01:47:26Z

@justinhoward Im fine to disable this behavior by default on this PR. If i do that can someone please merge?

jrochkind · 2023-06-22T16:12:35Z

I am very interested in this too, I'm running into problem cases similar to described in original issue -- if your store is misbehaving, it can wind up bringing your whole app down. Which is pretty disastrous. Ability to configure to in some cases just ignore rack-attack failure and proceed with app processing seems crucial.

jrochkind · 2023-06-22T16:46:07Z

Actually, now trying to trace what's going on in my app, I am realizing that when using Memcached/Dalli, similar behavior may actually already be built into Dalli? eg down_retry_delay reminds me of what I was looking for with failure_cooldown here...

But I think you really do want to suspend rack-attack at this higher level notwithstanding... I realize I am not sure I understand what's going on.

johnnyshields · 2023-08-22T06:59:21Z

@grzuy please review.

henryaj · 2023-11-13T15:47:49Z

@grzuy any chance of getting this into master? I just had my production app nearly taken out by a bad memcached!

santib · 2023-11-14T16:59:41Z

Hi @johnnyshields, I think that the problem you are solving is legit, thanks for your contribution 🙌

As you said, this PR adds several features, which IMO makes it hard to be accepted and merged by maintainers. Maybe if we keep this PR as simple as possible, and separate the other features/improvements (like allowed_errors, failure_cooldown, calling?, etc) into their own separate PRs, it’d be easier to discuss each of them and get them approved. So I’m wondering: what would be the simplest solution that would solve the main issue while also taking into consideration @justinhoward’s comment?.

What comes to my mind is something like:

Catch one error per request (to avoid the N*M timeouts)
Allow devs to handle this exception (by default re-raise it to keep existing behavior)

Maybe this could be done by keeping the existing catch statements and re-raising a new custom tailored exception Rack::Attack::CacheError that bubbles up and is caught in the main method with something like

rescue Rack::Attack::CacheError => e
  Rack::Attack.exception_handler.call(e)
  @app.call(env)
end

My question is: if we do that, will it have solved the issue you experienced and the ones described in #511? What else would be missing? Does the solution has any drawback in terms of behavior, security, or any other aspect?

johnnyshields · 2023-11-14T17:53:23Z

@santib I am happy to try to break this into a series of smaller PRs if I can get your commitment to review and merge in a timely manner.

FWIW I've served billions of requests at TableCheck in the last ~1 year using this code, with no issues.

johnnyshields · 2023-11-14T17:54:46Z

I just had my production app nearly taken out by a bad memcached!

@henryaj welcome to the club!

johnnyshields · 2023-11-14T18:00:53Z

@santib

Maybe this could be done by keeping the existing catch statements.. and introduce a custom tailored Rack::Attack::CacheError

That would be possible, but since the internal Dalli/RedisProxy classes would be now raising an error (instead of suppressing it as previously), from an app-owner perspective I don't think it makes a huge difference. TBH I think is desirable to have the original database-driver specific error messages, rather than catching and raising a new one. By removing all the silly rescuing { } blocks the code is both cleaner and has (slightly) better performance.

santib · 2023-11-14T18:29:51Z

Hey, just in case, no, I'm not a maintainer.

That would be possible, but since the internal Dalli/RedisProxy classes would be now raising an error (instead of suppressing it as previously), from an app-owner perspective I don't think it makes a huge difference.

Agree, the difference is mostly for the maintainance of this gem, so that each store_proxy continues to encapsulate the specifics of each store.

Leaving here an example of what I'm referring to: main...santib:rack-attack:system-hardening-refactor

Apart from the introduced Rack::Attack::CacheError, the changes are:

The fact that we let the errors bubble up from the cache stores and catch them in the main flow, allow us to not have the N*M timeouts (just one) and solve the denial of service issue.
The cache_error_handler does nothing by default (keeping existing behavior) but allow devs to notify/log about it.

johnnyshields · 2023-11-14T18:39:58Z

Yep I understand what you are proposing. IMHO that's actually more complex than my PR. My preference is to keep things as simple as possible and just use the native errors--less code, less bloat, no new not-yet-seen error types, etc.

That being said I will defer to the maintainers whether they prefer the bloat. If a maintainer can chime in here I'd appreciate it.

jrochkind · 2023-11-14T18:51:25Z

@johnnyshields if you fork rack-attack, I'll use your fork! It does not appear to me that current maintainers of rack-attack are interested in improvements.

johnnyshields · 2023-11-14T18:59:00Z

@jrochkind it will be second gem I've forked this week 🤣 OK give me a few days I'll do it over the weekend.

jamiemccarthy · 2023-11-14T20:06:15Z

It does not appear to me that current maintainers of rack-attack are interested in improvements

I will note that over on #578 one of the maintainers has been engaging with me, on that PR I submitted last year. He's had some great feedback, and some progress has been made just in this past week.

I'm not sure I'd give up on them just yet!

jrochkind · 2023-11-14T20:29:23Z

@jamiemccarthy it looks to me like over on #578 you have been discussing your PR with @santib .... who said here on this issue that they are not a maintainer....

@santib , if you give detailed feedback and requests for changes that sound like they are coming from a maintainer, and people invest their time in discussing and making those changes thinking if they meet your approval they will get their PR merged by you... they may be dismayed when they find out you actually have no influence over that process...

santib · 2023-11-14T20:39:10Z

@jamiemccarthy I'm sorry if I left the wrong impression 🙏

@jrochkind Don't worry, I'll make sure to always clarify I'm not a maintainer from now on 👍

What I do think is that if we make PRs simpler they'll eventually get reviewed by maintainers. But of course you don't have to trust me.

jamiemccarthy · 2023-11-14T20:47:59Z

Oops! My misunderstanding. Well, I'm not dismayed — @santib I do think your feedback has been helpful!

In any case the repo has had 8 PRs merged in October, so it doesn't seem like it's abandoned.

grzuy · 2023-11-15T02:16:13Z

Thank you @johnnyshields for your work and patience on this PR.

And thank you @timdiggins, @justinhoward, @jrochkind, @henryaj, @santib and @jamiemccarthy for all the helpful comments and feedback.

Agree with @justinhoward and @santib comments in that we would want to narrow it down to the smallest changeset possible that address yours issue in production, being an opt-in setting. I would prefer, given the considerable amount of rack-attack users and maturity of the gem, that mostly nothing changes for existing users that don't opt-in.

Thanks again.

johnnyshields · 2023-11-15T05:21:06Z

@grzuy Appreciate the comment. Let me try to break this into a series of smaller PRs.

One thing that will be inevitable is that we need to raise an error from the Redis/DalliProxy classes to the parent level. Currently this is being swallowed inside the proxy classes themselves. Two options here:

@santib has proposed to introduce a new "wrapper" error class such as Rack::Attack::CacheError
I am proposing to use the existing Dalli/RedisErrors, which may have more info and are easier to work with IMHO. It makes the code simpler if we do this, and we can always go @santib's way later if there is a strong reason to do so.

Do you have a strong opinion on which way we should go?

grzuy · 2023-11-18T13:27:06Z

@jamiemccarthy it looks to me like over on #578 you have been discussing your PR with @santib .... who said here on this issue that they are not a maintainer....

Just for what is worth, he is a mantainer now #637.

johnnyshields mentioned this pull request Mar 18, 2022

Don't fail when Redis is misbehaving, e.g. OOM command not allowed when used memory > 'maxmemory' #511

Open

johnnyshields force-pushed the system-hardening branch from a141aa1 to ae9d7fd Compare March 20, 2022 09:13

Clarify custom error handling in readme

d3853e8

johnnyshields mentioned this pull request Mar 20, 2022

Add error_handler option as Proc to Dalli::Store petergoldstein/dalli#907

Closed

johnnyshields added 3 commits March 20, 2022 19:30

Improve clarity of readme

78c0939

Rename ignored_errors to allowed_errors

6af71d3

Add specs for failure_cooldown? and failed! methods

59d5619

timdiggins reviewed Jun 22, 2022

View reviewed changes

lib/rack/attack.rb Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

lib/rack/attack.rb Outdated Show resolved Hide resolved

lib/rack/attack.rb Outdated Show resolved Hide resolved

- Use request.env rather than separate env parameter

9cb7822

- Improvements to readme

Merge branch 'main' into system-hardening

63a8651

Merge branch 'main' into system-hardening

958ebe1

Merge branch 'main' into system-hardening

4ff0da1

johnnyshields mentioned this pull request Nov 16, 2023

System Hardening Phase 1: Catch errors at top level #639

Closed

9 tasks

Merge branch 'main' into system-hardening

2791b9d

johnnyshields mentioned this pull request Nov 25, 2023

System Hardening Phase 1: Catch errors at top level, take two #641

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fault-tolerance for cache system errors #577

Add fault-tolerance for cache system errors #577

johnnyshields commented Mar 18, 2022 •

edited

Loading

johnnyshields commented Apr 1, 2022

johnnyshields commented Apr 20, 2022

johnnyshields commented Jun 22, 2022

timdiggins left a comment

johnnyshields commented Jun 24, 2022

johnnyshields commented Aug 30, 2022

johnnyshields commented Jan 18, 2023

justinhoward commented Jan 19, 2023

johnnyshields commented Jan 20, 2023 •

edited

Loading

jrochkind commented Jun 22, 2023

jrochkind commented Jun 22, 2023

johnnyshields commented Aug 22, 2023

henryaj commented Nov 13, 2023

santib commented Nov 14, 2023 •

edited

Loading

johnnyshields commented Nov 14, 2023 •

edited

Loading

johnnyshields commented Nov 14, 2023

johnnyshields commented Nov 14, 2023 •

edited

Loading

santib commented Nov 14, 2023

johnnyshields commented Nov 14, 2023 •

edited

Loading

jrochkind commented Nov 14, 2023

johnnyshields commented Nov 14, 2023

jamiemccarthy commented Nov 14, 2023

jrochkind commented Nov 14, 2023

santib commented Nov 14, 2023

jamiemccarthy commented Nov 14, 2023

grzuy commented Nov 15, 2023

johnnyshields commented Nov 15, 2023

grzuy commented Nov 18, 2023

Add fault-tolerance for cache system errors #577

Are you sure you want to change the base?

Add fault-tolerance for cache system errors #577

Conversation

johnnyshields commented Mar 18, 2022 • edited Loading

johnnyshields commented Apr 1, 2022

johnnyshields commented Apr 20, 2022

johnnyshields commented Jun 22, 2022

timdiggins left a comment

Choose a reason for hiding this comment

johnnyshields commented Jun 24, 2022

johnnyshields commented Aug 30, 2022

johnnyshields commented Jan 18, 2023

justinhoward commented Jan 19, 2023

johnnyshields commented Jan 20, 2023 • edited Loading

jrochkind commented Jun 22, 2023

jrochkind commented Jun 22, 2023

johnnyshields commented Aug 22, 2023

henryaj commented Nov 13, 2023

santib commented Nov 14, 2023 • edited Loading

johnnyshields commented Nov 14, 2023 • edited Loading

johnnyshields commented Nov 14, 2023

johnnyshields commented Nov 14, 2023 • edited Loading

santib commented Nov 14, 2023

johnnyshields commented Nov 14, 2023 • edited Loading

jrochkind commented Nov 14, 2023

johnnyshields commented Nov 14, 2023

jamiemccarthy commented Nov 14, 2023

jrochkind commented Nov 14, 2023

santib commented Nov 14, 2023

jamiemccarthy commented Nov 14, 2023

grzuy commented Nov 15, 2023

johnnyshields commented Nov 15, 2023

grzuy commented Nov 18, 2023

johnnyshields commented Mar 18, 2022 •

edited

Loading

johnnyshields commented Jan 20, 2023 •

edited

Loading

santib commented Nov 14, 2023 •

edited

Loading

johnnyshields commented Nov 14, 2023 •

edited

Loading

johnnyshields commented Nov 14, 2023 •

edited

Loading

johnnyshields commented Nov 14, 2023 •

edited

Loading