Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: less-spiky throttles #578

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

jamiemccarthy
Copy link

Overview

My company's been using Rack::Attack for years and it's an important part of how our Rails application controls traffic. But because every IP's 60-second period for its requests rolls over at the same time, we were seeing a spiky pattern to traffic.

At the top of the minute, all the badly-behaving clients' requests would be passed through our application's Rails server. Then over the course of that minute they would gradually get themselves blocked. Near the end of the minute, traffic would be lower. And it would repeat, with Rails getting slammed again at the top of each minute.

It's better for performance purposes to spread that traffic out over the minute as evenly as possible. This is especially important if we want to improve our 99th-percentile performance.

This PR implements that by giving each discriminator (such as an IP number) a pseudo-random offset into the time period. While the same duration is used, in our case 60 seconds, each individual IP's rollover time could be fixed at :00, :59, or anywhere in-between.

Badly-behaving clients can still slam Rails when their period rolls over. But they don't all slam it together.

The change is implemented inside Rack::Attack::Cache. The count method now takes an optional argument to enable the period offset, defaulting to the old behavior. Rack::Attack::Throttle makes use of it, unless there's an incompatible callback (see below). But it wouldn't make sense for safelist, blocklist and track, so their behavior is unaffected.

The effect

Vox Media switched our main production external-facing web app to this branch, 06787b5 to be precise, earlier this year. It's been running fine, serving hundreds of requests per second, for a month.

Before: Here's what a DDoS-attempt from 12 IP's looked like before the offset functionality was enabled. The empty gaps at the top of each minute are where the bot's requests were not being throttled and all 12 attacker IPs were being served actual traffic all at once. This concentrated the attack's load on our app servers into those first few seconds of each minute -- the traffic "spikes" show up as the negative space between the 429 bars:

Screen Shot 2022-01-25 at 1 05 53 PM

After: And here's a different DDoS-attempt, with the offset enabled. Each of the attacker's 2,742 IPs still has the same number of allowed requests per minute, but each IP's counter resets at a different time instead of landing all at once, spreading out the load across the whole minute:

Screen Shot 2022-03-16 at 10 43 13 AM

(In that time period, almost all of the attack came from just a handful of IPs, with 2,700+ others contributing only a little traffic each. But despite that extreme concentration, the random offset was still sufficient to spread out the load.)

Ensuring custom-responder correctness

So as we just saw, this change should work well, and work transparently, for applications that are using Rack::Attack to throttle public non-authenticated traffic.

However, there are some compatibility issues when it's used to return helpful information for what the readme calls "well-behaved" traffic, for example authenticated customers hitting an API.

The problem is that this PR makes Rack::Attack::Cache use a private method to calculate when the end of the period will come, which is the value we want to emit to friendly clients. But for some years the README has been advising applications to calculate it themselves, and has provided sample code that doesn't return the same result as this PR now does.

With this branch's changes, now that previous sample code, or any similar code, would calculate the wrong result. The throttling effect will be the same, but applications' lambdas would be emitting incorrect information.

Now, I have of course updated the DEFAULT_THROTTLED_RESPONDER lambda to read that value out of match_data. So if an application is just setting throttled_response_retry_after_header = true and using that default, it'll work correctly.

But if an application has written their own custom responder, perhaps based on previous sample code, this PR disables its own functionality, to make sure the correct answer is returned. Oh well! The throttling behavior doesn't change and the gem works as it did before.

When application authors notice this change, they can update their lambda to be compatible, and when they do, I've provided a config option to restore this PR's functionality: throttled_responder_is_offset_aware = true. I've updated the README's sample code to reflect this.

Because this ensures backwards compatibility, with functionality automatically remaining the same if necessary, I believe that should allow this PR to be released in a 6.x version.

I've written tests to ensure correctness in all three cases (default responder, custom not-aware, custom aware). And Vox Media has run each of the latter two cases in production.

Tests improved

A number of tests had been written assuming that they knew how to generate the key they were checking for. That's an implementation detail that this PR now changes. So I changed them to query a class method for that data (via send since it's a private method).

I also added unit tests for that private method's key generation, to make sure that code is separately covered.

This also has the advantage of removing all the non-Timecop-frozen Time.now references from tests, each of which is an unlikely but potential failure. I don't promise to have found every case where Timecop can absolutely rule out tests breaking, but I think I got most of them, and the test suite should be a little more reliable now.

Each of the three cases described above is separately tested.

Other considerations and my rationales

Is this too unpredictable for friendlies?

Sometimes apps want to throttle "friendly" traffic, promising paying customers a certain amount of requests. They may be using Rack::Attack for this purpose.

So let's consider the case of an app that wants to throttle "well-behaved" customers in a predictable way, promising them x requests per week or month. We might wonder if rolling over that week or month at an unpredictable time would make the customer unhappy.

I consider this unlikely. Currently the period must roll over at % period == 0 which for a week is not Sunday midnight but an oddball 7 PM or 8 PM Eastern Time every Wednesday depending on daylight savings. And this can't be used for month-by-month traffic: there's no way to specify an exact period for the number of seconds in a month, so nobody's doing that.

If any applications are dependent on throttling customer traffic per-day with the expected behavior of rolling over at GMT midnight, this PR encapsulates that logic in Rack::Attack::Cache.offset_for. A targeted monkey-patch could pick any time of day. Maybe their customers would prefer midnight in their local time zone.

But any such applications are probably using a throttled callback to report this information to customers' clients, and if those clients can't figure out what to do that seems like a client problem, not an application problem.

If this behavior change seems like a deal-breaker, I'd be happy to make offset_for a bit more complex and, for example, always return 0 for period == 86400, or period >= 86400, or similar.

But I don't think this would be necessary, because I think Rack::Attack is already not the right solution for these needs.

1-second as choice of resolution

I round offsets off to the nearest second, and consider that sufficient. If someone feels it would be helpful to round to a tenth of a second, I'd have no objection. But it would have to be an extremely high-traffic site to notice the difference, and it'd require rethinking tests a bit.

MD5 as choice of hash

This PR works only because the discriminator (IP number or what-have-you) gets hashed to a predictable value. I picked Digest::MD5 to generate the hash value. It's fast, and MD5's infamous issues for cryptography don't concern us for this purpose.

Its speed seems fine. On my Chromebook, in a simple benchmark on an IP address, MD5 and modulo run in about 10 microseconds. That's 2 orders of magnitude under a typical round-trip time to the cache, and 3-4 under servicing a typical web request. So I think that's good enough.

(In case anyone's curious, Zlib.crc32 isn't better, and stripping off a prefix/suffix of the hexdigest before calling String#hex only makes it slower. The gem doesn't support Ruby < 2.4, so we don't have to worry about Fixnum vs. Bignum: it's all just Integer, and % on MD5's 128-bit Integer is no problem for Ruby.)

One more thing

I kept this PR as on-target as possible, but in this case I did want to correct a documentation error. For the last nine years the README has said that match_data annotation happens for:

responses that did not exceed a throttle limit

This is backwards and I changed it to:

responses that exceeded a throttle limit

@johnnyshields
Copy link

This is a wonderful change! Thank you for contributing it!

@jamiemccarthy
Copy link
Author

Let me know if there are any questions I can answer about this PR, or if you need me to squash commits or reformat anything. I merged main back in, fixed the one conflict, and if you approve the workflow I think CI should pass. I'm also available to do a Zoom to discuss it, if that'll help.

I know there's a fair bit going on here and the diff is pretty large. But I think it's an improvement worth doing, and most of the diff is in the tests which isn't as daunting as it looks.

@Nowaker
Copy link

Nowaker commented Apr 17, 2022

Wonderful work!

@mitchellhenke
Copy link
Contributor

This PR looks wonderful, and I'd love to see it merged 🙂

@fullmetalengineer
Copy link

Just want to say this is absolutely great and I hope to see it merged when it's ready. Great contribution!

@dinsley
Copy link

dinsley commented Apr 1, 2023

+1 This is a great change. Exactly what we're seeing here and looking at.

Comment on lines -365 to +366
For responses that did not exceed a throttle limit, Rack::Attack annotates the env with match data:
For responses that exceeded a throttle limit, Rack::Attack annotates the env with match data:
Copy link
Collaborator

@santib santib Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think both of these sentences are somehow inaccurate (or incomplete) 🤔. In spec/rack_attack_throttle_spec.rb we can see that request.env['rack.attack.throttle_data'][name] is always set (when using the throttling feature), except when the provided block returns nil, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

so maybe this is just some confusion between throttle_data ('rack.attack.throttle_data') and matched_data which include 'rack.attack.matched', 'rack.attack.match_discriminator', 'rack.attack.match_type', and 'rack.attack.match_data'.


data = {
discriminator: discriminator,
count: count,
period: current_period,
limit: current_limit,
epoch_time: cache.last_epoch_time
epoch_time: cache.last_epoch_time,
retry_after: cache.last_retry_after_time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will set the retry_after value on every request (not just the throttled ones). Is that the intended behavior? or should it be only for throttled requests?

@santib
Copy link
Collaborator

santib commented Oct 27, 2023

Hi @jamiemccarthy thanks a lot for the well detailed description. I think you brought up an interesting problem, but I'm still not clear on what the solution should be.

I'm curious if there is any reason why it wouldn't be enough to make use of the period as a proc feature to achieve the behavior you are describing. What I mean is that rack-attack users can implement their throttling like:

    period_proc = proc { |req| (Digest::MD5.hexdigest(req.ip).hex % 6) + 1 }
    Rack::Attack.throttle("by ip", limit: 1, period: period_proc) do |request|
      request.ip
    end

and that will distribute them in a similar way, right? I tried it in your test spec/acceptance/customizing_throttled_affects_offset.rb:22 and seems to work pretty much in the same way. Am I missing something?

@jamiemccarthy
Copy link
Author

any reason why it wouldn't be enough to make use of the period as a proc feature

Hi there! Great question!

That doesn't work because what it's doing is adjusting the size of the period. In your example, the period for some IP's would be 1, which would reset their counter every second.

What this PR does is keep a constant period, but adjust the offset into it pseudo-randomly.

@santib
Copy link
Collaborator

santib commented Nov 1, 2023

Thanks for clarifying @jamiemccarthy, it makes sense now. After giving it more thought, I believe that the core of your solution is the right one, and that it would be a nice addition to the gem ❤️ That being said, I do have some general comments and things to discuss if you'd like:

1. Simplifying the enabling/disabling of the offset feature

I think the

use_offset = throttled_responder_is_offset_aware ||
             (!throttled_response && throttled_responder == DEFAULT_THROTTLED_RESPONDER)

conditional adds complexity both to the gem maintenance and to the users of the gem. I find it a bit confusing that the offset feature can get turned on/off automatically without respecting the flag, just because a custom responder is set or not. For example even if you have the flag set to false it will still use the offset feature if you have the default responder.

Not saying it's wrong but maybe WDYT about splitting it in a separate PR so we can further discuss it there, and in this PR going with a more straight forward solution using the flag like this

use_offset = throttled_responder_is_offset_aware

?

The flag will be off by default so people using custom responders wouldn't face issues, and we can add a warning message in the README to only enable the offset feature if you are not using a custom responder or if you are going to adapt the custom responder along with it.

I think this will also simplify tests, for example not needing to add this line which I find hard to understand without all this context.

2. Move owner of the offset feature + remove use_offset param

What if instead of having the Cache class know how generate the offset and all, it's the Throttle class which does it (if the flag is on), and then the Cache class just knows about an offset value that needs to be added up and that by default it's 0 (thus not affecting allow2ban and fail2ban)?

That way we can get rid of passing around a use_offset flag, right? Instead we would be able to use Rack::Attack.configuration.throttled_responder_is_offset_aware directly in the Throttle class and I think that would help lower the complexity of the PR.

3. Thread-safety issue

I'm a bit worried about @last_retry_after_time since it can bring thread safety issues. In a common Rails setup with multi-threaded Puma, the Cache instance will be shared between 5 web threads, and if for some reason there is a context switch between the line that sets @last_retry_after_time and the line that uses it, there is a chance that the wrong value is returned. If we also go with the previous recommendation, we can easily return an offset in the match data and throttle data so that users of the gem can use it to calculate the "retry_after" value. Something like:

now = match_data[:epoch_time]
retry_after = match_data[:period] - ((now + match_data[:offset]) % match_data[:period])

which notice it's exactly the same as the existing suggestion but with the addition of the offset.

I have a diff here just in case it's easier main...santib:rack-attack:jm-random-period-offset. Let me know your thoughts :)

@jamiemccarthy
Copy link
Author

Hi @santib , thanks for your interest in this and your thoughtful comments! Sorry I didn't get a chance to respond until this week. I'll reply one point at a time.

  1. Simplifying the enabling/disabling of the offset feature

I agree with you that my throttled_responder_is_offset_aware boolean is complex, and it would be nice to get rid of it.

My goal with this boolean was to let apps upgrade this gem to automatically get the new offset functionality if that can happen transparently, and to continue with the old behavior by default if it can't. That way there's no need to bump the major version number to indicate a breaking change, and people upgrading don't even have to read the changelog. Everything stays compatible and the best (imho) behavior happens automatically.

My hunch is that most installations are using DEFAULT_THROTTLED_RESPONDER and also would appreciate the improved performance, so I was primarily thinking of that use case.

But it's a tradeoff and I totally understand if you want to make that trade a little differently. Maybe it's not super important if we make them go ahead and enable it manually, as described in the changelog and readme. If you're cool with that, great, let's do that instead.

If we do go that way, throttled_responder_is_offset_aware becomes an unnecessarily wordy name for that option. If we're setting all offsets to 0 unless that option is set, maybe call it Rack::Attack.configuration.use_throttle_offsets or something?

@jamiemccarthy
Copy link
Author

  1. Move owner of the offset feature + remove use_offset param

I have no problem with that, but I'd want to see how it would be done.

I wasn't especially happy with adding so much code to Cache when fail2ban, allow2ban, and safelist simply have no reason to use it. It felt like a separation-of-concerns code smell.

Unfortunately, calculating and using the offset does seem like it has to happen inside cache, and exposing that logic to Throttle feels even more smelly.

Here's an alternative idea. What if my original approach with the use_offset boolean was wrong? I don't think it should be an argument passed down a chain of three methods. It's a different behavior and it should be a subclass.

So what if we create a subclass Rack::Attack::CacheWithOffset < Rack::Attack::Cache that only Throttle uses? Rack::Attack can instantiate it on-demand just like its cache method already does.

I think the subclass only has to override two methods, offset_for and period_number_and_time_into. In fact if we combine them into one method that takes both arguments (unprefixed_key, period), it only has to override that one method, two lines of code total, and the base class gets even simpler.

Tests get slightly cleaner too I bet.

If you'd like to see how that looks, I'm happy to write it and see what you think.

@jamiemccarthy
Copy link
Author

  1. Thread-safety issue

Yikes, you're right, good catch! Our app uses Unicorn so I'm embarrassed to say I never even thought of this.

Your solution looks great to me, but I would suggest asking users of the gem to write

retry_after = match_data[:period] - ((now + match_data[:offset]) % match_data[:period])

is too much copypasta and it locks the gem into that logic in the future. It's cleaner and more future-proof to provide a utility method to do that calculation, right? One that only depends on the match_data passed to it?

But however you want to do this is fine. Sorry I submitted a thread-unsafe PR 😞

@jamiemccarthy
Copy link
Author

I have a diff here just in case it's easier main...santib:rack-attack:jm-random-period-offset. Let me know your thoughts :)

Hey yeah that looks great. You're right about just adding an offset argument that defaults to 0. My Rack::Attack::CacheWithOffset subclass idea was overkill. I like what you did better.

@santib
Copy link
Collaborator

santib commented Nov 13, 2023

Hi @jamiemccarthy, no worries about timing, this is OSS, so we do the best that we can 😄 Thanks for continuing such an interesting conversation.

  1. Simplifying the enabling/disabling of the offset feature

Agree that it’s all about a tradeoff. The thing is that I think it’d be easier to convince the maintainers of this gem to merge this improvement if we don’t change the default behavior. If this feature ends up being so successful that everyone wants to have it turned on, then we can consider changing the default in a future version. But taking this first step would let us get feedback before making a major change.

Regarding the name of the config, I like your suggestion Rack::Attack.configuration.use_throttle_offsets 👍

  1. Move owner of the offset feature + remove use_offset param

Unfortunately, calculating and using the offset does seem like it has to happen inside cache, and exposing that logic to Throttle feels even more smelly.

I agree it’s weird having the Throttle class affect the key of the Cache. On the other hand, this is how the gem currently works, the caller of the Cache sends it an unprefixed_key and a period which are then used to build the cache key. In that sense, sending the offset seems appropriate to me. Maybe in a separate PR we could explore extracting the cache key generation into its own class CacheKey? I don’t like to overcomplicate things either, but it’s true that currently the cache key generation is spread between the Cache and its callers.

EDIT: another option simplify things is to make the offset stuff be handled by the cache, ignoring if it's being called from throttle or other caller. Even if it's a useless feature for fail2ban, allow2ban, and safelist, it wouldn't cause any harm right?

  1. Thread-safety issue

No worries, it’s hard to keep so many things in mind when developing gems, but this reinforces why I’d be very cautious changing such a widely used gem as rack-attack.

Also agree that:

retry_after = match_data[:period] - ((now + match_data[:offset]) % match_data[:period])

is bad. But I don’t think the offset part made it bad, I think it already is too complex. So maybe:

  • We could have a separate PR that returns match_data[:retry_after] or similar, hiding the actual details of the period and stuff.
  • Maybe we can hide the offset to devs also, and make match_data[:epoch_time] include the offset.

Let me know your thoughts 🙌

@jamiemccarthy
Copy link
Author

Let me know your thoughts

All that sounds fine.

I'm happy to make this PR into the MVP and then look at other changes in later PRs. Want me to go ahead and do that?

Or if you had a specific idea of what you wanted to do, I'm happy to go ahead and let you take a bite at it.

@santib
Copy link
Collaborator

santib commented Nov 20, 2023

Hi @jamiemccarthy,

I'm happy to make this PR into the MVP and then look at other changes in later PRs. Want me to go ahead and do that?

Yes, please 🙌 Once you are done with the changes I'll review it again.

@jamiemccarthy
Copy link
Author

Hey, I just wanted to say I haven't forgotten about this, I've just gotten busy this past week or two. I'll get on this soon, I hope.

@santib
Copy link
Collaborator

santib commented Jan 28, 2024

Just thinking out loud, but another option that would help with the spiky throttles would be optionally removing the period from the key.

Did you consider this option? I guess it'd be a bigger and more risky change, so I'm not suggesting to go down this path, but just asking.

@jamiemccarthy
Copy link
Author

Hi Santiago, thanks for the mention. I'll reply on that PR.

I haven't forgotten about trying to make this thread-safe, I've just gotten really busy, sorry about that. I still intend to come back to it, but I'm not sure when.

@MuhammadAhsan66
Copy link

Just want to say this is absolutely great and I hope to see it merged when it's ready. Great contribution!
@jamiemccarthy Eagerly waiting for this PR to merge. 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants