-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor: less-spiky throttles #578
base: main
Are you sure you want to change the base?
Conversation
This is a wonderful change! Thank you for contributing it! |
Let me know if there are any questions I can answer about this PR, or if you need me to squash commits or reformat anything. I merged I know there's a fair bit going on here and the diff is pretty large. But I think it's an improvement worth doing, and most of the diff is in the tests which isn't as daunting as it looks. |
Wonderful work! |
Conflicts: lib/rack/attack/cache.rb Resolved (just a code comment)
This PR looks wonderful, and I'd love to see it merged 🙂 |
Just want to say this is absolutely great and I hope to see it merged when it's ready. Great contribution! |
+1 This is a great change. Exactly what we're seeing here and looking at. |
For responses that did not exceed a throttle limit, Rack::Attack annotates the env with match data: | ||
For responses that exceeded a throttle limit, Rack::Attack annotates the env with match data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think both of these sentences are somehow inaccurate (or incomplete) 🤔. In spec/rack_attack_throttle_spec.rb
we can see that request.env['rack.attack.throttle_data'][name]
is always set (when using the throttling feature), except when the provided block returns nil
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
data = { | ||
discriminator: discriminator, | ||
count: count, | ||
period: current_period, | ||
limit: current_limit, | ||
epoch_time: cache.last_epoch_time | ||
epoch_time: cache.last_epoch_time, | ||
retry_after: cache.last_retry_after_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will set the retry_after
value on every request (not just the throttled ones). Is that the intended behavior? or should it be only for throttled requests?
Hi @jamiemccarthy thanks a lot for the well detailed description. I think you brought up an interesting problem, but I'm still not clear on what the solution should be. I'm curious if there is any reason why it wouldn't be enough to make use of the period as a proc feature to achieve the behavior you are describing. What I mean is that period_proc = proc { |req| (Digest::MD5.hexdigest(req.ip).hex % 6) + 1 }
Rack::Attack.throttle("by ip", limit: 1, period: period_proc) do |request|
request.ip
end and that will distribute them in a similar way, right? I tried it in your test |
Hi there! Great question! That doesn't work because what it's doing is adjusting the size of the period. In your example, the period for some IP's would be 1, which would reset their counter every second. What this PR does is keep a constant period, but adjust the offset into it pseudo-randomly. |
Thanks for clarifying @jamiemccarthy, it makes sense now. After giving it more thought, I believe that the core of your solution is the right one, and that it would be a nice addition to the gem ❤️ That being said, I do have some general comments and things to discuss if you'd like: 1. Simplifying the enabling/disabling of the offset featureI think the use_offset = throttled_responder_is_offset_aware ||
(!throttled_response && throttled_responder == DEFAULT_THROTTLED_RESPONDER) conditional adds complexity both to the gem maintenance and to the users of the gem. I find it a bit confusing that the offset feature can get turned on/off automatically without respecting the flag, just because a custom responder is set or not. For example even if you have the flag set to Not saying it's wrong but maybe WDYT about splitting it in a separate PR so we can further discuss it there, and in this PR going with a more straight forward solution using the flag like this use_offset = throttled_responder_is_offset_aware ? The flag will be off by default so people using custom responders wouldn't face issues, and we can add a warning message in the README to only enable the offset feature if you are not using a custom responder or if you are going to adapt the custom responder along with it. I think this will also simplify tests, for example not needing to add this line which I find hard to understand without all this context. 2. Move owner of the offset feature + remove
|
Hi @santib , thanks for your interest in this and your thoughtful comments! Sorry I didn't get a chance to respond until this week. I'll reply one point at a time.
I agree with you that my My goal with this boolean was to let apps upgrade this gem to automatically get the new offset functionality if that can happen transparently, and to continue with the old behavior by default if it can't. That way there's no need to bump the major version number to indicate a breaking change, and people upgrading don't even have to read the changelog. Everything stays compatible and the best (imho) behavior happens automatically. My hunch is that most installations are using But it's a tradeoff and I totally understand if you want to make that trade a little differently. Maybe it's not super important if we make them go ahead and enable it manually, as described in the changelog and readme. If you're cool with that, great, let's do that instead. If we do go that way, |
I have no problem with that, but I'd want to see how it would be done. I wasn't especially happy with adding so much code to Unfortunately, calculating and using the offset does seem like it has to happen inside cache, and exposing that logic to Here's an alternative idea. What if my original approach with the So what if we create a subclass I think the subclass only has to override two methods, Tests get slightly cleaner too I bet. If you'd like to see how that looks, I'm happy to write it and see what you think. |
Yikes, you're right, good catch! Our app uses Unicorn so I'm embarrassed to say I never even thought of this. Your solution looks great to me, but I would suggest asking users of the gem to write
is too much copypasta and it locks the gem into that logic in the future. It's cleaner and more future-proof to provide a utility method to do that calculation, right? One that only depends on the But however you want to do this is fine. Sorry I submitted a thread-unsafe PR 😞 |
Hey yeah that looks great. You're right about just adding an |
Hi @jamiemccarthy, no worries about timing, this is OSS, so we do the best that we can 😄 Thanks for continuing such an interesting conversation.
Agree that it’s all about a tradeoff. The thing is that I think it’d be easier to convince the maintainers of this gem to merge this improvement if we don’t change the default behavior. If this feature ends up being so successful that everyone wants to have it turned on, then we can consider changing the default in a future version. But taking this first step would let us get feedback before making a major change. Regarding the name of the config, I like your suggestion
I agree it’s weird having the EDIT: another option simplify things is to make the offset stuff be handled by the cache, ignoring if it's being called from throttle or other caller. Even if it's a useless feature for fail2ban, allow2ban, and safelist, it wouldn't cause any harm right?
No worries, it’s hard to keep so many things in mind when developing gems, but this reinforces why I’d be very cautious changing such a widely used gem as Also agree that:
is bad. But I don’t think the offset part made it bad, I think it already is too complex. So maybe:
Let me know your thoughts 🙌 |
All that sounds fine. I'm happy to make this PR into the MVP and then look at other changes in later PRs. Want me to go ahead and do that? Or if you had a specific idea of what you wanted to do, I'm happy to go ahead and let you take a bite at it. |
Hi @jamiemccarthy,
Yes, please 🙌 Once you are done with the changes I'll review it again. |
Hey, I just wanted to say I haven't forgotten about this, I've just gotten busy this past week or two. I'll get on this soon, I hope. |
Just thinking out loud, but another option that would help with the spiky throttles would be optionally removing the period from the key. Did you consider this option? I guess it'd be a bigger and more risky change, so I'm not suggesting to go down this path, but just asking. |
Hi Santiago, thanks for the mention. I'll reply on that PR. I haven't forgotten about trying to make this thread-safe, I've just gotten really busy, sorry about that. I still intend to come back to it, but I'm not sure when. |
Just want to say this is absolutely great and I hope to see it merged when it's ready. Great contribution! |
Overview
My company's been using Rack::Attack for years and it's an important part of how our Rails application controls traffic. But because every IP's 60-second period for its requests rolls over at the same time, we were seeing a spiky pattern to traffic.
At the top of the minute, all the badly-behaving clients' requests would be passed through our application's Rails server. Then over the course of that minute they would gradually get themselves blocked. Near the end of the minute, traffic would be lower. And it would repeat, with Rails getting slammed again at the top of each minute.
It's better for performance purposes to spread that traffic out over the minute as evenly as possible. This is especially important if we want to improve our 99th-percentile performance.
This PR implements that by giving each discriminator (such as an IP number) a pseudo-random offset into the time period. While the same duration is used, in our case 60 seconds, each individual IP's rollover time could be fixed at :00, :59, or anywhere in-between.
Badly-behaving clients can still slam Rails when their period rolls over. But they don't all slam it together.
The change is implemented inside
Rack::Attack::Cache
. Thecount
method now takes an optional argument to enable the period offset, defaulting to the old behavior.Rack::Attack::Throttle
makes use of it, unless there's an incompatible callback (see below). But it wouldn't make sense for safelist, blocklist and track, so their behavior is unaffected.The effect
Vox Media switched our main production external-facing web app to this branch, 06787b5 to be precise, earlier this year. It's been running fine, serving hundreds of requests per second, for a month.
Before: Here's what a DDoS-attempt from 12 IP's looked like before the offset functionality was enabled. The empty gaps at the top of each minute are where the bot's requests were not being throttled and all 12 attacker IPs were being served actual traffic all at once. This concentrated the attack's load on our app servers into those first few seconds of each minute -- the traffic "spikes" show up as the negative space between the 429 bars:
After: And here's a different DDoS-attempt, with the offset enabled. Each of the attacker's 2,742 IPs still has the same number of allowed requests per minute, but each IP's counter resets at a different time instead of landing all at once, spreading out the load across the whole minute:
(In that time period, almost all of the attack came from just a handful of IPs, with 2,700+ others contributing only a little traffic each. But despite that extreme concentration, the random offset was still sufficient to spread out the load.)
Ensuring custom-responder correctness
So as we just saw, this change should work well, and work transparently, for applications that are using Rack::Attack to throttle public non-authenticated traffic.
However, there are some compatibility issues when it's used to return helpful information for what the readme calls "well-behaved" traffic, for example authenticated customers hitting an API.
The problem is that this PR makes
Rack::Attack::Cache
use a private method to calculate when the end of the period will come, which is the value we want to emit to friendly clients. But for some years the README has been advising applications to calculate it themselves, and has provided sample code that doesn't return the same result as this PR now does.With this branch's changes, now that previous sample code, or any similar code, would calculate the wrong result. The throttling effect will be the same, but applications' lambdas would be emitting incorrect information.
Now, I have of course updated the
DEFAULT_THROTTLED_RESPONDER
lambda to read that value out ofmatch_data
. So if an application is just settingthrottled_response_retry_after_header = true
and using that default, it'll work correctly.But if an application has written their own custom responder, perhaps based on previous sample code, this PR disables its own functionality, to make sure the correct answer is returned. Oh well! The throttling behavior doesn't change and the gem works as it did before.
When application authors notice this change, they can update their lambda to be compatible, and when they do, I've provided a config option to restore this PR's functionality:
throttled_responder_is_offset_aware = true
. I've updated the README's sample code to reflect this.Because this ensures backwards compatibility, with functionality automatically remaining the same if necessary, I believe that should allow this PR to be released in a 6.x version.
I've written tests to ensure correctness in all three cases (default responder, custom not-aware, custom aware). And Vox Media has run each of the latter two cases in production.
Tests improved
A number of tests had been written assuming that they knew how to generate the key they were checking for. That's an implementation detail that this PR now changes. So I changed them to query a class method for that data (via
send
since it's a private method).I also added unit tests for that private method's key generation, to make sure that code is separately covered.
This also has the advantage of removing all the non-Timecop-frozen
Time.now
references from tests, each of which is an unlikely but potential failure. I don't promise to have found every case where Timecop can absolutely rule out tests breaking, but I think I got most of them, and the test suite should be a little more reliable now.Each of the three cases described above is separately tested.
Other considerations and my rationales
Is this too unpredictable for friendlies?
Sometimes apps want to throttle "friendly" traffic, promising paying customers a certain amount of requests. They may be using Rack::Attack for this purpose.
So let's consider the case of an app that wants to throttle "well-behaved" customers in a predictable way, promising them x requests per week or month. We might wonder if rolling over that week or month at an unpredictable time would make the customer unhappy.
I consider this unlikely. Currently the period must roll over at
% period == 0
which for a week is not Sunday midnight but an oddball 7 PM or 8 PM Eastern Time every Wednesday depending on daylight savings. And this can't be used for month-by-month traffic: there's no way to specify an exact period for the number of seconds in a month, so nobody's doing that.If any applications are dependent on throttling customer traffic per-day with the expected behavior of rolling over at GMT midnight, this PR encapsulates that logic in
Rack::Attack::Cache.offset_for
. A targeted monkey-patch could pick any time of day. Maybe their customers would prefer midnight in their local time zone.But any such applications are probably using a throttled callback to report this information to customers' clients, and if those clients can't figure out what to do that seems like a client problem, not an application problem.
If this behavior change seems like a deal-breaker, I'd be happy to make
offset_for
a bit more complex and, for example, always return0
forperiod == 86400
, orperiod >= 86400
, or similar.But I don't think this would be necessary, because I think Rack::Attack is already not the right solution for these needs.
1-second as choice of resolution
I round offsets off to the nearest second, and consider that sufficient. If someone feels it would be helpful to round to a tenth of a second, I'd have no objection. But it would have to be an extremely high-traffic site to notice the difference, and it'd require rethinking tests a bit.
MD5 as choice of hash
This PR works only because the discriminator (IP number or what-have-you) gets hashed to a predictable value. I picked
Digest::MD5
to generate the hash value. It's fast, and MD5's infamous issues for cryptography don't concern us for this purpose.Its speed seems fine. On my Chromebook, in a simple benchmark on an IP address, MD5 and modulo run in about 10 microseconds. That's 2 orders of magnitude under a typical round-trip time to the cache, and 3-4 under servicing a typical web request. So I think that's good enough.
(In case anyone's curious,
Zlib.crc32
isn't better, and stripping off a prefix/suffix of thehexdigest
before callingString#hex
only makes it slower. The gem doesn't support Ruby < 2.4, so we don't have to worry about Fixnum vs. Bignum: it's all just Integer, and%
on MD5's 128-bit Integer is no problem for Ruby.)One more thing
I kept this PR as on-target as possible, but in this case I did want to correct a documentation error. For the last nine years the README has said that match_data annotation happens for:
This is backwards and I changed it to: