TPU: Replace governor crate with TokenBucket#8154
TPU: Replace governor crate with TokenBucket#8154alexpyattaev merged 6 commits intoanza-xyz:masterfrom
Conversation
| limiter: DefaultKeyedRateLimiter::keyed(quota), | ||
| limiter: KeyedRateLimiter::new( | ||
| CONNECTION_RATE_LIMITER_CLEANUP_SIZE_THRESHOLD, | ||
| TokenBucket::new(3, 3, limit_per_minute as f64 / 60.0), |
There was a problem hiding this comment.
Allowing bursts of 3 connection requests, with average rate of limit_per_minute as f64 / 60.0 per second
There was a problem hiding this comment.
We used to have NonZero guard against bad input. Should we ensure limit_per_minute as f64 / 60.0 is NonZero?
There was a problem hiding this comment.
Technically, the data structure operates correctly if this is set to zero. Since input to this is a const anyway, what would enforcing it to be non-zero achieve?
There was a problem hiding this comment.
I missed we are doing floating math here. Does token bucket support fractions? For example 30/minute --> 0.5/second? I was concerned someone put a none zero number in /min and expecting to see some packets going through and after divided by 60, it becomes 0 and became a total ban -- a surprise.
There was a problem hiding this comment.
Yes, token bucket does support fractions, so this will allow packets for any non-zero amount per minute. And for a zero amount it would block ingress as one would expect. I.e. type-system wise it makes no sense to limit this to a NonZero type.
|
|
||
| /// retain only keys whose rate-limiting start date is within the rate-limiting interval. | ||
| /// Otherwise drop them as inactive | ||
| pub fn retain_recent(&self) { |
There was a problem hiding this comment.
new rate limiter does internal housekeeping
| CONNECTION_RATE_LIMITER_CLEANUP_SIZE_THRESHOLD, | ||
| TokenBucket::new(3, 3, limit_per_minute as f64 / 60.0), | ||
| 8, | ||
| ), |
There was a problem hiding this comment.
The choice of 8 shards here is fairly arbitrary, maybe a better number can be found?
There was a problem hiding this comment.
I think we should have higher shards to ensure good performance in our high concurrency use case at expense a little bit higher memory. This only impacts the HashMap metadata overhead which is not that large. I would give about 256 for a balance.
There was a problem hiding this comment.
Yes this is good point, number of shards should be > number of workers. Fixed fb3d399
| == SendTransactionStatsNonAtomic { | ||
| successfully_sent: 2, | ||
| write_error_connection_lost: 2, | ||
| connection_error_timed_out: 1, |
There was a problem hiding this comment.
this is how it should have been - when ratelimit is exceeded, we should get an error.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8154 +/- ##
=========================================
- Coverage 83.0% 83.0% -0.1%
=========================================
Files 826 826
Lines 362234 362206 -28
=========================================
- Hits 300790 300760 -30
- Misses 61444 61446 +2 🚀 New features to boost your workflow:
|
|
|
||
| /// Check if the connection from the said `ip` is allowed. | ||
| /// Here we assume that only IPs with actual confirmed connections are stored in it, | ||
| /// since we should only modify server state once source IP is verifired |
| Self { | ||
| limiter: RateLimiter::direct(quota), | ||
| // Check if we have records in the rate limiter for the given IP address | ||
| dbg!(self.limiter.current_tokens(ip)); |
There was a problem hiding this comment.
use the debug logging function to be consistent
There was a problem hiding this comment.
And log the conditions for debug purposes.
| // now that we have observed the handshake we can be certain | ||
| // that the initiator owns an IP address, we can update rate | ||
| // limiters on the server | ||
| if overall_connection_rate_limiter.consume_tokens(1).is_err() { |
There was a problem hiding this comment.
We should not do overall_connection rate limit first. A noisy connection maker will crowd out others and make them harder to make connections. I think we still should do per-ip address first.
There was a problem hiding this comment.
Overall limiter is cheaper to check. Does order of checks make any difference if we check both anyway?
There was a problem hiding this comment.
Yes. Imagine a noisy connector quickly passing through the overall gate but bring the limit close to the limit. The innocent connector will have trouble to go through this overall gate. Cause kind of DOS. We used to have the overall limiter first and keyed limiter after it. We explicitly changed the position due to that concern.
There was a problem hiding this comment.
Yes but if overall limit is exhausted we will drop connection no matter the order of checking.
There was a problem hiding this comment.
The conditions of triggering the overall limits will be very different under these two schemes. Let use am example, suppose my overall limits is 10/s, I have 5 clients, targeted connections per client 2/s.
C1, C2, C3, C4, C5
C1 is the trouble maker, he makes busy connections at a rate >= 10/s
While C2-C5 are good citizens 2/s,
With overall limiter first, because C1 is triggering the trigger already, C2 - C5's legit requests will be most likely denied.
With the per address limiter first, no matter how C1 is noisy, only 2/s connections allowed to contend with the overall limiter. Making the overall limiter fairer and less prone to be DOS. This will reduce the incentive for clients to make noisy connections. It will not be gaining advantage by doing more frequent connections. That is what we want.
There was a problem hiding this comment.
Of course you are right! Apologies for being dense here.
This reverts commit 589d58b.
…anza-xyz#8627) * Revert "TPU: Replace governor crate with TokenBucket (anza-xyz#8154)" This reverts commit 589d58b. * lock update for unclean revert
Problem
Summary of Changes