TPU: Replace governor crate with TokenBucket by alexpyattaev · Pull Request #8154 · anza-xyz/agave

alexpyattaev · 2025-09-23T11:56:35Z

Problem

Governor crate has known issues discussed in new token bucket impl #6893
The API of Governor does not allow for the kind of handling we need in Streamer

Summary of Changes

Remove governor in favor of TokenBucket
Update streamer to check rate limits before replying to handshakes

alexpyattaev · 2025-09-23T12:07:59Z

-            limiter: DefaultKeyedRateLimiter::keyed(quota),
+            limiter: KeyedRateLimiter::new(
+                CONNECTION_RATE_LIMITER_CLEANUP_SIZE_THRESHOLD,
+                TokenBucket::new(3, 3, limit_per_minute as f64 / 60.0),


Allowing bursts of 3 connection requests, with average rate of limit_per_minute as f64 / 60.0 per second

We used to have NonZero guard against bad input. Should we ensure limit_per_minute as f64 / 60.0 is NonZero?

Technically, the data structure operates correctly if this is set to zero. Since input to this is a const anyway, what would enforcing it to be non-zero achieve?

I missed we are doing floating math here. Does token bucket support fractions? For example 30/minute --> 0.5/second? I was concerned someone put a none zero number in /min and expecting to see some packets going through and after divided by 60, it becomes 0 and became a total ban -- a surprise.

Yes, token bucket does support fractions, so this will allow packets for any non-zero amount per minute. And for a zero amount it would block ingress as one would expect. I.e. type-system wise it makes no sense to limit this to a NonZero type.

alexpyattaev · 2025-09-23T12:08:56Z

-
-    /// retain only keys whose rate-limiting start date is within the rate-limiting interval.
-    /// Otherwise drop them as inactive
-    pub fn retain_recent(&self) {


new rate limiter does internal housekeeping

alexpyattaev · 2025-09-23T12:10:27Z

+                CONNECTION_RATE_LIMITER_CLEANUP_SIZE_THRESHOLD,
+                TokenBucket::new(3, 3, limit_per_minute as f64 / 60.0),
+                8,
+            ),


The choice of 8 shards here is fairly arbitrary, maybe a better number can be found?

I think we should have higher shards to ensure good performance in our high concurrency use case at expense a little bit higher memory. This only impacts the HashMap metadata overhead which is not that large. I would give about 256 for a balance.

Yes this is good point, number of shards should be > number of workers. Fixed fb3d399

alexpyattaev · 2025-10-01T14:19:29Z

            == SendTransactionStatsNonAtomic {
-                successfully_sent: 2,
-                write_error_connection_lost: 2,
+                connection_error_timed_out: 1,


this is how it should have been - when ratelimit is exceeded, we should get an error.

codecov-commenter · 2025-10-01T14:56:11Z

Codecov Report

❌ Patch coverage is 73.77049% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.0%. Comparing base (f88c890) to head (fb3d399).
⚠️ Report is 191 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #8154     +/-   ##
=========================================
- Coverage    83.0%    83.0%   -0.1%     
=========================================
  Files         826      826             
  Lines      362234   362206     -28     
=========================================
- Hits       300790   300760     -30     
- Misses      61444    61446      +2

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lijunwangs · 2025-10-01T17:20:46Z


    /// Check if the connection from the said `ip` is allowed.
+    /// Here we assume that only IPs with actual confirmed connections are stored in it,
+    /// since we should only modify server state once source IP is verifired


nits: verified

fixed in dd66eae

lijunwangs · 2025-10-01T17:25:17Z

-        Self {
-            limiter: RateLimiter::direct(quota),
+        // Check if we have records in the rate limiter for the given IP address
+        dbg!(self.limiter.current_tokens(ip));


use the debug logging function to be consistent

And log the conditions for debug purposes.

Ooh thanks, good catch!

fixed in dd66eae

lijunwangs · 2025-10-02T18:31:15Z

+                // now that we have observed the handshake we can be certain
+                // that the initiator owns an IP address, we can update rate
+                // limiters on the server
+                if overall_connection_rate_limiter.consume_tokens(1).is_err() {


We should not do overall_connection rate limit first. A noisy connection maker will crowd out others and make them harder to make connections. I think we still should do per-ip address first.

Overall limiter is cheaper to check. Does order of checks make any difference if we check both anyway?

Yes. Imagine a noisy connector quickly passing through the overall gate but bring the limit close to the limit. The innocent connector will have trouble to go through this overall gate. Cause kind of DOS. We used to have the overall limiter first and keyed limiter after it. We explicitly changed the position due to that concern.

Yes but if overall limit is exhausted we will drop connection no matter the order of checking.

The conditions of triggering the overall limits will be very different under these two schemes. Let use am example, suppose my overall limits is 10/s, I have 5 clients, targeted connections per client 2/s.

C1, C2, C3, C4, C5

C1 is the trouble maker, he makes busy connections at a rate >= 10/s
While C2-C5 are good citizens 2/s,

With overall limiter first, because C1 is triggering the trigger already, C2 - C5's legit requests will be most likely denied.

With the per address limiter first, no matter how C1 is noisy, only 2/s connections allowed to contend with the overall limiter. Making the overall limiter fairer and less prone to be DOS. This will reduce the incentive for clients to make noisy connections. It will not be gaining advantage by doing more frequent connections. That is what we want.

Of course you are right! Apologies for being dense here.

Fixed in a780c9a

lijunwangs

LGTM!

This reverts commit 589d58b.

* Revert "TPU: Replace governor crate with TokenBucket (#8154)" This reverts commit 589d58b. * lock update for unclean revert

…anza-xyz#8627) * Revert "TPU: Replace governor crate with TokenBucket (anza-xyz#8154)" This reverts commit 589d58b. * lock update for unclean revert

remove governor crate in favor of token bucket

998e2a3

alexpyattaev marked this pull request as ready for review September 23, 2025 11:57

alexpyattaev requested review from KirillLykov, alessandrod and lijunwangs September 23, 2025 11:58

alexpyattaev changed the title ~~remove governor crate in favor of token bucket~~ Replace governor crate with TokenBucket Sep 23, 2025

alexpyattaev commented Sep 23, 2025

View reviewed changes

Comment thread streamer/src/nonblocking/connection_rate_limiter.rs

alexpyattaev commented Sep 23, 2025

View reviewed changes

Comment thread streamer/src/nonblocking/quic.rs

fix test_rate_limiting

28dd4f7

alexpyattaev commented Oct 1, 2025

View reviewed changes

KirillLykov previously approved these changes Oct 1, 2025

View reviewed changes

lijunwangs reviewed Oct 1, 2025

View reviewed changes

alexpyattaev changed the title ~~Replace governor crate with TokenBucket~~ TPU: Replace governor crate with TokenBucket Oct 2, 2025

lijunwangs reviewed Oct 2, 2025

View reviewed changes

address typos, better debugging

dd66eae

alexpyattaev dismissed KirillLykov’s stale review via dd66eae October 2, 2025 19:40

alexpyattaev requested review from KirillLykov and lijunwangs October 2, 2025 19:41

fix order of ratelimit token consumption

a780c9a

lijunwangs reviewed Oct 6, 2025

View reviewed changes

Comment thread streamer/src/nonblocking/quic.rs

increment counters also before incoming connection is confirmed.

47bed27

alexpyattaev requested a review from lijunwangs October 8, 2025 14:18

scale number of shards in ratelimiter with number of threads

fb3d399

lijunwangs approved these changes Oct 15, 2025

View reviewed changes

alexpyattaev added this pull request to the merge queue Oct 15, 2025

Merged via the queue into anza-xyz:master with commit 589d58b Oct 15, 2025
54 checks passed

alexpyattaev deleted the rm_governor branch October 15, 2025 09:21

apfitzge added a commit to apfitzge/agave that referenced this pull request Oct 22, 2025

Revert "TPU: Replace governor crate with TokenBucket (anza-xyz#8154)"

c07616a

This reverts commit 589d58b.

github-merge-queue Bot pushed a commit that referenced this pull request Oct 22, 2025

Revert "TPU: Replace governor crate with TokenBucket (#8154)" (#8627)

a06cca6

* Revert "TPU: Replace governor crate with TokenBucket (#8154)" This reverts commit 589d58b. * lock update for unclean revert

alexpyattaev restored the rm_governor branch October 28, 2025 21:44

alexpyattaev deleted the rm_governor branch October 28, 2025 21:45

alexpyattaev mentioned this pull request Oct 28, 2025

Streamer: swap governor crate with TokenBucket #8740

Merged

Conversation

alexpyattaev commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lijunwangs Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lijunwangs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexpyattaev commented Sep 23, 2025 •

edited

Loading

lijunwangs Oct 14, 2025 •

edited

Loading

codecov-commenter commented Oct 1, 2025 •

edited

Loading