Skip to content

TPU: Replace governor crate with TokenBucket#8154

Merged
alexpyattaev merged 6 commits intoanza-xyz:masterfrom
alexpyattaev:rm_governor
Oct 15, 2025
Merged

TPU: Replace governor crate with TokenBucket#8154
alexpyattaev merged 6 commits intoanza-xyz:masterfrom
alexpyattaev:rm_governor

Conversation

@alexpyattaev
Copy link
Copy Markdown

@alexpyattaev alexpyattaev commented Sep 23, 2025

Problem

  • Governor crate has known issues discussed in new token bucket impl #6893
  • The API of Governor does not allow for the kind of handling we need in Streamer

Summary of Changes

  • Remove governor in favor of TokenBucket
  • Update streamer to check rate limits before replying to handshakes

@alexpyattaev alexpyattaev marked this pull request as ready for review September 23, 2025 11:57
@alexpyattaev alexpyattaev changed the title remove governor crate in favor of token bucket Replace governor crate with TokenBucket Sep 23, 2025
Comment thread streamer/src/nonblocking/connection_rate_limiter.rs
limiter: DefaultKeyedRateLimiter::keyed(quota),
limiter: KeyedRateLimiter::new(
CONNECTION_RATE_LIMITER_CLEANUP_SIZE_THRESHOLD,
TokenBucket::new(3, 3, limit_per_minute as f64 / 60.0),
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing bursts of 3 connection requests, with average rate of limit_per_minute as f64 / 60.0 per second

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to have NonZero guard against bad input. Should we ensure limit_per_minute as f64 / 60.0 is NonZero?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the data structure operates correctly if this is set to zero. Since input to this is a const anyway, what would enforcing it to be non-zero achieve?

Copy link
Copy Markdown

@lijunwangs lijunwangs Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed we are doing floating math here. Does token bucket support fractions? For example 30/minute --> 0.5/second? I was concerned someone put a none zero number in /min and expecting to see some packets going through and after divided by 60, it becomes 0 and became a total ban -- a surprise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, token bucket does support fractions, so this will allow packets for any non-zero amount per minute. And for a zero amount it would block ingress as one would expect. I.e. type-system wise it makes no sense to limit this to a NonZero type.


/// retain only keys whose rate-limiting start date is within the rate-limiting interval.
/// Otherwise drop them as inactive
pub fn retain_recent(&self) {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new rate limiter does internal housekeeping

CONNECTION_RATE_LIMITER_CLEANUP_SIZE_THRESHOLD,
TokenBucket::new(3, 3, limit_per_minute as f64 / 60.0),
8,
),
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice of 8 shards here is fairly arbitrary, maybe a better number can be found?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have higher shards to ensure good performance in our high concurrency use case at expense a little bit higher memory. This only impacts the HashMap metadata overhead which is not that large. I would give about 256 for a balance.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is good point, number of shards should be > number of workers. Fixed fb3d399

Comment thread streamer/src/nonblocking/quic.rs
== SendTransactionStatsNonAtomic {
successfully_sent: 2,
write_error_connection_lost: 2,
connection_error_timed_out: 1,
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is how it should have been - when ratelimit is exceeded, we should get an error.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 1, 2025

Codecov Report

❌ Patch coverage is 73.77049% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.0%. Comparing base (f88c890) to head (fb3d399).
⚠️ Report is 191 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #8154     +/-   ##
=========================================
- Coverage    83.0%    83.0%   -0.1%     
=========================================
  Files         826      826             
  Lines      362234   362206     -28     
=========================================
- Hits       300790   300760     -30     
- Misses      61444    61446      +2     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

KirillLykov
KirillLykov previously approved these changes Oct 1, 2025
Comment thread streamer/src/nonblocking/connection_rate_limiter.rs

/// Check if the connection from the said `ip` is allowed.
/// Here we assume that only IPs with actual confirmed connections are stored in it,
/// since we should only modify server state once source IP is verifired
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: verified

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in dd66eae

Self {
limiter: RateLimiter::direct(quota),
// Check if we have records in the rate limiter for the given IP address
dbg!(self.limiter.current_tokens(ip));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the debug logging function to be consistent

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And log the conditions for debug purposes.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh thanks, good catch!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in dd66eae

@alexpyattaev alexpyattaev changed the title Replace governor crate with TokenBucket TPU: Replace governor crate with TokenBucket Oct 2, 2025
Comment thread streamer/src/nonblocking/quic.rs Outdated
// now that we have observed the handshake we can be certain
// that the initiator owns an IP address, we can update rate
// limiters on the server
if overall_connection_rate_limiter.consume_tokens(1).is_err() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not do overall_connection rate limit first. A noisy connection maker will crowd out others and make them harder to make connections. I think we still should do per-ip address first.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall limiter is cheaper to check. Does order of checks make any difference if we check both anyway?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Imagine a noisy connector quickly passing through the overall gate but bring the limit close to the limit. The innocent connector will have trouble to go through this overall gate. Cause kind of DOS. We used to have the overall limiter first and keyed limiter after it. We explicitly changed the position due to that concern.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but if overall limit is exhausted we will drop connection no matter the order of checking.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditions of triggering the overall limits will be very different under these two schemes. Let use am example, suppose my overall limits is 10/s, I have 5 clients, targeted connections per client 2/s.

C1, C2, C3, C4, C5

C1 is the trouble maker, he makes busy connections at a rate >= 10/s
While C2-C5 are good citizens 2/s,

With overall limiter first, because C1 is triggering the trigger already, C2 - C5's legit requests will be most likely denied.

With the per address limiter first, no matter how C1 is noisy, only 2/s connections allowed to contend with the overall limiter. Making the overall limiter fairer and less prone to be DOS. This will reduce the incentive for clients to make noisy connections. It will not be gaining advantage by doing more frequent connections. That is what we want.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course you are right! Apologies for being dense here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a780c9a

Comment thread streamer/src/nonblocking/quic.rs
Copy link
Copy Markdown

@lijunwangs lijunwangs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alexpyattaev alexpyattaev added this pull request to the merge queue Oct 15, 2025
Merged via the queue into anza-xyz:master with commit 589d58b Oct 15, 2025
54 checks passed
@alexpyattaev alexpyattaev deleted the rm_governor branch October 15, 2025 09:21
apfitzge added a commit to apfitzge/agave that referenced this pull request Oct 22, 2025
github-merge-queue Bot pushed a commit that referenced this pull request Oct 22, 2025
* Revert "TPU: Replace governor crate with TokenBucket (#8154)"

This reverts commit 589d58b.

* lock update for unclean revert
@alexpyattaev alexpyattaev restored the rm_governor branch October 28, 2025 21:44
@alexpyattaev alexpyattaev deleted the rm_governor branch October 28, 2025 21:45
rustopian pushed a commit to rustopian/agave that referenced this pull request Nov 20, 2025
…anza-xyz#8627)

* Revert "TPU: Replace governor crate with TokenBucket (anza-xyz#8154)"

This reverts commit 589d58b.

* lock update for unclean revert
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants