new token bucket impl by alexpyattaev · Pull Request #6893 · anza-xyz/agave

alexpyattaev · 2025-07-09T11:10:30Z

Problem

We use external crate for a token bucket which is overkill
The way we implement keyed rate limiting is somewhat inefficient

Summary of Changes

Make our own glorious token bucket
Also make a concurrent hashmap based version of the same that uses LazyLRU logic for cleanup instead of a flat loop

codecov-commenter · 2025-07-09T12:55:46Z

Codecov Report

❌ Patch coverage is 90.78498% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.1%. Comparing base (302ff5e) to head (eec664d).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #6893     +/-   ##
=========================================
- Coverage    83.1%    83.1%   -0.1%     
=========================================
  Files         815      816      +1     
  Lines      358629   358922    +293     
=========================================
+ Hits       298071   298309    +238     
- Misses      60558    60613     +55

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lijunwangs · 2025-08-05T00:22:58Z

+    #[allow(clippy::arithmetic_side_effects)]
+    fn maybe_shrink(&self) {
+        let mut actual_len = 0;
+        let target_shard_size = self.target_capacity / self.data.shards().len();


What if self.target_capacity < shards?

We should document it and assert probably.

This is a good catch thank you! target_size becomes zero and data structure wipes all records every time :( Will patch.

KirillLykov · 2025-09-12T06:21:08Z

@@ -0,0 +1,106 @@
+#![allow(clippy::arithmetic_side_effects)]


Why haven't you used some benchmarking framework? And what are the results of the current benchmarking?

Benching frameworks are not well suited here since the bench is multithreaded, and requires peculiar setup to run. Running it in a loop 10000 times does not really show meaningful perf since you need thread contention.
Here are results:

Running bench_token_bucket... Run complete over 5 seconds Accepted 16667, Rejected: 39887821 processed 39904488 requests, 7980897.5 per second ========== Running bench_token_bucket_eviction... Run complete over 5 seconds Max observed size was 406 processed 17113044 requests, 3422608.8 per second Rejected: 95951 ========== Running bench_keyed_rate_limiter... Run complete over 5 seconds Accepted: 1024000 (target 1024000) Rejected: 37008846 processed 38032846 requests, 7606569.5 per second

TL;DR we can process about 7 M requests per second per bucket, the KeyedRateLimiter may slow things down if there is a lot of churn to 3M requests per second.

Sounds much more than we ever need

Well we'd want real code to do things other than token buckets but I do not know how to make this substantially faster, I'm quite certain we are close to hitting HW limits here.

KirillLykov · 2025-09-12T06:26:29Z

+}
+
+impl Clone for TokenBucket {
+    fn clone(&self) -> Self {


This clone looks a bit suspicions to me because it is a deep copy while I typically expect that atomics are copied by Arc-style. Do we really need it somewhere?

Yes, this is used in the KeyedRateLimiter to clone buckets. It is generally nice to have around so you can mass-produce them from a prototype of some kind. There is no state in them that would become invalidated if we access atomics one at a time, so no reason not to implement clone.

KirillLykov · 2025-09-12T06:30:37Z

+        self.update_state(now);
+        match self
+            .tokens
+            .fetch_update(Ordering::SeqCst, Ordering::SeqCst, |tokens| {


SeqCst -- although it is the safest option, it might be less performant in comparison to Release/Acquire combination. Have you seen any difference?

Yeah I was hunting a nasty concurrency bug here for a while, now I have found it and dropped to AcqRel and Acquire where applicable.

alexpyattaev · 2025-09-12T10:19:17Z

@vadorovsky I believe I have finally tracked down the concurrency bug that was eating my brain here, so should be good to go. Now the system that credits tokens is super braindead and relies on simple CAS logic to gate which thread credits for which time interval.

KirillLykov

My main concern here was atomics ordering, if @vadorovsky double checks that it is ok, to me looks good now.

Adds TokenBucket and KeyedRateLimiter to replace governor crate and in general allow for better control over rate-limiting options

alexpyattaev · 2025-09-12T15:09:22Z

My main concern here was atomics ordering, if @vadorovsky double checks that it is ok, to me looks good now.

Added shuttle test based on @vadorovsky 's design, the coast is clear=)

vadorovsky · 2025-09-12T17:02:59Z

+    /// depositing new tokens (if appropriate)
+    fn update_state(&self, now: u64) {
+        // fetch last update time
+        let last = self.last_update.load(Ordering::SeqCst);


I think Ordering::Acquire would be sufficient here. But SeqCst is not incorrect. In case you don't see any perf difference, I guess it's fine to leave it as it is.

No perf difference:(

vadorovsky · 2025-09-12T17:23:22Z

+                    .fetch_add(time_to_return, Ordering::Relaxed);
+            }
+            Err(_) => {
+                // Another thread advanced last_update first → nothing we can do now.


I'm wondering if we should actually do something in that case. There is a slight chance that the current thread's now is still higher than the new value set by an another thread. To handle that case, we could retry the update.

fn update_state(&self, now: u64) { // WE MAKE IT `mut` let mut last = self.last_update.load(Ordering::SeqCst); // If time has not advanced, nothing to do. while now > last { match self.last_update.compare_exchange( last, now, Ordering::AcqRel, // winner publishes new timestamp Ordering::Acquire, // loser observes updates ) { Ok(_) => { // success case... } // THE DIFFERENCE: If CAS failed, we retry with the new value. Err(last_update) => { last = current; } } } }

Not having a loop here was a deliberate choice, it reduces the time spent per request, which is what we want way more than accuracy (since this code gets called per packet, and we have on the order of millions of packets).

In the current version we will just mint tokens on the next call, which is good enough for intended use, since probability that whatever other thread did not consume is enough to mint many tokens is low.

Fair enough

vadorovsky

gregcusack

can you explain more why we need this new token bucket please? As we've previously discussed it sounds like we use the governor crate but that is bloated and has some bugs. i think one of those bugs was letting in too much traffic. how much of a problem is this? I am just wary of reimplementing something from scratch especially in a core part of the validator. seems like we need a lot of testing for this. shuttle_test_token_bucket_race is great!

gregcusack · 2025-09-12T17:32:38Z

+// much of the testing is impossible outside of real multithreading in release mode.
+impl TokenBucket {
+    /// Allocate a new TokenBucket
+    pub fn new(initial_tokens: u64, max_tokens: u64, new_tokens_per_second: f64) -> Self {


can we switch the floating point math to fixed point? over time it looks like these small rouding errors could add up and create some inconsistent behavior.

Over a billion requests these would add up to a few milliseconds, it is not really a concern. And switching to fixed point would not eliminate them, just reduce them a bit (since we would still have finite precision). And the code would get really ugly (I have tried it already, it becomes very hard to follow). Perf difference is non-existent.

lijunwangs · 2025-09-12T19:52:38Z

I would like to see some data on the difference makes, like the variation of input requests and the requests passed through given some limiting configuration over period of time between this and governor.
Do we know the CPU usage difference?

alexpyattaev · 2025-09-13T04:40:31Z

I would like to see some data on the difference makes, like the variation of input requests and the requests passed through given some limiting configuration over period of time between this and governor. Do we know the CPU usage difference?

I have done benchmarks before and Governor compares as follows:

8 threads poking at KeyedRateLimiter:

  START             solana-streamer nonblocking::connection_rate_limiter::test::bench_token_bucket
Run complete over 5 seconds
Accepted: 1024000 (target 1024000)  
Rejected:  33741839 

Same setup with Governor crate:

  START             solana-streamer nonblocking::connection_rate_limiter::test::bench_governor
running 1 test
Run complete over 5 seconds
Accepted: 1330995 (target 1024000) // 10 % off!
Rejected:  25997438

so atomic token bucket is better perf and a fair bit more accurate.

The benches for TokenBucket also ensure that it is limiting consistently over each 100ms interval and not just over long time interval.

gregcusack

lgtm!

alexpyattaev added the noCI Suppress CI on this Pull Request label Jul 9, 2025

alexpyattaev requested a review from KirillLykov July 9, 2025 11:10

alexpyattaev added CI Pull Request is ready to enter CI and removed noCI Suppress CI on this Pull Request labels Jul 9, 2025

anza-team removed the CI Pull Request is ready to enter CI label Jul 9, 2025

KirillLykov reviewed Jul 9, 2025

View reviewed changes

Comment thread streamer/src/nonblocking/connection_rate_limiter.rs Outdated

KirillLykov reviewed Jul 10, 2025

View reviewed changes

Comment thread streamer/src/nonblocking/connection_rate_limiter.rs Outdated

alexpyattaev force-pushed the ratelimiter branch 3 times, most recently from aa91a89 to 58d47a2 Compare July 13, 2025 18:12

alexpyattaev requested review from KirillLykov and lijunwangs July 14, 2025 16:58

alexpyattaev marked this pull request as ready for review July 20, 2025 06:01