SWQOS: a better implementation by alexpyattaev · Pull Request #8880 · anza-xyz/agave

alexpyattaev · 2025-11-04T14:53:02Z

Problems

SWQOS introduces separation between staked and unstaked connections with artificial stake thresholds
SWQOS uses EMA averaging, parameters for which are not well documented and do not line up well with real loads (see also Increase the duration of the EMA smoothing window (STREAM_LOAD_EMA_INTERVAL_COUNT) #10033 which fixed some of that)
We want TPS budget unused by connection X to be used by other connections (including the unstaked ones)
We want strong guarantees we will not be throttling anyone unless overall load is high enough
We want to have no limits on the amount of connections we can have open and idle (currently we "preallocate" bandwidth to unstaked connections).
We want to lay foundations to enable streamer to directly control the MAX_STREAM parameter of the connection instead of throttling reads as it does now.

Summary of Changes

Adds the new SWQOS implementation (thanks @KirillLykov for help with initial design)

Design Description & Overview

Assume all unstaked connections have 1000 SOL implicit stake to simplify math. Choose maxTPS = 1000000 for example
Give each connection an atomic token bucket. The amount of tokens a bucket can hold is proportional to stake.
Every 20 ms, we refill all token buckets with tokens. The total amount to be filled in each bucket is X = effective_stake / total_effective_stake * maxTPS * 20 / 1000 . X is clamped by (1, MAX_TOKENS) to ensure we always put at least some tokens into every bucket on every iteration.
If a given bucket can not fit X tokens, whatever does not fit is overflow. effective_stake = stake * tokens_consumed / tokens_allocated, i.e. if all tokens went to overflow, then effective_stake = 0 .
Update total_effective_stake as sum of all effective_stakes of all connections.
During stream consumption, we check if there is a token in the bucket, if yes we consume stream, if no we sleep until bucket is refilled. This will be replaced with a logic that drives MAX_STREAM parameter on the connection instead.

This way any unused bandwidth gets redistributed to other peers in stake-proportional manner. The code is about 1/5 the size of original SWQOS + EMA.

Open questions:

is 20ms a good number? Should we refill more often?
how many tokens should a bucket accumulate to smooth out load peaks?
How much TPS to assign to unstaked connections when the TPU is congested? Is 1000 fake SOL correct?
Should we gift new connections a bunch of tokens? Currently we gift 500 tokens to ensure they can send right away without waiting for the refill round.

Fixes #8863

Test results

5 staked IDs, each with 4 connections, 5ms latency

['EaS2hq3AuHT6yfWaySzcHrmUkxhcf9uoiW1BSGK19Ho4', '0']
['CSn1HwUodERYzWyPT7XqEncg3wUREzH41j3VQjdz8wZk', '200000']
['HZ5BaRKySw3vvnuCd8PXHU9FTW3MM595HKw5myMSdBDR', '400000']
['4W9ncwL5kD2UwgkbJx6FeLpboQr243iimZsVYSNrqwuV', '600000']
['Dn9tqdTVnFEHc4sS7Dd8ELT49pSRcGNeLRWV5o9fVXQJ', '800000']

Targeting 500K TPS with new code:

Server captured 1508505 transactions (502835 TPS)
EaS2hq3AuHT6yfWaySzcHrmUkxhcf9uoiW1BSGK19Ho4: sent=4053 got=4022 lost 31 (1340 TPS) 
CSn1HwUodERYzWyPT7XqEncg3wUREzH41j3VQjdz8wZk: sent=163268 got=163193 lost 75 (54397 TPS)
HZ5BaRKySw3vvnuCd8PXHU9FTW3MM595HKw5myMSdBDR: sent=321687 got=321456 lost 231 (107152 TPS)
4W9ncwL5kD2UwgkbJx6FeLpboQr243iimZsVYSNrqwuV: sent=482394 got=482390 lost 4 (160796 TPS)
Dn9tqdTVnFEHc4sS7Dd8ELT49pSRcGNeLRWV5o9fVXQJ: sent=537522 got=537444 lost 78 (179148 TPS)

stake proportionality:

Host:CSn1HwUodERYzWyPT7XqEncg3wUREzH41j3VQjdz8wZk
Stake:10.0%
Transactions:10.5%

Host:HZ5BaRKySw3vvnuCd8PXHU9FTW3MM595HKw5myMSdBDR
Stake:20.0%
Transactions:20.2%

Host:4W9ncwL5kD2UwgkbJx6FeLpboQr243iimZsVYSNrqwuV
Stake:30.0%
Transactions:31.6%

Host:Dn9tqdTVnFEHc4sS7Dd8ELT49pSRcGNeLRWV5o9fVXQJ
Stake:40.0%
Transactions:37.6%

Timelapse of transfers:

Legacy streamer (targeting 500K TPS):

Server captured 1750835 transactions (350167 TPS)
EaS2hq3AuHT6yfWaySzcHrmUkxhcf9uoiW1BSGK19Ho4: sent=1185 got=1178 lost 7 (235 TPS)
CSn1HwUodERYzWyPT7XqEncg3wUREzH41j3VQjdz8wZk: sent=285115 got=285111 lost 4 (57022 TPS)
HZ5BaRKySw3vvnuCd8PXHU9FTW3MM595HKw5myMSdBDR: sent=488173 got=488169 lost 4 (97633 TPS)
4W9ncwL5kD2UwgkbJx6FeLpboQr243iimZsVYSNrqwuV: sent=488267 got=488263 lost 4 (97652 TPS)
Dn9tqdTVnFEHc4sS7Dd8ELT49pSRcGNeLRWV5o9fVXQJ: sent=488118 got=488114 lost 4 (97622 TPS)

Critically:

Host:CSn1HwUodERYzWyPT7XqEncg3wUREzH41j3VQjdz8wZk
Stake:10.0%
Transactions:16.3% //6% too much

Host:HZ5BaRKySw3vvnuCd8PXHU9FTW3MM595HKw5myMSdBDR
Stake:20.0%
Transactions:27.9% // 8% too much

Host:4W9ncwL5kD2UwgkbJx6FeLpboQr243iimZsVYSNrqwuV
Stake:30.0%
Transactions:27.9% // missing 2% TPS

Host:Dn9tqdTVnFEHc4sS7Dd8ELT49pSRcGNeLRWV5o9fVXQJ
Stake:40.0%
Transactions:27.9%. //missing 12% TPS

So clearly legacy streamer could do better at reaching target TPS

bw-solana · 2025-11-11T16:43:40Z

If all buckets are full, how much TPS (or bandwidth or w/e) would that translate into? How many transactions during the next 10ms interval?

I'm a little concerned that overfill represents past underutilization, but it is being given to hungrier connections for future utilization. Past underutilization has already been lost.

In practice, I think we've IBRLed ingest enough that we can rely on the law of averages + large numbers and just oversubscribe a bit and be fine. But theoretically, there could be some burst capacity problems

alexpyattaev · 2025-11-11T17:02:22Z

If all buckets are full, how much TPS (or bandwidth or w/e) would that translate into? How many transactions during the next 10ms interval?

The overflow from last round gets reassigned to every active consumer in the current round, so at any given point in time there is at most 2x tokens floating around. If we had zero usage for a while (all buckets full) and then suddenly TXs start arriving at full speed, we'd allow 2x sustained TPS target for 10ms, then start throttling based on immediate load.

I'm a little concerned that overfill represents past underutilization, but it is being given to hungrier connections for future utilization. Past underutilization has already been lost.

It is one of the simplest way to achieve O(N) complexity for refilling algorithm across N connection table keys, for 4000 connections this is desirable. An exact solution generally requires O(N^2) compute complexity, so it is out of the question. There are some N*log(N) solutions too, but they are more complex implementation-wise.

In practice, I think we've IBRLed ingest enough that we can rely on the law of averages + large numbers and just oversubscribe a bit and be fine. But theoretically, there could be some burst capacity problems

Yeah I'm looking into some algorithms that do not require overflow but instead estimate true demand per connection and adjust refill for next round based on past demand. So if you did not use your tokens for a while, your refill rate would be zero (but your bucket is full). Once you start using the bandwidth, refill rate gets bumped to reflect your stake at the start of next round.

codecov-commenter · 2025-11-15T19:16:23Z

Codecov Report

❌ Patch coverage is 91.30435% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.7%. Comparing base (f57c835) to head (2214028).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8880      +/-   ##
==========================================
+ Coverage    69.4%    82.7%   +13.3%     
==========================================
  Files         847      855       +8     
  Lines      256453   321259   +64806     
==========================================
+ Hits       178012   265786   +87774     
+ Misses      78441    55473   -22968

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alexpyattaev · 2025-11-15T23:08:45Z

@bw-solana no more overflow reallocation, now the algorithm predicts future demand based on past use, and allocates correctly in one go. This actually ends up being far more flexible too.

stablebits · 2025-11-24T10:31:57Z


 impl QosController<SwQosConnectionContext> for SwQos {
+    async fn async_init(&mut self) {
+        tokio::spawn(refill_task(


This API is confusing to me. async_init() sounds like it’s doing some optional setup that’s only needed when the object is used in an async environment, but here it actually starts refill_task(), which looks like a core part of the functionality. Since this is the only place where refill_task() is called, it seems like the object can’t really be used in a sync environment anyway (or refill_task() would need to be called explicitly?). If that’s the case, why not move this into new() or new_async() (this would also remove mut in mut qos : T in run_server())?

Also, does async_init() take &mut self with the intention that the task handle will be stored at some point? As it is now, the task is detached, which makes things even less clear.

Also, won't this task become a bottleneck (by consistently missing its 10ms deadline) in the presence of 1000s of active connection tasks? And so most of these tasks would go to sleep before their stakes are recalculated...

btw., if we do keep these sleeps for now, it looks like the tasks would only need to sleep until the start of the next 10 ms window and wake up once their tokens were recalculated.

What if instead of sleeping they at least simply yield immediately? (like the first step to get rid of sleep).

async_init is marked async to avoid users from calling it outside of tokio context errorneously (as it would result in runtime panic). spawning a tokio task can only be done from inside the runtime. new_async is also an option but it would break the API similarity with SimpleQos.

The refill task needs to loop over at most 6000 connections. Each connection requires a few microseconds (as the math is trivial, and it only acquires 2 locks). We could refill less often at the cost of longer response to load changes, needs to be tested.

Added benchmarks in 9ddfd1b . One refill over 20K connections takes ~300 microseconds. So we can totally afford it even with the lock contention over connection tables.

stablebits · 2026-01-28T10:28:37Z

@anza-xyz/networking
This might be a better implementation of the 'proactive' throttling approach. But I still think that we should first align on the overall design, since there are 2 camps here (or maybe 3):

Proactive throttling (this PR): the TPU enforces a TPS target one way or another (like via the token bucket in this PR), essentially trying to protect the rest of the pipeline from overload
(which may involve dropping transactions).
Reactive throttling: the TPU reads freely until the pipeline signals back-pressure, then selectively throttles connections (for example, using QUIC's native flow control).
Hybrid: maybe a mix of both proactive (but high limit; safety) and reactive.

I still lean toward the reactive approach 🤔. A static TPS cap will likely be always wrong in some direction — too low and we leave capacity on the table, too high and we're not protecting anything. The actual capacity likely varies with transaction mix, internal state, and hopefully improves with the pipeline optimisations over time. With reactive throttling, the pipeline's own processing rate becomes the clock, and there's zero overhead when the system isn't under pressure (which is most of the time).

For back-pressure, the channel between TPU and the processing pipeline has a bounded capacity (Kirill's changes). We could monitor its fill level.

I understand the concern that without any TPU limit, a burst could in principle overwhelm the pipeline before back-pressure propagates, and transactions would need to be discarded. But this is arguably an extreme case, and the same problem occurs with proactive throttling if the TPS target is set too high for a given load profile.

In any case, I think it'd be best to put all the design options on the table, involve relevant stakeholders, make a decision, and then execute (and stop arguing about it).

alexpyattaev · 2026-01-28T10:40:33Z

I understand the concern that without any TPU limit, a burst could in principle overwhelm the pipeline before back-pressure propagates, and transactions would need to be discarded. But this is arguably an extreme case, and the same problem occurs with proactive throttling if the TPS target is set too high for a given load profile.

As long as a remote attacker can cause transactions that have been admitted via TPU to get dropped at a later stage, legit users will be forced to send multiple copies to compensate. This was the case when UDP was used, it was not pretty.

Also in case of reactive throttling, we still need swqos to figure. out whom to throttle once congestion starts. Even in a fully reactive approach you need accounting of who used how much and a mechanism to enforce throttling.

stablebits · 2026-01-28T12:37:38Z

I understand the concern that without any TPU limit, a burst could in principle overwhelm the pipeline before back-pressure propagates, and transactions would need to be discarded. But this is arguably an extreme case, and the same problem occurs with proactive throttling if the TPS target is set too high for a given load profile.

As long as a remote attacker can cause transactions that have been admitted via TPU to get dropped at a later stage, legit users will be forced to send multiple copies to compensate. This was the case when UDP was used, it was not pretty.

Also in case of reactive throttling, we still need swqos to figure. out whom to throttle once congestion starts. Even in a fully reactive approach you need accounting of who used how much and a mechanism to enforce throttling.

Yes, but the fundamental difference is that the artificial TPS cap would not be enforced by TPU in this case.

alexpyattaev · 2026-01-29T18:39:16Z

Yes, but the fundamental difference is that the artificial TPS cap would not be enforced by TPU in this case.

Well let us try to put it together and see how it performs.

gregcusack · 2026-01-29T23:16:18Z

I agree with @stablebits that we shouldn't throttle until necessary. We already had this debate with his PR here: #9580. I don't think it is productive to keep going back and forth. @alessandrod has already helped us settle this debate and asked us to implement "(2) Reactive Throttling".

Setting the TPS per client-stake is wayy too finicky and hard to get right. Accept and throttle only when backpressure. And yes, we can use swqos when we actually want to start throttling

alexpyattaev · 2026-01-30T15:31:35Z

Setting the TPS per client-stake is wayy too finicky and hard to get right. Accept and throttle only when backpressure. And yes, we can use swqos when we actually want to start throttling

Throttle whom and by how much "when backpressure"? We'd still need logic to figure out whom to throttle once backpressure kicks in. Which logic do we use for that?

alexpyattaev changed the title ~~draft of a better SWQOS implementation~~ SWQOS: a better implementation Nov 4, 2025

alexpyattaev added the noCI Suppress CI on this Pull Request label Nov 4, 2025

alexpyattaev commented Nov 4, 2025

View reviewed changes

Comment thread streamer/src/nonblocking/swqos.rs Outdated

alexpyattaev requested review from KirillLykov and lijunwangs November 4, 2025 22:22

alexpyattaev force-pushed the better_swqos branch 2 times, most recently from 872c299 to bf5878b Compare November 6, 2025 23:05

alexpyattaev force-pushed the better_swqos branch from 19bab09 to da553c1 Compare November 15, 2025 00:14

alexpyattaev added CI Pull Request is ready to enter CI and removed noCI Suppress CI on this Pull Request labels Nov 15, 2025

anza-team removed the CI Pull Request is ready to enter CI label Nov 15, 2025

alexpyattaev force-pushed the better_swqos branch from da553c1 to ee2c4f0 Compare November 15, 2025 00:16

alexpyattaev mentioned this pull request Nov 15, 2025

streamer: refactor config structs #9093

Merged

alexpyattaev force-pushed the better_swqos branch from 7b21256 to 8317646 Compare November 15, 2025 22:25

alexpyattaev requested a review from bw-solana November 15, 2025 23:06

alexpyattaev force-pushed the better_swqos branch from 8317646 to e8d1dd4 Compare November 22, 2025 23:20

alexpyattaev marked this pull request as ready for review November 22, 2025 23:23

alexpyattaev requested review from stablebits and removed request for lijunwangs November 22, 2025 23:23

alexpyattaev force-pushed the better_swqos branch from e8d1dd4 to 696acb6 Compare November 23, 2025 21:13

stablebits reviewed Nov 24, 2025

View reviewed changes

alexpyattaev force-pushed the better_swqos branch 3 times, most recently from c8d6204 to a6101ad Compare November 30, 2025 12:12

alexpyattaev force-pushed the better_swqos branch 2 times, most recently from 47a8ecf to b313448 Compare January 21, 2026 15:46

alexpyattaev force-pushed the better_swqos branch 3 times, most recently from 9ddfd1b to 66be57a Compare January 22, 2026 23:38

alexpyattaev requested review from gregcusack and stablebits January 22, 2026 23:48

Base implementation of the new swqos

daf349f

alexpyattaev force-pushed the better_swqos branch from fbf3c1f to daf349f Compare January 27, 2026 23:15

address overflow possibilities

2214028

alexpyattaev closed this Jan 30, 2026

Conversation

alexpyattaev commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problems

Summary of Changes

Design Description & Overview

Test results

Uh oh!

Uh oh!

bw-solana commented Nov 11, 2025

Uh oh!

alexpyattaev commented Nov 11, 2025

Uh oh!

codecov-commenter commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alexpyattaev commented Nov 15, 2025

Uh oh!

stablebits Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

stablebits Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexpyattaev Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

alexpyattaev Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

stablebits commented Jan 28, 2026

Uh oh!

alexpyattaev commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stablebits commented Jan 28, 2026

Uh oh!

alexpyattaev commented Jan 29, 2026

Uh oh!

gregcusack commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexpyattaev commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

alexpyattaev commented Nov 4, 2025 •

edited

Loading

codecov-commenter commented Nov 15, 2025 •

edited

Loading

stablebits Nov 24, 2025 •

edited

Loading

alexpyattaev commented Jan 28, 2026 •

edited

Loading

gregcusack commented Jan 29, 2026 •

edited

Loading