Skip to content

SWQOS: a better implementation#8880

Closed
alexpyattaev wants to merge 2 commits intoanza-xyz:masterfrom
alexpyattaev:better_swqos
Closed

SWQOS: a better implementation#8880
alexpyattaev wants to merge 2 commits intoanza-xyz:masterfrom
alexpyattaev:better_swqos

Conversation

@alexpyattaev
Copy link
Copy Markdown

@alexpyattaev alexpyattaev commented Nov 4, 2025

Problems

  • SWQOS introduces separation between staked and unstaked connections with artificial stake thresholds

  • SWQOS uses EMA averaging, parameters for which are not well documented and do not line up well with real loads (see also Increase the duration of the EMA smoothing window (STREAM_LOAD_EMA_INTERVAL_COUNT) #10033 which fixed some of that)

  • We want TPS budget unused by connection X to be used by other connections (including the unstaked ones)

  • We want strong guarantees we will not be throttling anyone unless overall load is high enough

  • We want to have no limits on the amount of connections we can have open and idle (currently we "preallocate" bandwidth to unstaked connections).

  • We want to lay foundations to enable streamer to directly control the MAX_STREAM parameter of the connection instead of throttling reads as it does now.

Summary of Changes

  • Adds the new SWQOS implementation (thanks @KirillLykov for help with initial design)

Fixes #8863

Design Description & Overview

  1. Assume all unstaked connections have 1000 SOL implicit stake to simplify math. Choose maxTPS = 1000000 for example
  2. Give each connection an atomic token bucket. The amount of tokens a bucket can hold is proportional to stake.
  3. Every 20 ms, we refill all token buckets with tokens. The total amount to be filled in each bucket is X = effective_stake / total_effective_stake * maxTPS * 20 / 1000 . X is clamped by (1, MAX_TOKENS) to ensure we always put at least some tokens into every bucket on every iteration.
  4. If a given bucket can not fit X tokens, whatever does not fit is overflow. effective_stake = stake * tokens_consumed / tokens_allocated, i.e. if all tokens went to overflow, then effective_stake = 0 .
  5. Update total_effective_stake as sum of all effective_stakes of all connections.
  6. During stream consumption, we check if there is a token in the bucket, if yes we consume stream, if no we sleep until bucket is refilled. This will be replaced with a logic that drives MAX_STREAM parameter on the connection instead.

This way any unused bandwidth gets redistributed to other peers in stake-proportional manner. The code is about 1/5 the size of original SWQOS + EMA.

Open questions:

  • is 20ms a good number? Should we refill more often?
  • how many tokens should a bucket accumulate to smooth out load peaks?
  • How much TPS to assign to unstaked connections when the TPU is congested? Is 1000 fake SOL correct?
  • Should we gift new connections a bunch of tokens? Currently we gift 500 tokens to ensure they can send right away without waiting for the refill round.

Fixes #8863

Test results

5 staked IDs, each with 4 connections, 5ms latency

['EaS2hq3AuHT6yfWaySzcHrmUkxhcf9uoiW1BSGK19Ho4', '0']
['CSn1HwUodERYzWyPT7XqEncg3wUREzH41j3VQjdz8wZk', '200000']
['HZ5BaRKySw3vvnuCd8PXHU9FTW3MM595HKw5myMSdBDR', '400000']
['4W9ncwL5kD2UwgkbJx6FeLpboQr243iimZsVYSNrqwuV', '600000']
['Dn9tqdTVnFEHc4sS7Dd8ELT49pSRcGNeLRWV5o9fVXQJ', '800000']

Targeting 500K TPS with new code:

Server captured 1508505 transactions (502835 TPS)
EaS2hq3AuHT6yfWaySzcHrmUkxhcf9uoiW1BSGK19Ho4: sent=4053 got=4022 lost 31 (1340 TPS) 
CSn1HwUodERYzWyPT7XqEncg3wUREzH41j3VQjdz8wZk: sent=163268 got=163193 lost 75 (54397 TPS)
HZ5BaRKySw3vvnuCd8PXHU9FTW3MM595HKw5myMSdBDR: sent=321687 got=321456 lost 231 (107152 TPS)
4W9ncwL5kD2UwgkbJx6FeLpboQr243iimZsVYSNrqwuV: sent=482394 got=482390 lost 4 (160796 TPS)
Dn9tqdTVnFEHc4sS7Dd8ELT49pSRcGNeLRWV5o9fVXQJ: sent=537522 got=537444 lost 78 (179148 TPS)

stake proportionality:

Host:CSn1HwUodERYzWyPT7XqEncg3wUREzH41j3VQjdz8wZk
Stake:10.0%
Transactions:10.5%

Host:HZ5BaRKySw3vvnuCd8PXHU9FTW3MM595HKw5myMSdBDR
Stake:20.0%
Transactions:20.2%

Host:4W9ncwL5kD2UwgkbJx6FeLpboQr243iimZsVYSNrqwuV
Stake:30.0%
Transactions:31.6%

Host:Dn9tqdTVnFEHc4sS7Dd8ELT49pSRcGNeLRWV5o9fVXQJ
Stake:40.0%
Transactions:37.6%

Timelapse of transfers:

TPS

Legacy streamer (targeting 500K TPS):

Server captured 1750835 transactions (350167 TPS)
EaS2hq3AuHT6yfWaySzcHrmUkxhcf9uoiW1BSGK19Ho4: sent=1185 got=1178 lost 7 (235 TPS)
CSn1HwUodERYzWyPT7XqEncg3wUREzH41j3VQjdz8wZk: sent=285115 got=285111 lost 4 (57022 TPS)
HZ5BaRKySw3vvnuCd8PXHU9FTW3MM595HKw5myMSdBDR: sent=488173 got=488169 lost 4 (97633 TPS)
4W9ncwL5kD2UwgkbJx6FeLpboQr243iimZsVYSNrqwuV: sent=488267 got=488263 lost 4 (97652 TPS)
Dn9tqdTVnFEHc4sS7Dd8ELT49pSRcGNeLRWV5o9fVXQJ: sent=488118 got=488114 lost 4 (97622 TPS)

Critically:

Host:CSn1HwUodERYzWyPT7XqEncg3wUREzH41j3VQjdz8wZk
Stake:10.0%
Transactions:16.3% //6% too much

Host:HZ5BaRKySw3vvnuCd8PXHU9FTW3MM595HKw5myMSdBDR
Stake:20.0%
Transactions:27.9% // 8% too much

Host:4W9ncwL5kD2UwgkbJx6FeLpboQr243iimZsVYSNrqwuV
Stake:30.0%
Transactions:27.9% // missing 2% TPS

Host:Dn9tqdTVnFEHc4sS7Dd8ELT49pSRcGNeLRWV5o9fVXQJ
Stake:40.0%
Transactions:27.9%. //missing 12% TPS

So clearly legacy streamer could do better at reaching target TPS
TPS

@alexpyattaev alexpyattaev changed the title draft of a better SWQOS implementation SWQOS: a better implementation Nov 4, 2025
@alexpyattaev alexpyattaev added the noCI Suppress CI on this Pull Request label Nov 4, 2025
Comment thread streamer/src/nonblocking/swqos.rs Outdated
@alexpyattaev alexpyattaev force-pushed the better_swqos branch 2 times, most recently from 872c299 to bf5878b Compare November 6, 2025 23:05
@bw-solana
Copy link
Copy Markdown

If all buckets are full, how much TPS (or bandwidth or w/e) would that translate into? How many transactions during the next 10ms interval?

I'm a little concerned that overfill represents past underutilization, but it is being given to hungrier connections for future utilization. Past underutilization has already been lost.

In practice, I think we've IBRLed ingest enough that we can rely on the law of averages + large numbers and just oversubscribe a bit and be fine. But theoretically, there could be some burst capacity problems

@alexpyattaev
Copy link
Copy Markdown
Author

If all buckets are full, how much TPS (or bandwidth or w/e) would that translate into? How many transactions during the next 10ms interval?

The overflow from last round gets reassigned to every active consumer in the current round, so at any given point in time there is at most 2x tokens floating around. If we had zero usage for a while (all buckets full) and then suddenly TXs start arriving at full speed, we'd allow 2x sustained TPS target for 10ms, then start throttling based on immediate load.

I'm a little concerned that overfill represents past underutilization, but it is being given to hungrier connections for future utilization. Past underutilization has already been lost.

It is one of the simplest way to achieve O(N) complexity for refilling algorithm across N connection table keys, for 4000 connections this is desirable. An exact solution generally requires O(N^2) compute complexity, so it is out of the question. There are some N*log(N) solutions too, but they are more complex implementation-wise.

In practice, I think we've IBRLed ingest enough that we can rely on the law of averages + large numbers and just oversubscribe a bit and be fine. But theoretically, there could be some burst capacity problems

Yeah I'm looking into some algorithms that do not require overflow but instead estimate true demand per connection and adjust refill for next round based on past demand. So if you did not use your tokens for a while, your refill rate would be zero (but your bucket is full). Once you start using the bandwidth, refill rate gets bumped to reflect your stake at the start of next round.

@alexpyattaev alexpyattaev added CI Pull Request is ready to enter CI and removed noCI Suppress CI on this Pull Request labels Nov 15, 2025
@anza-team anza-team removed the CI Pull Request is ready to enter CI label Nov 15, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 15, 2025

Codecov Report

❌ Patch coverage is 91.30435% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.7%. Comparing base (f57c835) to head (2214028).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8880      +/-   ##
==========================================
+ Coverage    69.4%    82.7%   +13.3%     
==========================================
  Files         847      855       +8     
  Lines      256453   321259   +64806     
==========================================
+ Hits       178012   265786   +87774     
+ Misses      78441    55473   -22968     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alexpyattaev
Copy link
Copy Markdown
Author

@bw-solana no more overflow reallocation, now the algorithm predicts future demand based on past use, and allocates correctly in one go. This actually ends up being far more flexible too.

@alexpyattaev alexpyattaev marked this pull request as ready for review November 22, 2025 23:23
@alexpyattaev alexpyattaev requested review from stablebits and removed request for lijunwangs November 22, 2025 23:23

impl QosController<SwQosConnectionContext> for SwQos {
async fn async_init(&mut self) {
tokio::spawn(refill_task(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is confusing to me. async_init() sounds like it’s doing some optional setup that’s only needed when the object is used in an async environment, but here it actually starts refill_task(), which looks like a core part of the functionality. Since this is the only place where refill_task() is called, it seems like the object can’t really be used in a sync environment anyway (or refill_task() would need to be called explicitly?). If that’s the case, why not move this into new() or new_async() (this would also remove mut in mut qos : T in run_server())?

Also, does async_init() take &mut self with the intention that the task handle will be stored at some point? As it is now, the task is detached, which makes things even less clear.

Copy link
Copy Markdown

@stablebits stablebits Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, won't this task become a bottleneck (by consistently missing its 10ms deadline) in the presence of 1000s of active connection tasks? And so most of these tasks would go to sleep before their stakes are recalculated...

btw., if we do keep these sleeps for now, it looks like the tasks would only need to sleep until the start of the next 10 ms window and wake up once their tokens were recalculated.

What if instead of sleeping they at least simply yield immediately? (like the first step to get rid of sleep).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async_init is marked async to avoid users from calling it outside of tokio context errorneously (as it would result in runtime panic). spawning a tokio task can only be done from inside the runtime. new_async is also an option but it would break the API similarity with SimpleQos.

The refill task needs to loop over at most 6000 connections. Each connection requires a few microseconds (as the math is trivial, and it only acquires 2 locks). We could refill less often at the cost of longer response to load changes, needs to be tested.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added benchmarks in 9ddfd1b . One refill over 20K connections takes ~300 microseconds. So we can totally afford it even with the lock contention over connection tables.

@alexpyattaev alexpyattaev force-pushed the better_swqos branch 3 times, most recently from c8d6204 to a6101ad Compare November 30, 2025 12:12
@alexpyattaev alexpyattaev force-pushed the better_swqos branch 2 times, most recently from 47a8ecf to b313448 Compare January 21, 2026 15:46
@alexpyattaev alexpyattaev force-pushed the better_swqos branch 3 times, most recently from 9ddfd1b to 66be57a Compare January 22, 2026 23:38
@stablebits
Copy link
Copy Markdown

@anza-xyz/networking
This might be a better implementation of the 'proactive' throttling approach. But I still think that we should first align on the overall design, since there are 2 camps here (or maybe 3):

  1. Proactive throttling (this PR): the TPU enforces a TPS target one way or another (like via the token bucket in this PR), essentially trying to protect the rest of the pipeline from overload
    (which may involve dropping transactions).
  2. Reactive throttling: the TPU reads freely until the pipeline signals back-pressure, then selectively throttles connections (for example, using QUIC's native flow control).
  3. Hybrid: maybe a mix of both proactive (but high limit; safety) and reactive.

I still lean toward the reactive approach 🤔. A static TPS cap will likely be always wrong in some direction — too low and we leave capacity on the table, too high and we're not protecting anything. The actual capacity likely varies with transaction mix, internal state, and hopefully improves with the pipeline optimisations over time. With reactive throttling, the pipeline's own processing rate becomes the clock, and there's zero overhead when the system isn't under pressure (which is most of the time).

For back-pressure, the channel between TPU and the processing pipeline has a bounded capacity (Kirill's changes). We could monitor its fill level.

I understand the concern that without any TPU limit, a burst could in principle overwhelm the pipeline before back-pressure propagates, and transactions would need to be discarded. But this is arguably an extreme case, and the same problem occurs with proactive throttling if the TPS target is set too high for a given load profile.

In any case, I think it'd be best to put all the design options on the table, involve relevant stakeholders, make a decision, and then execute (and stop arguing about it).

@alexpyattaev
Copy link
Copy Markdown
Author

alexpyattaev commented Jan 28, 2026

I understand the concern that without any TPU limit, a burst could in principle overwhelm the pipeline before back-pressure propagates, and transactions would need to be discarded. But this is arguably an extreme case, and the same problem occurs with proactive throttling if the TPS target is set too high for a given load profile.

As long as a remote attacker can cause transactions that have been admitted via TPU to get dropped at a later stage, legit users will be forced to send multiple copies to compensate. This was the case when UDP was used, it was not pretty.

Also in case of reactive throttling, we still need swqos to figure. out whom to throttle once congestion starts. Even in a fully reactive approach you need accounting of who used how much and a mechanism to enforce throttling.

@stablebits
Copy link
Copy Markdown

I understand the concern that without any TPU limit, a burst could in principle overwhelm the pipeline before back-pressure propagates, and transactions would need to be discarded. But this is arguably an extreme case, and the same problem occurs with proactive throttling if the TPS target is set too high for a given load profile.

As long as a remote attacker can cause transactions that have been admitted via TPU to get dropped at a later stage, legit users will be forced to send multiple copies to compensate. This was the case when UDP was used, it was not pretty.

Also in case of reactive throttling, we still need swqos to figure. out whom to throttle once congestion starts. Even in a fully reactive approach you need accounting of who used how much and a mechanism to enforce throttling.

Yes, but the fundamental difference is that the artificial TPS cap would not be enforced by TPU in this case.

@alexpyattaev
Copy link
Copy Markdown
Author

Yes, but the fundamental difference is that the artificial TPS cap would not be enforced by TPU in this case.

Well let us try to put it together and see how it performs.

@gregcusack
Copy link
Copy Markdown

gregcusack commented Jan 29, 2026

I agree with @stablebits that we shouldn't throttle until necessary. We already had this debate with his PR here: #9580. I don't think it is productive to keep going back and forth. @alessandrod has already helped us settle this debate and asked us to implement "(2) Reactive Throttling".

Setting the TPS per client-stake is wayy too finicky and hard to get right. Accept and throttle only when backpressure. And yes, we can use swqos when we actually want to start throttling

@alexpyattaev
Copy link
Copy Markdown
Author

Setting the TPS per client-stake is wayy too finicky and hard to get right. Accept and throttle only when backpressure. And yes, we can use swqos when we actually want to start throttling

Throttle whom and by how much "when backpressure"? We'd still need logic to figure out whom to throttle once backpressure kicks in. Which logic do we use for that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TPU QUIC Connection Throttling

6 participants