use BDP for streamer window size calculations by alexpyattaev · Pull Request #7954 · anza-xyz/agave

alexpyattaev · 2025-09-08T14:24:58Z

Problem

SWQOS ignores RTT in some of its calculations. This means that connections with high latency are heavily rate limited, much more than the stake amount would suggest. This happens before the actual stake-based stream throttling even has a change to kick in, and is thus very counterintuitive to the client.
A deeper version of #7706. These PRs are mutually exclusive.

Some preliminary concepts:

Sender can not have more than receive_window bytes in flight between itself and server
Sender can not have more than max_concurrent_streams worth of open streams at any time
These together limit how many TXs can be “in flight” between client and server and not yet ACKd. Currently, both these limits are computed based on stake. We need to compute them based on stake and RTT to the client (as longer RTT means you need to have more things on the wire before you see an ACK).
Beyond all the limits mentioned above, there is an overall stream limit imposed by the code in stream_throttle.rs. This has no impact on staked nodes (as long as we do not have many of them) but caps the unstaked peer TPS at 200.

This mechanism is not intended as the actual rate limiter for complaint clients, just as a limit on network buffers and such like. This is designed to allocate more bandwidth than what you'd get today. Note that a client can open up to 8 concurrent connections per identity, allowing each connection to leverage the full bandwidth on the network level (before throttling by stream_throttle.rs logic). So this PR should not reduce the TPU bandwidth in any way, except for customers with extremely low latency to the target validator. If operators want to limit TPS on a per-client basis, they should use throttling logic for it rather than receive_window.

Number of concurrent streams should match the BDP of the link to prevent starvation of the client.
Throughput in practice is ultimately not limited only by window size or number of streams in flight, but rather the throttling logic behind all this that operates on the per-staked-identity basis, not per connection.

Summary of Changes

Make number of concurrent streams allowed scale with RX window size.
Set bandwidth allocation for unstaked/min-staked connection to 4 Mbps (800 TPS for 500 byte TXs)
Set bandwidth allocation for staked connection to 80 Mbps (16 KTPS for 500 byte TXs)
Bump number of concurrent streams to match the bandwidth allocations (up to 5000)

Formula for bandwidth allocation per connection:
min_rate = 4 # Mbit/s
max_rate = 80 # Mbit/s
min_bit_rate = min_rate + (stake * (max_rate - min_rate) / max_stake)

Actual bitrate will depend on RTT:

For RTT < 50ms it will be strictly more than min_bit_rate to mimic the behavior before change
For 50 < RTT < 300 ms it will match min_bit_rate
For RTT > 300ms it will be roughly min_bit_rate * 300 / RTT

The maximal sustainable TPS per connection will depend on the size of transactions. The concurrent_streams allocation is based on the transaction size of 400 bytes and is calculated as
max_streams = receive_window / mean_transaction_size
so if transactions are smaller on average, the TPS will be smaller proportionally. For example if your transactions are 200 bytes each, you'd get about half the TPS you would expect given your bandwidth allocation.

Without this PR, for the same set of nodes each holding 20% of stake:

{'latency': 10,
'clients': 5,
 'duration': 5.0,
 'tx-size': 1024}
Server captured 610320 transactions (122064 TPS)

Allowing 1 Gbit/s worth of transactions from 5 clients is probably excessive... but we will deal with it in a separate PR

{'latency': 150,
'clients': 5,
 'duration': 5.0,
 'tx-size': 1024}
Server captured 35930 transactions (7186 TPS)

With 150ms latency (Auckland-Barcelona link) we lose nearly all our TPS, down to 58 Mbps with the same stake

With this PR:

{'latency': 10,
'clients': 5,
'duration': 5.0,
 'tx-size': 1024}
Server captured 547171 transactions (109434 TPS)

Roughly same TPS for low-latency nodes

{'latency': 150,
'clients': 5,
 'duration': 5.0,
 'tx-size': 1024}
Server captured 132884 transactions (26576 TPS)

But now at very high latency, we still provide ~220 Mbps (4x more than before the change)

Clearly, the PR makes the changes in TPS as a result of latency variations less severe, but it does not eliminate them completely. Eliminating them completely is out of scope of this PR.

Some pretty plots

Each point on the plot is collected via run.sh script in the repo with test utils. The ideal plot should look like a single flat plane rising along the stake axis and not dependent on latency at all. In reality very high latency connections suffer a bit. Blue = low TPS. Yellow = high TPS.

1024 byte TXs

Before:

After:

512 byte TXs

Before:

After:

176 byte TXs

Before:

After:

Codecov Report

❌ Patch coverage is 99.04762% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 83.2%. Comparing base (3e7bb4a) to head (0b59170).
⚠️ Report is 52 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #7954     +/-   ##
=========================================
- Coverage    83.2%    83.2%   -0.1%     
=========================================
  Files         861      861             
  Lines      375220   375158     -62     
=========================================
- Hits       312482   312419     -63     
- Misses      62738    62739      +1

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lijunwangs

LGTM

alexpyattaev · 2025-09-14T07:28:54Z

LGTM

@lijunwangs had to rebase due to conflicts

alessandrod

some initial comments, I haven't tried running the code yet

alessandrod · 2025-09-25T10:02:29Z


        stats.active_streams.fetch_sub(1, Ordering::Relaxed);
        stream_load_ema.update_ema_if_needed();
+        if (stream_number % 128) == 0 {


How was this number chosen? It feels like it's way too often.

It depends on whether you are staked or not. For staked it may be a bit too often, for unstaked it is once per second (unless they have cheated on the RTT). Do you think this should be based on stake?

alessandrod · 2025-09-26T13:41:33Z

+    // max(1) is needed on localhost to avoid zero result
+    // truncate here is safe since u64 millis is an eternity
+    let millis = (rtt.as_millis() as u64).clamp(1, MAX_ALLOWED_RTT_MS);
+    let receive_window = (max_receive_rate_kbps * millis) / 8;


we should give 10-20% boost here to account for ack coalescing and network blips

alexpyattaev · 2025-10-16T13:55:26Z

@gregcusack @lijunwangs the issue with the perf for small transaction was due to starvation of concurrent streams on the server side, I have just missed it before. Had to bump max streams count to 1500. Now the scaling is as we would expect for average TX mix, with slight degradation if some high staked node is sending pure small TXs (which is very unlikely).

gregcusack · 2025-10-21T22:47:19Z

I have tested this a few times with Alex's streamer tester. But I am getting more packet loss with this PR than with the current code. Not sure if it's this PR (don't see a bug in the code) or a mininet issue. When I don't get packet loss, the distribution of bandwidth across clients by stake is much better than what we currently have. Will try and find some time this week to reproduce across two dev boxes.

alexpyattaev · 2025-10-22T07:04:46Z

But I am getting more packet loss with this PR than with the current code. Not sure if it's this PR (don't see a bug in the code) or a mininet issue.

With old code I could not get good speeds at high latency, where majority of the packet loss is observed.

gregcusack · 2025-10-24T17:54:09Z

ok ran some tests across multiple hosts.
TLDR: This code provides more fair TPS between nodes with the same stake but different latencies as intended. The downside is the overall TPS the server processes is significantly reduced.

Setup: 1 node in LA running the quic server. 2 nodes running quic clients in Dallas and Osaka.
Experiments (ran across multiple packet sizes, but I'll just show you tests with 1000 byte packets)
Tx-size: 1000 bytes
Overall TPS reduction: 60%

Data:

THIS PR

Server: 17,970 TPS

Dallas

Pubkey	TPS
4bRcVxpQMjP2VNX1geBJWcQVERTnPJDbaeiZAVwVEG6h	676
7kFZrEHzgo8jGgLPzmEYwmiFfYXFv65n5R7Zm4ihnfr6	1127
EP6NEYGo8DUzDSEyBtS7rskUhARKfYf52zcqDVfRaqfy	1562
22EfiGE3JA6AFS8T7dZf9qmF9h2nzpuQH3ak8588tjjp	1993
DGC1ARvjkdQZcdnrXbonkbRpA6LY4Vm5TCqF54tjbQR4	4504

Osaka

Pubkey	TPS
3GC9X8jfraids6Wgr7s11jYsgoZLgoxBjvBSPBVggMap	631
HFEqs5Qxni1HhmyhxMrw8qPAqneDk1MRQWqWE1KkPsX9	980
4fUgrqKEHBnPUCKhYk37DZp7ZqBsKzb7MgEe1AHChPgS	1339
Fox7AyrfriBrE3VDme7WNAs25YoAZvXwXPsq3wPzBpam	1673
K44JtsDAjdYsuEt8KJosZZ9ES1FEosB3wETqyMa41mr	3481

MASTER

Server: 45,704 TPS

Dallas

Pubkey	TPS
4bRcVxpQMjP2VNX1geBJWcQVERTnPJDbaeiZAVwVEG6h	4365
7kFZrEHzgo8jGgLPzmEYwmiFfYXFv65n5R7Zm4ihnfr6	4827
EP6NEYGo8DUzDSEyBtS7rskUhARKfYf52zcqDVfRaqfy	7122
22EfiGE3JA6AFS8T7dZf9qmF9h2nzpuQH3ak8588tjjp	6926
DGC1ARvjkdQZcdnrXbonkbRpA6LY4Vm5TCqF54tjbQR4	13342

Osaka

Pubkey	TPS
3GC9X8jfraids6Wgr7s11jYsgoZLgoxBjvBSPBVggMap	1015
HFEqs5Qxni1HhmyhxMrw8qPAqneDk1MRQWqWE1KkPsX9	1398
4fUgrqKEHBnPUCKhYk37DZp7ZqBsKzb7MgEe1AHChPgS	1784
Fox7AyrfriBrE3VDme7WNAs25YoAZvXwXPsq3wPzBpam	1940
K44JtsDAjdYsuEt8KJosZZ9ES1FEosB3wETqyMa41mr	2982

So I think we need to get overall TPS numbers up before this PR can land imo. These experiments show similar results to the plots Alex provided in the PR description

alexpyattaev · 2025-10-25T06:35:55Z

Thanks for checking this Greg!

So I think we need to get overall TPS numbers up before this PR can land imo.

I am not quite sure which TPS do you want us to match - the Dallas or the Osaka numbers (as they are latency-dependent in master). If you were to run tests LA-LA, you would get even more TPS witht the old code.

The basic math for relative resource allocation is the same as in the original PR (linear interpolation between min stake and max stake), the question is just which TPS do we target for which amount of stake.

I have bumped the bandwidth allocations 2x in 452a1e3 as an initial step, but we do need to answer a fundamental question "how mush should we allow for X amount of stake?".

alessandrod · 2025-10-28T00:05:13Z

If you were to run tests LA-LA, you would get even more TPS witht the old code.

so why are we trying to lower numbers? we want the opposite?

alexpyattaev · 2025-10-28T13:31:08Z

If you were to run tests LA-LA, you would get even more TPS witht the old code.

so why are we trying to lower numbers? we want the opposite?

We are not trying to lower them, we are trying to choose them to reflect the realistic demand. This PR tried to match the bandwidth allocation of ~30 ms RTT, that got 2x'd following feedback from Greg. Not sure on what basis do you want to tune this.

…#7745) * use BDP to compute the rx window before SWQOS throttling is applied, helps high-latency senders (>50ms RTT) get reasonable TPS * set number of streamer streams based on BDP (for same reason) * For any RTT below 400ms, target up to 80 Mbps service rate per max-staked connection * add a workaround to keep giving higher bandwidth to close-by nodes * update RX window every 128 TXs in case someone tries to spoof it

alexpyattaev · 2025-11-03T22:19:00Z

@alessandrod I've added logic to make sure the bandwidth for close-by clients (<50ms RTT) remains on par with the current case. Bandwidth for far-away clients (>50ms) is strictly higher than before (see plots).

alexpyattaev · 2025-11-06T18:52:54Z

Turning this to draft for now in favor of #8948

alexpyattaev requested review from KirillLykov and alessandrod September 8, 2025 15:13

alexpyattaev marked this pull request as ready for review September 8, 2025 15:14

alessandrod requested changes Sep 8, 2025

View reviewed changes

alexpyattaev requested a review from lijunwangs September 8, 2025 17:41

lijunwangs reviewed Sep 9, 2025

View reviewed changes

Comment thread streamer/src/nonblocking/quic.rs Outdated

Comment thread streamer/src/nonblocking/quic.rs Outdated

Comment thread streamer/src/nonblocking/quic.rs Outdated

lijunwangs reviewed Sep 9, 2025

View reviewed changes

Comment thread streamer/src/nonblocking/quic.rs Outdated

alexpyattaev force-pushed the streamer_bdp branch from 19d9d77 to b817358 Compare September 9, 2025 11:15

alexpyattaev mentioned this pull request Sep 10, 2025

use BDP in SWQOS calculations #7706

Closed

alexpyattaev requested review from alessandrod and lijunwangs September 10, 2025 17:00

lijunwangs previously approved these changes Sep 11, 2025

View reviewed changes

alexpyattaev dismissed lijunwangs’s stale review via 5d58f33 September 14, 2025 07:27

alexpyattaev force-pushed the streamer_bdp branch from 4a712eb to 5d58f33 Compare September 14, 2025 07:27

lijunwangs previously approved these changes Sep 16, 2025

View reviewed changes

alexpyattaev dismissed lijunwangs’s stale review via c21e493 September 23, 2025 12:23

alexpyattaev force-pushed the streamer_bdp branch from 5d58f33 to c21e493 Compare September 23, 2025 12:23

alexpyattaev requested a review from gregcusack September 24, 2025 18:04

alessandrod reviewed Sep 26, 2025

View reviewed changes

alexpyattaev force-pushed the streamer_bdp branch 2 times, most recently from de83248 to ed995e1 Compare September 29, 2025 14:14

bw-solana reviewed Oct 8, 2025

View reviewed changes

Comment thread streamer/src/nonblocking/quic.rs Outdated

alexpyattaev force-pushed the streamer_bdp branch from 0a1870a to 0f8e8ab Compare October 16, 2025 13:51

alexpyattaev requested review from alessandrod, bw-solana and lijunwangs October 16, 2025 13:55

alexpyattaev force-pushed the streamer_bdp branch 4 times, most recently from 8f6f085 to e67cc85 Compare November 3, 2025 21:40

alexpyattaev force-pushed the streamer_bdp branch from e67cc85 to 8c81523 Compare November 3, 2025 22:09

alexpyattaev added 2 commits November 6, 2025 09:12

switch from kbps to reference TPS

16597ba

go back to 4x ratio between staked and unstaked

0b59170

alexpyattaev marked this pull request as draft November 6, 2025 16:54

alexpyattaev mentioned this pull request Nov 6, 2025

scale max_concurrent_uni_streams with BDP #8948

Merged

mergify Bot mentioned this pull request Jan 16, 2026

v3.1: scale max_concurrent_uni_streams with BDP (backport of #8948) #10066

Closed

alexpyattaev closed this Jan 19, 2026

alexpyattaev mentioned this pull request Jan 21, 2026

Streamer: add BDP scaling #10144

Merged

mergify Bot mentioned this pull request Jan 22, 2026

v3.1: Streamer: add BDP scaling (backport of #10144) #10151

Closed

Conversation

alexpyattaev commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Some pretty plots

1024 byte TXs

512 byte TXs

176 byte TXs

Uh oh!

alessandrod left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lijunwangs left a comment

Choose a reason for hiding this comment

Uh oh!

alexpyattaev commented Sep 14, 2025

Uh oh!

alessandrod left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alessandrod Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

alexpyattaev Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alessandrod Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexpyattaev commented Oct 16, 2025

Uh oh!

gregcusack commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexpyattaev commented Oct 22, 2025

Uh oh!

gregcusack commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

THIS PR

Dallas

Osaka

MASTER

Dallas

Osaka

Uh oh!

alexpyattaev commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alessandrod commented Oct 28, 2025

Uh oh!

alexpyattaev commented Oct 28, 2025

Uh oh!

alexpyattaev commented Nov 3, 2025

Uh oh!

alexpyattaev commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

alexpyattaev commented Sep 8, 2025 •

edited

Loading

codecov-commenter commented Sep 8, 2025 •

edited

Loading

gregcusack commented Oct 21, 2025 •

edited

Loading

gregcusack commented Oct 24, 2025 •

edited

Loading

alexpyattaev commented Oct 25, 2025 •

edited

Loading