Skip to content

use BDP for streamer window size calculations#7954

Closed
alexpyattaev wants to merge 3 commits intoanza-xyz:masterfrom
alexpyattaev:streamer_bdp
Closed

use BDP for streamer window size calculations#7954
alexpyattaev wants to merge 3 commits intoanza-xyz:masterfrom
alexpyattaev:streamer_bdp

Conversation

@alexpyattaev
Copy link
Copy Markdown

@alexpyattaev alexpyattaev commented Sep 8, 2025

Problem

SWQOS ignores RTT in some of its calculations. This means that connections with high latency are heavily rate limited, much more than the stake amount would suggest. This happens before the actual stake-based stream throttling even has a change to kick in, and is thus very counterintuitive to the client.
A deeper version of #7706. These PRs are mutually exclusive.

Some preliminary concepts:

  • Sender can not have more than receive_window bytes in flight between itself and server
  • Sender can not have more than max_concurrent_streams worth of open streams at any time
  • These together limit how many TXs can be “in flight” between client and server and not yet ACKd. Currently, both these limits are computed based on stake. We need to compute them based on stake and RTT to the client (as longer RTT means you need to have more things on the wire before you see an ACK).
  • Beyond all the limits mentioned above, there is an overall stream limit imposed by the code in stream_throttle.rs. This has no impact on staked nodes (as long as we do not have many of them) but caps the unstaked peer TPS at 200.

This mechanism is not intended as the actual rate limiter for complaint clients, just as a limit on network buffers and such like. This is designed to allocate more bandwidth than what you'd get today. Note that a client can open up to 8 concurrent connections per identity, allowing each connection to leverage the full bandwidth on the network level (before throttling by stream_throttle.rs logic). So this PR should not reduce the TPU bandwidth in any way, except for customers with extremely low latency to the target validator. If operators want to limit TPS on a per-client basis, they should use throttling logic for it rather than receive_window.

  • Number of concurrent streams should match the BDP of the link to prevent starvation of the client.
  • Throughput in practice is ultimately not limited only by window size or number of streams in flight, but rather the throttling logic behind all this that operates on the per-staked-identity basis, not per connection.

Summary of Changes

  • Make number of concurrent streams allowed scale with RX window size.
  • Set bandwidth allocation for unstaked/min-staked connection to 4 Mbps (800 TPS for 500 byte TXs)
  • Set bandwidth allocation for staked connection to 80 Mbps (16 KTPS for 500 byte TXs)
  • Bump number of concurrent streams to match the bandwidth allocations (up to 5000)

Formula for bandwidth allocation per connection:
min_rate = 4 # Mbit/s
max_rate = 80 # Mbit/s
min_bit_rate = min_rate + (stake * (max_rate - min_rate) / max_stake)

Actual bitrate will depend on RTT:

  • For RTT < 50ms it will be strictly more than min_bit_rate to mimic the behavior before change
  • For 50 < RTT < 300 ms it will match min_bit_rate
  • For RTT > 300ms it will be roughly min_bit_rate * 300 / RTT

The maximal sustainable TPS per connection will depend on the size of transactions. The concurrent_streams allocation is based on the transaction size of 400 bytes and is calculated as
max_streams = receive_window / mean_transaction_size
so if transactions are smaller on average, the TPS will be smaller proportionally. For example if your transactions are 200 bytes each, you'd get about half the TPS you would expect given your bandwidth allocation.

Without this PR, for the same set of nodes each holding 20% of stake:

{'latency': 10,
'clients': 5,
 'duration': 5.0,
 'tx-size': 1024}
Server captured 610320 transactions (122064 TPS)

Allowing 1 Gbit/s worth of transactions from 5 clients is probably excessive... but we will deal with it in a separate PR

{'latency': 150,
'clients': 5,
 'duration': 5.0,
 'tx-size': 1024}
Server captured 35930 transactions (7186 TPS)

With 150ms latency (Auckland-Barcelona link) we lose nearly all our TPS, down to 58 Mbps with the same stake

With this PR:

{'latency': 10,
'clients': 5,
'duration': 5.0,
 'tx-size': 1024}
Server captured 547171 transactions (109434 TPS)

Roughly same TPS for low-latency nodes

{'latency': 150,
'clients': 5,
 'duration': 5.0,
 'tx-size': 1024}
Server captured 132884 transactions (26576 TPS)

But now at very high latency, we still provide ~220 Mbps (4x more than before the change)

Clearly, the PR makes the changes in TPS as a result of latency variations less severe, but it does not eliminate them completely. Eliminating them completely is out of scope of this PR.

Some pretty plots

Each point on the plot is collected via run.sh script in the repo with test utils. The ideal plot should look like a single flat plane rising along the stake axis and not dependent on latency at all. In reality very high latency connections suffer a bit. Blue = low TPS. Yellow = high TPS.

1024 byte TXs

Before:
plot_old_1024_bytes

After:
plot_new_1024_bytes

512 byte TXs

Before:

plot_old_512_bytes

After:

plot_new_512_bytes

176 byte TXs

Before:

plot_old_176_bytes

After:

plot_new_176_bytes

See also #7745 for relevant discussion.

@alexpyattaev alexpyattaev marked this pull request as ready for review September 8, 2025 15:14
Copy link
Copy Markdown

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flagging so I don't forget

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Sep 8, 2025

Codecov Report

❌ Patch coverage is 99.04762% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 83.2%. Comparing base (3e7bb4a) to head (0b59170).
⚠️ Report is 52 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #7954     +/-   ##
=========================================
- Coverage    83.2%    83.2%   -0.1%     
=========================================
  Files         861      861             
  Lines      375220   375158     -62     
=========================================
- Hits       312482   312419     -63     
- Misses      62738    62739      +1     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread streamer/src/nonblocking/quic.rs Outdated
Comment thread streamer/src/nonblocking/quic.rs Outdated
Comment thread streamer/src/nonblocking/quic.rs Outdated
Comment thread streamer/src/nonblocking/quic.rs Outdated
lijunwangs
lijunwangs previously approved these changes Sep 11, 2025
Copy link
Copy Markdown

@lijunwangs lijunwangs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexpyattaev
Copy link
Copy Markdown
Author

LGTM

@lijunwangs had to rebase due to conflicts

lijunwangs
lijunwangs previously approved these changes Sep 16, 2025
Copy link
Copy Markdown

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial comments, I haven't tried running the code yet

Comment thread streamer/src/nonblocking/quic.rs Outdated

stats.active_streams.fetch_sub(1, Ordering::Relaxed);
stream_load_ema.update_ema_if_needed();
if (stream_number % 128) == 0 {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was this number chosen? It feels like it's way too often.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on whether you are staked or not. For staked it may be a bit too often, for unstaked it is once per second (unless they have cheated on the RTT). Do you think this should be based on stake?

Comment thread streamer/src/nonblocking/quic.rs
Comment thread streamer/src/nonblocking/quic.rs Outdated
// max(1) is needed on localhost to avoid zero result
// truncate here is safe since u64 millis is an eternity
let millis = (rtt.as_millis() as u64).clamp(1, MAX_ALLOWED_RTT_MS);
let receive_window = (max_receive_rate_kbps * millis) / 8;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should give 10-20% boost here to account for ack coalescing and network blips

@alexpyattaev alexpyattaev force-pushed the streamer_bdp branch 2 times, most recently from de83248 to ed995e1 Compare September 29, 2025 14:14
Comment thread streamer/src/nonblocking/quic.rs Outdated
@alexpyattaev
Copy link
Copy Markdown
Author

@gregcusack @lijunwangs the issue with the perf for small transaction was due to starvation of concurrent streams on the server side, I have just missed it before. Had to bump max streams count to 1500. Now the scaling is as we would expect for average TX mix, with slight degradation if some high staked node is sending pure small TXs (which is very unlikely).

@gregcusack
Copy link
Copy Markdown

gregcusack commented Oct 21, 2025

I have tested this a few times with Alex's streamer tester. But I am getting more packet loss with this PR than with the current code. Not sure if it's this PR (don't see a bug in the code) or a mininet issue. When I don't get packet loss, the distribution of bandwidth across clients by stake is much better than what we currently have. Will try and find some time this week to reproduce across two dev boxes.

@alexpyattaev
Copy link
Copy Markdown
Author

But I am getting more packet loss with this PR than with the current code. Not sure if it's this PR (don't see a bug in the code) or a mininet issue.

With old code I could not get good speeds at high latency, where majority of the packet loss is observed.

@gregcusack
Copy link
Copy Markdown

gregcusack commented Oct 24, 2025

ok ran some tests across multiple hosts.
TLDR: This code provides more fair TPS between nodes with the same stake but different latencies as intended. The downside is the overall TPS the server processes is significantly reduced.

Setup: 1 node in LA running the quic server. 2 nodes running quic clients in Dallas and Osaka.
Experiments (ran across multiple packet sizes, but I'll just show you tests with 1000 byte packets)
Tx-size: 1000 bytes
Overall TPS reduction: 60%

Data:

THIS PR

Server: 17,970 TPS

Dallas

Pubkey TPS
4bRcVxpQMjP2VNX1geBJWcQVERTnPJDbaeiZAVwVEG6h 676
7kFZrEHzgo8jGgLPzmEYwmiFfYXFv65n5R7Zm4ihnfr6 1127
EP6NEYGo8DUzDSEyBtS7rskUhARKfYf52zcqDVfRaqfy 1562
22EfiGE3JA6AFS8T7dZf9qmF9h2nzpuQH3ak8588tjjp 1993
DGC1ARvjkdQZcdnrXbonkbRpA6LY4Vm5TCqF54tjbQR4 4504

Osaka

Pubkey TPS
3GC9X8jfraids6Wgr7s11jYsgoZLgoxBjvBSPBVggMap 631
HFEqs5Qxni1HhmyhxMrw8qPAqneDk1MRQWqWE1KkPsX9 980
4fUgrqKEHBnPUCKhYk37DZp7ZqBsKzb7MgEe1AHChPgS 1339
Fox7AyrfriBrE3VDme7WNAs25YoAZvXwXPsq3wPzBpam 1673
K44JtsDAjdYsuEt8KJosZZ9ES1FEosB3wETqyMa41mr 3481

MASTER

Server: 45,704 TPS

Dallas

Pubkey TPS
4bRcVxpQMjP2VNX1geBJWcQVERTnPJDbaeiZAVwVEG6h 4365
7kFZrEHzgo8jGgLPzmEYwmiFfYXFv65n5R7Zm4ihnfr6 4827
EP6NEYGo8DUzDSEyBtS7rskUhARKfYf52zcqDVfRaqfy 7122
22EfiGE3JA6AFS8T7dZf9qmF9h2nzpuQH3ak8588tjjp 6926
DGC1ARvjkdQZcdnrXbonkbRpA6LY4Vm5TCqF54tjbQR4 13342

Osaka

Pubkey TPS
3GC9X8jfraids6Wgr7s11jYsgoZLgoxBjvBSPBVggMap 1015
HFEqs5Qxni1HhmyhxMrw8qPAqneDk1MRQWqWE1KkPsX9 1398
4fUgrqKEHBnPUCKhYk37DZp7ZqBsKzb7MgEe1AHChPgS 1784
Fox7AyrfriBrE3VDme7WNAs25YoAZvXwXPsq3wPzBpam 1940
K44JtsDAjdYsuEt8KJosZZ9ES1FEosB3wETqyMa41mr 2982

So I think we need to get overall TPS numbers up before this PR can land imo. These experiments show similar results to the plots Alex provided in the PR description

@alexpyattaev
Copy link
Copy Markdown
Author

alexpyattaev commented Oct 25, 2025

Thanks for checking this Greg!

So I think we need to get overall TPS numbers up before this PR can land imo.

I am not quite sure which TPS do you want us to match - the Dallas or the Osaka numbers (as they are latency-dependent in master). If you were to run tests LA-LA, you would get even more TPS witht the old code.

The basic math for relative resource allocation is the same as in the original PR (linear interpolation between min stake and max stake), the question is just which TPS do we target for which amount of stake.

I have bumped the bandwidth allocations 2x in 452a1e3 as an initial step, but we do need to answer a fundamental question "how mush should we allow for X amount of stake?".

@alessandrod
Copy link
Copy Markdown

If you were to run tests LA-LA, you would get even more TPS witht the old code.

so why are we trying to lower numbers? we want the opposite?

@alexpyattaev
Copy link
Copy Markdown
Author

If you were to run tests LA-LA, you would get even more TPS witht the old code.

so why are we trying to lower numbers? we want the opposite?

We are not trying to lower them, we are trying to choose them to reflect the realistic demand. This PR tried to match the bandwidth allocation of ~30 ms RTT, that got 2x'd following feedback from Greg. Not sure on what basis do you want to tune this.

@alexpyattaev alexpyattaev force-pushed the streamer_bdp branch 4 times, most recently from 8f6f085 to e67cc85 Compare November 3, 2025 21:40
…#7745)

* use BDP to compute the rx window before SWQOS throttling is applied,
  helps high-latency senders (>50ms RTT) get reasonable TPS
* set number of streamer streams based on BDP (for same reason)
* For any RTT below 400ms, target up to 80 Mbps service rate per max-staked connection
* add a workaround to keep giving higher bandwidth to close-by nodes
* update RX window every 128 TXs in case someone tries to spoof it
@alexpyattaev
Copy link
Copy Markdown
Author

@alessandrod I've added logic to make sure the bandwidth for close-by clients (<50ms RTT) remains on par with the current case. Bandwidth for far-away clients (>50ms) is strictly higher than before (see plots).

@alexpyattaev
Copy link
Copy Markdown
Author

Turning this to draft for now in favor of #8948

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants