Skip to content

Streamer/TPU: scale amount of bytes in flight with peer RTT#7745

Merged
alexpyattaev merged 4 commits intoanza-xyz:masterfrom
alexpyattaev:autotune_number_of_streams
Sep 8, 2025
Merged

Streamer/TPU: scale amount of bytes in flight with peer RTT#7745
alexpyattaev merged 4 commits intoanza-xyz:masterfrom
alexpyattaev:autotune_number_of_streams

Conversation

@alexpyattaev
Copy link
Copy Markdown

@alexpyattaev alexpyattaev commented Aug 27, 2025

Problem

SWQOS ignores RTT in some of its calculations. This means that connections with high latency are heavily rate limited, much more than the stake amount would suggest. This happens before throttling even has a change to kick in, and is thus very counterintuitive to the client.
A deeper version of #7706. These PRs are mutually exclusive.

  • Sender can not have more than receive_window bytes in flight between itself and server
  • Sender can not have more than max_concurrent_streams worth of open streams at any time
  • These together limit how many TXs can be “in flight” between client and server and not yet ACKd. Currently, both these limits are computed based on stake. We need to compute them based on stake and RTT to the client (as longer RTT means you need to have more things on the wire before you see an ACK).

This mechanism is not intended as the actual rate limiter for complaint clients, just as a limit on network buffers and such like. This is designed to allocate more bandwidth than what you'd get today. Note that a client can open up to 8 concurrent connections per identity, allowing for up to 16000 TPS on the network level (before throttling). So this should not reduce the TPU bandwidth in any way. If operators want to limit TPS on a per-client basis, they should use throttling logic for it rather than receive_window.

  • Number of concurrent streams should match the BDP of the link to prevent starvation of the client.
  • Throughput is not limited by window size or number of streams in flight, but rather the throttling logic behind all this that operates on the per-staked-identity basis, not per connection.

Summary of Changes

  • Make number of concurrent streams allowed scale with RX window size.

Without this PR, for the same set of nodes each holding 20% of stake:

{'latency': 10,
'clients': 5,
 'duration': 10.0,
 'tx-size': 250}
Server captured 468315 transactions (46831 TPS)


{'latency': 200,
'clients': 5,
 'duration': 10.0,
 'tx-size': 250}
Server captured 57811 transactions (5781 TPS)

With this PR:

{'latency': 10,
'clients': 5,
'duration': 10.0,
 'tx-size': 250}
Server captured 196799 transactions (19679 TPS)


{'latency': 200,
'clients': 5,
 'duration': 10.0,
 'tx-size': 250}
Server captured 135651 transactions (13565 TPS)

Clearly, the PR makes the changes in TPS as a result of latency variations less severe.

Importantly, we are getting > 2000 TPS per staked connection (so up to 16000 TPS per staked identity before throttling)
Currently, mainnet is serving ~1000 TPS, so this should never limit the overall TPS.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 98.91304% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 83.1%. Comparing base (4a3e05a) to head (3cf2ba1).
⚠️ Report is 2358 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7745   +/-   ##
=======================================
  Coverage    83.0%    83.1%           
=======================================
  Files         812      812           
  Lines      356900   356820   -80     
=======================================
- Hits       296578   296529   -49     
+ Misses      60322    60291   -31     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alexpyattaev alexpyattaev force-pushed the autotune_number_of_streams branch from d1b02ea to 644e3f8 Compare August 27, 2025 19:35
@alexpyattaev alexpyattaev force-pushed the autotune_number_of_streams branch 2 times, most recently from bf5768c to e267c8a Compare August 27, 2025 19:44
@lvboudre
Copy link
Copy Markdown

lvboudre commented Aug 27, 2025

@alexpyattaev

I think we need a max stream limit, because window size alone is not a good proxy for bandwidth.

Window size represents the number bytes that can be send without waiting for an acknowledgement.
If someone receives ACK quickly, the window size capacity will never be filled.
This is why window-size needs to be greater for higher-latency peers.

Someone with really low latency, that receives ACK faster from remote validator will be able to send more transaction (thus use more bandwidth) than other higher-stake peer.

IMO, max-concurrent stream is closer to what we call "bandwidth".

EDIT:

I didn't notice compute_receive_window_bdp that was used to compute the new max_concurrrent_stream.

@alexpyattaev
Copy link
Copy Markdown
Author

@alexpyattaev

I think we need a max stream limit, because window size alone is not a good proxy for bandwidth.

Unfortunately, the stream limit is an even worse proxy for bandwidth on the wire since a stream can carry anywhere between 180 and 1232 bytes, thus having a factor of 7 "error" in terms of how many bits per second we allow.

Thus, if we rate limit primarily based on streams and not Rx window, it becomes harder to predict how much bandwidth a given TPU peer will ultimately be able to use. Once 4KB transactions are available, this variability will be even more pronounced. This will force the server to be more conservative with bandwidth allocations, resulting in less bandwidth for everyone.

do not update RX window on every TX, only every 64 TXs
bump max RTT to 300ms based on popular request
Not adjusting stream count during connection lifetime reduces allocations
@alexpyattaev alexpyattaev force-pushed the autotune_number_of_streams branch from 6f2c114 to 3cf2ba1 Compare August 29, 2025 08:20
@alexpyattaev alexpyattaev marked this pull request as ready for review August 29, 2025 08:20
@alexpyattaev alexpyattaev changed the title autotune number of streamer streams Streamer/TPU: scale receive window and number of streams with peer RTT Aug 29, 2025
const MAX_ALLOWED_RTT: Duration = Duration::from_millis(300);

/// Maximal possible amount of streams to allocate per connection
const MAX_ALLOWED_UNI_STREAMS: u64 = 1024;
Copy link
Copy Markdown
Author

@alexpyattaev alexpyattaev Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to max out at 512 for highest staked peer. This is not intended for throttling.

@alexpyattaev alexpyattaev changed the title Streamer/TPU: scale receive window and number of streams with peer RTT Streamer/TPU: scale amount of bytes in flight with peer RTT Sep 3, 2025
@alexpyattaev
Copy link
Copy Markdown
Author

Ran this on mds1 for 3 days, no obvious issues found.

@alexpyattaev alexpyattaev merged commit 82716b8 into anza-xyz:master Sep 8, 2025
43 checks passed
@alexpyattaev alexpyattaev deleted the autotune_number_of_streams branch September 8, 2025 12:59
@brooksprumo brooksprumo mentioned this pull request Sep 8, 2025
alexpyattaev added a commit to alexpyattaev/agave that referenced this pull request Sep 8, 2025
@alexpyattaev alexpyattaev restored the autotune_number_of_streams branch September 8, 2025 14:19
@alexpyattaev alexpyattaev deleted the autotune_number_of_streams branch September 8, 2025 14:19
alexpyattaev added a commit that referenced this pull request Sep 8, 2025
#7953)

Revert "Streamer/TPU: scale amount of bytes in flight with peer RTT (#7745)"

This reverts commit 82716b8.
const CONNECTION_CLOSE_CODE_DISALLOWED: u32 = 2;
const CONNECTION_CLOSE_REASON_DISALLOWED: &[u8] = b"disallowed";

const CONNECTION_CLOSE_CODE_EXCEED_MAX_STREAM_COUNT: u32 = 3;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed? We still have max streams no?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. It is enforced by quinn.

alexpyattaev added a commit to alexpyattaev/agave that referenced this pull request Sep 9, 2025
…#7745)

* use BDP in SWQOS calculations
* set number of streamer streams based on BDP
do not update RX window on every TX, only every 64 TXs
Set max RTT to 300ms
alexpyattaev added a commit to alexpyattaev/agave that referenced this pull request Sep 14, 2025
…#7745)

* use BDP in SWQOS calculations
* set number of streamer streams based on BDP
do not update RX window on every TX, only every 64 TXs
Set max RTT to 300ms
alexpyattaev added a commit to alexpyattaev/agave that referenced this pull request Sep 23, 2025
…#7745)

* use BDP in SWQOS calculations
* set number of streamer streams based on BDP
do not update RX window on every TX, only every 64 TXs
Set max RTT to 300ms
alexpyattaev added a commit to alexpyattaev/agave that referenced this pull request Sep 29, 2025
…#7745)

* use BDP in SWQOS calculations
* set number of streamer streams based on BDP
do not update RX window on every TX, only every 64 TXs
Set max RTT to 300ms

address review comments from Lijun
alexpyattaev added a commit to alexpyattaev/agave that referenced this pull request Sep 29, 2025
…#7745)

* use BDP in SWQOS calculations
* set number of streamer streams based on BDP
do not update RX window on every TX, only every 64 TXs
Set max RTT to 300ms

address review comments from Lijun
alexpyattaev added a commit to alexpyattaev/agave that referenced this pull request Oct 16, 2025
…#7745)

* use BDP to compute the rx window before SWQOS throttling is applied
* set number of streamer streams based on BDP
do not update RX window on every TX, only every 64 TXs
Set max RTT to 300ms
alexpyattaev added a commit to alexpyattaev/agave that referenced this pull request Nov 1, 2025
…#7745)

* use BDP to compute the rx window before SWQOS throttling is applied
* set number of streamer streams based on BDP
do not update RX window on every TX, only every 64 TXs
Set max RTT to 300ms
alexpyattaev added a commit to alexpyattaev/agave that referenced this pull request Nov 2, 2025
…#7745)

* use BDP to compute the rx window before SWQOS throttling is applied
* set number of streamer streams based on BDP
do not update RX window on every TX, only every 64 TXs
Set max RTT to 300ms
alexpyattaev added a commit to alexpyattaev/agave that referenced this pull request Nov 2, 2025
…#7745)

* use BDP to compute the rx window before SWQOS throttling is applied
* set number of streamer streams based on BDP
* target up to 80 Mbps service rate per max-staked connection
do not update RX window on every TX, only every 64 TXs
Set max RTT to 400ms
alexpyattaev added a commit to alexpyattaev/agave that referenced this pull request Nov 3, 2025
…#7745)

* use BDP to compute the rx window before SWQOS throttling is applied,
  helps high-latency senders (>50ms RTT) get reasonable TPS
* set number of streamer streams based on BDP (for same reason)
* For any RTT below 400ms, target up to 80 Mbps service rate per max-staked connection
* add a workaround to keep giving higher bandwidth to close-by nodes
* update RX window every 128 TXs in case someone tries to spoof it
alexpyattaev added a commit to alexpyattaev/agave that referenced this pull request Nov 3, 2025
…#7745)

* use BDP to compute the rx window before SWQOS throttling is applied,
  helps high-latency senders (>50ms RTT) get reasonable TPS
* set number of streamer streams based on BDP (for same reason)
* For any RTT below 400ms, target up to 80 Mbps service rate per max-staked connection
* add a workaround to keep giving higher bandwidth to close-by nodes
* update RX window every 128 TXs in case someone tries to spoof it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants