Streamer/TPU: scale amount of bytes in flight with peer RTT#7745
Streamer/TPU: scale amount of bytes in flight with peer RTT#7745alexpyattaev merged 4 commits intoanza-xyz:masterfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #7745 +/- ##
=======================================
Coverage 83.0% 83.1%
=======================================
Files 812 812
Lines 356900 356820 -80
=======================================
- Hits 296578 296529 -49
+ Misses 60322 60291 -31 🚀 New features to boost your workflow:
|
d1b02ea to
644e3f8
Compare
bf5768c to
e267c8a
Compare
|
I think we need a max stream limit, because window size alone is not a good proxy for bandwidth. Window size represents the number bytes that can be send without waiting for an acknowledgement. Someone with really low latency, that receives ACK faster from remote validator will be able to send more transaction (thus use more bandwidth) than other higher-stake peer. IMO, max-concurrent stream is closer to what we call "bandwidth". EDIT: I didn't notice |
Unfortunately, the stream limit is an even worse proxy for bandwidth on the wire since a stream can carry anywhere between 180 and 1232 bytes, thus having a factor of 7 "error" in terms of how many bits per second we allow. Thus, if we rate limit primarily based on streams and not Rx window, it becomes harder to predict how much bandwidth a given TPU peer will ultimately be able to use. Once 4KB transactions are available, this variability will be even more pronounced. This will force the server to be more conservative with bandwidth allocations, resulting in less bandwidth for everyone. |
do not update RX window on every TX, only every 64 TXs bump max RTT to 300ms based on popular request
Not adjusting stream count during connection lifetime reduces allocations
6f2c114 to
3cf2ba1
Compare
| const MAX_ALLOWED_RTT: Duration = Duration::from_millis(300); | ||
|
|
||
| /// Maximal possible amount of streams to allocate per connection | ||
| const MAX_ALLOWED_UNI_STREAMS: u64 = 1024; |
There was a problem hiding this comment.
This used to max out at 512 for highest staked peer. This is not intended for throttling.
|
Ran this on mds1 for 3 days, no obvious issues found. |
…nza-xyz#7745)" This reverts commit 82716b8.
| const CONNECTION_CLOSE_CODE_DISALLOWED: u32 = 2; | ||
| const CONNECTION_CLOSE_REASON_DISALLOWED: &[u8] = b"disallowed"; | ||
|
|
||
| const CONNECTION_CLOSE_CODE_EXCEED_MAX_STREAM_COUNT: u32 = 3; |
There was a problem hiding this comment.
Why is this removed? We still have max streams no?
There was a problem hiding this comment.
Yes, we do. It is enforced by quinn.
…#7745) * use BDP in SWQOS calculations * set number of streamer streams based on BDP do not update RX window on every TX, only every 64 TXs Set max RTT to 300ms
…#7745) * use BDP in SWQOS calculations * set number of streamer streams based on BDP do not update RX window on every TX, only every 64 TXs Set max RTT to 300ms
…#7745) * use BDP in SWQOS calculations * set number of streamer streams based on BDP do not update RX window on every TX, only every 64 TXs Set max RTT to 300ms
…#7745) * use BDP in SWQOS calculations * set number of streamer streams based on BDP do not update RX window on every TX, only every 64 TXs Set max RTT to 300ms address review comments from Lijun
…#7745) * use BDP in SWQOS calculations * set number of streamer streams based on BDP do not update RX window on every TX, only every 64 TXs Set max RTT to 300ms address review comments from Lijun
…#7745) * use BDP to compute the rx window before SWQOS throttling is applied * set number of streamer streams based on BDP do not update RX window on every TX, only every 64 TXs Set max RTT to 300ms
…#7745) * use BDP to compute the rx window before SWQOS throttling is applied * set number of streamer streams based on BDP do not update RX window on every TX, only every 64 TXs Set max RTT to 300ms
…#7745) * use BDP to compute the rx window before SWQOS throttling is applied * set number of streamer streams based on BDP do not update RX window on every TX, only every 64 TXs Set max RTT to 300ms
…#7745) * use BDP to compute the rx window before SWQOS throttling is applied * set number of streamer streams based on BDP * target up to 80 Mbps service rate per max-staked connection do not update RX window on every TX, only every 64 TXs Set max RTT to 400ms
…#7745) * use BDP to compute the rx window before SWQOS throttling is applied, helps high-latency senders (>50ms RTT) get reasonable TPS * set number of streamer streams based on BDP (for same reason) * For any RTT below 400ms, target up to 80 Mbps service rate per max-staked connection * add a workaround to keep giving higher bandwidth to close-by nodes * update RX window every 128 TXs in case someone tries to spoof it
…#7745) * use BDP to compute the rx window before SWQOS throttling is applied, helps high-latency senders (>50ms RTT) get reasonable TPS * set number of streamer streams based on BDP (for same reason) * For any RTT below 400ms, target up to 80 Mbps service rate per max-staked connection * add a workaround to keep giving higher bandwidth to close-by nodes * update RX window every 128 TXs in case someone tries to spoof it
Problem
SWQOS ignores RTT in some of its calculations. This means that connections with high latency are heavily rate limited, much more than the stake amount would suggest. This happens before throttling even has a change to kick in, and is thus very counterintuitive to the client.
A deeper version of #7706. These PRs are mutually exclusive.
This mechanism is not intended as the actual rate limiter for complaint clients, just as a limit on network buffers and such like. This is designed to allocate more bandwidth than what you'd get today. Note that a client can open up to 8 concurrent connections per identity, allowing for up to 16000 TPS on the network level (before throttling). So this should not reduce the TPU bandwidth in any way. If operators want to limit TPS on a per-client basis, they should use throttling logic for it rather than receive_window.
Summary of Changes
Without this PR, for the same set of nodes each holding 20% of stake:
With this PR:
Clearly, the PR makes the changes in TPS as a result of latency variations less severe.
Importantly, we are getting > 2000 TPS per staked connection (so up to 16000 TPS per staked identity before throttling)
Currently, mainnet is serving ~1000 TPS, so this should never limit the overall TPS.