use BDP for streamer window size calculations#7954
use BDP for streamer window size calculations#7954alexpyattaev wants to merge 3 commits intoanza-xyz:masterfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #7954 +/- ##
=========================================
- Coverage 83.2% 83.2% -0.1%
=========================================
Files 861 861
Lines 375220 375158 -62
=========================================
- Hits 312482 312419 -63
- Misses 62738 62739 +1 🚀 New features to boost your workflow:
|
19d9d77 to
b817358
Compare
4a712eb to
5d58f33
Compare
@lijunwangs had to rebase due to conflicts |
5d58f33 to
c21e493
Compare
alessandrod
left a comment
There was a problem hiding this comment.
some initial comments, I haven't tried running the code yet
|
|
||
| stats.active_streams.fetch_sub(1, Ordering::Relaxed); | ||
| stream_load_ema.update_ema_if_needed(); | ||
| if (stream_number % 128) == 0 { |
There was a problem hiding this comment.
How was this number chosen? It feels like it's way too often.
There was a problem hiding this comment.
It depends on whether you are staked or not. For staked it may be a bit too often, for unstaked it is once per second (unless they have cheated on the RTT). Do you think this should be based on stake?
| // max(1) is needed on localhost to avoid zero result | ||
| // truncate here is safe since u64 millis is an eternity | ||
| let millis = (rtt.as_millis() as u64).clamp(1, MAX_ALLOWED_RTT_MS); | ||
| let receive_window = (max_receive_rate_kbps * millis) / 8; |
There was a problem hiding this comment.
we should give 10-20% boost here to account for ack coalescing and network blips
de83248 to
ed995e1
Compare
0a1870a to
0f8e8ab
Compare
|
@gregcusack @lijunwangs the issue with the perf for small transaction was due to starvation of concurrent streams on the server side, I have just missed it before. Had to bump max streams count to 1500. Now the scaling is as we would expect for average TX mix, with slight degradation if some high staked node is sending pure small TXs (which is very unlikely). |
|
I have tested this a few times with Alex's streamer tester. But I am getting more packet loss with this PR than with the current code. Not sure if it's this PR (don't see a bug in the code) or a mininet issue. When I don't get packet loss, the distribution of bandwidth across clients by stake is much better than what we currently have. Will try and find some time this week to reproduce across two dev boxes. |
With old code I could not get good speeds at high latency, where majority of the packet loss is observed. |
|
ok ran some tests across multiple hosts. Setup: 1 node in LA running the quic server. 2 nodes running quic clients in Dallas and Osaka. Data: THIS PRServer: 17,970 TPS Dallas
Osaka
MASTERServer: 45,704 TPS Dallas
Osaka
So I think we need to get overall TPS numbers up before this PR can land imo. These experiments show similar results to the plots Alex provided in the PR description |
|
Thanks for checking this Greg!
I am not quite sure which TPS do you want us to match - the Dallas or the Osaka numbers (as they are latency-dependent in master). If you were to run tests LA-LA, you would get even more TPS witht the old code. The basic math for relative resource allocation is the same as in the original PR (linear interpolation between min stake and max stake), the question is just which TPS do we target for which amount of stake. I have bumped the bandwidth allocations 2x in 452a1e3 as an initial step, but we do need to answer a fundamental question "how mush should we allow for X amount of stake?". |
so why are we trying to lower numbers? we want the opposite? |
We are not trying to lower them, we are trying to choose them to reflect the realistic demand. This PR tried to match the bandwidth allocation of ~30 ms RTT, that got 2x'd following feedback from Greg. Not sure on what basis do you want to tune this. |
8f6f085 to
e67cc85
Compare
…#7745) * use BDP to compute the rx window before SWQOS throttling is applied, helps high-latency senders (>50ms RTT) get reasonable TPS * set number of streamer streams based on BDP (for same reason) * For any RTT below 400ms, target up to 80 Mbps service rate per max-staked connection * add a workaround to keep giving higher bandwidth to close-by nodes * update RX window every 128 TXs in case someone tries to spoof it
e67cc85 to
8c81523
Compare
|
@alessandrod I've added logic to make sure the bandwidth for close-by clients (<50ms RTT) remains on par with the current case. Bandwidth for far-away clients (>50ms) is strictly higher than before (see plots). |
|
Turning this to draft for now in favor of #8948 |
Problem
SWQOS ignores RTT in some of its calculations. This means that connections with high latency are heavily rate limited, much more than the stake amount would suggest. This happens before the actual stake-based stream throttling even has a change to kick in, and is thus very counterintuitive to the client.
A deeper version of #7706. These PRs are mutually exclusive.
Some preliminary concepts:
This mechanism is not intended as the actual rate limiter for complaint clients, just as a limit on network buffers and such like. This is designed to allocate more bandwidth than what you'd get today. Note that a client can open up to 8 concurrent connections per identity, allowing each connection to leverage the full bandwidth on the network level (before throttling by stream_throttle.rs logic). So this PR should not reduce the TPU bandwidth in any way, except for customers with extremely low latency to the target validator. If operators want to limit TPS on a per-client basis, they should use throttling logic for it rather than receive_window.
Summary of Changes
Formula for bandwidth allocation per connection:
min_rate = 4 # Mbit/smax_rate = 80 # Mbit/smin_bit_rate = min_rate + (stake * (max_rate - min_rate) / max_stake)Actual bitrate will depend on RTT:
min_bit_rateto mimic the behavior before changeThe maximal sustainable TPS per connection will depend on the size of transactions. The concurrent_streams allocation is based on the transaction size of 400 bytes and is calculated as
max_streams = receive_window / mean_transaction_sizeso if transactions are smaller on average, the TPS will be smaller proportionally. For example if your transactions are 200 bytes each, you'd get about half the TPS you would expect given your bandwidth allocation.
Without this PR, for the same set of nodes each holding 20% of stake:
Allowing 1 Gbit/s worth of transactions from 5 clients is probably excessive... but we will deal with it in a separate PR
With 150ms latency (Auckland-Barcelona link) we lose nearly all our TPS, down to 58 Mbps with the same stake
With this PR:
Roughly same TPS for low-latency nodes
But now at very high latency, we still provide ~220 Mbps (4x more than before the change)
Clearly, the PR makes the changes in TPS as a result of latency variations less severe, but it does not eliminate them completely. Eliminating them completely is out of scope of this PR.
Some pretty plots
Each point on the plot is collected via run.sh script in the repo with test utils. The ideal plot should look like a single flat plane rising along the stake axis and not dependent on latency at all. In reality very high latency connections suffer a bit. Blue = low TPS. Yellow = high TPS.
1024 byte TXs
Before:

After:

512 byte TXs
Before:
After:
176 byte TXs
Before:
After:
See also #7745 for relevant discussion.