Adjust receive window to make them linear to the count of streams #33913

lijunwangs · 2023-10-27T21:06:36Z

Problem

It has been found in across region testing, transaction sending to remote regions are some time failing with WriteError(Stopped). This is caused that after a stream is opened, the client is blocked from sending due to the following error in Quinn on blocked by connection level flow control. This is due to limited receive window among the multiple streams opened, some stream spent long time in waiting for the limit to be available. The problem is not as obvious when the latency is low as contenders might release the usage fast enough before the time out is reached. But in the case of wide area network with longer latency the problem becomes more severe. Fundamentally, if we allow that many of streams to be opened, we need to ensure they have enough bandwidth to be processed concurrently.

Summary of Changes

Make connection receive window ratio scale to the streams limit.

Without the change, with the following configuration: 4 nodes in the us-west and 1 node in eu-west, running bench-tps tests with thin-client, with duration set to > 300 seconds, we can observe send transaction failures in quic_client with WriteError(Stopped)., with the fix, the problem is no longer observed.

Fixes #

codecov · 2023-10-27T22:18:34Z

Codecov Report

Merging #33913 (cb65405) into master (24a4670) will increase coverage by 0.0%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #33913   +/-   ##
=======================================
  Coverage    81.9%    81.9%           
=======================================
  Files         809      809           
  Lines      218062   218062           
=======================================
+ Hits       178616   178660   +44     
+ Misses      39446    39402   -44

lijunwangs · 2023-10-31T21:52:34Z

In the bench-tps tests with thin client using the following command

lijun@lijun-dev:~/solana$ ./cargo run --release --bin solana-bench-tps -- -u http://35.233.177.221:8899 --identity /home/lijun/.config/solana/id.json --tx_count 1000 --thread-batch-sleep-ms 0 -t 20 --duration 300 -n 35.233.177.221:8001 --read-client-keys /home/lijun/gce-keypairs.yaml

The cluster has 5 nodes, 4 nodes running in us-west and one node in eu-west.

The results for with and without the changes when the target selected is local to the client are similar.

With the changes:

[2023-10-30T18:37:32.379799783Z INFO solana_bench_tps::bench] Average TPS: 13596.006
[2023-10-30T18:47:45.479273306Z INFO solana_bench_tps::bench] Average TPS: 14842.48
[2023-10-30T22:09:32.425591207Z INFO solana_bench_tps::bench] Average TPS: 13733.603
[2023-10-30T22:52:55.438325354Z INFO solana_bench_tps::bench] Average TPS: 13025.998
[2023-10-30T23:16:19.227599708Z INFO solana_bench_tps::bench] Average TPS: 12082.257

Without:
[2023-10-31T00:23:51.799392269Z INFO solana_bench_tps::bench] Average TPS: 13629.625
[2023-10-31T00:40:32.005345530Z INFO solana_bench_tps::bench] Average TPS: 13446.362
[2023-10-31T01:15:02.894377583Z INFO solana_bench_tps::bench] Average TPS: 15698.498
[2023-10-31T05:15:19.921295026Z INFO solana_bench_tps::bench] Average TPS: 10978.847
[2023-10-31T05:21:02.765027475Z INFO solana_bench_tps::bench] Average TPS: 5734.667

There are material differences when the target selected is the only one of the remote node in the cluster (eu-west).

With the change:
[2023-10-30T18:54:22.452562507Z INFO solana_bench_tps::bench] Average TPS: 1743.3228
[2023-10-30T19:12:23.275733223Z INFO solana_bench_tps::bench] Average TPS: 1781.0348
[2023-10-30T19:20:13.163388833Z INFO solana_bench_tps::bench] Average TPS: 1737.1093

Without the fix:
[2023-10-31T00:11:18.913058858Z INFO solana_bench_tps::bench] Average TPS: 125.85008
[2023-10-31T01:04:35.701763927Z INFO solana_bench_tps::bench] Average TPS: 125.05482
[2023-10-31T21:51:28.900832620Z INFO solana_bench_tps::bench] Average TPS: 125.72437

There are 10 times of difference.

apfitzge · 2023-11-03T05:54:55Z

sdk/src/quic.rs

@@ -26,12 +26,12 @@ pub const QUIC_CONNECTION_HANDSHAKE_TIMEOUT: Duration = Duration::from_secs(60);

 /// The receive window for QUIC connection from unstaked nodes is
 /// set to this ratio times [`solana_sdk::packet::PACKET_DATA_SIZE`]
-pub const QUIC_UNSTAKED_RECEIVE_WINDOW_RATIO: u64 = 1;
+pub const QUIC_UNSTAKED_RECEIVE_WINDOW_RATIO: u64 = 128;


Not very familiar with how we use quic, so forgive my long and possibly out of scope questions. I'm trying to understand the effect these constants have.

This means an unstaked peer can send 128 * PACKET_DATA_SIZE bytes before being blocked, right?
PACKET_DATA_SIZE will not include IP, or Quic header bytes, so this means it would be slightly less than 128 max-length transactions?

This is a big jump from 1 to 128 for unstaked peers. Does this 128x the amount of unstaked traffic allowed by the quic-server? Are there concerns about overwhelming the server?

When does this receive window get reset? i.e. the thin-client is spamming transactions all the time during bench-tps, so I'd expect it to nearly always have more than 128 txs in-flight.

Would they get reset on acknowledgement?

What does that mean for my packets if say 1024 packets arrive very close in time to each other? Probably assume the sw quic server would not be fast enough to respond before all packets physically arrive. Would the first 128 be accepted and the rest just dropped?

Correct, the receive window is set to linear to the amount of the the streams allowed. Each connection will be allowed to consume stream count limit x the PACKET_DATA_SIZE. The amounts to about 128*1232 = 150K at maximum for a unstaked connection. We allow 500 unstaked connections -- that amounts to about 73MB max for unstaked connections. I think that is a reasonable number. In any case my thinking is the number of streams and the receive window ratio should be linearly set. Having the large amount of streams with small receive window can 1. reduce throughput because of fragmentation and 2. can actually increase the load of both CPU and memory. For the CPU, the client and server need to maintain the states and doing the pulling on these blocked streams and for memory -- we are actually buffering the data in the quic streamer anyway until the full stream data is received. So for the worst case scenario I do not think it is increasing the max memory consumption. The receive window is dynamically adjusted by the quic protocol layer as data is received and consumed -- it is sliding window.

A good read explaining this mechanism is https://docs.google.com/document/d/1F2YfdDXKpy20WVKJueEf4abn_LVZHhMUMS5gX6Pgjl4/mobilebasic.

When does this receive window get reset? i.e. the thin-client is spamming transactions all the time during bench-tps, so I'd expect it to nearly always have more than 128 txs in-flight.
Yes, if I client decide to do so -- we allow at maximum concurrently 128 streams. That is not net new change though.

What does that mean for my packets if say 1024 packets arrive very close in time to each other? Probably assume the sw quic server would not be fast enough to respond before all packets physically arrive. Would the first 128 be accepted and the rest just dropped?

QUIC protocol will issue blocked message to the stream if the receive window is being exceeded, the client will be blocked and will need to wait for notifications of the WINDOW_UPDATE. For client maliciously ignoring the receive window mechanism, the connection can be dropped by the server.

It seems this will increase the receive window for all the streams for a given connection. So just to understand, this can potentially increase the RAM usage for a given node. Is that correct? If so, in worst case what's the increase in RAM usage?

In the worst case scenario which is malicious clients intentionally holds off finalizing the streams and the streamer actually fails to pull off the data from quinn protocol layer, the memory consumed will be

128x1232x512/1024^2 = 73MB for unstaked
512x1232x2000/1024^2 = 1203 MB for staked.

But we are very unlikely to run into this situation as we actively polling data from Quinn into streamer's internal buffers. In that case, even the client intentionally stopping finalizing the data it would not actually increase the net worst case memory pressure as the data would have been buffered at the streamer's temporary buffers anyway which are dictated by the stream_per_connection_count * connection_count.

lijunwangs · 2023-11-14T16:13:29Z

@ryleung-solana @pgarg66 -- can you please help review this?

…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations

mergify · 2024-04-05T08:01:42Z

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

…3913) Adjust receive window to make them linear to the count of streams to reduce fragmentations (cherry picked from commit aa991b6)

…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations

#595) Adjust receive window to make them linear to the count of streams (solana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations

#595) Adjust receive window to make them linear to the count of streams (#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations

Try different receive window

cb65405

lijunwangs requested review from apfitzge, pgarg66, ryleung-solana and sakridge October 31, 2023 21:53

apfitzge reviewed Nov 3, 2023

View reviewed changes

pgarg66 approved these changes Nov 14, 2023

View reviewed changes

lijunwangs merged commit aa991b6 into solana-labs:master Nov 14, 2023
42 checks passed

lijunwangs added a commit to lijunwangs/solana that referenced this pull request Mar 18, 2024

Adjust receive window to make them linear to the count of streams (so…

87d3166

…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations

lijunwangs added a commit to lijunwangs/solana that referenced this pull request Mar 18, 2024

Adjust receive window to make them linear to the count of streams (so…

81591ee

…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations

lijunwangs added a commit to lijunwangs/solana that referenced this pull request Mar 21, 2024

Adjust receive window to make them linear to the count of streams (so…

2b18f0c

…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations

lijunwangs added a commit to lijunwangs/solana that referenced this pull request Mar 27, 2024

Adjust receive window to make them linear to the count of streams (so…

27ccc3c

…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations

willhickey mentioned this pull request Mar 28, 2024

v1.18 commits - please ignore anza-xyz/agave#475

Closed

lijunwangs added a commit to lijunwangs/solana that referenced this pull request Apr 3, 2024

Adjust receive window to make them linear to the count of streams (so…

7b5d8f3

…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations

lijunwangs added the v1.17 PRs that should be backported to v1.17 label Apr 5, 2024

mergify bot pushed a commit that referenced this pull request Apr 5, 2024

Adjust receive window to make them linear to the count of streams (#3…

b08faa5

…3913) Adjust receive window to make them linear to the count of streams to reduce fragmentations (cherry picked from commit aa991b6)

mergify bot mentioned this pull request Apr 5, 2024

v1.17: Adjust receive window to make them linear to the count of streams (backport of #33913) #35495

Closed

lijunwangs added a commit to lijunwangs/solana that referenced this pull request Apr 5, 2024

Adjust receive window to make them linear to the count of streams (so…

364e779

…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations

t-nelson mentioned this pull request Apr 5, 2024

Adjust receive window to make them linear to the count of streams (#3… anza-xyz/agave#595

Merged

jeffwashington pushed a commit to lijunwangs/solana that referenced this pull request Apr 5, 2024

Adjust receive window to make them linear to the count of streams (so…

9b30724

…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust receive window to make them linear to the count of streams #33913

Adjust receive window to make them linear to the count of streams #33913

lijunwangs commented Oct 27, 2023 •

edited

Loading

codecov bot commented Oct 27, 2023 •

edited

Loading

lijunwangs commented Oct 31, 2023

apfitzge Nov 3, 2023

lijunwangs Nov 6, 2023 •

edited

Loading

pgarg66 Nov 14, 2023

lijunwangs Nov 14, 2023 •

edited

Loading

lijunwangs commented Nov 14, 2023

mergify bot commented Apr 5, 2024

Adjust receive window to make them linear to the count of streams #33913

Adjust receive window to make them linear to the count of streams #33913

Conversation

lijunwangs commented Oct 27, 2023 • edited Loading

Problem

Summary of Changes

codecov bot commented Oct 27, 2023 • edited Loading

Codecov Report

lijunwangs commented Oct 31, 2023

apfitzge Nov 3, 2023

Choose a reason for hiding this comment

lijunwangs Nov 6, 2023 • edited Loading

Choose a reason for hiding this comment

pgarg66 Nov 14, 2023

Choose a reason for hiding this comment

lijunwangs Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

lijunwangs commented Nov 14, 2023

mergify bot commented Apr 5, 2024

lijunwangs commented Oct 27, 2023 •

edited

Loading

codecov bot commented Oct 27, 2023 •

edited

Loading

lijunwangs Nov 6, 2023 •

edited

Loading

lijunwangs Nov 14, 2023 •

edited

Loading