Skip to content

fix: mitigate head-of-line blocking in QUIC stream handling#10937

Draft
stablebits wants to merge 1 commit intoanza-xyz:masterfrom
stablebits:hol
Draft

fix: mitigate head-of-line blocking in QUIC stream handling#10937
stablebits wants to merge 1 commit intoanza-xyz:masterfrom
stablebits:hol

Conversation

@stablebits
Copy link
Copy Markdown

@stablebits stablebits commented Mar 3, 2026

Note: I'll do more measurements, but making it available for general feedback and review in the meantime.

Replace sequential stream processing in handle_connection() with concurrent polling of multiple streams within a single tokio task.

Problem

See https://docs.google.com/document/d/1nS7gsPqHG-2q9_rVgkChCIeZzByB2hBTbR8Tsc5inUk/edit?tab=t.0#heading=h.s49mngh00e8b

Streams are processed one at a time: accept → read all chunks → accept next. When packet loss causes a stream's read to block waiting on missing data, all subsequent streams are stalled — reintroducing HTTP/2-style HOL blocking that QUIC was designed to eliminate.

Moreover, in Quinn (and QUIC doesn't mandate any specific retransmission strategy) retransmissions are stream-local and scheduled in a round-robin fashion along with regular data from other streams. With lots of pending streams on the client side, it may take quite some time (multiple RTTs, depending on the number of pending streams and current CWND) for lost data to be delivered.

Summary of Changes

Replace the sequential one-stream-at-a-time loop with a multi-stream polling mechanism that processes up to 8 (configurable) concurrent streams per connection. Default 8 streams parallelism should hopefully be sufficient to resolve HoL issues for most practical use cases (i.e. not severe packet loss or reordering affecting many streams).

Comparing to a possible alternative using UnorderedFutures:

Current code does a fixed scan of at most 8 slots. FuturesUnordered avoids scanning but adds queue bookkeeping overhead.
Current code is virtually allocation-free after connection setup. FuturesUnordered usually allocates per pushed future/task node.
FuturesUnordered is better if fairness/ready-queue behavior is a hard requirement. In our case (short streams, send_fairness disabled on sender), this is arguably not required.
If we later want to raise concurrent streams per connection to dozens/hundreds, FuturesUnordered might become more compelling.

At MAX_STREAMS_PER_CONNECTION = 8 and short stream sizes (even with 4K transactions later), the current approach is likely a bit more efficient in the hot path. To be measure though.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 87.02703% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.1%. Comparing base (578e52e) to head (851776c).

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #10937    +/-   ##
========================================
  Coverage    83.0%    83.1%            
========================================
  Files         839      839            
  Lines      317370   317498   +128     
========================================
+ Hits       263721   263843   +122     
- Misses      53649    53655     +6     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alexpyattaev
Copy link
Copy Markdown

FYI no point tagging us individually, networking gets tagged automagically by github.

@gregcusack gregcusack assigned gregcusack and unassigned gregcusack Mar 5, 2026
@gregcusack
Copy link
Copy Markdown

FYI no point tagging us individually, networking gets tagged automagically by github.

personally i like getting tagged individually. so i know that i need to look at it instead of just anyone on network team

Replace the sequential one-stream-at-a-time loop with a multi-stream
polling mechanism that processes up to 8 (configurable) concurrent
streams per connection. Default 8 streams parallelism should hopefully
be sufficient to resolve HoL issues for most practical use cases
(i.e. not severe packet loss or reordering affecting many streams).
@alexpyattaev
Copy link
Copy Markdown

Scale of the problem in current master:

image

We are delaying some stream's reassembly by 160ms on average

image

Here by 280 ms on average

@alexpyattaev
Copy link
Copy Markdown

With this PR it actually somehow got worse though:

image

Could be that we are not dropping as many blocked streams and reassembling them eventually instead? We get way more delayed streams than before... Will need more data.

@stablebits
Copy link
Copy Markdown
Author

stablebits commented Mar 6, 2026

@alexpyattaev

Could be that we are not dropping as many blocked streams and reassembling them eventually instead? We get way more delayed streams than before... Will need more data.

Timed-out streams aren't accounted in either case. The current master should have done something like this:

diff --git a/streamer/src/nonblocking/quic.rs b/streamer/src/nonblocking/quic.rs
index 7da1162ac3..12aa9c5631 100644
--- a/streamer/src/nonblocking/quic.rs
+++ b/streamer/src/nonblocking/quic.rs
@@ -657,6 +657,16 @@ async fn handle_connection<Q, C>(
                     stats
                         .total_stream_read_timeouts
                         .fetch_add(1, Ordering::Relaxed);
+
+                    let total_latency = accum.start_time.elapsed();
+                    debug!("Stream reassembly cancelled {}", total_latency.as_millis());
+                    stats
+                        .reassembly_delayed_streams
+                        .fetch_add(1, Ordering::Relaxed);
+                    stats
+                        .reassembly_delayed_streams_cumulative_delay_us
+                        .fetch_add(total_latency.as_micros() as usize, Ordering::Relaxed);
+
                     break;
                 }
             };

One more interesting question is whether 2s timeouts may cause some clients to reset their connections (but not sure yet whether this could have any impact on our new metrics).

Yet another possibility is that we encountered an outlier sample or (perhaps more likely) that there is a bug in this implementation.

Can we enable these messages:

--- a/streamer/src/nonblocking/quic.rs
+++ b/streamer/src/nonblocking/quic.rs
@@ -796,7 +796,7 @@ async fn handle_connection<Q, C>(
                 let now = Instant::now();
                 for (i, slot) in slots.iter_mut().enumerate() {
                     if slot.is_active() && now >= slot.deadline {
-                        debug!("Timeout in receiving on stream slot {i}");
+                        info!("Timeout in receiving on stream slot {i}");
                         slot.deactivate();
                         active_count -= 1;
                         stats.active_streams.fetch_sub(1, Ordering::Relaxed);
@@ -946,7 +946,7 @@ fn handle_chunks(
     let packet_size = packet.meta().size;
     let total_latency = accum.start_time.elapsed();
     if total_latency > rtt.mul_f32(LATE_REASSEMBLY_THRESHOLD) {
-        debug!("Stream reassembly dealyed {}", total_latency.as_millis());
+        info!("Stream reassembly dealyed {}", total_latency.as_millis());
         stats
             .reassembly_delayed_streams
             .fetch_add(1, Ordering::Relaxed);

and collect logs?

Updated:

We have stream_read_timeouts in both cases and the max counts are low in all cases (<10).

Screenshot 2026-03-06 at 10 02 55

@stablebits
Copy link
Copy Markdown
Author

stablebits commented Mar 6, 2026

But another aspect is that the new 'reassembly' metrics can't show the positive effect of HoL. Stream reassembly time won't improve. We'd also need to somehow measure the time during which accept_uni() couldn't be called or even better, how long already pending streams in quinn could not have been accepted.

Let's get back to this topic after new SWQoS. Converted to Draft status.

Updated:

Just to complete data analysis (possible root cause).

So the most likely explanation is that there are cases when multiple (likely sequential packets) are lost. Let's consider this scenario, and even if all streams are single packet.

Packets: 1, 2, 3, 4, 5 with each packet hosting a single stream.

  1. Packet 1 arrived
  2. accept_uni() returns stream_id=1 -> the highest_accepted_id that we've seen is 1.
  3. Packets 2 and 3 are lost.
  4. Packet 4 arrives with stream_id=4

Now, the way quinn (and QUIC in general) works is that the arrival of stream_id=N will trigger the opening of all streams from highest_accepted_id up and including N.

  1. accept_uni() returns stream_id=2 -> we add it to the pending list and start measuring delay time.
  2. accept_uni() returns stream_id=3 -> same as above; now we have 2 pending streams
  3. accept_uni() returns stream_id=4 -> data is available
  4. ...

When data lost in packets 2 and 3 is recovered, the new code records delayed time for 2 streams (IDs 2 and 3).

The current master code may record it for stream_id=2 only. Because if data for both streams is recovered at the same time (likely), it will accept_uni() stream_id=3 only after this recovery and its data will already be available.

@stablebits stablebits marked this pull request as draft March 6, 2026 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants