netstacklat - Handle TCP head-of-line blocking #133

simosund · 2025-09-29T13:48:43Z

Currently netstacklat may report latency for head-of-line (HOL) blocked TCP reads. TCP HOL-blocking can occur when there is packet loss or reordering, when the TCP stack has to delay the delivery of a packet to the socket receive queue until the preceding TCP segments have been received.

As this HOL-blocking is dependent on factors external to the local host (i.e. the packet loss and latency of the entire network path from the sender), this is not the type of latency we typically want to highlight with netstacklat. This PR addresses this issue by not recording the latency for reads that may experience HOL-blocking.

See the commit messages for further details on the how this is accomplished and some limitations.

simosund · 2025-09-29T15:43:37Z

In the following test I demonstrate the issue this PR attempts to fix and what the impact of this patch is. Here I've fetching some 100kB files over a link with 34ms latency and loss varying between 0%, 0.1%, 0.5%, 1%, 2%, 5%, 10% and finally back to 0% again.

Interactive versions of the graphs below together with some additional data is available in this Grafana snapshot (the "before" data corresponds with the earlier half in the snapshot, and "after" is the later half).

NOTE: "Before" is actually still from this PR, just with the HOL-blocking filter it introduces disabled, roughly corresponding to passing the new --include-tcp-hol-delay option. It sets a constant in the rodata, so the JIT compiler should dead-code eliminate the whole thing.

Impact on latency distribution

Before

As can be seen, as soon as there's some packet loss we start seeing a small fraction of significantly higher latencies, several of them starting in the 16ms+ bucket (that bucket ends just below the configured 34ms link latency). As the packet loss is increased, we see a roughly proportional growth in the fraction of high latency values caused by HOL blocking. At higher packet loss rates we even see some latencies streching into the 500ms+ range (likely from timeouts or the same segment needing to be retransmitted multiple times)

After

It successfully avoids recording any latency values which suffer from considerable HOL-blocking. We can still see a little bit of a shift in the latency distribution for the non-HOL-blocked reads though.

Number of measurements

NOTE: I've updated this section, as in the initial posting I accidentally used the wrong hook (tcp-socket-enqueued) and then found no difference in number of measurement between before and after (which is to be expected, as this PR has no affect on the tcp-socket-enqueued hook). I've now updated this with correct figures for tcp-socket-read.

Before:

netstacklat_tcpread_count_withoutholfilt

After:

As expected we see a small drop in the number of measurements when we exclude the potentially HOL-blocked reads. As the packet loss increases the relative reduction in samples grows. Overall, the reduction is pretty small though, suggesting that we're not discarding a large amount of valid samples.

eBPF program runtime

Before

After

It does clearly add a bit of overhead for the netstacklat_tcp_recv_timestamp program (which is the one recording the tcp-socket-read latency). When there's no packet loss, the difference appears to be quite small, with both before and after showing around 250ns per invocation. The overhead does however seem to grow as the packet loss increases, reaching around 400ns at 1% packet loss (vs around 280ns before). However, past 1% the overhead drops down to around 330ns (vs 280-320ns before).

Clearly there is some variation in the measured runtime here and the numbers are not super exact. Overall though, I would say that the additional overhead is measurable but most of the time not too problematic. As this shows the average overhead, we get a bit of a tradeoff from the additional work of looking up the ooo-range and comparing it to the last byte read, vs lower overhead in the instances we exclude the measurement and then don't have do the hashmap lookups to update the histogram.

Here we can also see the netstacklat_tcp_queue_ofo() program, which is responsible for tracking the upper ooo sequence range whenever there's an out-of-order packet. In the before case it's basically a NOP (but still measures in at 30ns here). In the after case it still only takes between 70-100ns. As it only runs when there are out-of-order packets, the overall overhead from this is fairly low. In this these runs, it only ran about 2.5% as frequently as netstacklat_tcp_recv_timestamp at 0.1% packet loss, and about 25% as frequently at 10% packet loss.

Total (estimated) eBPF CPU overhead

Before

After

This is the estimated CPU overhead for all eBPF programs from netstacklat (based on the cumulative runtime for each eBPF program invocation as measured by kernel.bpf_stats_enabled). The largest difference is at 0.1% packet loss, where the total CPU utilization increased by around 10% (from ~0.175% to ~0.195%). If we focus on only the affected programs, netstacklat_tcp_recv_timestamp and new netstacklat_tcp_data_queue_ofo, this increases to about 35% (from ~0.033% to 0.045%). For all the other cases the relative increase in overhead is lower, with around 25% at 1% packet loss, 10% at 10% loss, and 5% at 0% loss.

simosund · 2025-09-29T16:29:09Z

When you have time @netoptimizer, could you check if tcp_data_queue_ofo() is traceable on your systems. I.e that you can attach an fentry program to it.

$ sudo bpftrace -lv 'fentry:tcp_data_queue_ofo'
kfunc:vmlinux:tcp_data_queue_ofo
    struct sock * sk
    struct sk_buff * skb

Discovered that on some systems I could not trace tcp_data_queue_ofo(), presumably because it's been inlined. If tcp_data_queue_ofo() is unavailable I'm not really sure how to solve this issue. Could possibly try to kprobe on tcp_data_queue() with the offset to where tcp_data_queue_ofo() is called (the only callsite), but kprobe with offsets seems cumbersome (based on Toke's blog post) and hardly portable.

The filter_ifindex() and filter_network_ns() functios are often used together. Therefore, add a filter_network() function that budles these two together. Signed-off-by: Simon Sundberg <[email protected]>

simosund · 2025-10-17T17:58:45Z

So apparently tcp_data_queue_ofo() not being traceable is unfortunately not entirely uncommon, and therefore my solution here won't work.

So I have prepared two alternative solutions as well. They both still build on the idea to track an upper range for the out-of-order sequence and avoid any reads below this sequence number. But they use some different mechanisms to track this ooo-range.

The first one is pretty similar to this one, but uses an fexit hook at tcp_data_queue() (the sole call site of tcp_data_queue_ofo()) rather than tcp_data_queue_ofo() directly. It is available in my netstacklat-ooo-alt2 branch.
The other one is a bit different, as it does not use any additional hook to track the ooo-range. Instead it checks for differences in the number of ooo-segments using tcp_sock.rcv_ooopack when our tcp-socket-read hook runs and figures out the ooo-range whenever there is a difference in the counter (indicating that new ooo-segments have arrived). It is available in the netstacklat-ooo-v3 branch.

I suspect both of these approaches will have a bit higher overhead than my original solution, and will rerun my test with both of them to see which of them is the better option.

simosund · 2025-10-20T16:24:16Z

I have now repeated my previous test with to compare the 3 considered solutions. A grafana dashboard with the results is available here.

The 4 major time-groups that can be seen represent the different solutions:

No ooo-filtering (current main)
Tracking ooo-range by tracing fentry:tcp_data_queue_ofo() (branch netstacklat-ooo-alt1)
Tracking ooo-range by tracing fexit:tcp_data_queue() (branch netstacklat-ooo-alt2)
Tracking ooo-range by checking for difference in tcp_sock.rcv_ooopack (branch netstacklat-ooo-alt3)

Within each of those 4 groups, there are 8 smaller groups that (like previously) show the result when running with 0, 0.1%, 0.5%, 1%, 2%, 5%, 10%, and 0% (again) packet loss.

Latency recorded

All solutions appear to be working correctly, removing the group with high-latency measurements that appear without ooo-filtering when there's packet loss.

Measurement count

All solutions seem to result in a similar (small) loss in measurement samples. In many cases the loss is minor enough that it likely fall within the error margins. In relative reduction in measurement samples is (as expected) most pronounced at high packet loss (>= 5%), where we get upwards of ~20% less samples.

Per-invocation overhead

For the second and third groups (alt1 and alt2), the second line shows the runtime overhead for the additional probe (at tcp_data_queue_ofo or tcp_data_queue). The spikes up to around 600ns occurs between the tests when there's very little traffic (so the programs are triggered infrequently by e.g. ssh traffic).

The numbers are not entirely stable throughout the tests, but without the ooo-filtering the overhead is around 280ns per invocation. With ooo-filtering we're looking at around or just below 400ns for alt1 and alt2, while alt3 is (somewhat surprisingly) a bit lower at around 350ns per invocation.

While the per-invocation runtime for tcp_data_queue_ofo (~150ns) for alt1 appears higher than the one for tcp_data_queue (<100ns) for alt2, it's likely largely a side-effect of tcp_data_queue_ofo running much less frequently.

Total overhead (est. CPU util, effectively per-invocation overhead * frequency)

All 3 alternatives seem to result in roughly similar additional overhead for the tcp-socket-read probe itself (~40% higher at 0.1% packet loss), although alt3 is seems to be slightly lower. However, alt1 and alt2 also add additional overhead from tracing extra functions, and especially for alt2 tracing the frequently called tcp_data_queue this makes up for a fair bit of extra overhead (resulting in around twice to total overhead at 0.1% packet loss).

simosund · 2025-10-20T16:33:37Z

Based on the discussion from the meeting today, I will go ahead with the "alt3" solution, as it overall has the lowest overhead and does not need any additional probes. I will therefore repoint this PR to the alt3 branch (instead of the alt1 like it is now).

It is unfortunately also the solution that might be most prone due to (probe) reading socket members outside of the socket lock. To try and reduce the risk for concurrent updates to the socket members we're reading cuasing issues for the ooo-filter, I will add some sanity checks for the read values. Later down the line we might also add some error counters to get a better sense of how common these race conditions are (although will not be able to perfectly detect all issues).

The 'tcp-socket-read' currently reports the latency for the skb containing the last TCP segment read from the socket. However, this segment might have been head of line (HOL) blocked by a previous segment missing. In this case, netstacklat's reported latency will include HOL blocking periods that is dependent on external factors (such as network packet loss, and network latency impacts retransmission time). As netstacklat is primarily intended to identify issues within the local host (in the network stack or receiving applications), by default filter out any socket reads were the last read SKB might have experienced HOL-blocking. Add the new -y/--include-tcp-hol-delay option to retain the old behavior of reporting latency for all reads, including those that are HOL-blocked. This may be useful in some scenarios when you still want to be aware of latency issues caused by HOL-blocking, even though it is caused by external components. For example, in a data center context were you have full control over the network, it may still be relevant to monitor HOL-based caused by the network. To exclude HOL-blocked reads, detect if any new ooo-segments have arrived by checking for differences in the number of ooo-packets in tcp_sock->rcv_ooopack. If any new ooo-segments have arrived, exclude the latency sample from the current read and set a limit for the next safe sequence number to read where the current ooo-packets must have been passed so segments can no longer be HOL-blocked. If there are skbs in the ooo-queue, set the limit to the end of the ooo-queue. Otherwise, set the limit to the current rcv_nxt (as if the ooo-queue is empty the detected ooo-segments must already have been merged into the receive queue and rcv_nxt must have advanced past them). If the read is past the safe sequence limit and no new ooo-segments have arrived, it's safe to start including the latency samples again. For sockets were some ooo-segments have been observed, keep the ooo-range state in socket storage (BPF_MAP_TYPE_SK_STORAGE). Skip protecting this state with a spin-lock, as it should only be concurrently accessed if there are concurrent reads on the same TCP socket, which is assumed to be very rare as applications attempting that cannot know which part of the data each of their concurrent reads will get. There are some scenarios that may cause this ooo-filtering to fail. - If multiple reads are done to the socket concurrently, we may not correctly track the last read byte. The kernel does not keep a lock on the TCP socket at the time our hooked function tcp_recv_timestamp() runs. If two reads are done in parallel, it's therefore possible that for both reads we will check the last read byte (tcp_sock.copied_seq) after the second read has updated it. We may then incorrectly conclude that the first read was ahead of the ooo-range when it was not, and record its latency when we should have excluded it. In practice I belive this issue should be quite rare, as most applications will probably not attempt to perform multiple concurrent reads to a single connected TCP socket in parallel (as then you cannot know which part of the payload the parallel reads will return). - As tcp_recv_timestamp() runs outside of the socket lock, the various state members we access may concurrently be updated as we're attempting to read them. An especially problematic one is tcp_sock.ooo_last_skb, which keeps a pointer to an SKB that is only valid while the ooo-queue is non-empty. It is possible that between our check for if the ooo-queue is non-empty and following the ooo_last_skb pointer, the ooo-queue is cleared and the ooo_last_skb pointer may end up pointing towards a freed SKB. If the socket members we access are updated before or while we read them, it can break the filtering in numerous ways, e.g. result in includes samples that should have been excluded (due to e.g. copied_seq being updated before our read) or excluding a large amount of valid samples (due to e.g. setting a sequence limit based on garbage in a freed SKB). Signed-off-by: Simon Sundberg <[email protected]>

The logic for excluding samples from TCP reads that may have been delayed by HOL blocking relies on reading a number of fields from the TCP socket outside of the socket lock. This may be prone to errors due to the socket state being updated at another place in the kernel while our eBPF program is running. To reduce the risk that data races causes issues for our HoL detection logic, add a number of sanity checks to the read values. The most problematic of the read fields is ooo_last_skb, as that is a pointer to another skb. This pointer is only valid as long as the out_of_order_queue is non-empty. Due to a data race, we may check that the ooo-queue is non-empty while there are still SKBs in it, then the ooo-queue is cleared by the kernel, and then we attempt to read the contents of the ooo_last_skb SKB, which may at this point have been freed and/or recycled. This may result in incorrect values being used for the sequence limit used to exclude future reads of ooo-segments. The faulty sequence limit may both cause reads of HOL-blocked segments to be included or the exclusion of an unnecessarily large amount of future reads (up to 2 GB). To reduce the risk that the garbage data from an invalid SKB is used, introduce two sanity checks for end_seq in the ooo_last_skb. First check if the sequence number is zero, if so assume it is invalid (even though it can be a valid sequence number). Even though we will get an error code if reading the data from this SKB fails altogether, we may still succeed reading from a no longer valid SKB, in which case there is a high risk the data will have been zeroed. If it's non-zero, also check that it is within the current receive window (if not, clamp it to the receive window). Also introduce sanity checks for rcv_nxt and copied_seq in the tcp_sock, ensuring that they monotonically increase. To enable this, the last read (sane) value is tracked together with the ooo-state in the socket storage. For these, do not consider sequence 0 invalid, as these fields should be valid (although possibly updated in-between) as long as reading them succeeds (and failure to read is detected through the returned error code of bpf_core_read()). Skip adding similar monotonic growth validity checks for the rcv_wup field that now may need to be probed to compute the receive window as a compromise to not have to unconditionally probe and update its state every time. For the rcv_wnd field, also needed to calculate the receive window, I am not aware of any simple validity checks to perform. Signed-off-by: Simon Sundberg <[email protected]>

simosund · 2025-10-22T21:42:00Z

@netoptimizer In my last commit I've added some sanity checks like we talked about on the meeting. I am however not sure if these sanity checks are a very good idea, as they do come with a noticeable increase in overhead and still offer somewhat limited protection.

Performance impact

I've rerun my test with various degrees of packet loss, and the results are available in this dashboard. The first run is with current master (no ooo-filter at all), the second with "alt3" as of commit dff6bc5 (without sanity checks, same as alt3 in my previous test), and the last is alt3 with the sanity checks added (in commit 59e923d).

Looking at performance for the tcp-socket-read hook (the only one that should be affected), it looks like this:

Program runtime

NOTE: The spikes to ~700ns here are from periods of very little traffic, so focus on the parts between the spikes.

Runtime without ooo-filter varies from around 250 - 290ns, runtime with ooo-filter varies from around 300 - 390ns, and runtime with ooo-filter + added sanity checks varies from around 350 - 470ns. So the sanity checks add about as much additional overhead as the entire filtering mechanism itself. Or phrased a differently, the sanity checks roughly double the overhead of the entire filtering mechanism as a whole.

Overall CPU util

The largest CPU utilization overall seems to occur at the test with 0.1% packet loss (second spike in each group). For this test the CPU utilization without ooo-filtering is about 0.34%, about 0.46% with ooo-filtering and around 0.57% with ooo-filtering + sanity checks. So once again, the sanity checks roughly double the overhead of the ooo-filtering.

Limitations with the sanity checks

The sanity checks are described in the commit message (and the code). There's basically two types of them.

A sanity check that the sequence number (skb->cb.end_seq) read from tp->ooo_last_skb is within the current receive window
Sanity checks that tp->rcv_nxt and tp->copied_seq did not decrease since last read

The sanity check that the ooo sequence number is within the receive window adds some overhead as it requires probing the tp->rcv_wup and tp->rcv_wnd fields as well. It also adds a potential new error source as now we need to trust that they are right, although it's less likely they're catastrophically wrong than the skb read (as they don't risk being read from a freed skb).

I'm also not entirely sure that the end of the ooo-sequence number must always be within the current receive window, even if that should usually be the case. Partly because depending on system configuration it appears that the receive window can shrink (see the tcp_select_window()), in which the current receive window might actually be smaller than the receive window was when the skb was added to the ooo-queue. And partly because as far as I can tell from the code in tcp_data_queue() it only drops packets where the starting sequence (cb.seq) is ahead of the receive window, meaning that segments where the end sequence (cb.end_seq) is ahead of the receive window may still be added to the error queue.

I'm also not sure how much the "doesn't decrease between reads" checks on rcv_nxt and copied_seq really help, as even if we get a complete garbage value (which we generally shouldn't here though, as the socket should still be valid while our eBPF program runs and 32-bit writes I assume are "atomic" on most architectures) there's only about a 50% chance we will detect the error. Or put another way, it won't detect errors below ~2GB. It might be possible to reel in these limits further if there's some good way to determine the maximum amount of data that can be data that can be read in a single TCP socket read, but even if possible that will likely require probe reading additional socket members (adding additional overhead and error sources).

Conclusion

The sanity check on the data in the ooo_last_skb would be really good to have (as that is read from a potentially invalid pointer), but it does add some overhead (and new potential error sources) due to having to probe additional fields, and I'm not sure the concept behind it (comparing it against the current receive window) is entirely sound.

The sanity checks on rcv_nxt and coped_seq are quite weak. They also add some overhead as now they always have to be probed to keep track of their state, while before they only needed to be probed conditionally when needed.

Minor refactor of network filtering

cc17c68

The filter_ifindex() and filter_network_ns() functios are often used together. Therefore, add a filter_network() function that budles these two together. Signed-off-by: Simon Sundberg <[email protected]>

simosund force-pushed the netstacklat-handleretrans branch 2 times, most recently from a7182a6 to 50ee793 Compare October 16, 2025 15:46

simosund force-pushed the netstacklat-handleretrans branch from 50ee793 to fca192b Compare October 21, 2025 18:49

simosund force-pushed the netstacklat-handleretrans branch from fca192b to 59e923d Compare October 22, 2025 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

netstacklat - Handle TCP head-of-line blocking #133

netstacklat - Handle TCP head-of-line blocking #133

Uh oh!

simosund commented Sep 29, 2025

Uh oh!

simosund commented Sep 29, 2025 •

edited

Loading

Uh oh!

simosund commented Sep 29, 2025

Uh oh!

simosund commented Oct 17, 2025 •

edited

Loading

Uh oh!

simosund commented Oct 20, 2025

Uh oh!

simosund commented Oct 20, 2025

Uh oh!

simosund commented Oct 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

netstacklat - Handle TCP head-of-line blocking #133

Are you sure you want to change the base?

netstacklat - Handle TCP head-of-line blocking #133

Uh oh!

Conversation

simosund commented Sep 29, 2025

Uh oh!

simosund commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Impact on latency distribution

Before

After

Number of measurements

Before:

After:

eBPF program runtime

Before

After

Total (estimated) eBPF CPU overhead

Before

After

Uh oh!

simosund commented Sep 29, 2025

Uh oh!

simosund commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simosund commented Oct 20, 2025

Latency recorded

Measurement count

Per-invocation overhead

Total overhead (est. CPU util, effectively per-invocation overhead * frequency)

Uh oh!

simosund commented Oct 20, 2025

Uh oh!

simosund commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance impact

Program runtime

Overall CPU util

Limitations with the sanity checks

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

simosund commented Sep 29, 2025 •

edited

Loading

simosund commented Oct 17, 2025 •

edited

Loading

simosund commented Oct 22, 2025 •

edited

Loading