Skip to content

Conversation

@simosund
Copy link
Contributor

Currently netstacklat may report latency for head-of-line (HOL) blocked TCP reads. TCP HOL-blocking can occur when there is packet loss or reordering, when the TCP stack has to delay the delivery of a packet to the socket receive queue until the preceding TCP segments have been received.

As this HOL-blocking is dependent on factors external to the local host (i.e. the packet loss and latency of the entire network path from the sender), this is not the type of latency we typically want to highlight with netstacklat. This PR addresses this issue by not recording the latency for reads that may experience HOL-blocking.

See the commit messages for further details on the how this is accomplished and some limitations.

@simosund
Copy link
Contributor Author

simosund commented Sep 29, 2025

In the following test I demonstrate the issue this PR attempts to fix and what the impact of this patch is. Here I've fetching some 100kB files over a link with 34ms latency and loss varying between 0%, 0.1%, 0.5%, 1%, 2%, 5%, 10% and finally back to 0% again.

Interactive versions of the graphs below together with some additional data is available in this Grafana snapshot (the "before" data corresponds with the earlier half in the snapshot, and "after" is the later half).

NOTE: "Before" is actually still from this PR, just with the HOL-blocking filter it introduces disabled, roughly corresponding to passing the new --include-tcp-hol-delay option. It sets a constant in the rodata, so the JIT compiler should dead-code eliminate the whole thing.

Impact on latency distribution

Before

netstacklat_tcpread_withoutholfilt

As can be seen, as soon as there's some packet loss we start seeing a small fraction of significantly higher latencies, several of them starting in the 16ms+ bucket (that bucket ends just below the configured 34ms link latency). As the packet loss is increased, we see a roughly proportional growth in the fraction of high latency values caused by HOL blocking. At higher packet loss rates we even see some latencies streching into the 500ms+ range (likely from timeouts or the same segment needing to be retransmitted multiple times)

After

netstacklat_tcpread_withholfilt

It successfully avoids recording any latency values which suffer from considerable HOL-blocking. We can still see a little bit of a shift in the latency distribution for the non-HOL-blocked reads though.

Number of measurements

NOTE: I've updated this section, as in the initial posting I accidentally used the wrong hook (tcp-socket-enqueued) and then found no difference in number of measurement between before and after (which is to be expected, as this PR has no affect on the tcp-socket-enqueued hook). I've now updated this with correct figures for tcp-socket-read.

Before:

netstacklat_tcpread_count_withoutholfilt

After:

netstacklat_tcpread_count_withholfilt

As expected we see a small drop in the number of measurements when we exclude the potentially HOL-blocked reads. As the packet loss increases the relative reduction in samples grows. Overall, the reduction is pretty small though, suggesting that we're not discarding a large amount of valid samples.

eBPF program runtime

Before

netstacklat_runtime_withoutholfilt

After

netstacklat_runtime_withholfilt

It does clearly add a bit of overhead for the netstacklat_tcp_recv_timestamp program (which is the one recording the tcp-socket-read latency). When there's no packet loss, the difference appears to be quite small, with both before and after showing around 250ns per invocation. The overhead does however seem to grow as the packet loss increases, reaching around 400ns at 1% packet loss (vs around 280ns before). However, past 1% the overhead drops down to around 330ns (vs 280-320ns before).

Clearly there is some variation in the measured runtime here and the numbers are not super exact. Overall though, I would say that the additional overhead is measurable but most of the time not too problematic. As this shows the average overhead, we get a bit of a tradeoff from the additional work of looking up the ooo-range and comparing it to the last byte read, vs lower overhead in the instances we exclude the measurement and then don't have do the hashmap lookups to update the histogram.

Here we can also see the netstacklat_tcp_queue_ofo() program, which is responsible for tracking the upper ooo sequence range whenever there's an out-of-order packet. In the before case it's basically a NOP (but still measures in at 30ns here). In the after case it still only takes between 70-100ns. As it only runs when there are out-of-order packets, the overall overhead from this is fairly low. In this these runs, it only ran about 2.5% as frequently as netstacklat_tcp_recv_timestamp at 0.1% packet loss, and about 25% as frequently at 10% packet loss.

Total (estimated) eBPF CPU overhead

Before

netstacklat_estimatedcpu_withoutholfilt

After

netstacklat_estimatedcpu_withholfilt

This is the estimated CPU overhead for all eBPF programs from netstacklat (based on the cumulative runtime for each eBPF program invocation as measured by kernel.bpf_stats_enabled). The largest difference is at 0.1% packet loss, where the total CPU utilization increased by around 10% (from ~0.175% to ~0.195%). If we focus on only the affected programs, netstacklat_tcp_recv_timestamp and new netstacklat_tcp_data_queue_ofo, this increases to about 35% (from ~0.033% to 0.045%). For all the other cases the relative increase in overhead is lower, with around 25% at 1% packet loss, 10% at 10% loss, and 5% at 0% loss.

@simosund
Copy link
Contributor Author

When you have time @netoptimizer, could you check if tcp_data_queue_ofo() is traceable on your systems. I.e that you can attach an fentry program to it.

$ sudo bpftrace -lv 'fentry:tcp_data_queue_ofo'
kfunc:vmlinux:tcp_data_queue_ofo
    struct sock * sk
    struct sk_buff * skb

Discovered that on some systems I could not trace tcp_data_queue_ofo(), presumably because it's been inlined. If tcp_data_queue_ofo() is unavailable I'm not really sure how to solve this issue. Could possibly try to kprobe on tcp_data_queue() with the offset to where tcp_data_queue_ofo() is called (the only callsite), but kprobe with offsets seems cumbersome (based on Toke's blog post) and hardly portable.

The filter_ifindex() and filter_network_ns() functios are often used
together. Therefore, add a filter_network() function that budles these
two together.

Signed-off-by: Simon Sundberg <[email protected]>
@simosund simosund force-pushed the netstacklat-handleretrans branch 2 times, most recently from a7182a6 to 50ee793 Compare October 16, 2025 15:46
@simosund
Copy link
Contributor Author

simosund commented Oct 17, 2025

So apparently tcp_data_queue_ofo() not being traceable is unfortunately not entirely uncommon, and therefore my solution here won't work.

So I have prepared two alternative solutions as well. They both still build on the idea to track an upper range for the out-of-order sequence and avoid any reads below this sequence number. But they use some different mechanisms to track this ooo-range.

  • The first one is pretty similar to this one, but uses an fexit hook at tcp_data_queue() (the sole call site of tcp_data_queue_ofo()) rather than tcp_data_queue_ofo() directly. It is available in my netstacklat-ooo-alt2 branch.
  • The other one is a bit different, as it does not use any additional hook to track the ooo-range. Instead it checks for differences in the number of ooo-segments using tcp_sock.rcv_ooopack when our tcp-socket-read hook runs and figures out the ooo-range whenever there is a difference in the counter (indicating that new ooo-segments have arrived). It is available in the netstacklat-ooo-v3 branch.

I suspect both of these approaches will have a bit higher overhead than my original solution, and will rerun my test with both of them to see which of them is the better option.

@simosund
Copy link
Contributor Author

I have now repeated my previous test with to compare the 3 considered solutions. A grafana dashboard with the results is available here.

The 4 major time-groups that can be seen represent the different solutions:

  • No ooo-filtering (current main)
  • Tracking ooo-range by tracing fentry:tcp_data_queue_ofo() (branch netstacklat-ooo-alt1)
  • Tracking ooo-range by tracing fexit:tcp_data_queue() (branch netstacklat-ooo-alt2)
  • Tracking ooo-range by checking for difference in tcp_sock.rcv_ooopack (branch netstacklat-ooo-alt3)

Within each of those 4 groups, there are 8 smaller groups that (like previously) show the result when running with 0, 0.1%, 0.5%, 1%, 2%, 5%, 10%, and 0% (again) packet loss.

Latency recorded

retrans_test_tcpread_latency

All solutions appear to be working correctly, removing the group with high-latency measurements that appear without ooo-filtering when there's packet loss.

Measurement count

retrans_test_tcpread_count

All solutions seem to result in a similar (small) loss in measurement samples. In many cases the loss is minor enough that it likely fall within the error margins. In relative reduction in measurement samples is (as expected) most pronounced at high packet loss (>= 5%), where we get upwards of ~20% less samples.

Per-invocation overhead

retrans_test_tcpread_runtime

For the second and third groups (alt1 and alt2), the second line shows the runtime overhead for the additional probe (at tcp_data_queue_ofo or tcp_data_queue). The spikes up to around 600ns occurs between the tests when there's very little traffic (so the programs are triggered infrequently by e.g. ssh traffic).

The numbers are not entirely stable throughout the tests, but without the ooo-filtering the overhead is around 280ns per invocation. With ooo-filtering we're looking at around or just below 400ns for alt1 and alt2, while alt3 is (somewhat surprisingly) a bit lower at around 350ns per invocation.

While the per-invocation runtime for tcp_data_queue_ofo (~150ns) for alt1 appears higher than the one for tcp_data_queue (<100ns) for alt2, it's likely largely a side-effect of tcp_data_queue_ofo running much less frequently.

Total overhead (est. CPU util, effectively per-invocation overhead * frequency)

retrans_test_tcpread_overhead

All 3 alternatives seem to result in roughly similar additional overhead for the tcp-socket-read probe itself (~40% higher at 0.1% packet loss), although alt3 is seems to be slightly lower. However, alt1 and alt2 also add additional overhead from tracing extra functions, and especially for alt2 tracing the frequently called tcp_data_queue this makes up for a fair bit of extra overhead (resulting in around twice to total overhead at 0.1% packet loss).

@simosund
Copy link
Contributor Author

Based on the discussion from the meeting today, I will go ahead with the "alt3" solution, as it overall has the lowest overhead and does not need any additional probes. I will therefore repoint this PR to the alt3 branch (instead of the alt1 like it is now).

It is unfortunately also the solution that might be most prone due to (probe) reading socket members outside of the socket lock. To try and reduce the risk for concurrent updates to the socket members we're reading cuasing issues for the ooo-filter, I will add some sanity checks for the read values. Later down the line we might also add some error counters to get a better sense of how common these race conditions are (although will not be able to perfectly detect all issues).

The 'tcp-socket-read' currently reports the latency for the skb
containing the last TCP segment read from the socket. However, this
segment might have been head of line (HOL) blocked by a previous
segment missing. In this case, netstacklat's reported latency will
include HOL blocking periods that is dependent on external
factors (such as network packet loss, and network latency impacts
retransmission time). As netstacklat is primarily intended to identify
issues within the local host (in the network stack or receiving
applications), by default filter out any socket reads were the last
read SKB might have experienced HOL-blocking.

Add the new -y/--include-tcp-hol-delay option to retain the old
behavior of reporting latency for all reads, including those that are
HOL-blocked. This may be useful in some scenarios when you still want
to be aware of latency issues caused by HOL-blocking, even though it
is caused by external components. For example, in a data center
context were you have full control over the network, it may still be
relevant to monitor HOL-based caused by the network.

To exclude HOL-blocked reads, detect if any new ooo-segments have
arrived by checking for differences in the number of ooo-packets in
tcp_sock->rcv_ooopack. If any new ooo-segments have arrived, exclude
the latency sample from the current read and set a limit for the next
safe sequence number to read where the current ooo-packets must have
been passed so segments can no longer be HOL-blocked. If there are
skbs in the ooo-queue, set the limit to the end of the
ooo-queue. Otherwise, set the limit to the current rcv_nxt (as if the
ooo-queue is empty the detected ooo-segments must already have been
merged into the receive queue and rcv_nxt must have advanced past
them). If the read is past the safe sequence limit and no new
ooo-segments have arrived, it's safe to start including the latency
samples again.

For sockets were some ooo-segments have been observed, keep the
ooo-range state in socket storage (BPF_MAP_TYPE_SK_STORAGE). Skip
protecting this state with a spin-lock, as it should only be
concurrently accessed if there are concurrent reads on the same TCP
socket, which is assumed to be very rare as applications attempting
that cannot know which part of the data each of their concurrent reads
will get.

There are some scenarios that may cause this ooo-filtering to fail.
- If multiple reads are done to the socket concurrently, we may not
  correctly track the last read byte. The kernel does not keep a lock
  on the TCP socket at the time our hooked function
  tcp_recv_timestamp() runs. If two reads are done in parallel, it's
  therefore possible that for both reads we will check the last read
  byte (tcp_sock.copied_seq) after the second read has updated it. We
  may then incorrectly conclude that the first read was ahead of the
  ooo-range when it was not, and record its latency when we should
  have excluded it. In practice I belive this issue should be quite
  rare, as most applications will probably not attempt to perform
  multiple concurrent reads to a single connected TCP socket in
  parallel (as then you cannot know which part of the payload the
  parallel reads will return).

- As tcp_recv_timestamp() runs outside of the socket lock, the various
  state members we access may concurrently be updated as we're
  attempting to read them. An especially problematic one is
  tcp_sock.ooo_last_skb, which keeps a pointer to an SKB that is only
  valid while the ooo-queue is non-empty. It is possible that between
  our check for if the ooo-queue is non-empty and following the
  ooo_last_skb pointer, the ooo-queue is cleared and the ooo_last_skb
  pointer may end up pointing towards a freed SKB. If the socket
  members we access are updated before or while we read them, it can
  break the filtering in numerous ways, e.g. result in includes
  samples that should have been excluded (due to e.g. copied_seq being
  updated before our read) or excluding a large amount of valid
  samples (due to e.g. setting a sequence limit based on garbage in a
  freed SKB).

Signed-off-by: Simon Sundberg <[email protected]>
@simosund simosund force-pushed the netstacklat-handleretrans branch from 50ee793 to fca192b Compare October 21, 2025 18:49
The logic for excluding samples from TCP reads that may have been
delayed by HOL blocking relies on reading a number of fields from the
TCP socket outside of the socket lock. This may be prone to errors due
to the socket state being updated at another place in the kernel while
our eBPF program is running. To reduce the risk that data races causes
issues for our HoL detection logic, add a number of sanity checks to
the read values.

The most problematic of the read fields is ooo_last_skb, as that is a
pointer to another skb. This pointer is only valid as long as the
out_of_order_queue is non-empty. Due to a data race, we may check that
the ooo-queue is non-empty while there are still SKBs in it, then the
ooo-queue is cleared by the kernel, and then we attempt to read the
contents of the ooo_last_skb SKB, which may at this point have been
freed and/or recycled. This may result in incorrect values being used
for the sequence limit used to exclude future reads of
ooo-segments. The faulty sequence limit may both cause reads of
HOL-blocked segments to be included or the exclusion of an
unnecessarily large amount of future reads (up to 2 GB).

To reduce the risk that the garbage data from an invalid SKB is used,
introduce two sanity checks for end_seq in the ooo_last_skb. First
check if the sequence number is zero, if so assume it is invalid (even
though it can be a valid sequence number). Even though we will get an
error code if reading the data from this SKB fails altogether, we may
still succeed reading from a no longer valid SKB, in which case there
is a high risk the data will have been zeroed. If it's non-zero, also
check that it is within the current receive window (if not, clamp it
to the receive window).

Also introduce sanity checks for rcv_nxt and copied_seq in the
tcp_sock, ensuring that they monotonically increase. To enable this,
the last read (sane) value is tracked together with the ooo-state in
the socket storage. For these, do not consider sequence 0 invalid, as
these fields should be valid (although possibly updated in-between) as
long as reading them succeeds (and failure to read is detected through
the returned error code of bpf_core_read()).

Skip adding similar monotonic growth validity checks for the rcv_wup
field that now may need to be probed to compute the receive window as
a compromise to not have to unconditionally probe and update its state
every time. For the rcv_wnd field, also needed to calculate the
receive window, I am not aware of any simple validity checks to
perform.

Signed-off-by: Simon Sundberg <[email protected]>
@simosund simosund force-pushed the netstacklat-handleretrans branch from fca192b to 59e923d Compare October 22, 2025 15:03
@simosund
Copy link
Contributor Author

simosund commented Oct 22, 2025

@netoptimizer In my last commit I've added some sanity checks like we talked about on the meeting. I am however not sure if these sanity checks are a very good idea, as they do come with a noticeable increase in overhead and still offer somewhat limited protection.

Performance impact

I've rerun my test with various degrees of packet loss, and the results are available in this dashboard. The first run is with current master (no ooo-filter at all), the second with "alt3" as of commit dff6bc5 (without sanity checks, same as alt3 in my previous test), and the last is alt3 with the sanity checks added (in commit 59e923d).

Looking at performance for the tcp-socket-read hook (the only one that should be affected), it looks like this:

Program runtime

retrans_checksoverhead_runtime

NOTE: The spikes to ~700ns here are from periods of very little traffic, so focus on the parts between the spikes.

Runtime without ooo-filter varies from around 250 - 290ns, runtime with ooo-filter varies from around 300 - 390ns, and runtime with ooo-filter + added sanity checks varies from around 350 - 470ns. So the sanity checks add about as much additional overhead as the entire filtering mechanism itself. Or phrased a differently, the sanity checks roughly double the overhead of the entire filtering mechanism as a whole.

Overall CPU util

retrans_checksoverhead_cpu

The largest CPU utilization overall seems to occur at the test with 0.1% packet loss (second spike in each group). For this test the CPU utilization without ooo-filtering is about 0.34%, about 0.46% with ooo-filtering and around 0.57% with ooo-filtering + sanity checks. So once again, the sanity checks roughly double the overhead of the ooo-filtering.

Limitations with the sanity checks

The sanity checks are described in the commit message (and the code). There's basically two types of them.

  • A sanity check that the sequence number (skb->cb.end_seq) read from tp->ooo_last_skb is within the current receive window
  • Sanity checks that tp->rcv_nxt and tp->copied_seq did not decrease since last read

The sanity check that the ooo sequence number is within the receive window adds some overhead as it requires probing the tp->rcv_wup and tp->rcv_wnd fields as well. It also adds a potential new error source as now we need to trust that they are right, although it's less likely they're catastrophically wrong than the skb read (as they don't risk being read from a freed skb).

I'm also not entirely sure that the end of the ooo-sequence number must always be within the current receive window, even if that should usually be the case. Partly because depending on system configuration it appears that the receive window can shrink (see the tcp_select_window()), in which the current receive window might actually be smaller than the receive window was when the skb was added to the ooo-queue. And partly because as far as I can tell from the code in tcp_data_queue() it only drops packets where the starting sequence (cb.seq) is ahead of the receive window, meaning that segments where the end sequence (cb.end_seq) is ahead of the receive window may still be added to the error queue.

I'm also not sure how much the "doesn't decrease between reads" checks on rcv_nxt and copied_seq really help, as even if we get a complete garbage value (which we generally shouldn't here though, as the socket should still be valid while our eBPF program runs and 32-bit writes I assume are "atomic" on most architectures) there's only about a 50% chance we will detect the error. Or put another way, it won't detect errors below ~2GB. It might be possible to reel in these limits further if there's some good way to determine the maximum amount of data that can be data that can be read in a single TCP socket read, but even if possible that will likely require probe reading additional socket members (adding additional overhead and error sources).

Conclusion

The sanity check on the data in the ooo_last_skb would be really good to have (as that is read from a potentially invalid pointer), but it does add some overhead (and new potential error sources) due to having to probe additional fields, and I'm not sure the concept behind it (comparing it against the current receive window) is entirely sound.

The sanity checks on rcv_nxt and coped_seq are quite weak. They also add some overhead as now they always have to be probed to keep track of their state, while before they only needed to be probed conditionally when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant