-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP connection stall, avoided by disabling AutoTuneSndBuf #10243
Comments
Thanks for the report. Let me see whether we can repro this. Also: I don't think I knew that gVisor's netstack was used by Docker on Mac. That's awesome. |
I can repro this on Mac. On Linux it runs for the full hour without issue. I'll look at whether there's anything obviously platform-dependent, but it would be really helpful if I could build docker for mac to enable debug logs and add bits that'll help me track down the issue. Is there any way for me to do so? |
A couple first observations about the pcap:
|
I misunderstood the setup. gVisor is sending packets that are originally from the client, and the server is running in a Linux VM. So it's not weird that messing with the send buffer affects behavior. That means that the server is advertising a zero window, which is strange. |
And another question: is the pcap between the client and proxy, or the proxy and iperf server? |
Hi Kevin, thanks for taking a look!
The only way I can suggest is if there are particular things you'd like to add or enable, if you point me at them I can publish a developer build via our internal CI. (Sorry I realise this is very inconvenient!)
Yes that's it -- the I generated the pcap on the Linux side with
On Linux by default Docker Desktop is using qemu's builtin networking. I'd like to switch this to gvisor by default but there's a different bug in play (which I suspect is a lower-level qemu virtio-net problem) If you'd like to play with it, you can edit Back on Mac, the default ethernet layer is a Thanks again for all your help so far |
Another data point: I shared a build with the |
A nice thing about this kind of bug is that it gives me an opportunity to increase debuggability a bit. I've created a dev build of Docker Desktop Mac arm64 Mac amd64 which, if this file exists:
and then the app is started, it will record the last 1024 packets sent to the VM in a ring. When the app quits it writes the ring to disk. So this gives the trace more from the gvisor point of view (rather than using It looks similar to the previous trace Edit: the pcap is has a bug where the frame length == capture length == 100 |
[ Sorry for the spam, here's another trace: docker-desktop-gvisor.zip taken on the host with the frame length field fixed, so Wireshark can see the payload length. I'll refresh the DD builds in case that's useful. ] |
Can you try a build with this patch? My current theory is that the proxy is blocked waiting to write out some data, but the send buffer shrinks and we forget to notify the proxy that it can write again. That would show up in the pcap as a a receive buffer zero window because packets sent by iperf would queue up at the proxy/netstack waiting for the send. I have a theory+test for how |
I refreshed to
It still stalled with a zero window :( Here's a pcap of the last ~1000 packets from the host's point of view: docker-desktop-gvisor.zip |
The part I don't understand from the pcap, (looking at the latest one), Linux (receiving the
assuming the window scaling is 7, it should be fine to send up to
which ends at
so gvisor sends a series of ACKs (window probes?)
but Linux never opens the window. I don't see any more traffic about this connection from Linux in fact. Maybe it's really some kind of connection tracking / firewall issue on my end preventing Linux from opening the window? I'll try to rule that out just in case... |
Agreed that those are window probes sent with exponential backoff. And thanks for looking into the conntrack/firewall stuff.
|
@kevinGC why are TCP checksums zero when sending from gvisor? If we don't have GSO offload then they should be set by gvisor? Is it possible that somehow window probes are being sent with a zero checksum incorrectly? If so Linux would drop them and never see the zero window probe resulting in this behavior. Given that the proxy is a go program not running gvisor maybe the stack is not being setup the same way as gvisor so could behave differently. |
gvisor/pkg/tcpip/transport/tcp/snd.go Line 908 in 5b30c4a
|
gvisor/pkg/tcpip/transport/tcp/connect.go Line 839 in 5b30c4a
|
@kevinGC thanks for your thoughts. I'll dig in a bit more on my side (once I'm back from PTO next week). Thanks again for all your help so far. FYI the checksums are off because I've set |
If Linux is configured to ignore checksums it should not matter. But worth giving it a shot anyway. That said somethings about how send/receive tuning works. Send buffer tuning relies on how much data is acked by receiver to grow send buffers. Receive buffer tuning depends on how quickly the data is read by the application in this case your proxy code. If your receive buffers are not growing it's probably because the proxy is not reading fast enough. If send buffers are not growing it's probably because you are not getting timely acks which move the sliding window forward fast enough. If Linux is not opening it's receive buffer it would mean that iperf is not reading the data which seems highly unlikely. The other reason it could go to zero is if there are too many out of order packets being received and gvisor isnt sending the packet that closes the gap. In such a case then out of order packets will fill up the receive buffer on the Linux side resulting in a stall till the packet that moves the window forward is sent by gVisor. |
Something I missed. I noticed you are using a SOCK_DGRAM unix socket, I would change it to SOCK_SEQPACKET which guarantees in order delivery. |
Just saw the pcap the whole ACKing a packet not seen seems to indicate Out of Order delivery of packets. Your issue might be the use of SOCK_DGRAM vs SOCK_SEQPACKET. |
@hbhasker thanks, I’ll investigate and report back :) Also since there’s no evidence of gvisor malfunctioning (the link to the auto tune buffer option seems more like a coincidence now) feel free to close this ticket in the meantime! |
Thanks for all the help Bhasker. A few thoughts:
My understanding is that ACKing unseen packets usually comes from capture not keeping up with transmission. I'm not sure I follow how unseen packet ACKs indicates out of order delivery.
gVisor is sending window probes and getting nothing back; if if there's a gap we should be getting back an ACK for a sent-and-dropped packet. Even with a zero window, we should be getting some kind of response. And I'll just leave this open until I hear whether SOCK_SEQPACKET does the trick. |
Oh, here's a thought: our zero window probes are broken? IIUC they should be sending the next byte in the probe. From RFC 9293 3.8.6.1:
Per the pcap, we're sending zero-length probes, which IIUC Linux doesn't have to ACK (why would it, it's just a duplicate and up-to-date ACK): This seems to be supported by Wireshark calling the probes So the theory is we're filling up the Linux receive buffer, getting an ACK+zerowindow packet, and never realizing that the window has reopened because we don't properly probe and thus never know that iperf has consumed its bytes. |
Per [RFC 9293 3.8.6.1], zero window probes need at least one byte of payload. Currently we send empty packets, which the receiver sees as an up-to-date, but useless, ACK. This can result in a stall when the receiver never sends data (only ACKs), as netstack will probe forever and the receiver won't know to respond. See #10243. [RFC 9293 3.8.6.1]: https://datatracker.ietf.org/doc/html/rfc9293#name-zero-window-probing PiperOrigin-RevId: 622301779
That's a good point. I wonder how we missed that. Maybe worth making a
quick fix and seeing?
…On Fri, Apr 5, 2024, 3:18 PM Kevin Krakauer ***@***.***> wrote:
Oh, here's a thought: our zero window probes are broken? IIUC they should
be sending the next byte in the probe. From RFC 9293 3.8.6.1
<https://datatracker.ietf.org/doc/html/rfc9293#name-zero-window-probing>:
The sending TCP peer must regularly transmit at least one octet of new
data (if available), or retransmit to the receiving TCP peer even if the
send window is zero, in order to "probe" the window. This retransmission is
essential to guarantee that when either TCP peer has a zero window the
reopening of the window will be reliably reported to the other.
Per the pcap, we're sending zero-length probes, which IIUC Linux doesn't
have to ACK (why would it, it's just a duplicate and up-to-date ACK):
Screenshot.2024-04-05.at.3.07.12.PM.png (view on web)
<https://github.com/google/gvisor/assets/1718925/be6127ce-75f1-4778-84d1-906790dc330c>
This seems to be supported by Wireshark calling the probes TCP Dup ACK
instead of TCP ZeroWindowProbe, which it seems to do elsewhere (picture
from unrelated wireshark Q&A
<https://osqa-ask.wireshark.org/questions/21230/tcp-acked-unseen-segment-wireshark-bug/>
post):
[image: TCP ZeroWindowProbe]
<https://camo.githubusercontent.com/a76d255879fa5de4eb8565af696a12ab503df7953e0fb039d2603fc8f57076aa/68747470733a2f2f6f7371612d61736b2e77697265736861726b2e6f72672f757066696c65732f436170747572655f31372e504e47>
—
Reply to this email directly, view it on GitHub
<#10243 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFSYNOBMDMJMLQ6KF7YF43Y34PJNAVCNFSM6AAAAABFTOAPCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBQG4YDAMZUGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
#10252 has the change, although I'm still wrestling with tests. |
Wait but that's sending snduna-1. So sending the last acked byte should be
fine too since it will elicit a dup ack.
…On Fri, Apr 5, 2024, 4:49 PM Kevin Krakauer ***@***.***> wrote:
#10252 <#10252> has the change,
although I'm still wrestling with tests.
—
Reply to this email directly, view it on GitHub
<#10243 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFSYNK7RUXLHL6LKIECBHTY342B5AVCNFSM6AAAAABFTOAPCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBQG43TSNJVHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@hbhasker looks like Linux doesn't follow the RFC either: it basically sends a keepalive, which has the same effect. I.e. it sends a byte with an already-ACKed sequence number, not the next byte. That seems easier, so I'll do that when I get the chance. But I think #10252 should also fix the problem for now. |
Good news -- with #10252 I can't reproduce the stall any more! Regarding
(on Linux it looks ok) I also found a comment in the chromium source
which is a shame. However the bug looks fixed to me which is great news. Thanks, both, for all your help! |
Glad that worked! We're going to go with #10258 (don't need to segment split, more Linux-like, fewer test changes). Feel free to try it or wait for it to get merged. It should also fix the issue. |
A few things I think we should verify. a) I read the linux code and it doesn't seem to send any data at all. It's behaviour is only different when the urgent pointer is set where it sends two probes instead of 1. gVisor doesn't implement urgent pointer so we can just ignore that case. b) As per RFC https://datatracker.ietf.org/doc/html/rfc9293#section-3.10.7.4 sending a packet with an ACK bit set and an out of window sequence number will generate an ACK even if there is no payload. I would write a small linux packet impact test to verify this. Also see (https://sourcegraph.com/github.com/torvalds/linux@fec50db7033ea478773b159e0e2efb135270e3b7/-/blob/net/ipv4/tcp_input.c?L5893) which will always send an out of window dup ack as long as its not rate limited. c) The implementation in 0b36b49 is not what linux does and introduces a payload in the stream that is not even part of the original data. I would not do this as this would be very strange in TCP packet captures where suddenly ZWP carry a different byte for the same sequence number in probes vs a regular retransmit of data. I think we are missing something else here. A packetimpact test should help figure it out better. |
Good catches. a: Agreed, and I ran a little manual test to check. Linux does not send any data; it basically just sends a keepalive. My Mac does send a byte, and the internet says Windows does too. b: Right. In my manual test Linux does respond to the payload-less packet, so I expect we'll see the same from a packetimpact test. And poking around the kernel code it certainly looks like it should send an ACK. c: I think this is harmless because the byte must be discarded for being out-of-window. Keepalives are even explicitly allowed to send garbage. I'll modify the packetimpact tests to verify the payload size of ZWPs. I expect we'll see zero-length probes. But as far as fixing this issue is concerned, I think there are two options:
|
Reg 1. If it was an iptables issue we would observe same stall when using Linux as the client.
I would also check if somehow the checksum for probes is wrong. I know it's unlikely but worth ruling it out. Also the sending garbage payload is a MAY and only to support non compliant TCP implementations. I would suggest just store the last byte before the first unacked byte in the TCP struct separately and use that rather send junk. |
I mean iptables in the particular spin+config of Linux that Docker Mac is using. I don't think we have a way to see "Linux as the client" because the proxy never used Linux's stack, it used Mirage's before gVisor. I will check (heh) the checksum. And if we're going to avoid the garbage byte, I'd rather just do #10252:
Certainly open to discussion, but IMO one way has most of the advantages. |
So I think the garbage byte is a fine option to avoid splitting the segment etc. That said I am still curious on why it works for linux to send a zero byte probe but not for gvisor. Is the sequence number sent by linux any different? From reading the code it should be sndUna-1 but would be good to confirm. Anyway +1 to using the garbage byte for now but maybe keep the issue open to investigate when you have time to see why it works for linux vs not for us. Should be easy to replicate a ZW in gvisor by just not reading the data at all and let it fill and see what linux does vs what a gvisor sender does in a packet impact test. |
Description
[First of all, thanks for writing this awesome piece of software!]
Using the gvisor tcpip library on macOS (across a range of versions, including the latest I could find i.e.
gvisor.dev/gvisor v0.0.0-20240331093445-9d995324d058
) I see a stall where:iperf3 -c localhost -t 3600
(the 3600 is aspirational)localhost:5201
which is accepted by a proxy which calls gonet.DialContextTCPiperf3 -s
is running (in a docker container)Typically after a few minutes the stream stalls (usually < 3 minutes). If I apply this patch it doesn't seem to stall:
I can also achieve the same effect by creating my endpoint manually (not using gonet) and using:
Attached is a .pcap of a TCP session which was transferring data quickly until it stalled:
gvisor-report-s100.zip
Let me know what you think! Thanks again.
Related to docker/for-mac#7207
Steps to reproduce
The bug reproduces with Docker Desktop 4.28 on Mac (which uses gvisor internally).
In one terminal:
In another terminal:
runsc version
No response
docker version (if using docker)
No response
uname
Darwin foo.local 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:41 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8103 arm64
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
No response
The text was updated successfully, but these errors were encountered: