TCP collapse in full duplex connections with MPTCP enabled #261

fciaccia · 2018-05-24T15:45:18Z

TL;DR
I found that MPTCP behaves differently than legacy TCP for duplex TCP connections: under some circumstances the connection completely stalls when sending data in both directions.

Hi all,

I have a specific use case where I am transferring big amount of data over the same TCP connection in both directions. I am experiencing a full connection collapse after both directions are reaching high throughput; the connection never recovers.

A consistent reduction in throughput is expected when both direction are carrying full payload data (due to ACKs being piggybacked more slowly), and when setting rmem and wmem maximum values to more than the default Linux ones the TCP flow is more likely to collapse. Specifically, I grew, in both endpoints, the rmem max value to 8MB and wmem to 6MB (from the 6MB/4MB default values respectively).

So far, so good: this behaviour is expected. However, when enabling MPTCP with identical conditions (single subflow, same windows and CC algorithm) it exhibits different symptoms: throughput drops to zero and the connection stalls.

My impression is that MPTCP is not handling correctly this use case (i.e., full duplex TCP).

The environment I am testing consists of two Debian Jessie VMs, where only one of the two is multi-homed with two access links. The connection goes over the Internet, and RTT is quite big, ~130ms. VM1 is deployed in a KVM host, VM2 is an Amazon EC2 instance. Aggregated capacity in VM1 is 1.2Gbps (1x200Mbps link + 1x1Gbps link) where the Amazon instance can get up to 2Gbps over a single link. Both interfaces in VM1 have public IPv4 configured. VM2 in Amazon is behind a NAT.

I can reproduce the misbehaviour with both MPTCP kernel v0.94 and v0.93 directly installed from the apt repo. I can reproduce the collapse with both fullmesh path manager (which opens two subflows), and default path manager. I am using CUBIC as congestion control algorithm.

Uname -a:

Linux mptcp-box 4.14.24.mptcp #9 SMP Fri Mar 9 19:13:05 UTC 2018 x86_64 GNU/Linux

The kernel log with mptcp_debug enabled shows the connection establishment and nothing more until I don't force close the connection by interrupting the client/server (public IPs mangled on purpose; the following log is taken when using default path manager):

[13188.687390] mptcp_alloc_mpcb: created mpcb with token 0x4bf4afb9
[13188.687448] mptcp_add_sock: token 0x4bf4afb9 pi 1, src_addr:1.2.3.4:53676 dst_addr:3.4.5.6:1234, cnt_subflows now 1

On force close:
[13529.223071] mptcp_close: Close of meta_sk with tok 0x4bf4afb9

My sysctl configuration:

net.ipv4.tcp_rmem=4096 87380 8388608 
net.ipv4.tcp_wmem=4096 16384 6291456
net.mptcp.mptcp_path_manager=default
net.ipv4.tcp_congestion_control=cubic

I am attaching a simple C program which reproduces the kind of traffic causing the error. It sends 2GB of random data in both direction over the same connection.

To compile:
gcc -Wall -O2 -pthread -o duplextcp duplextcp.c

Usage server:
./duplextcp -s localIP localPort

Usage client:
./duplextcp destIP destPort

Any help would be much appreciated.
Thank you very much!

duplextcp.zip

The text was updated successfully, but these errors were encountered:

matttbe · 2018-05-30T13:00:27Z

Hi @fciaccia,

Thank you for this detailed bug report!
I will try to find time to reproduce it but it doesn't look very simple to fix as there are not so much info from the kernel :)

Matt

cpaasch · 2018-07-28T00:17:43Z

Writing down some notes on this issue here as I am looking into it:

There is packet-loss happening and from that moment on the connection is going to stall. What happens is that the server is retransmitting a lot of out-of-order data and there is ultimately only one segment missing. However, this last one is never getting acknowledged, although it is being retransmitted.

This means that in the TCP-input path somewhere we are dropping it. nstat counters should indicate where, but I can't find which one that is. tcp_try_rmem_schedule should not be the cause as otherwise the PRUNED counters would have increased. It could be added to the backlog-queue and the backlog-queue might have never been scheduled. That is an option as the client is working on two threads for sending/receiving.

Besides that, maybe some other drop is taking place.

pRiVi · 2018-09-25T15:15:34Z

I think the main problem here is, that there is no (known? documented?) possiblity to debug whats happening and what the internal states of mptcp are.

I played two days with mptcp and saw only strange behavoir, looking the same as all the bugs here around, but no way to assign it to internal states or something like that.

Without this there is no debugging possible, nor bug reports which could help, don't you think so?

When storing a pointer to a dst_metrics structure in dst_entry._metrics, two flags are added in the least significant bits of the pointer value. Hence this assumes all pointers to dst_metrics structures have at least 4-byte alignment. However, on m68k, the minimum alignment of 32-bit values is 2 bytes, not 4 bytes. Hence in some kernel builds, dst_default_metrics may be only 2-byte aligned, leading to obscure boot warnings like: WARNING: CPU: 0 PID: 7 at lib/refcount.c:28 refcount_warn_saturate+0x44/0x9a refcount_t: underflow; use-after-free. Modules linked in: CPU: 0 PID: 7 Comm: ksoftirqd/0 Tainted: G W 5.5.0-rc2-atari-01448-g114a1a1038af891d-dirty multipath-tcp#261 Stack from 10835e6c: 10835e6c 0038134f 00023fa6 00394b0f 0000001c 00000009 00321560 00023fea 00394b0f 0000001c 001a70f8 00000009 00000000 10835eb4 00000001 00000000 04208040 0000000a 00394b4a 10835ed4 00043aa8 001a70f8 00394b0f 0000001c 00000009 00394b4a 0026aba8 003215a4 00000003 00000000 0026d5a8 00000001 003215a4 003a4361 003238d6 000001f0 00000000 003215a4 10aa3b00 00025e84 003ddb00 10834000 002416a8 10aa3b00 00000000 00000080 000aa038 0004854a Call Trace: [<00023fa6>] __warn+0xb2/0xb4 [<00023fea>] warn_slowpath_fmt+0x42/0x64 [<001a70f8>] refcount_warn_saturate+0x44/0x9a [<00043aa8>] printk+0x0/0x18 [<001a70f8>] refcount_warn_saturate+0x44/0x9a [<0026aba8>] refcount_sub_and_test.constprop.73+0x38/0x3e [<0026d5a8>] ipv4_dst_destroy+0x5e/0x7e [<00025e84>] __local_bh_enable_ip+0x0/0x8e [<002416a8>] dst_destroy+0x40/0xae Fix this by forcing 4-byte alignment of all dst_metrics structures. Fixes: e5fd387 ("ipv6: do not overwrite inetpeer metrics prematurely") Signed-off-by: Geert Uytterhoeven <[email protected]> Signed-off-by: David S. Miller <[email protected]>

[ Upstream commit 258a980 ] When storing a pointer to a dst_metrics structure in dst_entry._metrics, two flags are added in the least significant bits of the pointer value. Hence this assumes all pointers to dst_metrics structures have at least 4-byte alignment. However, on m68k, the minimum alignment of 32-bit values is 2 bytes, not 4 bytes. Hence in some kernel builds, dst_default_metrics may be only 2-byte aligned, leading to obscure boot warnings like: WARNING: CPU: 0 PID: 7 at lib/refcount.c:28 refcount_warn_saturate+0x44/0x9a refcount_t: underflow; use-after-free. Modules linked in: CPU: 0 PID: 7 Comm: ksoftirqd/0 Tainted: G W 5.5.0-rc2-atari-01448-g114a1a1038af891d-dirty multipath-tcp#261 Stack from 10835e6c: 10835e6c 0038134f 00023fa6 00394b0f 0000001c 00000009 00321560 00023fea 00394b0f 0000001c 001a70f8 00000009 00000000 10835eb4 00000001 00000000 04208040 0000000a 00394b4a 10835ed4 00043aa8 001a70f8 00394b0f 0000001c 00000009 00394b4a 0026aba8 003215a4 00000003 00000000 0026d5a8 00000001 003215a4 003a4361 003238d6 000001f0 00000000 003215a4 10aa3b00 00025e84 003ddb00 10834000 002416a8 10aa3b00 00000000 00000080 000aa038 0004854a Call Trace: [<00023fa6>] __warn+0xb2/0xb4 [<00023fea>] warn_slowpath_fmt+0x42/0x64 [<001a70f8>] refcount_warn_saturate+0x44/0x9a [<00043aa8>] printk+0x0/0x18 [<001a70f8>] refcount_warn_saturate+0x44/0x9a [<0026aba8>] refcount_sub_and_test.constprop.73+0x38/0x3e [<0026d5a8>] ipv4_dst_destroy+0x5e/0x7e [<00025e84>] __local_bh_enable_ip+0x0/0x8e [<002416a8>] dst_destroy+0x40/0xae Fix this by forcing 4-byte alignment of all dst_metrics structures. Fixes: e5fd387 ("ipv6: do not overwrite inetpeer metrics prematurely") Signed-off-by: Geert Uytterhoeven <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

matttbe added the bug label May 30, 2018

dweb32 mentioned this issue Jul 18, 2018

Available subflows aren't used anymore in long-lasting connections with kernel version 4 #273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCP collapse in full duplex connections with MPTCP enabled #261

TCP collapse in full duplex connections with MPTCP enabled #261

fciaccia commented May 24, 2018 •

edited

Loading

matttbe commented May 30, 2018

cpaasch commented Jul 28, 2018

pRiVi commented Sep 25, 2018 •

edited

Loading

TCP collapse in full duplex connections with MPTCP enabled #261

TCP collapse in full duplex connections with MPTCP enabled #261

Comments

fciaccia commented May 24, 2018 • edited Loading

matttbe commented May 30, 2018

cpaasch commented Jul 28, 2018

pRiVi commented Sep 25, 2018 • edited Loading

fciaccia commented May 24, 2018 •

edited

Loading

pRiVi commented Sep 25, 2018 •

edited

Loading