-
Notifications
You must be signed in to change notification settings - Fork 340
MPTCP client gets only half throughput it should receive with 802.11n #150
Comments
Hi, Did you try with |
Hi, Sure net.mptcp.mptcp_checksum=0. |
It could be nice to check what is causing the degradation: CPU on client/server side, more losses on client/server side, which path manager, the number of subflow, the TCP congestion control used, etc. |
CPU is not seems t be the case, usually it's less than 15% on both server/client.
|
Strange. Christoph has certainly more ideas but just to be sure, could you try with cubic instead of balia? |
This sounds like we don't fill the WiFi channel enough to get the desired throughput. It's rather a low-level issue. Just as a shot in the dark, can you increase the sysctl If you have a packet-trace, it would be great! |
Hi, ) |
Yeah, looking at the trace it really seems like tcp_limit_output_bytes is the culprit. Please let us know if it helped. If yes, we have to fix something in that area. |
Hi, If you want to test whether this is the case, please try the follows: |
Hello @jaehyun-hwang , your analysis seems to me to confirm that tcp_limit_output_bytes is the "problem". Can someone who can reproduce this issue try out increasing tcp_limit_output_bytes ? (I don't have a machine with WiFi-interface at hand). @jaehyun-hwang, you mention:
Why wouldn't the other 2 packets be scheduled? The only reason why this could happen from what I see, is if the rate-limiting from TCP kicks in. |
Hi, Looks like tcp_limit_output_bytes is not the case, changing it from 128K to 512K hadn't improved throughput. |
Thanks! |
I will take a closer look at your packet-traces. |
Hi Christoph, (1) The mptcp sender receives an ack that acks packet 1-2, and now we can send two new packets as cwnd_quota becomes 2. At this moment, suppose another ack that acks 3-4 is received, updating cwnd_quota to 4. If this is a regular tcp, it could send totally 4 packets at this tcp_write_xmit loop, consuming all cwnd_quota. But in our case, because there is only two packets to send (5 and 6) in the subflow's send-queue, cwnd_quota would be 2. We have to wait for the next mptcp_write_xmit() to insert packets 7-8 into the subflow's send-queue, in order to consume the remaining cwnd_quota. |
@jaehyun-hwang Thanks for the explanation (and sorry for the delay). When the "another ack 3-4" comes in at TCP, it should schedule a call to |
@Cherep07 : Can you do a new run of experiments? I would like to see one run with regular TCP, and one with MPTCP (mptcp_checksum disabled) on a single interface. Both, using cubic congestion control What I see in the trace is that the sender is increasing its rate up to the point where there is a longer idle period. After that the sender stops increasing the rate. So, I am wondering why with MPTCP we stop increasing, while with TCP we still do so. Having both traces allows me to compare them. Thanks! |
@cpaasch I'm happy to investigate this problem further with you. I think it would be better to share my kernel logs to explain the problem clearly. I need some time to do this as I have to resetup my PCs at home to reproduce this phenomenon. I will get back to you in a couple of days. |
Hello, In my system, iperf reaches up to 120-130Mbps with a regular tcp flow. It remains around 60Mbps with a mptcp subflow. The phenomenon is reproduced. I couldn't find the main cause yet, but one clue is that, when I set iperf client(i.e., sender)'s tcp window to 64k with "-w 64k", it achieves 120Mbps. I looked at some internal variables:
Since the cwnd is very high in the default case (but shows low throughput), its queuing delay is also high. (Actually we can utilize the wifi only with 100 cwnds as seen in the second case.) Another observation is that, in the second case, the acks (and data) are a bit bursty and tso value is above 10. But in the default case, it shows 2-packet ping pong pattern; receives 2-ack and sends 2 data packets, results in a small tso value - 2 or 3. Any idea? I will continue to look at it and let you know there's any new observation. @Cherep07 can you try again with the default "-w 64k" at the iperf client side? |
@jaehyun-hwang Thanks for all this info! When you say "tso value is above 10". What do you mean with "tso value"? With the 2-packet ping-pong pattern, does this mean that the flight-size is only 2? I think, we need to try to narrow down a bit the possible reasons of what is going wrong here.
Also, if you have a packet-trace, can you share it so we can compare with @Cherep07's trace. Thanks a lot again! |
One thing is that, when disabling tso/gso, it shows 60Mbps even with "-w 64k" option. Autocorking seems not effective in any cases. Sorry unfortunately packet traces cannot be shared..The below is snapshots of "ss -i" for the both cases. Case 1 - 60Mbps) Case 2 - 120Mbps with -w 64k) For your better understanding, my topology is: |
@jaehyun-hwang Thanks for the quick turnaround. As we are filling the congestion-window it means that the MPTCP-scheduler is properly pushing enough traffic on the link. Can you check in the traces (would be easier if you could share them ;)) on the sender what the packet-size is compared to the DSS data-length in the MPTCP-option? If the packet-size is smaller than the DSS data-length it means that the packet gets split in tcp_write_xmit. If not, then the packet gets split earlier in the MPTCP-code. Then, please check what the segment-size in tcp_sendmsg is (after the call to tcp_send_mss, check the variable mss_now). Finally, check the segment-size in mptcp_write_xmit. Thanks! |
Also, do you see the same issue in v0.90 ? |
@cpaasch I agree with your thoughts.
|
@jaehyun-hwang Yes, it seems to me that in the scheduler This shows that always only scheduling up to the cwnd is not good. For better performance, we probably have to schedule more than what the cwnd allows on a subflow ("scheduling" means queuing it in the write-queue - not yet sending it on the wire). It's a difficult balance to find, because we don't want to queue too much on a slow subflow. I can imagine that calling Can you try out the following (completely untested; it is based on the latest mptcp_v0.91-branch):
|
@paasch Thanks for the patch! From the kernel logs, now I confirm in mptcp_write_xmit, skb is not splitted by "sublimit" (set by Mptcp_sched.c), but "limit" is.. Since v0.90 doesn't have tcp_tso_autosize(), I simply added it inside mptcp_sched.c as it has no dependancy to othet functions. |
@jaehyun-hwang Hmmm... I see. I think there are two solutions:
Can you try out the first option? And please, update to v0.91 for the testing here. God knows, if what you are observing is an artifact of some incompatibilities between the |
@cpaasch For the first option, could you guide a bit more detail where is the right place to put tcp_tso_should_defer? Let me upgrade my system to v0.91 in both sides, and I will back to you tomorrow. |
@cpaasch I have upgraded my systems to v0.91.
|
@jaehyun-hwang Sorry, for not coming back earlier. I'm quite busy with other stuff at the moment. Porting mptcp_trunk to v4.4 right now, and it's very tricky. I will try to come back to you by tomorrow. |
@cpaasch That's perfectly fine. We can discuss further next week. Have a nice weekend! |
Any updates? This issue affects almost any configuration with wireless link.. |
@rstanislav : Can you please check setting tcp_adv_win_scale to 2 in the client side. Whats the TP reached |
commit 87a73eb upstream. It turns out that the loop where we read manufacturer jedec_read_mfd() can under some circumstances get a CFI_MFR_CONTINUATION repeatedly, making the loop go over all banks and eventually hit the end of the map and crash because of an access violation: Unable to handle kernel paging request at virtual address c4980000 pgd = (ptrval) [c4980000] *pgd=03808811, *pte=00000000, *ppte=00000000 Internal error: Oops: 7 [#1] PREEMPT ARM CPU: 0 PID: 1 Comm: swapper Not tainted 4.16.0-rc1+ multipath-tcp#150 Hardware name: Gemini (Device Tree) PC is at jedec_probe_chip+0x6ec/0xcd0 LR is at 0x4 pc : [<c03a2bf4>] lr : [<00000004>] psr: 60000013 sp : c382dd18 ip : 0000ffff fp : 00000000 r10: c0626388 r9 : 00020000 r8 : c0626340 r7 : 00000000 r6 : 00000001 r5 : c3a71afc r4 : c382dd70 r3 : 00000001 r2 : c4900000 r1 : 00000002 r0 : 00080000 Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 0000397f Table: 00004000 DAC: 00000053 Process swapper (pid: 1, stack limit = 0x(ptrval)) Fix this by breaking the loop with a return 0 if the offset exceeds the map size. Fixes: 5c9c11e ("[MTD] [NOR] Add support for flash chips with ID in bank other than 0") Cc: <[email protected]> Signed-off-by: Linus Walleij <[email protected]> Signed-off-by: Boris Brezillon <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
It turns out that the loop where we read manufacturer jedec_read_mfd() can under some circumstances get a CFI_MFR_CONTINUATION repeatedly, making the loop go over all banks and eventually hit the end of the map and crash because of an access violation: Unable to handle kernel paging request at virtual address c4980000 pgd = (ptrval) [c4980000] *pgd=03808811, *pte=00000000, *ppte=00000000 Internal error: Oops: 7 [#1] PREEMPT ARM CPU: 0 PID: 1 Comm: swapper Not tainted 4.16.0-rc1+ #150 Hardware name: Gemini (Device Tree) PC is at jedec_probe_chip+0x6ec/0xcd0 LR is at 0x4 pc : [<c03a2bf4>] lr : [<00000004>] psr: 60000013 sp : c382dd18 ip : 0000ffff fp : 00000000 r10: c0626388 r9 : 00020000 r8 : c0626340 r7 : 00000000 r6 : 00000001 r5 : c3a71afc r4 : c382dd70 r3 : 00000001 r2 : c4900000 r1 : 00000002 r0 : 00080000 Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 0000397f Table: 00004000 DAC: 00000053 Process swapper (pid: 1, stack limit = 0x(ptrval)) Fix this by breaking the loop with a return 0 if the offset exceeds the map size. Fixes: 5c9c11e ("[MTD] [NOR] Add support for flash chips with ID in bank other than 0") Cc: <[email protected]> Signed-off-by: Linus Walleij <[email protected]> Signed-off-by: Boris Brezillon <[email protected]>
commit 87a73eb upstream. It turns out that the loop where we read manufacturer jedec_read_mfd() can under some circumstances get a CFI_MFR_CONTINUATION repeatedly, making the loop go over all banks and eventually hit the end of the map and crash because of an access violation: Unable to handle kernel paging request at virtual address c4980000 pgd = (ptrval) [c4980000] *pgd=03808811, *pte=00000000, *ppte=00000000 Internal error: Oops: 7 [#1] PREEMPT ARM CPU: 0 PID: 1 Comm: swapper Not tainted 4.16.0-rc1+ #150 Hardware name: Gemini (Device Tree) PC is at jedec_probe_chip+0x6ec/0xcd0 LR is at 0x4 pc : [<c03a2bf4>] lr : [<00000004>] psr: 60000013 sp : c382dd18 ip : 0000ffff fp : 00000000 r10: c0626388 r9 : 00020000 r8 : c0626340 r7 : 00000000 r6 : 00000001 r5 : c3a71afc r4 : c382dd70 r3 : 00000001 r2 : c4900000 r1 : 00000002 r0 : 00080000 Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 0000397f Table: 00004000 DAC: 00000053 Process swapper (pid: 1, stack limit = 0x(ptrval)) Fix this by breaking the loop with a return 0 if the offset exceeds the map size. Fixes: 5c9c11e ("[MTD] [NOR] Add support for flash chips with ID in bank other than 0") Cc: <[email protected]> Signed-off-by: Linus Walleij <[email protected]> Signed-off-by: Boris Brezillon <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
… abort commit 23e3d7f7061f8682c751c46512718f47580ad8f0 upstream. we got issue as follows: [ 72.796117] EXT4-fs error (device sda): ext4_journal_check_start:83: comm fallocate: Detected aborted journal [ 72.826847] EXT4-fs (sda): Remounting filesystem read-only fallocate: fallocate failed: Read-only file system [ 74.791830] jbd2_journal_commit_transaction: jh=0xffff9cfefe725d90 bh=0x0000000000000000 end delay [ 74.793597] ------------[ cut here ]------------ [ 74.794203] kernel BUG at fs/jbd2/transaction.c:2063! [ 74.794886] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 74.795533] CPU: 4 PID: 2260 Comm: jbd2/sda-8 Not tainted 5.17.0-rc8-next-20220315-dirty multipath-tcp#150 [ 74.798327] RIP: 0010:__jbd2_journal_unfile_buffer+0x3e/0x60 [ 74.801971] RSP: 0018:ffffa828c24a3cb8 EFLAGS: 00010202 [ 74.802694] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 74.803601] RDX: 0000000000000001 RSI: ffff9cfefe725d90 RDI: ffff9cfefe725d90 [ 74.804554] RBP: ffff9cfefe725d90 R08: 0000000000000000 R09: ffffa828c24a3b20 [ 74.805471] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9cfefe725d90 [ 74.806385] R13: ffff9cfefe725d98 R14: 0000000000000000 R15: ffff9cfe833a4d00 [ 74.807301] FS: 0000000000000000(0000) GS:ffff9d01afb00000(0000) knlGS:0000000000000000 [ 74.808338] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 74.809084] CR2: 00007f2b81bf4000 CR3: 0000000100056000 CR4: 00000000000006e0 [ 74.810047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 74.810981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 74.811897] Call Trace: [ 74.812241] <TASK> [ 74.812566] __jbd2_journal_refile_buffer+0x12f/0x180 [ 74.813246] jbd2_journal_refile_buffer+0x4c/0xa0 [ 74.813869] jbd2_journal_commit_transaction.cold+0xa1/0x148 [ 74.817550] kjournald2+0xf8/0x3e0 [ 74.819056] kthread+0x153/0x1c0 [ 74.819963] ret_from_fork+0x22/0x30 Above issue may happen as follows: write truncate kjournald2 generic_perform_write ext4_write_begin ext4_walk_page_buffers do_journal_get_write_access ->add BJ_Reserved list ext4_journalled_write_end ext4_walk_page_buffers write_end_fn ext4_handle_dirty_metadata ***************JBD2 ABORT************** jbd2_journal_dirty_metadata -> return -EROFS, jh in reserved_list jbd2_journal_commit_transaction while (commit_transaction->t_reserved_list) jh = commit_transaction->t_reserved_list; truncate_pagecache_range do_invalidatepage ext4_journalled_invalidatepage jbd2_journal_invalidatepage journal_unmap_buffer __dispose_buffer __jbd2_journal_unfile_buffer jbd2_journal_put_journal_head ->put last ref_count __journal_remove_journal_head bh->b_private = NULL; jh->b_bh = NULL; jbd2_journal_refile_buffer(journal, jh); bh = jh2bh(jh); ->bh is NULL, later will trigger null-ptr-deref journal_free_journal_head(jh); After commit 96f1e09, we no longer hold the j_state_lock while iterating over the list of reserved handles in jbd2_journal_commit_transaction(). This potentially allows the journal_head to be freed by journal_unmap_buffer while the commit codepath is also trying to free the BJ_Reserved buffers. Keeping j_state_lock held while trying extends hold time of the lock minimally, and solves this issue. Fixes: 96f1e09("jbd2: avoid long hold times of j_state_lock while committing a transaction") Signed-off-by: Ye Bin <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Do you aware of mptcp throughput degradation over 802.11n?
In my experiments I have used Ubuntu 16.04.
Same degradation over 802.11n is also seen with several mptcp connections.
I've found the same observations at http://ebsis.info.uaic.ro/school2016/niculescu.mptcp.wireless.ebsis.pdf (page 69).
Is this expected? Is there any workarounds which allows to see mptcp benefits over 802.11n?
Thanks in advance
The text was updated successfully, but these errors were encountered: