Skip to content
This repository has been archived by the owner on Apr 18, 2024. It is now read-only.

MPTCP client gets only half throughput it should receive with 802.11n #150

Open
artyomcherepanov opened this issue Nov 2, 2016 · 32 comments

Comments

@artyomcherepanov
Copy link

Do you aware of mptcp throughput degradation over 802.11n?

In my experiments I have used Ubuntu 16.04.

  • With default linux kernel (without mptcp) throughput over single 802.11n interface shows ~140mbps.
  • With mptcp v.0.91 kernel throughput over single 802.11n interface (mptcp disabled) shows ~140mbps.
  • With mptcp v.0.91 kernel throughput over single 802.11n interface (mptcp enabled) shows ~69mbps.
    Same degradation over 802.11n is also seen with several mptcp connections.

I've found the same observations at http://ebsis.info.uaic.ro/school2016/niculescu.mptcp.wireless.ebsis.pdf (page 69).

Is this expected? Is there any workarounds which allows to see mptcp benefits over 802.11n?

Thanks in advance

@artyomcherepanov artyomcherepanov changed the title MPTCP clients gets only half throughput it should receive with 802.11n MPTCP client gets only half throughput it should receive with 802.11n Nov 2, 2016
@matttbe
Copy link
Member

matttbe commented Nov 2, 2016

Hi,

Did you try with net.mptcp.mptcp_checksum=0? If it is enabled, some operations will have to be done by the CPU, not in hardware.

@artyomcherepanov
Copy link
Author

Hi,

Sure net.mptcp.mptcp_checksum=0.
Also mptcp over all other interfaces (LAN, 4G, 802.11a) works fine, such issue occurs only on 802.11n

@matttbe
Copy link
Member

matttbe commented Nov 2, 2016

It could be nice to check what is causing the degradation: CPU on client/server side, more losses on client/server side, which path manager, the number of subflow, the TCP congestion control used, etc.

@artyomcherepanov
Copy link
Author

CPU is not seems t be the case, usually it's less than 15% on both server/client.
Path manager is fullmesh. Issue is reproduced even if only one subflow is used through the single 802.11n interface. TCP congestion control mechanism is balia, however changing it to any *lia, has no noticeable effect on throughput.
In my case configuration is:

  1. linksys wifi AP (router) with 802.11n support
  2. laptop (client) with embedded wifi, connected to the router other-the-air
  3. x86 server, connected to router via ethernet connection. also connected to the client through ethernet switch.
    Both server and client has same mptcp configuration.
    In my experiments I had enebled overall mptcp, but disabled mptcp for ethernet connection from server to client, so only wifi connection is mptcp enabled.
    I see low tcp throughput (<70mbps) with such configuration (udp throughput is not affected).
    In case wifi connection is also mptcp disabled - throughput reaches desired value (twice bigger than with mptcp enabled).
    For traffic generation I use iperf with default tcp parameters.

@matttbe
Copy link
Member

matttbe commented Nov 2, 2016

Strange. Christoph has certainly more ideas but just to be sure, could you try with cubic instead of balia?
Could you also check with ss/wireshark if you have a lot of losses?
Not sure it could happen but maybe the router doesn't enabled some hardware accelerations (strange because it doesn't need a lot of info to forward traffic).

@cpaasch
Copy link
Member

cpaasch commented Nov 4, 2016

This sounds like we don't fill the WiFi channel enough to get the desired throughput. It's rather a low-level issue.

Just as a shot in the dark, can you increase the sysctl tcp_limit_output_bytes ? (for example, double it).

If you have a packet-trace, it would be great!

@artyomcherepanov
Copy link
Author

Hi,
changing congestion control to cubic has no positive effect to throughput.
I'll try to increase tcp_limit_output_bytes, but will be able to do it only on Monday.
packet trace is attached (collected with balia)
[baliasrv.zip](https://github.com/multipath-tcp/mptcp/files/572328/baliasrv.zip

)

@cpaasch
Copy link
Member

cpaasch commented Nov 4, 2016

Yeah, looking at the trace it really seems like tcp_limit_output_bytes is the culprit. Please let us know if it helped. If yes, we have to fix something in that area.

@jaehyun-hwang
Copy link
Contributor

Hi,
I had a similar problem recently only with 802.11n and worked well with LTE, LAN, 802.11ac, etc.
Through some deep analysis, I found there can be some unexpected behavior between mptcp scheduling and tcp transmitting when 802.11n is used. More specifically: supposed that an ack arrives in the mptcp sender just after some data packets (allowed by cwnd) are scheduled to a subflow and the subflow is about to transmit them in tcp_write_xmit(). At this time, although the subflow has more cwnd-quota by the new ack, it would be able to send only the data scheduled by mptcp.
Let me give you a simple example: cwnd is 4 and in_flight is also 4. Then an ack that covers 2 packets just arrive. Now mptcp will schedule 2 packets from mptcp-send-queue to subflow-send-queue, and the subflow is trying to send the data in tcp_write_xmit(). At this moment, another ack arrives and opens 2 more packets in cwnd. Actually the subflow should be able to send totally 4 packets, but it can send only 2 packets because the latter two packets are not scheduled yet. We should wait for the next packet scheduling, and this leads to performance degradation. The regular tcp will work fine because its send-queue would always have enough data as no packet scheduling is needed.

If you want to test whether this is the case, please try the follows:
In mptcp_write_xmit() (mptcp_output.c), comment out the below two lines:
/*
if (sublimit)
limit = min(limit, sublimit);
*/
By doing this, we can avoid mptcp-level fragmentation - it will slightly schedule more data than cwnd and the skb fragmentation will be finally performed with up-to-date cwnd in tcp_write_xmit(). This can alleviate the above problem.
I know that this is not a fundamental solution and we need to find a better way if my analysis is correct. I thought one possibility is that we may re-perform mptcp scheduling in tcp_write_xmit() if mptcp-send-queue has more data (not scheduled) than the subflow's (cwnd_quota - new packets in the subflow-send-queue), but it looks not easy. It would be good to discuss how to solve it.
Lastly, I don't know why 802.11n generates this kind of ACK pattern. What I found is that 802.11n performs the frame aggregation, which aggregates data packets for 2.5ms to improve the transmission efficiency, and this probably results in an abnormal ack pattern or something. I expect some wifi experts may give us better explanation.

@cpaasch
Copy link
Member

cpaasch commented Nov 7, 2016

Hello @jaehyun-hwang ,

your analysis seems to me to confirm that tcp_limit_output_bytes is the "problem". Can someone who can reproduce this issue try out increasing tcp_limit_output_bytes ? (I don't have a machine with WiFi-interface at hand).

@jaehyun-hwang, you mention:

Let me give you a simple example: cwnd is 4 and in_flight is also 4. Then an ack that covers 2 packets just arrive. Now mptcp will schedule 2 packets from mptcp-send-queue to subflow-send-queue, and the subflow is trying to send the data in tcp_write_xmit(). At this moment, another ack arrives and opens 2 more packets in cwnd. Actually the subflow should be able to send totally 4 packets, but it can send only 2 packets because the latter two packets are not scheduled yet. We should wait for the next packet scheduling, and this leads to performance degradation. The regular tcp will work fine because its send-queue would always have enough data as no packet scheduling is needed.

Why wouldn't the other 2 packets be scheduled? The only reason why this could happen from what I see, is if the rate-limiting from TCP kicks in.

@artyomcherepanov
Copy link
Author

Hi,

Looks like tcp_limit_output_bytes is not the case, changing it from 128K to 512K hadn't improved throughput.

@cpaasch
Copy link
Member

cpaasch commented Nov 7, 2016

Thanks!

@cpaasch
Copy link
Member

cpaasch commented Nov 7, 2016

I will take a closer look at your packet-traces.

@jaehyun-hwang
Copy link
Contributor

Hi Christoph,
What I meant is the other 2 packets are not inserted into the subflow's send-queue yet at the moment.
Let me explain a bit more.
Let's assume
cwnd_quota: cwnd - in_flight
mptcp_send_queue has data packets 1 to 8.
Subflow's cwnd is 4 and in_flight is also 4, meaning cwnd_quota is 0.

(1) The mptcp sender receives an ack that acks packet 1-2, and now we can send two new packets as cwnd_quota becomes 2.
(2) in mptcp_write_xmit(), packets 5-6 will be inserted from the mptcp send-queue to the subflow's send-queue.
(3) in tcp_write_xmit(), the subflow trys to the new packets in subflow's send-queue as much as cwnd_quota allows.

At this moment, suppose another ack that acks 3-4 is received, updating cwnd_quota to 4. If this is a regular tcp, it could send totally 4 packets at this tcp_write_xmit loop, consuming all cwnd_quota. But in our case, because there is only two packets to send (5 and 6) in the subflow's send-queue, cwnd_quota would be 2. We have to wait for the next mptcp_write_xmit() to insert packets 7-8 into the subflow's send-queue, in order to consume the remaining cwnd_quota.
I have found, In a normal case, cwnd_quota usually becomes 0 after tcp_write_xmit() is called, but I observed this is often non-zero with 802.11n, which means it doesn't fully utilize the resource in time. I suspect this is the main reason for the degradation problem, although still don't know why it happens only with 802.11n.
Please let me know if there is any unclear or missing point.

@cpaasch
Copy link
Member

cpaasch commented Nov 21, 2016

@jaehyun-hwang Thanks for the explanation (and sorry for the delay). When the "another ack 3-4" comes in at TCP, it should schedule a call to mptcp_write_xmit(), which should trigger the sending of 7-8. The fact that it does not get sent, probably means that we are limiting the subflow somewhere else. E.g., TCP small-queues,...
I don't have a means to test on an 802.11n device. Can you try to investigate further and figure out why data doesn't get sent on the subflow? We must be limiting somewhere the sending - I would start looking at mptcp_write_xmit() and the scheduler function mptcp_next_segment.

@cpaasch
Copy link
Member

cpaasch commented Nov 21, 2016

@Cherep07 : Can you do a new run of experiments? I would like to see one run with regular TCP, and one with MPTCP (mptcp_checksum disabled) on a single interface. Both, using cubic congestion control

What I see in the trace is that the sender is increasing its rate up to the point where there is a longer idle period. After that the sender stops increasing the rate. So, I am wondering why with MPTCP we stop increasing, while with TCP we still do so. Having both traces allows me to compare them.

Thanks!

@jaehyun-hwang
Copy link
Contributor

@cpaasch I'm happy to investigate this problem further with you. I think it would be better to share my kernel logs to explain the problem clearly. I need some time to do this as I have to resetup my PCs at home to reproduce this phenomenon. I will get back to you in a couple of days.

@jaehyun-hwang
Copy link
Contributor

Hello,

In my system, iperf reaches up to 120-130Mbps with a regular tcp flow. It remains around 60Mbps with a mptcp subflow. The phenomenon is reproduced.

I couldn't find the main cause yet, but one clue is that, when I set iperf client(i.e., sender)'s tcp window to 64k with "-w 64k", it achieves 120Mbps.

I looked at some internal variables:
(mptcp_enabled=1, mptcp_checksum=0, tcp algo is cubic)

  1. default (-> auto tuning):
  • cwnd: quickly reaches to 1100
  • sk_sndbuf (both meta_sk and subsk): 4194304 (equivalent to tcp_wmem[2])
  1. with "-w 64k" option only at the sender:
  • cwnd: 108
  • sk_sndbuf: 131072 (64k * 2)

Since the cwnd is very high in the default case (but shows low throughput), its queuing delay is also high. (Actually we can utilize the wifi only with 100 cwnds as seen in the second case.)

Another observation is that, in the second case, the acks (and data) are a bit bursty and tso value is above 10. But in the default case, it shows 2-packet ping pong pattern; receives 2-ack and sends 2 data packets, results in a small tso value - 2 or 3.
Lastly, I observed TSQ does not limit the speed.

Any idea? I will continue to look at it and let you know there's any new observation.

@Cherep07 can you try again with the default "-w 64k" at the iperf client side?

@cpaasch
Copy link
Member

cpaasch commented Nov 28, 2016

@jaehyun-hwang Thanks for all this info!

When you say "tso value is above 10". What do you mean with "tso value"? With the 2-packet ping-pong pattern, does this mean that the flight-size is only 2?

I think, we need to try to narrow down a bit the possible reasons of what is going wrong here.
For case 1, can you do different tests, each of them disabling a certain setting:

  1. Disable TSO/GSO (with ethtool)
  2. Disable TCP auto-corking (sysctl tcp_autocorking)

Also, if you have a packet-trace, can you share it so we can compare with @Cherep07's trace.
Finally, it might give some info to trace the TCP-subflow with ss -i. We could see that way whether MPTCP queues enough data in the subflow and it is the subflow itself that decides to not send it in tcp_write_xmit. I think, this ss-info would really help us narrowing it down.

Thanks a lot again!

@jaehyun-hwang
Copy link
Contributor

@cpaasch

  • tso value means the value of "tso_segs" variable in the tcp_write_xmit loop.

  • Regarding 2-packet pattern, sorry I think my expression was a bit confusing. It means (cwnd - in_flight) is 2. For example, cwnd and inflight are 1000 and a 2-ack comes in, inflight becomes 998, it sends 2 data packets, and so on..

  • Disable tso/gso: still 60Mbps

  • Disable autocorking: still 60Mbps

One thing is that, when disabling tso/gso, it shows 60Mbps even with "-w 64k" option. Autocorking seems not effective in any cases.

Sorry unfortunately packet traces cannot be shared..The below is snapshots of "ss -i" for the both cases.

Case 1 - 60Mbps)
cubic wscale:7,7 rto 421 rtt:220.517/0.196 mss:1428 cwnd:1116 ssthresh:17 send 57.8Mbps lastrcv:279218722 pacing_rate 115.6Mbps unacked:1104 rcv_space:29200

Case 2 - 120Mbps with -w 64k)
cubic wscale:7,7 rto 208 rtt:7.749/1.021 mss:1428 cwnd:110 ssthresh:26 send 162.2Mbps lastrcv:279348589 pacing_rate 324.3Mbps unacked:92 rcv_space:29200

For your better understanding, my topology is:
(Mptcp sender)-(eth0)-----(AP)-----(wlan0)-(mptcp receiver)
The sender uses a wired link to AP while the receiver uses a wifi interface to AP. (The above logs are shown in the sender.)

@cpaasch
Copy link
Member

cpaasch commented Nov 28, 2016

@jaehyun-hwang Thanks for the quick turnaround.

As we are filling the congestion-window it means that the MPTCP-scheduler is properly pushing enough traffic on the link.
However, because the TSO-size is 2 there is something going wrong in the TSO-size selection. These small packets increase the ACK-traffic (because the receiver can't do any coalescing) and this is the reason for the low throughput, because the WiFi-channel gets congested. Which also explains why we can't reproduce on a purely wired link, because there we have full duplex.

Can you check in the traces (would be easier if you could share them ;)) on the sender what the packet-size is compared to the DSS data-length in the MPTCP-option? If the packet-size is smaller than the DSS data-length it means that the packet gets split in tcp_write_xmit. If not, then the packet gets split earlier in the MPTCP-code.

Then, please check what the segment-size in tcp_sendmsg is (after the call to tcp_send_mss, check the variable mss_now). Finally, check the segment-size in mptcp_write_xmit.

Thanks!

@cpaasch
Copy link
Member

cpaasch commented Nov 28, 2016

Also, do you see the same issue in v0.90 ?

@jaehyun-hwang
Copy link
Contributor

@cpaasch I agree with your thoughts.

  • Regarding the comparison between packet size and dds data-length, I didn't look at the traces, but through kernel logs, I could confirm that the skb is always splitted in mptcp_write_xmit, and not in tcp_write_xmit.
    -> I tried skipping "mptcp_fragment" in mptcp_write_xmit, by setting a very large value to "limit" variable (so that it can bypass calling the mptcp_fragment), and this patch improves throughput up to around 110Mbps, increasing TSO-size to 9. We may have to look around this..

  • mss_now was 1428 in both functions.

  • actually..I have used v0.90 in both sides.

@cpaasch
Copy link
Member

cpaasch commented Nov 29, 2016

@jaehyun-hwang Yes, it seems to me that in the scheduler mptcp_next_segment, the problem is that we size the segment such that it fits exactly in the congestion-window. As 2 packets have been acked, we will send again a 2-packet burst.

This shows that always only scheduling up to the cwnd is not good. For better performance, we probably have to schedule more than what the cwnd allows on a subflow ("scheduling" means queuing it in the write-queue - not yet sending it on the wire). It's a difficult balance to find, because we don't want to queue too much on a slow subflow. I can imagine that calling tcp_tso_autosize in mptcp_next_segment might be a good solution, as it takes the subflow's rate into account.

Can you try out the following (completely untested; it is based on the latest mptcp_v0.91-branch):

diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
index 2e1d08a6e930..a2e3bb189907 100644
--- a/net/mptcp/mptcp_sched.c
+++ b/net/mptcp/mptcp_sched.c
@@ -420,31 +420,7 @@ static struct sk_buff *mptcp_next_segment(struct sock *meta_sk,
 	if (skb->len <= mss_now)
 		return skb;
 
-	/* The following is similar to tcp_mss_split_point, but
-	 * we do not care about nagle, because we will anyways
-	 * use TCP_NAGLE_PUSH, which overrides this.
-	 *
-	 * So, we first limit according to the cwnd/gso-size and then according
-	 * to the subflow's window.
-	 */
-
-	gso_max_segs = (*subsk)->sk_gso_max_segs;
-	if (!gso_max_segs) /* No gso supported on the subflow's NIC */
-		gso_max_segs = 1;
-	max_segs = min_t(unsigned int, tcp_cwnd_test(subtp, skb), gso_max_segs);
-	if (!max_segs)
-		return NULL;
-
-	max_len = mss_now * max_segs;
-	window = tcp_wnd_end(subtp) - subtp->write_seq;
-
-	needed = min(skb->len, window);
-	if (max_len <= skb->len)
-		/* Take max_win, which is actually the cwnd/gso-size */
-		*limit = max_len;
-	else
-		/* Or, take the window */
-		*limit = needed;
+	*limit = mss_now * tcp_tso_autosize(subsk, mss_now);
 
 	return skb;
 }

@jaehyun-hwang
Copy link
Contributor

@paasch Thanks for the patch!
It now achieves 105Mbps as TSO-value increases to 9 with the patch. But again it goes down to 60Mbps after 12-14 seconds..

From the kernel logs, now I confirm in mptcp_write_xmit, skb is not splitted by "sublimit" (set by Mptcp_sched.c), but "limit" is..
It becomes a lower value than "sublimit" after calling tcp_mss_split_point in mptcp_write_xmit after 12-13 seconds, decreasing TSO to 2 again. It seems we need to do something for "limit" as well.

Since v0.90 doesn't have tcp_tso_autosize(), I simply added it inside mptcp_sched.c as it has no dependancy to othet functions.

@cpaasch
Copy link
Member

cpaasch commented Nov 29, 2016

@jaehyun-hwang Hmmm... I see.

I think there are two solutions:

  1. Either, we bring back tcp_tso_should_defer into mptcp_write_xmit. That way we won't schedule the skb on the subflow if waiting a bit allows to send a bigger sized segment.
  2. Not do the tcp_mss_split_point in mptcp_write_xmit, and leave it up to the subflow. But that might make us send beyond the window. I will have to check that.

Can you try out the first option?

And please, update to v0.91 for the testing here. God knows, if what you are observing is an artifact of some incompatibilities between the tcp_tso_autosize being backported to v0.90.

@jaehyun-hwang
Copy link
Contributor

@cpaasch For the first option, could you guide a bit more detail where is the right place to put tcp_tso_should_defer?
For the second option, it should work as I already tested it, but yes, it might push slightly more data than the window allows.

Let me upgrade my system to v0.91 in both sides, and I will back to you tomorrow.

@jaehyun-hwang
Copy link
Contributor

@cpaasch I have upgraded my systems to v0.91.

  • v0.91+above patch shows the same result with v0.90+patch
  • I have tried to bring back tcp_tso_should_defer to mptcp_write_xmit as you suggested, but failed.. The v0.91 code is quite different from v0.88 in mptcp_write_xmit. Any good suggestion?
  • I downgraded the sender side system to v0.88 that includes tcp_tso_should_defer in mptcp_write_xmit. In this case, mptcp shows 95Mbps while a regular tcp achieves 120Mbps.

@cpaasch
Copy link
Member

cpaasch commented Dec 2, 2016

@jaehyun-hwang Sorry, for not coming back earlier. I'm quite busy with other stuff at the moment. Porting mptcp_trunk to v4.4 right now, and it's very tricky. I will try to come back to you by tomorrow.

@jaehyun-hwang
Copy link
Contributor

@cpaasch That's perfectly fine. We can discuss further next week. Have a nice weekend!

@rstanislav
Copy link

Any updates? This issue affects almost any configuration with wireless link..

@madhanraj
Copy link

@rstanislav : Can you please check setting tcp_adv_win_scale to 2 in the client side. Whats the TP reached

dreibh pushed a commit to dreibh/mptcp that referenced this issue Apr 11, 2018
commit 87a73eb upstream.

It turns out that the loop where we read manufacturer
jedec_read_mfd() can under some circumstances get a
CFI_MFR_CONTINUATION repeatedly, making the loop go
over all banks and eventually hit the end of the
map and crash because of an access violation:

Unable to handle kernel paging request at virtual address c4980000
pgd = (ptrval)
[c4980000] *pgd=03808811, *pte=00000000, *ppte=00000000
Internal error: Oops: 7 [#1] PREEMPT ARM
CPU: 0 PID: 1 Comm: swapper Not tainted 4.16.0-rc1+ multipath-tcp#150
Hardware name: Gemini (Device Tree)
PC is at jedec_probe_chip+0x6ec/0xcd0
LR is at 0x4
pc : [<c03a2bf4>]    lr : [<00000004>]    psr: 60000013
sp : c382dd18  ip : 0000ffff  fp : 00000000
r10: c0626388  r9 : 00020000  r8 : c0626340
r7 : 00000000  r6 : 00000001  r5 : c3a71afc  r4 : c382dd70
r3 : 00000001  r2 : c4900000  r1 : 00000002  r0 : 00080000
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0000397f  Table: 00004000  DAC: 00000053
Process swapper (pid: 1, stack limit = 0x(ptrval))

Fix this by breaking the loop with a return 0 if
the offset exceeds the map size.

Fixes: 5c9c11e ("[MTD] [NOR] Add support for flash chips with ID in bank other than 0")
Cc: <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
cpaasch pushed a commit that referenced this issue May 1, 2018
It turns out that the loop where we read manufacturer
jedec_read_mfd() can under some circumstances get a
CFI_MFR_CONTINUATION repeatedly, making the loop go
over all banks and eventually hit the end of the
map and crash because of an access violation:

Unable to handle kernel paging request at virtual address c4980000
pgd = (ptrval)
[c4980000] *pgd=03808811, *pte=00000000, *ppte=00000000
Internal error: Oops: 7 [#1] PREEMPT ARM
CPU: 0 PID: 1 Comm: swapper Not tainted 4.16.0-rc1+ #150
Hardware name: Gemini (Device Tree)
PC is at jedec_probe_chip+0x6ec/0xcd0
LR is at 0x4
pc : [<c03a2bf4>]    lr : [<00000004>]    psr: 60000013
sp : c382dd18  ip : 0000ffff  fp : 00000000
r10: c0626388  r9 : 00020000  r8 : c0626340
r7 : 00000000  r6 : 00000001  r5 : c3a71afc  r4 : c382dd70
r3 : 00000001  r2 : c4900000  r1 : 00000002  r0 : 00080000
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0000397f  Table: 00004000  DAC: 00000053
Process swapper (pid: 1, stack limit = 0x(ptrval))

Fix this by breaking the loop with a return 0 if
the offset exceeds the map size.

Fixes: 5c9c11e ("[MTD] [NOR] Add support for flash chips with ID in bank other than 0")
Cc: <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
cpaasch pushed a commit that referenced this issue May 16, 2018
commit 87a73eb upstream.

It turns out that the loop where we read manufacturer
jedec_read_mfd() can under some circumstances get a
CFI_MFR_CONTINUATION repeatedly, making the loop go
over all banks and eventually hit the end of the
map and crash because of an access violation:

Unable to handle kernel paging request at virtual address c4980000
pgd = (ptrval)
[c4980000] *pgd=03808811, *pte=00000000, *ppte=00000000
Internal error: Oops: 7 [#1] PREEMPT ARM
CPU: 0 PID: 1 Comm: swapper Not tainted 4.16.0-rc1+ #150
Hardware name: Gemini (Device Tree)
PC is at jedec_probe_chip+0x6ec/0xcd0
LR is at 0x4
pc : [<c03a2bf4>]    lr : [<00000004>]    psr: 60000013
sp : c382dd18  ip : 0000ffff  fp : 00000000
r10: c0626388  r9 : 00020000  r8 : c0626340
r7 : 00000000  r6 : 00000001  r5 : c3a71afc  r4 : c382dd70
r3 : 00000001  r2 : c4900000  r1 : 00000002  r0 : 00080000
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0000397f  Table: 00004000  DAC: 00000053
Process swapper (pid: 1, stack limit = 0x(ptrval))

Fix this by breaking the loop with a return 0 if
the offset exceeds the map size.

Fixes: 5c9c11e ("[MTD] [NOR] Add support for flash chips with ID in bank other than 0")
Cc: <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
dreibh pushed a commit to dreibh/mptcp that referenced this issue Apr 29, 2022
… abort

commit 23e3d7f7061f8682c751c46512718f47580ad8f0 upstream.

we got issue as follows:
[   72.796117] EXT4-fs error (device sda): ext4_journal_check_start:83: comm fallocate: Detected aborted journal
[   72.826847] EXT4-fs (sda): Remounting filesystem read-only
fallocate: fallocate failed: Read-only file system
[   74.791830] jbd2_journal_commit_transaction: jh=0xffff9cfefe725d90 bh=0x0000000000000000 end delay
[   74.793597] ------------[ cut here ]------------
[   74.794203] kernel BUG at fs/jbd2/transaction.c:2063!
[   74.794886] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[   74.795533] CPU: 4 PID: 2260 Comm: jbd2/sda-8 Not tainted 5.17.0-rc8-next-20220315-dirty multipath-tcp#150
[   74.798327] RIP: 0010:__jbd2_journal_unfile_buffer+0x3e/0x60
[   74.801971] RSP: 0018:ffffa828c24a3cb8 EFLAGS: 00010202
[   74.802694] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[   74.803601] RDX: 0000000000000001 RSI: ffff9cfefe725d90 RDI: ffff9cfefe725d90
[   74.804554] RBP: ffff9cfefe725d90 R08: 0000000000000000 R09: ffffa828c24a3b20
[   74.805471] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9cfefe725d90
[   74.806385] R13: ffff9cfefe725d98 R14: 0000000000000000 R15: ffff9cfe833a4d00
[   74.807301] FS:  0000000000000000(0000) GS:ffff9d01afb00000(0000) knlGS:0000000000000000
[   74.808338] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   74.809084] CR2: 00007f2b81bf4000 CR3: 0000000100056000 CR4: 00000000000006e0
[   74.810047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   74.810981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   74.811897] Call Trace:
[   74.812241]  <TASK>
[   74.812566]  __jbd2_journal_refile_buffer+0x12f/0x180
[   74.813246]  jbd2_journal_refile_buffer+0x4c/0xa0
[   74.813869]  jbd2_journal_commit_transaction.cold+0xa1/0x148
[   74.817550]  kjournald2+0xf8/0x3e0
[   74.819056]  kthread+0x153/0x1c0
[   74.819963]  ret_from_fork+0x22/0x30

Above issue may happen as follows:
        write                   truncate                   kjournald2
generic_perform_write
 ext4_write_begin
  ext4_walk_page_buffers
   do_journal_get_write_access ->add BJ_Reserved list
 ext4_journalled_write_end
  ext4_walk_page_buffers
   write_end_fn
    ext4_handle_dirty_metadata
                ***************JBD2 ABORT**************
     jbd2_journal_dirty_metadata
 -> return -EROFS, jh in reserved_list
                                                   jbd2_journal_commit_transaction
                                                    while (commit_transaction->t_reserved_list)
                                                      jh = commit_transaction->t_reserved_list;
                        truncate_pagecache_range
                         do_invalidatepage
			  ext4_journalled_invalidatepage
			   jbd2_journal_invalidatepage
			    journal_unmap_buffer
			     __dispose_buffer
			      __jbd2_journal_unfile_buffer
			       jbd2_journal_put_journal_head ->put last ref_count
			        __journal_remove_journal_head
				 bh->b_private = NULL;
				 jh->b_bh = NULL;
				                      jbd2_journal_refile_buffer(journal, jh);
							bh = jh2bh(jh);
							->bh is NULL, later will trigger null-ptr-deref
				 journal_free_journal_head(jh);

After commit 96f1e09, we no longer hold the j_state_lock while
iterating over the list of reserved handles in
jbd2_journal_commit_transaction().  This potentially allows the
journal_head to be freed by journal_unmap_buffer while the commit
codepath is also trying to free the BJ_Reserved buffers.  Keeping
j_state_lock held while trying extends hold time of the lock
minimally, and solves this issue.

Fixes: 96f1e09("jbd2: avoid long hold times of j_state_lock while committing a transaction")
Signed-off-by: Ye Bin <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants