Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2022.1+] TP-Link WDR4300 hangs during reboot #2904

Open
grische opened this issue May 13, 2023 · 20 comments · Fixed by #3156 · May be fixed by #3397
Open

[v2022.1+] TP-Link WDR4300 hangs during reboot #2904

grische opened this issue May 13, 2023 · 20 comments · Fixed by #3156 · May be fixed by #3397
Labels
0. type: bug This is a bug 9. meta: upstream issue Issue pertains to an upstream project

Comments

@grische
Copy link
Contributor

grische commented May 13, 2023

Bug report

What is the problem?
Occasionally (>10% of all devices), hang after an autoupdate and need a manual powercycle to reboot.

I managed to reproduce this while a serial cable was attached:

root@ffmuc-e894f6d457e8:/# autoupdater -f
Retrieving manifest from http://[2001:678:ed0:f000::]/experimental/sysupgrade/experimental.manifest ...
Stopping cron...
Stopping urngd...
Stopping micrond...
Stopping sysntpd...
Stopping gluon-radvd...
Stopping uhttpd...
Stopping sse-multiplexd...
Stopping gluon-respondd...
[  350.261277] sysctl (6540): drop_caches: 3
vm.drop_caches = 3
Downloading image:  6144 / 6144 KiB
[  372.830625] device client1 left promiscuous mode
[  372.835595] br-client: port 4(client1) entered disabled state
[  373.170806] device client0 left promiscuous mode
[  373.175746] br-client: port 5(client0) entered disabled state
Stopping network...
[  374.316737] batman_adv: bat0: Interface deactivated: mesh1
[  374.564283] batman_adv: bat0: Removing interface: mesh1
[  374.587411] batman_adv: bat0: Interface deactivated: mesh0
[  374.782898] batman_adv: bat0: Removing interface: mesh0
[  376.235548] br-client: port 3(bat0) entered disabled state
[  376.241674] br-client: port 2(local-port) entered disabled state
[  376.247889] br-client: port 1(eth0.1) entered disabled state
[  376.336981] device bat0 left promiscuous mode
[  376.341546] br-client: port 3(bat0) entered disabled state
[  376.393905] device eth0.1 left promiscuous mode
[  376.398752] br-client: port 1(eth0.1) entered disabled state
[  376.498386] device local-port left promiscuous mode
[  376.503522] br-client: port 2(local-port) entered disabled state
[  376.807475] batman_adv: bat0: Interface deactivated: primary0
[  376.813326] batman_adv: bat0: Removing interface: primary0
[  376.819582] batman_adv: bat0: Interface deactivated: mesh-vpn
[  376.825424] batman_adv: bat0: Removing interface: mesh-vpn
Cannot find device "bat0"
autoupdater: warning: execution of /usr/lib/autoupdater/upgrade.d/10stop-network exited with status code 1
[  379.953731] sysctl (6840): drop_caches: 3
vm.drop_caches = 3
Sat May 13 12:33:01 CEST 2023 upgrade: Saving config files...
Sat May 13 12:33:02 CEST 2023 upgrade: Commencing upgrade. Closing all shell sessions.
Watchdog handover: fd=3
- watchdog -
Watchdog did not previously reset the system
Sat May 13 12:33:04 CEST 2023 upgrade: Sending TERM to remaining processes ...
Sat May 13 12:33:04 CEST 2023 upgrade: Sending signal TERM to udhcpc (2029)
Sat May 13 12:33:04 CEST 2023 upgrade: Sending signal TERM to odhcp6c (2034)
Sat May 13 12:33:04 CEST 2023 upgrade: Sending signal TERM to dnsmasq (2883)
Sat May 13 12:33:08 CEST 2023 upgrade: Sending KILL to remaining processes ...
[  400.548946] stage2 (7004): drop_caches: 3
Sat May 13 12:33:15 CEST 2023 upgrade: Switching to ramdisk...
mount: mounting /dev/mtdblock4 on /overlay failed: Resource busy
[  404.592005] VFS: Busy inodes after unmount of jffs2. Self-destruct in 5 seconds.  Have a nice day...
Sat May 13 10:33:18 UTC 2023 upgrade: Performing system upgrade...
[  404.686640] do_stage2 (7004): drop_caches: 3
Unlocking firmware ...

Writing from <stdin> to firmware ...
Appending jffs2 data from /tmp/sysupgrade.tgz to firmware..                                                                                                  
Sat May 13 10:33:49 UTC 2023 upgrade: Upgrade completed
Sat May 13 10:33:50 UTC 2023 upgrade: Rebooting system...
umount: can't unmount /dev: Resource busy
umount: can't unmount /tmp: Resource [  436.218456] reboot: Restarting system
▒

I am not sure if this is related to #185, but we were not able to reproduce it (yet) with a reboot.

What is the expected behaviour?
That the WDR4300 comes back up after an update.

Gluon Version:
v2022.1.2 and v2022.1.3
Probably also earlier v2022.x

We experienced similar behaviour during the initial v2022.1 deployment, but discarded it as "random".
It was more severe with the v2022.1.3 deployment (probably just because of chance) and I was able to reproduce it with a serial cable attached when upgrading from v2022.1.3 to v2022.1.4.

Site Configuration:
https://github.com/freifunkMUC/site-ffm/blob/833829e68f97e4781f175bdd688d7f498a7efe53/site.conf

Custom patches:
https://github.com/freifunkMUC/site-ffm/tree/833829e68f97e4781f175bdd688d7f498a7efe53/patches

@rotanid
Copy link
Member

rotanid commented May 14, 2023

does this also happen with the very similar WDR3600 ?

@grische
Copy link
Contributor Author

grische commented Jul 7, 2023

does this also happen with the very similar WDR3600 ?

Probably. We had a few isolated cases where a WDR3600 needed a power cycle after an upgrade but it is not clear if this is at all related to the problem described here. We don't have enough (failing) devices to have a confident answer.

@grische
Copy link
Contributor Author

grische commented Jul 7, 2023

It might be worth mentioning that the special symbol at the end of the log is printed during a boot as well. I'm not sure if this is printed before or after the bootloader loaded though.

EDIT:
I was able to reproduce the hang without the special symbol appearing. As if it got stuck during reboot:

[  226.555153] br-client: port 4(client0) entered disabled state
Watchdog handover: fd=3
- watchdog -
Watchdog did not previously reset the system
[  226.596451] device client1 left promiscuous mode
[  226.601297] br-client: port 5(client1) entered disabled state
Wed Jul 12 21:19:50 CEST 2023 upgrade: Sending TERM to remaining processes ...
Wed Jul 12 21:19:51 CEST 2023 upgrade: Sending signal TERM to sse-multiplexd (2340)
Wed Jul 12 21:19:51 CEST 2023 upgrade: Sending signal TERM to dnsmasq (2782)
Wed Jul 12 21:19:55 CEST 2023 upgrade: Sending KILL to remaining processes ...
[  237.363401] stage2 (5324): drop_caches: 3
Wed Jul 12 21:20:01 CEST 2023 upgrade: Switching to ramdisk...
mount: mounting /dev/mtdblock4 on /overlay failed: Resource busy
[  241.391167] VFS: Busy inodes after unmount of jffs2. Self-destruct in 5 seconds.  Have a nice day...
Wed Jul 12 19:20:05 UTC 2023 upgrade: Performing system upgrade...
[  241.489244] do_stage2 (5324): drop_caches: 3
Unlocking firmware ...

Writing from <stdin> to firmware ...
Wed Jul 12 19:20:23 UTC 2023 upgrade: Upgrade completed
Wed Jul 12 19:20:24 UTC 2023 upgrade: Rebooting system...
umount: can't unmount /dev: Resource busy
umount: can't unmount /tmp: Resource [  261.048575] reboot: Restarting system

@Grotax
Copy link

Grotax commented Aug 10, 2023

We also had reports in our community when I rolled out 2022.1 but thought it was random, and we didn't have proper logs or anything else. #2655

@smoe
Copy link
Contributor

smoe commented Aug 13, 2023

We observed this when transitioning from 2022.1.2 to 2022.1.4 on WDR4300 and more frequently on Ubiquiti AC lite. In our observation, the update was fine when the machine was rebooted just prior to the update, which may be suggesting an out-of-memory issue.

@grische
Copy link
Contributor Author

grische commented Nov 21, 2023

@smoe Just to clarify, we were able to reproduce the issue on a freshly booted device as well.
I assume the WDRs and the AC Lite are different issues here.

@blocktrron
Copy link
Member

@grische

One thing that comes to my mind is the usage of the newer ar934x SPI controller driver, at least no device reported in this issue uses the older ar71xx driver.

This driver was first shipped with OpenWrt 21.02, matching the observation it does not break with older releases based on OpenWrt 19.07 and older.

openwrt/openwrt@ebf0d8d

If you are still able to reproduce this issue, you can modify the ar934x DTSI to use the compatible for the ar71xx SPI controller. Ping me in case i should provide you with a patch. If this fixes the reboot issue, we have a better path where to look next.

@grische
Copy link
Contributor Author

grische commented Dec 12, 2023

@blocktrron thank you for looking into this. To avoid misunderstandings, you suggest to do this change here in OpenWRT?

diff --git a/target/linux/ath79/dts/ar934x.dtsi b/target/linux/ath79/dts/ar934x.dtsi
index d88c7bfabc..15201b197e 100644
--- a/target/linux/ath79/dts/ar934x.dtsi
+++ b/target/linux/ath79/dts/ar934x.dtsi
@@ -199,15 +199,17 @@
                };

                spi: spi@1f000000 {
-                       compatible = "qca,ar934x-spi";
-                       reg = <0x1f000000 0x1c>;
+                       compatible = "qca,ar7240-spi",
+                                       "qca,ar7100-spi";
+                       reg = <0x1f000000 0x10>;

                        clocks = <&pll ATH79_CLK_AHB>;
+                       clock-names = "ahb";
+
+                       status = "disabled";

                        #address-cells = <1>;
                        #size-cells = <0>;
-
-                       status = "disabled";
                };
        };

@blocktrron
Copy link
Member

@grische Almost. Just revert this commit in the file:

openwrt/openwrt@ebf0d8d#diff-45ad725f9ec8cc2da88738047b1d5c4d1e69df19194bd22394d3736e03093613

@grische
Copy link
Contributor Author

grische commented Jan 7, 2024

@blocktrron I was able to reproduce a hang after reboot even with the above commit reverted using Gluon v2023.1:
https://gist.github.com/grische/27e4e780530f9a0795d96afaf749a4ed

Here is the respective branch: https://github.com/grische/site-ffm/commits/test/revert-ath79-add-new-ar934x-spi-driver/

@blocktrron
Copy link
Member

@grische Are these hangs only reproducible after writing a upgrade image or does a regular reboot invocation also trigger a spurious hang?

@grische
Copy link
Contributor Author

grische commented Jan 7, 2024

I have a test WDR4300 device where I can reproduce the hangs during a reboot every other time. Surprisingly often actually.
This device has a manually installed serial port and a serial cable attached to the port.

@grische
Copy link
Contributor Author

grische commented Jan 8, 2024

On the exact same setup, I tested it with

  • gluon-v2021.1.2: no hang was observed for ~150 reboots
  • gluon-v2022.1.4: the hangs were observed after as early as 2 reboots or as long as 5 reboots.
  • gluon-v2023.2: the first hang was observed after 3 reboots.

@grische grische changed the title TP-Link WDR4300 hangs after autoupgrade [v2022.1+] TP-Link WDR4300 hangs during reboot Jan 8, 2024
grische pushed a commit to grische/openwrt that referenced this issue Jan 10, 2024
Add a cache-barrier after the reset-register write. This fixes spurious
reboot issues on TP-Link WDR3600 and WDR4300 devices with Zental DDR2
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839

Signed-off-by: David Bauer <[email protected]>
blocktrron added a commit to blocktrron/openwrt that referenced this issue Jan 10, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <[email protected]>
grische pushed a commit to grische/openwrt that referenced this issue Jan 10, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <[email protected]>
openwrt-bot pushed a commit to openwrt/openwrt that referenced this issue Jan 11, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: #13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <[email protected]>
openwrt-bot pushed a commit to openwrt/openwrt that referenced this issue Jan 11, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: #13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <[email protected]>
(cherry picked from commit 2fe8ecd)
grische pushed a commit to grische/openwrt that referenced this issue Jan 11, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <[email protected]>
@grische
Copy link
Contributor Author

grische commented Jan 11, 2024

The bug was fixed upstream in

openwrt-bot pushed a commit to openwrt/openwrt that referenced this issue Jan 11, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: #13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <[email protected]>
(cherry picked from commit 2fe8ecd)
HiGarfield pushed a commit to HiGarfield/lede-17.01.4-Mod that referenced this issue Jan 11, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <[email protected]>
(cherry picked from commit 2fe8ecd880396b5ae25fe9583aaa1d71be0b8468)
Vladdrako pushed a commit to Vladdrako/openwrt that referenced this issue Jan 14, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <[email protected]>
db260179 pushed a commit to db260179/openwrt that referenced this issue Jan 31, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <[email protected]>
sbeach92 pushed a commit to sbeach92/openwrt that referenced this issue Feb 16, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <[email protected]>
rondoval pushed a commit to rondoval/openwrt that referenced this issue Feb 25, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <[email protected]>
(cherry picked from commit 2fe8ecd)
ArtelMike pushed a commit to ArtelMike/smart-linux2 that referenced this issue Aug 29, 2024
Read back the reset register in order to flush the cache. This fixes
spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel
DRAM chips.

This issue was fixed in the past, but switching to the reset-driver
specific implementation removed the cache barrier which was previously
implicitly added by reading back the register in question.

Link: freifunk-gluon/gluon#2904
Link: openwrt#13043
Link: https://dev.archive.openwrt.org/ticket/17839
Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart")

Signed-off-by: David Bauer <[email protected]>
(cherry picked from commit 2fe8ecd)
@nrbffs
Copy link
Contributor

nrbffs commented Dec 15, 2024

We have lost several WDR3600 on a recent upgrade to 2023.2.4.

I attached a WDR3600 to a serial console and used this script to reboot it in a loop: https://gitlab.freifunk-stuttgart.de/-/snippets/8

I was able to observe failing reboots after 5, 20 and 250 tries.

With a patch like this, I have >1500 successful reboots now:

--- a/drivers/reset/reset-ath79.c
+++ b/drivers/reset/reset-ath79.c
@@ -79,8 +79,12 @@ static int ath79_reset_restart_handler(s
 {
 	struct ath79_reset *ath79_reset =
 		container_of(nb, struct ath79_reset, restart_nb);
+	unsigned int i = 0;
 
-	ath79_reset_assert(&ath79_reset->rcdev, FULL_CHIP_RESET);
+	while (1) {
+		ath79_reset_assert(&ath79_reset->rcdev, FULL_CHIP_RESET);
+		printk("reset: still alive after %u tries\n", i);
+	}
 
 	return NOTIFY_DONE;
 }

The printk can never be seen, but I suppose that's because there is never a chance to flush out the buffer to the console.

It's not clear to me why this works, but neither is the solution of reading back the register (ioremap should already disable cache).

@Djfe
Copy link
Contributor

Djfe commented Dec 15, 2024

I'm pretty certain we have seen this on a few wdr4300 in Aachen, too.
Kind of sad this can only be fixed by doing another reboot so after loosing a couple more devices, temporarily (for future updates)

thanks for reporting a fix ❤️

@Djfe
Copy link
Contributor

Djfe commented Dec 15, 2024

oh wait, I'm wrong. This code is run on boot so it fixes it immediately, right? (not before rebooting)

@blocktrron
Copy link
Member

@nrbffs can you provide the exact gluon commit of the upgrade failures start version (before upgrading) and commit hashes of the automatic reboots you have seen failing?

@blocktrron blocktrron reopened this Dec 16, 2024
@smoe
Copy link
Contributor

smoe commented Dec 16, 2024

oh wait, I'm wrong. This code is run on boot so it fixes it immediately, right? (not before rebooting)

At least the "i" should be incremented :-)

@nrbffs nrbffs linked a pull request Dec 17, 2024 that will close this issue
@nrbffs
Copy link
Contributor

nrbffs commented Dec 17, 2024

can you provide the exact gluon commit of the upgrade failures start version (before upgrading)

ec72498 (v2023.2.3)
plus these two patches by our community:

and commit hashes of the automatic reboots you have seen failing?

I have seen the failures also on the commit above as well as v2023.2.4

My proposed fix is in #3397 (tested on main, 1487 successful reboots)

@rotanid rotanid added 0. type: bug This is a bug 9. meta: upstream issue Issue pertains to an upstream project labels Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: bug This is a bug 9. meta: upstream issue Issue pertains to an upstream project
Projects
None yet
7 participants