-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v2022.1+] TP-Link WDR4300 hangs during reboot #2904
Comments
does this also happen with the very similar WDR3600 ? |
Probably. We had a few isolated cases where a WDR3600 needed a power cycle after an upgrade but it is not clear if this is at all related to the problem described here. We don't have enough (failing) devices to have a confident answer. |
It might be worth mentioning that the special symbol at the end of the log is printed during a boot as well. I'm not sure if this is printed before or after the bootloader loaded though. EDIT:
|
We also had reports in our community when I rolled out 2022.1 but thought it was random, and we didn't have proper logs or anything else. #2655 |
We observed this when transitioning from 2022.1.2 to 2022.1.4 on WDR4300 and more frequently on Ubiquiti AC lite. In our observation, the update was fine when the machine was rebooted just prior to the update, which may be suggesting an out-of-memory issue. |
@smoe Just to clarify, we were able to reproduce the issue on a freshly booted device as well. |
One thing that comes to my mind is the usage of the newer ar934x SPI controller driver, at least no device reported in this issue uses the older ar71xx driver. This driver was first shipped with OpenWrt 21.02, matching the observation it does not break with older releases based on OpenWrt 19.07 and older. If you are still able to reproduce this issue, you can modify the ar934x DTSI to use the compatible for the ar71xx SPI controller. Ping me in case i should provide you with a patch. If this fixes the reboot issue, we have a better path where to look next. |
@blocktrron thank you for looking into this. To avoid misunderstandings, you suggest to do this change here in OpenWRT? diff --git a/target/linux/ath79/dts/ar934x.dtsi b/target/linux/ath79/dts/ar934x.dtsi
index d88c7bfabc..15201b197e 100644
--- a/target/linux/ath79/dts/ar934x.dtsi
+++ b/target/linux/ath79/dts/ar934x.dtsi
@@ -199,15 +199,17 @@
};
spi: spi@1f000000 {
- compatible = "qca,ar934x-spi";
- reg = <0x1f000000 0x1c>;
+ compatible = "qca,ar7240-spi",
+ "qca,ar7100-spi";
+ reg = <0x1f000000 0x10>;
clocks = <&pll ATH79_CLK_AHB>;
+ clock-names = "ahb";
+
+ status = "disabled";
#address-cells = <1>;
#size-cells = <0>;
-
- status = "disabled";
};
}; |
@grische Almost. Just revert this commit in the file: openwrt/openwrt@ebf0d8d#diff-45ad725f9ec8cc2da88738047b1d5c4d1e69df19194bd22394d3736e03093613 |
@blocktrron I was able to reproduce a hang after reboot even with the above commit reverted using Gluon v2023.1: Here is the respective branch: https://github.com/grische/site-ffm/commits/test/revert-ath79-add-new-ar934x-spi-driver/ |
@grische Are these hangs only reproducible after writing a upgrade image or does a regular reboot invocation also trigger a spurious hang? |
I have a test WDR4300 device where I can reproduce the hangs during a reboot every other time. Surprisingly often actually. |
On the exact same setup, I tested it with
|
Add a cache-barrier after the reset-register write. This fixes spurious reboot issues on TP-Link WDR3600 and WDR4300 devices with Zental DDR2 DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Signed-off-by: David Bauer <[email protected]>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <[email protected]>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <[email protected]>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: #13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <[email protected]>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: #13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <[email protected]> (cherry picked from commit 2fe8ecd)
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <[email protected]>
The bug was fixed upstream in
|
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: #13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <[email protected]> (cherry picked from commit 2fe8ecd)
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <[email protected]> (cherry picked from commit 2fe8ecd880396b5ae25fe9583aaa1d71be0b8468)
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <[email protected]>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <[email protected]>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <[email protected]>
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <[email protected]> (cherry picked from commit 2fe8ecd)
Read back the reset register in order to flush the cache. This fixes spurious reboot hangs on TP-Link TL-WDR3600 and TL-WDR4300 with Zentel DRAM chips. This issue was fixed in the past, but switching to the reset-driver specific implementation removed the cache barrier which was previously implicitly added by reading back the register in question. Link: freifunk-gluon/gluon#2904 Link: openwrt#13043 Link: https://dev.archive.openwrt.org/ticket/17839 Link: f8a7bfe1cb2c ("MIPS: ath79: fix system restart") Signed-off-by: David Bauer <[email protected]> (cherry picked from commit 2fe8ecd)
We have lost several WDR3600 on a recent upgrade to 2023.2.4. I attached a WDR3600 to a serial console and used this script to reboot it in a loop: https://gitlab.freifunk-stuttgart.de/-/snippets/8 I was able to observe failing reboots after 5, 20 and 250 tries. With a patch like this, I have >1500 successful reboots now:
The printk can never be seen, but I suppose that's because there is never a chance to flush out the buffer to the console. It's not clear to me why this works, but neither is the solution of reading back the register (ioremap should already disable cache). |
I'm pretty certain we have seen this on a few wdr4300 in Aachen, too. thanks for reporting a fix ❤️ |
oh wait, I'm wrong. This code is run on boot so it fixes it immediately, right? (not before rebooting) |
@nrbffs can you provide the exact gluon commit of the upgrade failures start version (before upgrading) and commit hashes of the automatic reboots you have seen failing? |
At least the "i" should be incremented :-) |
ec72498 (v2023.2.3)
I have seen the failures also on the commit above as well as v2023.2.4 My proposed fix is in #3397 (tested on main, 1487 successful reboots) |
Bug report
What is the problem?
Occasionally (>10% of all devices), hang after an autoupdate and need a manual powercycle to reboot.
I managed to reproduce this while a serial cable was attached:
I am not sure if this is related to #185, but we were not able to reproduce it (yet) with a reboot.
What is the expected behaviour?
That the WDR4300 comes back up after an update.
Gluon Version:
v2022.1.2 and v2022.1.3
Probably also earlier v2022.x
We experienced similar behaviour during the initial v2022.1 deployment, but discarded it as "random".
It was more severe with the v2022.1.3 deployment (probably just because of chance) and I was able to reproduce it with a serial cable attached when upgrading from v2022.1.3 to v2022.1.4.
Site Configuration:
https://github.com/freifunkMUC/site-ffm/blob/833829e68f97e4781f175bdd688d7f498a7efe53/site.conf
Custom patches:
https://github.com/freifunkMUC/site-ffm/tree/833829e68f97e4781f175bdd688d7f498a7efe53/patches
The text was updated successfully, but these errors were encountered: