-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mediatek-filogic: weird tq on wr3000 - wifi instability after few minutes #3305
Comments
Can you check if the tx retries / tx failed counters from |
They are slightly increasing, but most of the time, they are constant. tx failed
tx retries
batctl p towards some mesh partner often does not work either, with package losses above 90%. |
I just tested the MTK patch: It looked good until I reloaded the driver at about 7:20 The logread still does not hint to something useful. |
Some driver hiccup is probably the most likely. But still wanted to ask, as it's not clear to me from these graphs alone: Has external traffic causing these losses been ruled out? Is there some available airtime graph? Does it correlate with some route changes in batman-adv? (I've seen funny route flapping / TQ changes/breakdowns caused by unicast traffic in the past in a test in a specific setup years ago when it was still 802.11g, caused by a hidden node problem: https://www.open-mesh.org/projects/batman-adv/wiki/Bcast-hidden-node. There it would oscillate between the good two-hop route and a bad, direct 1-hop route. Even if CTS/RTS was enabled for unicast traffic. Traffic over the two-hop route would interfere with the batman-adv OGM broadcasts... causing a breakdown in TQ and then switching to 1-hop. Then the TQ would improve, things would switch back to 2-hop. Rinse and repeat. Usually the hidden-node-problem should be quite rare though. Maybe even less likely with newer 802.11 revisions / improvements therein?) |
Thanks for looking into this @T-X . In our test-setup we experienced the same problems with a WAX220: There seems to be another test-installation with wr3000 and wax220 which does not have such issues according to its grafana: Remotely, I can only debug by checking the TQ with mesh partners (if there are any) - the actual symptom is that the wifi is unusable as a client when connected to the node during the times in which the device has 3-4% of TQ to neighbors (even though it might have a 100% TQ to mesh-vpn, so it surely is the wifi driver) I don't think it is a flapping route in batman, though I can not exclude this completely. |
First of all - the issue is still present in latest firmware with updated openwrt - as well as on openwrt master. I just noticed, that before some firmware iterations, the max TQ and min TQ were both fluctuating:
The latest v2023.2.x firmware does have a solid max tq but still varying min tq which seems broken. The solid max tq can be seen here - wax220: The same bad mesh symptoms were found on the NWA55AXE as well
openwrt masterI did build firmware from openwrt master to test, though it sometimes did not even load the wifi driver at all (for the wr3000) and did show the above behavior as well for the WAX220. So the problem is still not solved in openwrt master as of August 2024 (commit 5d2a008670122f3f69eb3ab4f776d9fe9b6d76dd). |
General instability on mediatek filogic devices with mt7915e have been seen, especially on the WR3000, WAX220 and others.
It has to be noted that some devices work better than others. Heavy wifi mesh seems to make the situation worse.
What is the problem?
An example of this is this behavior is this device:
https://grafana.ffac.rocks/d/000000002/node?orgId=1&var-node=80afca06d558&from=1718344052951&to=1718403869219&viewPanel=13
which includes very varying TQ of the device.
The latest finding is this:
https://grafana.ffac.rocks/d/000000002/node?orgId=1&var-node=80afca06d558&from=1720175532350&to=1720193698710&var-select_hostname=ffac-seilpforte-wr3000&var-hostname=ffac-seilpforte-wr3000&var-saveinterval=1m&var-nodetolink=0c0e76cf5d5e&viewPanel=13
At 1. I restarted the wifi driver using
rmmod mt7915e && modprobe mt7915e
At 2. I added another mesh device with which this device could mesh on mesh1, creating the timeout issue without the device being possible to reload the firmware
At 3. I restarted the device, as nothing helped.
Afterward, the weird changing TQ can be seen, which behaves in weird waves.
The current workaround includes reloading the mt7915e driver and rebooting the device once the mt7915e bug from #3154 occurs.
A package for this can be found here: https://github.com/ffac/gluon-packages/tree/main/ffac-mt7915-hotfix/files/lib/gluon/mt7915
As @nrbffs also noted on IRC, some other people reported instability with these devices as well. Currently, reloading the wifi driver twice a day seems to help in this situation..
This issue is not about #3154 but about the weird changing TQ leading to bad mesh quality and wifi quality.
What is the expected behaviour?
Mesh and wifi quality should be stable on mediatek filogic devices such as the WR3000.
Further steps
ls /sys/kernel/debug/ieee80211/phy*/mt76
to find somethingTX_Stats
I found that on other devices
cat /sys/kernel/debug/ieee80211/phy1/mt76/tx_stats
does only show values for 1 to 4 while the affected WR3000 has values for 1 to 8I do not really know if this is related or not, just a finding.
Gluon Version:
v2023.2.3
Site Configuration:
ffac @ v2023.2.3-2
Custom patches:
see site
The text was updated successfully, but these errors were encountered: