Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The client can get stuck in a high-frequency retry loop despite working brokers #149

Open
RalfJung opened this issue Aug 10, 2020 · 10 comments

Comments

@RalfJung
Copy link
Member

Sometimes a client seems to be stuck in a high-frequency retry loop, establishing a new connection every 2-5s and immediately abandoning it. #143 is resolved, which means the servers do not have heavy load from such a loop any more, but the question remains what is going on with those clients.

Unfortunately, so far I was not able to acquire a log from one of the affected nodes. @kaechele you said you also experienced this problem; were/are you able to get a logfile from the problematic node?

@RalfJung RalfJung changed the title The client can get stuck in a high-frequency retry loop The client can get stuck in a high-frequency retry loop despite working brokers Aug 10, 2020
@kaechele
Copy link
Contributor

kaechele commented Sep 6, 2020

Unfortunately, Dr. Jung (:tada:), I have been unable to capture a log in this situation. But I would not be surprised given the current state of the client that there is some case in which the state machine goes berserk.

@RalfJung
Copy link
Member Author

RalfJung commented Sep 6, 2020

Thanks. :) (However you found out about that.^^)

Yeah, I have fairly low confidence in that state machine code... but without a log I also don't really know where to start debugging this.

@valcryst
Copy link

valcryst commented Jun 11, 2022

We are facing this issue for some months now, but we wherent able to get access to one of the affected routers till now.
We observed that this symptom always starts after the "daily" reboot and affects around 70-80 Routers out of 1000.

image

When this happens you see Tunneldigger Broker instantly running the "down-hook" after creating the tunnel.
This is what we have seen on the Tunneldigger Client on the Router.

Sat Jun 11 13:09:25 2022 daemon.debug td-client: [broker2.ff-en.de:10122] Broker usage: 4587
Sat Jun 11 13:09:25 2022 daemon.info td-client: Selected broker1.ff-en.de:10122 as the best broker.
Sat Jun 11 13:09:25 2022 daemon.info td-client: [broker1.ff-en.de:10122] Tunnel successfully established.
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' is enabled
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Network device 'mesh-vpn' link is up
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' has link connectivity
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' is setting up now
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' is now up
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.162513] batman_adv: bat0: Adding interface: mesh-vpn
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.168110] batman_adv: bat0: The MTU of interface mesh-vpn is too small (1420) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1532 would solve the problem.
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.192795] batman_adv: bat0: Interface activated: mesh-vpn
Sat Jun 11 13:09:26 2022 daemon.err td-client: [broker1.ff-en.de:10122] Unable to create local L2TP tunnel!
Sat Jun 11 13:09:26 2022 daemon.err td-client: [broker1.ff-en.de:10122] Connection lost.
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.348378] batman_adv: bat0: Interface deactivated: mesh-vpn
Sat Jun 11 13:09:26 2022 daemon.notice netifd: Network device 'mesh-vpn' link is down

But more interesting is this "ps w" output.

root@FF-WTT-Citymanagement-Wetter:~# ps wwwww
  PID USER       VSZ STAT COMMAND
    1 root      1568 S    /sbin/procd
    2 root         0 SW   [kthreadd]
    4 root         0 IW<  [kworker/0:0H]
    6 root         0 IW<  [mm_percpu_wq]
    7 root         0 SW   [ksoftirqd/0]
   43 root         0 SW   [oom_reaper]
   90 root         0 IW<  [writeback]
   92 root         0 SW   [kcompactd0]
   93 root         0 IW<  [crypto]
   95 root         0 IW<  [kblockd]
  116 root         0 IW<  [watchdogd]
  129 root         0 IW   [kworker/0:1]
  138 root         0 SW   [kswapd0]
  197 root         0 SW   [spi0]
  292 root         0 IW<  [ipv6_addrconf]
  300 root         0 IW<  [kworker/0:1H]
  379 root         0 SWN  [jffs2_gcd_mtd4]
  446 root      1216 S    /sbin/ubusd
  447 root       924 S    /sbin/askfirst /usr/libexec/login.sh
  464 root      1028 S    /sbin/urngd
  635 root         0 IW<  [cfg80211]
  659 root         0 IW<  [bat_events]
  769 root      1296 S    /sbin/logd -S 64
  821 dnsmasq   1348 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
  836 root         0 IW   [kworker/0:2]
  855 root      1080 S    /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
 1022 root       932 S    /usr/sbin/gluon-arp-limiter
 1069 root      1764 S    /sbin/netifd
 1148 root       932 S    /usr/sbin/uradvd -i local-node -p 2a11:6c6:6000:feed::/64
 1388 root      1212 S    udhcpc -p /var/run/udhcpc-br-wan.pid -s /lib/netifd/dhcp.script -f -t 0 -i br-wan -x hostname:FF-WTT-Citymanagement-Wetter -C -O 121
 1391 root      1052 S    odhcp6c -s /lib/netifd/dhcpv6.script -t120 br-wan
 1396 root      1052 S    odhcp6c -s /lib/netifd/dhcpv6.script -L -m30 -t120 br-client
 1531 root      1124 S    /usr/sbin/gluon-radv-filterd -i br-client -c RADV_FILTER -t 20
 1587 root       932 S    /usr/sbin/micrond /usr/lib/micron.d
 1664 root       928 S    /usr/sbin/sse-multiplexd
 1760 root      1364 S    /usr/sbin/uhttpd -f -h /lib/gluon/status-page/www -r FF-WTT-Citymanagement-Wetter -x /cgi-bin -t 60 -T 30 -k 20 -A 1 -n 32 -N 100 -R -p 0.0.0.0:80 -p [::]:80
 1930 root      1352 S    /usr/sbin/dnsmasq -x /var/run/gluon-wan-dnsmasq.pid -u root -i lo -p 54 -h -r /var/gluon/wan-dnsmasq/resolv.conf
 1941 root      1068 S    /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
 1950 nobody    1040 S    /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
 1951 nobody    1040 S    /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
 2049 root      1584 S    /usr/sbin/hostapd -s -P /var/run/wifi-phy0.pid -B /var/run/hostapd-phy0.conf
 2073 root      1216 S<   /usr/sbin/ntpd -n -N -S /usr/sbin/ntpd-hotplug -p 2a11:6c6:6000:feed:ff::1 -p 2a11:6c6:6000:feed:ff::2 -p 91.212.181.1
 4095 root         0 IW   [kworker/u2:0]
 4204 root      2324 S    /usr/bin/respondd -d /usr/lib/respondd -p 1001 -g ff02::2:1001 -i mesh0 -g ff05::2:1001 -i br-client -t 10
 4519 root      1148 S    /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
 4563 root      1220 S    -ash
 4979 root      1068 S    /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
 4980 nobody    1040 S    /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
 4981 nobody    1040 S    /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
 5038 root      1212 R    ps wwwww
28583 root         0 IW   [kworker/u2:3]
32214 root         0 IW   [kworker/u2:2]
root@FF-WTT-Citymanagement-Wetter:~#

It looks like it gets started twice and this will last till next reboot (eventualy).
I suspect that some changes how the Tunneldigger watchdog kills the process introduced this?

Kindly Regards
Matthias

@RalfJung
Copy link
Member Author

RalfJung commented Jul 9, 2022

It looks like it gets started twice

Oh, yeah that would not be good. So this is an interesting lead indeed. In fact there are 6 entries for tunneldigger on your list? On one of my devices I see 3, so I guess that is the normal number. (Is this one line per thread? And then the main process is forking and we still see the parent process? Or something like that?)

Which watchdog changes are you talking about? We have our own separate watchdog at https://git.hacksaar.de/FreifunkSaar/gluon-ffsaar/-/tree/master/gluon-ffsaar-watchdog since the "built-in" one has proven to be insufficient, but that one does a full device reboot, so that can't be it.
(In fact as a work-around I could make that watchdog initiate a reboot when it sees more than 3 instances of /usr/bin/tunneldigger...)

This would have to be a bug in `/etc/init.d/tunneldigger, I think? That just uses /sbin/start-stop-daemon though so it'd be a bug in there.

@valcryst
Copy link

Oh, yeah that would not be good. So this is an interesting lead indeed. In fact there are 6 entries for tunneldigger on your list? On one of my devices I see 3, so I guess that is the normal number. (Is this one line per thread? And then the main process is forking and we still see the parent process? Or something like that?)

Indeed its the main process and one thread for each broker in the broker list.

Which watchdog changes are you talking about? We have our own separate watchdog at https://git.hacksaar.de/FreifunkSaar/gluon-ffsaar/-/tree/master/gluon-ffsaar-watchdog since the "built-in" one has proven to be insufficient, but that one does a full device reboot, so that can't be it. (In fact as a work-around I could make that watchdog initiate a reboot when it sees more than 3 instances of /usr/bin/tunneldigger...)

Well i stopped building firmware long ago in our community and someone else is doing it now.
I refer to the old watchdog that was inside tunneldigger before it was accepted into gluon repo.

Old Watchdog
New Watchdog

It included an "killall" wich i assume covered cases like this, the newer one touches only matching pids as it seems.

This would have to be a bug in `/etc/init.d/tunneldigger, I think? That just uses /sbin/start-stop-daemon though so it'd be a bug in there.

You mean this?
https://github.com/freifunk-gluon/packages/blob/master/net/tunneldigger/files/tunneldigger.init#L79

I dont know how it comes that routers end up with two instances of tunneldigger,
but it could be a starting point to look why it happens randomly after daily reboots.

@RalfJung
Copy link
Member Author

Indeed its the main process and one thread for each broker in the broker list.

That's not what I see. We have 4 brokers and I see 3 processes total.

# ps | grep tunneldigger
 1972 root      1068 S    /usr/bin/tunneldigger -u 30b5c26e9d6b -i mesh-vpn -t 1 -b gw1.saar.freifunk.net:10001 -b gw2.saar.freifunk.net:10001 -b gw3.saar.freifunk.net:10001 -b gw4.saar.freifunk.net:10001 -I br-wan -a
 1973 nobody    1040 S    /usr/bin/tunneldigger -u 30b5c26e9d6b -i mesh-vpn -t 1 -b gw1.saar.freifunk.net:10001 -b gw2.saar.freifunk.net:10001 -b gw3.saar.freifunk.net:10001 -b gw4.saar.freifunk.net:10001 -I br-wan -a
 1974 nobody    1040 S    /usr/bin/tunneldigger -u 30b5c26e9d6b -i mesh-vpn -t 1 -b gw1.saar.freifunk.net:10001 -b gw2.saar.freifunk.net:10001 -b gw3.saar.freifunk.net:10001 -b gw4.saar.freifunk.net:10001 -I br-wan -a

@valcryst
Copy link

Ok then this assumption was wrong.
Today i have more than 100 routers affected by this problem and the only way to keep
the gateways running while they get hammered is to block them by rate-limiter.

This is how it looks on one of the gateways.

image


I think i will modify that tunneldigger watchdog in gluon and add a killall to see if that fixes this mess.

@valcryst
Copy link

valcryst commented Dec 14, 2022

Hey @RalfJung i just wanted to let you know, we have overcome this issue by patching
the tunneldigger-watchdog that ships with Gluon. We couldnt investigate deeper and needed
a quick solution, because lots of our refugee installations where hit by this problem.
It seems that "killall -KILL tunneldigger" was our saviour here.

Kindly Regards

#!/usr/bin/lua

local uci = require('simple-uci').cursor()

local function restart_tunneldigger()
	os.execute('logger -t tunneldigger-watchdog "Restarting Tunneldigger."')
	os.execute('/etc/init.d/tunneldigger stop')
	os.execute('sleep 1')
	os.execute('killall -KILL tunneldigger')
	os.execute('rm -f /var/run/tunneldigger.mesh-vpn.pid'
	os.execute('sleep 5')
	os.execute('/etc/init.d/tunneldigger start')
end

local function read_pid_file()
	local pid_file = io.open('/var/run/tunneldigger.mesh-vpn.pid', 'r')
	if not pid_file then
		return nil
	end
	local pid = pid_file:read('*l')
	pid_file:close()
	return pid
end

local function has_mesh_vpn_neighbours()
	local handle = io.popen('batctl o', 'r')
	if not handle then
		return false
	end
	for line in handle:lines() do
		if line:find('mesh%-vpn') then
			handle:close()
			return true
		end
	end
	handle:close()
	return false
end

if uci:get_bool('tunneldigger', 'mesh_vpn', 'enabled') then
	if io.popen('pgrep -x /usr/bin/tunneldigger'):read('*l') ~= read_pid_file() then
		os.execute('logger -t tunneldigger-watchdog "Process-Pid does not match with pid-File."')
		restart_tunneldigger()
		return
	end
	if not has_mesh_vpn_neighbours() then
		os.execute('logger -t tunneldigger-watchdog "No vpn-mesh neighbours found."')
		restart_tunneldigger()
		return
	end
end

@RalfJung
Copy link
Member Author

@valcryst have you or do you plan to upstream this change into gluon itself?

In saarland we have patched our own secondary tunneldigger watchdog to count the number of tunneldigger processes, and issue a reboot if there are too many of them. That also seems to help.

@valcryst
Copy link

@RalfJung I've sent my feedback, but i'm no coder and maybe the experts on this find out
the real cause that triggers this unwanted behaviour of multiple running tunneldiggers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants