-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The client can get stuck in a high-frequency retry loop despite working brokers #149
Comments
Unfortunately, Dr. Jung (:tada:), I have been unable to capture a log in this situation. But I would not be surprised given the current state of the client that there is some case in which the state machine goes berserk. |
Thanks. :) (However you found out about that.^^) Yeah, I have fairly low confidence in that state machine code... but without a log I also don't really know where to start debugging this. |
We are facing this issue for some months now, but we wherent able to get access to one of the affected routers till now. When this happens you see Tunneldigger Broker instantly running the "down-hook" after creating the tunnel.
But more interesting is this "ps w" output.
It looks like it gets started twice and this will last till next reboot (eventualy). Kindly Regards |
Oh, yeah that would not be good. So this is an interesting lead indeed. In fact there are 6 entries for tunneldigger on your list? On one of my devices I see 3, so I guess that is the normal number. (Is this one line per thread? And then the main process is forking and we still see the parent process? Or something like that?) Which watchdog changes are you talking about? We have our own separate watchdog at https://git.hacksaar.de/FreifunkSaar/gluon-ffsaar/-/tree/master/gluon-ffsaar-watchdog since the "built-in" one has proven to be insufficient, but that one does a full device reboot, so that can't be it. This would have to be a bug in `/etc/init.d/tunneldigger, I think? That just uses /sbin/start-stop-daemon though so it'd be a bug in there. |
Indeed its the main process and one thread for each broker in the broker list.
Well i stopped building firmware long ago in our community and someone else is doing it now. It included an "killall" wich i assume covered cases like this, the newer one touches only matching pids as it seems.
You mean this? I dont know how it comes that routers end up with two instances of tunneldigger, |
That's not what I see. We have 4 brokers and I see 3 processes total.
|
Ok then this assumption was wrong. This is how it looks on one of the gateways. I think i will modify that tunneldigger watchdog in gluon and add a killall to see if that fixes this mess. |
Hey @RalfJung i just wanted to let you know, we have overcome this issue by patching Kindly Regards
|
@valcryst have you or do you plan to upstream this change into gluon itself? In saarland we have patched our own secondary tunneldigger watchdog to count the number of tunneldigger processes, and issue a reboot if there are too many of them. That also seems to help. |
@RalfJung I've sent my feedback, but i'm no coder and maybe the experts on this find out |
Sometimes a client seems to be stuck in a high-frequency retry loop, establishing a new connection every 2-5s and immediately abandoning it. #143 is resolved, which means the servers do not have heavy load from such a loop any more, but the question remains what is going on with those clients.
Unfortunately, so far I was not able to acquire a log from one of the affected nodes. @kaechele you said you also experienced this problem; were/are you able to get a logfile from the problematic node?
The text was updated successfully, but these errors were encountered: