iot-hub-device-client unable to connect to IoT-Hub when switched from network #548

robbinvandamme · 2020-04-23T13:12:23Z

Context

OS and version used: (ubuntu 18)
Python version: (3.7.5)
pip version: (pip 19.0.3)
cloned repo: (azure-iot-device==2.1.1)

Description of the issue

When switching from network while application is running and connected, connection to IoT-Hub fails on the new network.
Steps to reproduce:

connect iot-hub-device-client via network1 to the iot-hub
switch your device to network2 --> connection will drop and is not able to recover
switch back to network1 --> connection succeeds

Code sample exhibiting the issue

No code sample needed, just make sure the iot-hub-device-client.connect() is called when in network1.

AB#7366704

robbinvandamme · 2020-04-25T11:03:59Z

To be clear:
I'm 100% sure both networks have internet access.
I'm able to reproduce this 100% of the time, doesn't matter which network I start from.

BertKleewein · 2020-04-29T13:50:23Z

How are you switching networks? Is this done by switching cables, bringing interfaces up and down, or by some other method?

robbinvandamme · 2020-04-29T20:11:47Z

@BertKleewein Yes, switching cables.

robbinvandamme · 2020-05-14T09:34:34Z

@BertKleewein Are you able to reproduce this?

robbinvandamme · 2020-07-06T21:24:50Z

@BertKleewein
I think this is related to #606 .
DNS resolve issue?

BertKleewein · 2020-07-07T03:48:48Z

It sure feels like a DNS issue to me, but I'm not 100% confident on that. I'd be curious to see the output from ifconfig, ping YOUR_HUB.azure-devices.net, and traceroute YOUR_HUB.azure-devices.net when you're on network 1 (and the client works) and then again when you're on network 2 (and the client isn't working).

robbinvandamme · 2020-07-07T05:42:45Z

I’ll rebuild that setup and let u know.

robbinvandamme · 2020-07-07T10:24:34Z

Hi,
Hope this helps.
info.txt

robbinvandamme · 2020-07-28T10:03:25Z

@BertKleewein
Did you find the logs useful and have you been able to reproduce it?

electrozen · 2020-09-23T08:07:48Z

Hello I have a similar issue running the 2.2.0 version of the SDK. The "connection refused" (as in issue #606) followed by the inability to reconnect is triggered by at least 2 situations:

Sustained CPU overload
Long sleep / wake up of the device
Enclosed is a log for case 2. Note that the stack is still alive (see last messages) but stopped any attempt to reconnect.
azure-bug-150.txt

BertKleewein · 2020-09-30T22:02:10Z

@electrozen - I'm not sure if your issue is related to this one. Are you switching between two networks? If this isn't related to switching between networks, can you open a new issue please?

For situation number 1, can you provide more information, or even better, a log? Simply overloading the CPU doesn't seem to be sufficient to cause problems, and I suspect there is something more specific happening.

In your log for situation #2, it looks like the SAS credentials that are built from of your symmetric key are expiring while the system is asleep. Then, when the code tries to automatically reconnect, the connection is refused because the credentials are old. The client treats "UnauthorizedError" as a condition which requires user intervention, so it stops the automatic reconnection logic.

Unfortunately, I don't think calling the 'connect' method to force a reconnect would work either because the token renewal is based on a timer countdown and not clock time. In other words, when you connect, the client connects using credentials which are valid for about an hour of clock time (give-or-take). Then the client starts counting down, and after 50 minutes, it renews the credentials. Normally this means that the client renews the connection credentials before they expire. However, when you put the computer to sleep, the count-down pauses but the credentials still expire. If you go to sleep when the token still has 20 minutes before it needs to be renewed, the library assumes it still has 20 minutes of validity after it wakes up.

electrozen · 2020-10-01T08:32:59Z

Thanks. Indeed the issue does not appear specifically when changing networks. I will open another issue with logs for a CPU overload situation. Overall my use case involve user devices sleeping, changing networks, getting high CPU etc etc. So far I'm unable to have device connections reliably last more than a couple days on my test fleet. Based on your explanation, I will have to setup a specific Azure client status check and process restart in my separate watcher service.

BertKleewein · 2021-01-15T22:25:43Z

@robbinvandamme, I was finally able to reproduce this and I'm afraid that I can't do anything to fix it.

Before I go deeper, I need to explain that it takes our MQTT transport library (Paho) about 2 minutes to detect a broken connection on Linux. This is because the broken connection doesn't cause the socket to fail on Linux like it does on Windows. Instead, Paho has to detect the break by sending a PINGREQ packet to the server and waiting for a PINGRESP packet to return. It only sends this PINGREQ if 60 seconds elapse with no packets from the server, and then it waits 60 seconds for the PINGRESP to return. This 60 seconds is called the MQTT "keepalive" value.

This means that you can unplug your network cable for up to 2 minutes without the "dropped connection" code being executed. After that 2 minutes, it closes the socket and tries to re-open it again.

Back to your bug, it looks like an OS component, maybe Network Manager, can get into a broken state where it doesn't recognize when Ethernet cables are unplugged. Once I was able to reproduce this bug once, I could reproduce it every time. I assume it won't reproduce once I reboot.

I'm not sure exactly how it got into a broken state, but I just kept reproducing your issue with different timings (sometimes < 2 minutes, sometimes > 2 minutes). After about 10 or 15 times, it reproduced.

When broken:

I could not ping www.google.com because the name wouldn't resolve.
Running nmcli would show me the IP and DNS for the old network. In this example, I'm physically connected to my 192.168 network, but it's showing me numbers from my 10. network:

bertk@bertk-hp:/var/lib/dhcp$ nmcli
eno2: connected to Wired connection 2
        "Intel I210 Gigabit"
        ethernet (igb), 9C:7B:EF:22:6B:A7, hw, mtu 1500
        ip4 default
        inet4 10.0.0.174/24
        route4 0.0.0.0/0
        route4 10.0.0.0/24
        inet6 fe80::a60a:5390:61b5:849d/64
        route6 ff00::/8
        route6 fe80::/64
        route6 fe80::/64

<<< other interfaces removed >>>

DNS configuration:
        servers: 10.0.0.1
        interface: eno2

If I watch syslog (tail -f /var/log/syslog), I can see Network Manager slowly start to realize that the address is wrong:

Jan 15 12:14:49 bertk-hp NetworkManager[1036]: <info>  [1610741689.3271] device (eno2): scheduling DHCPv4 restart in 120 seconds, 1 tries left (reason: lease expired)

I see this message 3 times, for 3 tries, and then the connection starts working. I assume Network Manager does something to repair the connection but I didn't dig any deeper.

I was able to find discussions on similar problems by searching for "scheduling DHCPv4 restart" but I didn't find any fixes or workarounds (besides waiting 6-ish minutes for the connection to be repaired.)

robbinvandamme · 2021-01-29T09:51:22Z

@BertKleewein
Thx for the investigation!

You speak about an OS issue? Networkmanager that would go into a broken state for example.
But as you can see in the info.txt a few post ago, that was not the case for me.

Switching networks did not brake my internet connection.
From the moment I was connected to the new network I was able to ping the google dns. Traceroute to the iot-hub was also working.
Only the application using the iot-hub-device failed to connect to the iot-hub after switching to another network.

robbinvandamme · 2021-01-29T10:57:18Z

@BertKleewein

I think we had the same issue in one of our C++ applications.
We called res_init() to get the resolving correct again.
https://www.ibm.com/support/knowledgecenter/SSLTBW_2.3.0/com.ibm.zos.v2r3.bpxbd00/resinit.htm
saltstack/salt#21397

Maybe it has something todo with this...?

robbinvandamme · 2021-03-16T18:42:31Z

@BertKleewein if needed, I'm willing to prepare a test-setup and have a video call.

BertKleewein · 2021-04-06T19:54:50Z

@robbinvandamme - it sounds like the getaddrinfo function is just fundamentally broken. There is another workaround mentioned here (along with using res_init). Maybe using nscd can fix your problem?
https://stackoverflow.com/a/19930436/6262999

BertKleewein · 2021-06-02T23:35:41Z

@robbinvandamme - I'm sorry, I am unable to do anything to help you with this issue. Almost everything about this bug points to a problem with the underlying network stack, and the workaround calling res_init supports this. I can say with reasonable confidence that we're not doing anything wrong. We don't interact directly with the DNS resolver so there's no way to change how we use it. The best we could do is somehow detect this situation and mitigate it by calling something like res_init. We're not in any better position to do this than you are. In fact, we're in a much worse position because you only need to worry about fixing your scenario, but we need to worry about not breaking things for all of our other customers.

All of the reading and experimenting that I've done points to the conclusion that the DNS resolver landscape is a mess. Most recently, it's this article ( https://tailscale.com/blog/sisyphean-dns-client-linux/ ) and a conversation that goes along with it ( https://news.ycombinator.com/item?id=26821298 ). This article talks about the challenging task of getting a single resolver configuration correct. Trying to come up with a solution that fixes a bug in a particular configuration without affecting any other configurations is an impossible task.

You have been so incredibly patient with us and with me on this issue. You have endured my slow response. You have offered your time and assistance. I appreciate all of this more than I can say. I only wish I could help more (or at all), and I hope this doesn't cause you more pain.

(For what it's worth, I wonder if the reason that ping and traceroute resolve correctly because you launch a new process every time, and your azure-iot-device process fails because it is trying to connect using an old (now corrupted) process. I believe that res_init is a function in one of the C runtime libraries, so I assume this workaround is reinitializing something about the state of the running process rather than reinitializing something about the system as a whole. But this is still just a guess…)

robbinvandamme · 2022-01-26T14:42:47Z

@BertKleewein
I have an issue that could be related to this one.
It's an issue we have at a customer, I was not able to reproduce it myself (yet).

I'm using azure-iot-device==2.9.0.

What happens:

application is working fine
setup gets rebooted (i think including router and stuff like that)
our gateway boots faster then the router I suppose and at startup, no internet connection available
application tries some rest calls to some servers and it fails --> ssl:default [Name or service not known]"} (expected behavior)
iot-hub connections fails --> ConnectionFailedError(None) (expected behavior)
after some time router is booted and network connection is available
application tries some rest calls to some servers and it SUCCEEDS (expected behavior)
iothub connect() gets called --> still fails with ConnectionFailedError(None) and never recovers from this.

I don't have much logs for the moment.
Now I've enabled paho and azure DEBUG level logging and I'll try to ask the customer to power down his devices and start them the same way. In the hope we get into the same state and have some more logs.

I still think this issue have something to do with the resolver not getting updated in the application/sdk?
This issue looks a bit the same golang/go#21083.

Something I don't get is, why are the rest calls to other servers recovering, but the iot-hub-device-sdk is not?

Kind regards,

Robbin

robbinvandamme · 2022-01-27T09:24:48Z

@BertKleewein
I do see that ares_init() gets called in de azure device client C sdk at some places.
However, I'll try to call res_init() myself before calling connect() and see how it goes.
Could be that that initial issue of this topic is solved and including my last post.

I'll come back with the results :-).

robbinvandamme · 2022-01-28T13:23:35Z

@BertKleewein
I was able to reproduce this by switching between networks.
Also got it fixed (or at least a workaround) by calling __res_init() right before calling connect().

When connection drops due to network switch I check the connection state and if its disconnect, I again call __res_init().

That did the trick but I'm not convinced the user of the sdk should keep the resolver/dns settings up to date?

Could you drop me your opinion on that?

I got a hold of __res_init() like this, and I call it when we are disconnected right before connect()

import ctypes
libc = ctypes.cdll.LoadLibrary('libc.so.6')
res_init = getattr(libc, "__res_init")
res_init()

I think this will also fix the scenario where router and application are booted together and router is not fully up and running at the time we call our first connect().

Kind regards,

Robbin

robbinvandamme · 2022-05-02T14:31:55Z

@BertKleewein do you know how they handled this in the c-sdk?

github-actions bot added the IoTSDK label Apr 23, 2020

elhorton added pre-06-20 and removed pre-06-20 labels Jun 4, 2020

jebrando added the bug label Jun 12, 2020

robbinvandamme mentioned this issue Jul 6, 2020

Connection refused when we try to connect after connection failed the first time. #606

Closed

elhorton assigned elhorton and BertKleewein Jul 14, 2020

elhorton added the investigation-required label Aug 19, 2020

electrozen mentioned this issue Oct 28, 2020

SDK doesn't reconnect after host device long sleep #700

Closed

BertKleewein mentioned this issue Apr 6, 2021

Google API client breaks name resolution for paho in a docker context eclipse/paho.mqtt.python#529

Closed

anthonyvercolano unassigned elhorton Apr 7, 2021

BertKleewein closed this as completed Jun 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iot-hub-device-client unable to connect to IoT-Hub when switched from network #548

iot-hub-device-client unable to connect to IoT-Hub when switched from network #548

robbinvandamme commented Apr 23, 2020 •

edited by elhorton

Loading

robbinvandamme commented Apr 25, 2020

BertKleewein commented Apr 29, 2020

robbinvandamme commented Apr 29, 2020

robbinvandamme commented May 14, 2020

robbinvandamme commented Jul 6, 2020

BertKleewein commented Jul 7, 2020

robbinvandamme commented Jul 7, 2020

robbinvandamme commented Jul 7, 2020

robbinvandamme commented Jul 28, 2020

electrozen commented Sep 23, 2020

BertKleewein commented Sep 30, 2020

electrozen commented Oct 1, 2020

BertKleewein commented Jan 15, 2021

robbinvandamme commented Jan 29, 2021

robbinvandamme commented Jan 29, 2021

robbinvandamme commented Mar 16, 2021

BertKleewein commented Apr 6, 2021

BertKleewein commented Jun 2, 2021

robbinvandamme commented Jan 26, 2022 •

edited

Loading

robbinvandamme commented Jan 27, 2022

robbinvandamme commented Jan 28, 2022

robbinvandamme commented May 2, 2022

iot-hub-device-client unable to connect to IoT-Hub when switched from network #548

iot-hub-device-client unable to connect to IoT-Hub when switched from network #548

Comments

robbinvandamme commented Apr 23, 2020 • edited by elhorton Loading

Context

Description of the issue

Code sample exhibiting the issue

robbinvandamme commented Apr 25, 2020

BertKleewein commented Apr 29, 2020

robbinvandamme commented Apr 29, 2020

robbinvandamme commented May 14, 2020

robbinvandamme commented Jul 6, 2020

BertKleewein commented Jul 7, 2020

robbinvandamme commented Jul 7, 2020

robbinvandamme commented Jul 7, 2020

robbinvandamme commented Jul 28, 2020

electrozen commented Sep 23, 2020

BertKleewein commented Sep 30, 2020

electrozen commented Oct 1, 2020

BertKleewein commented Jan 15, 2021

robbinvandamme commented Jan 29, 2021

robbinvandamme commented Jan 29, 2021

robbinvandamme commented Mar 16, 2021

BertKleewein commented Apr 6, 2021

BertKleewein commented Jun 2, 2021

robbinvandamme commented Jan 26, 2022 • edited Loading

robbinvandamme commented Jan 27, 2022

robbinvandamme commented Jan 28, 2022

robbinvandamme commented May 2, 2022

robbinvandamme commented Apr 23, 2020 •

edited by elhorton

Loading

robbinvandamme commented Jan 26, 2022 •

edited

Loading