Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iot-hub-device-client unable to connect to IoT-Hub when switched from network #548

Closed
robbinvandamme opened this issue Apr 23, 2020 · 22 comments

Comments

@robbinvandamme
Copy link

robbinvandamme commented Apr 23, 2020

Context

  • OS and version used: (ubuntu 18)
  • Python version: (3.7.5)
  • pip version: (pip 19.0.3)
  • cloned repo: (azure-iot-device==2.1.1)

Description of the issue

When switching from network while application is running and connected, connection to IoT-Hub fails on the new network.
Steps to reproduce:

  • connect iot-hub-device-client via network1 to the iot-hub
  • switch your device to network2 --> connection will drop and is not able to recover
  • switch back to network1 --> connection succeeds

Code sample exhibiting the issue

No code sample needed, just make sure the iot-hub-device-client.connect() is called when in network1.

AB#7366704

@robbinvandamme
Copy link
Author

To be clear:
I'm 100% sure both networks have internet access.
I'm able to reproduce this 100% of the time, doesn't matter which network I start from.

@BertKleewein
Copy link
Member

How are you switching networks? Is this done by switching cables, bringing interfaces up and down, or by some other method?

@robbinvandamme
Copy link
Author

@BertKleewein Yes, switching cables.

@robbinvandamme
Copy link
Author

@BertKleewein Are you able to reproduce this?

@robbinvandamme
Copy link
Author

@BertKleewein
I think this is related to #606 .
DNS resolve issue?

@BertKleewein
Copy link
Member

It sure feels like a DNS issue to me, but I'm not 100% confident on that. I'd be curious to see the output from ifconfig, ping YOUR_HUB.azure-devices.net, and traceroute YOUR_HUB.azure-devices.net when you're on network 1 (and the client works) and then again when you're on network 2 (and the client isn't working).

@robbinvandamme
Copy link
Author

I’ll rebuild that setup and let u know.

@robbinvandamme
Copy link
Author

Hi,
Hope this helps.
info.txt

@robbinvandamme
Copy link
Author

@BertKleewein
Did you find the logs useful and have you been able to reproduce it?

@electrozen
Copy link

Hello I have a similar issue running the 2.2.0 version of the SDK. The "connection refused" (as in issue #606) followed by the inability to reconnect is triggered by at least 2 situations:

  1. Sustained CPU overload
  2. Long sleep / wake up of the device
    Enclosed is a log for case 2. Note that the stack is still alive (see last messages) but stopped any attempt to reconnect.
    azure-bug-150.txt

@BertKleewein
Copy link
Member

@electrozen - I'm not sure if your issue is related to this one. Are you switching between two networks? If this isn't related to switching between networks, can you open a new issue please?

For situation number 1, can you provide more information, or even better, a log? Simply overloading the CPU doesn't seem to be sufficient to cause problems, and I suspect there is something more specific happening.

In your log for situation #2, it looks like the SAS credentials that are built from of your symmetric key are expiring while the system is asleep. Then, when the code tries to automatically reconnect, the connection is refused because the credentials are old. The client treats "UnauthorizedError" as a condition which requires user intervention, so it stops the automatic reconnection logic.

Unfortunately, I don't think calling the 'connect' method to force a reconnect would work either because the token renewal is based on a timer countdown and not clock time. In other words, when you connect, the client connects using credentials which are valid for about an hour of clock time (give-or-take). Then the client starts counting down, and after 50 minutes, it renews the credentials. Normally this means that the client renews the connection credentials before they expire. However, when you put the computer to sleep, the count-down pauses but the credentials still expire. If you go to sleep when the token still has 20 minutes before it needs to be renewed, the library assumes it still has 20 minutes of validity after it wakes up.

@electrozen
Copy link

Thanks. Indeed the issue does not appear specifically when changing networks. I will open another issue with logs for a CPU overload situation. Overall my use case involve user devices sleeping, changing networks, getting high CPU etc etc. So far I'm unable to have device connections reliably last more than a couple days on my test fleet. Based on your explanation, I will have to setup a specific Azure client status check and process restart in my separate watcher service.

@BertKleewein
Copy link
Member

@robbinvandamme, I was finally able to reproduce this and I'm afraid that I can't do anything to fix it.

Before I go deeper, I need to explain that it takes our MQTT transport library (Paho) about 2 minutes to detect a broken connection on Linux. This is because the broken connection doesn't cause the socket to fail on Linux like it does on Windows. Instead, Paho has to detect the break by sending a PINGREQ packet to the server and waiting for a PINGRESP packet to return. It only sends this PINGREQ if 60 seconds elapse with no packets from the server, and then it waits 60 seconds for the PINGRESP to return. This 60 seconds is called the MQTT "keepalive" value.

This means that you can unplug your network cable for up to 2 minutes without the "dropped connection" code being executed. After that 2 minutes, it closes the socket and tries to re-open it again.

Back to your bug, it looks like an OS component, maybe Network Manager, can get into a broken state where it doesn't recognize when Ethernet cables are unplugged. Once I was able to reproduce this bug once, I could reproduce it every time. I assume it won't reproduce once I reboot.

I'm not sure exactly how it got into a broken state, but I just kept reproducing your issue with different timings (sometimes < 2 minutes, sometimes > 2 minutes). After about 10 or 15 times, it reproduced.

When broken:

  1. I could not ping www.google.com because the name wouldn't resolve.
  2. Running nmcli would show me the IP and DNS for the old network. In this example, I'm physically connected to my 192.168 network, but it's showing me numbers from my 10. network:
bertk@bertk-hp:/var/lib/dhcp$ nmcli
eno2: connected to Wired connection 2
        "Intel I210 Gigabit"
        ethernet (igb), 9C:7B:EF:22:6B:A7, hw, mtu 1500
        ip4 default
        inet4 10.0.0.174/24
        route4 0.0.0.0/0
        route4 10.0.0.0/24
        inet6 fe80::a60a:5390:61b5:849d/64
        route6 ff00::/8
        route6 fe80::/64
        route6 fe80::/64

<<< other interfaces removed >>>

DNS configuration:
        servers: 10.0.0.1
        interface: eno2

If I watch syslog (tail -f /var/log/syslog), I can see Network Manager slowly start to realize that the address is wrong:

Jan 15 12:14:49 bertk-hp NetworkManager[1036]: <info>  [1610741689.3271] device (eno2): scheduling DHCPv4 restart in 120 seconds, 1 tries left (reason: lease expired)

I see this message 3 times, for 3 tries, and then the connection starts working. I assume Network Manager does something to repair the connection but I didn't dig any deeper.

I was able to find discussions on similar problems by searching for "scheduling DHCPv4 restart" but I didn't find any fixes or workarounds (besides waiting 6-ish minutes for the connection to be repaired.)

@robbinvandamme
Copy link
Author

@BertKleewein
Thx for the investigation!

You speak about an OS issue? Networkmanager that would go into a broken state for example.
But as you can see in the info.txt a few post ago, that was not the case for me.

Switching networks did not brake my internet connection.
From the moment I was connected to the new network I was able to ping the google dns. Traceroute to the iot-hub was also working.
Only the application using the iot-hub-device failed to connect to the iot-hub after switching to another network.

@robbinvandamme
Copy link
Author

@BertKleewein

I think we had the same issue in one of our C++ applications.
We called res_init() to get the resolving correct again.
https://www.ibm.com/support/knowledgecenter/SSLTBW_2.3.0/com.ibm.zos.v2r3.bpxbd00/resinit.htm
saltstack/salt#21397

Maybe it has something todo with this...?

@robbinvandamme
Copy link
Author

@BertKleewein if needed, I'm willing to prepare a test-setup and have a video call.

@BertKleewein
Copy link
Member

@robbinvandamme - it sounds like the getaddrinfo function is just fundamentally broken. There is another workaround mentioned here (along with using res_init). Maybe using nscd can fix your problem?
https://stackoverflow.com/a/19930436/6262999

@BertKleewein
Copy link
Member

@robbinvandamme - I'm sorry, I am unable to do anything to help you with this issue. Almost everything about this bug points to a problem with the underlying network stack, and the workaround calling res_init supports this. I can say with reasonable confidence that we're not doing anything wrong. We don't interact directly with the DNS resolver so there's no way to change how we use it. The best we could do is somehow detect this situation and mitigate it by calling something like res_init. We're not in any better position to do this than you are. In fact, we're in a much worse position because you only need to worry about fixing your scenario, but we need to worry about not breaking things for all of our other customers.

All of the reading and experimenting that I've done points to the conclusion that the DNS resolver landscape is a mess. Most recently, it's this article ( https://tailscale.com/blog/sisyphean-dns-client-linux/ ) and a conversation that goes along with it ( https://news.ycombinator.com/item?id=26821298 ). This article talks about the challenging task of getting a single resolver configuration correct. Trying to come up with a solution that fixes a bug in a particular configuration without affecting any other configurations is an impossible task.

You have been so incredibly patient with us and with me on this issue. You have endured my slow response. You have offered your time and assistance. I appreciate all of this more than I can say. I only wish I could help more (or at all), and I hope this doesn't cause you more pain.

(For what it's worth, I wonder if the reason that ping and traceroute resolve correctly because you launch a new process every time, and your azure-iot-device process fails because it is trying to connect using an old (now corrupted) process. I believe that res_init is a function in one of the C runtime libraries, so I assume this workaround is reinitializing something about the state of the running process rather than reinitializing something about the system as a whole. But this is still just a guess…)

@robbinvandamme
Copy link
Author

robbinvandamme commented Jan 26, 2022

@BertKleewein
I have an issue that could be related to this one.
It's an issue we have at a customer, I was not able to reproduce it myself (yet).

I'm using azure-iot-device==2.9.0.

What happens:

  • application is working fine
  • setup gets rebooted (i think including router and stuff like that)
  • our gateway boots faster then the router I suppose and at startup, no internet connection available
  • application tries some rest calls to some servers and it fails --> ssl:default [Name or service not known]"} (expected behavior)
  • iot-hub connections fails --> ConnectionFailedError(None) (expected behavior)
  • after some time router is booted and network connection is available
  • application tries some rest calls to some servers and it SUCCEEDS (expected behavior)
  • iothub connect() gets called --> still fails with ConnectionFailedError(None) and never recovers from this.

I don't have much logs for the moment.
Now I've enabled paho and azure DEBUG level logging and I'll try to ask the customer to power down his devices and start them the same way. In the hope we get into the same state and have some more logs.

I still think this issue have something to do with the resolver not getting updated in the application/sdk?
This issue looks a bit the same golang/go#21083.

Something I don't get is, why are the rest calls to other servers recovering, but the iot-hub-device-sdk is not?

Kind regards,

Robbin

@robbinvandamme
Copy link
Author

@BertKleewein
I do see that ares_init() gets called in de azure device client C sdk at some places.
However, I'll try to call res_init() myself before calling connect() and see how it goes.
Could be that that initial issue of this topic is solved and including my last post.

I'll come back with the results :-).

@robbinvandamme
Copy link
Author

@BertKleewein
I was able to reproduce this by switching between networks.
Also got it fixed (or at least a workaround) by calling __res_init() right before calling connect().

When connection drops due to network switch I check the connection state and if its disconnect, I again call __res_init().

That did the trick but I'm not convinced the user of the sdk should keep the resolver/dns settings up to date?

Could you drop me your opinion on that?

I got a hold of __res_init() like this, and I call it when we are disconnected right before connect()

import ctypes
libc = ctypes.cdll.LoadLibrary('libc.so.6')
res_init = getattr(libc, "__res_init")
res_init()

I think this will also fix the scenario where router and application are booted together and router is not fully up and running at the time we call our first connect().

Kind regards,

Robbin

@robbinvandamme
Copy link
Author

@BertKleewein do you know how they handled this in the c-sdk?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants