Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

salt-minion getaddrinfo in dns_check() never gets updated nameservers because of glibc caching #21397

Closed
tjstansell opened this issue Mar 7, 2015 · 14 comments
Labels
Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists
Milestone

Comments

@tjstansell
Copy link
Contributor

There can be a race condition when the minion starts where it might start before the network is fully initialized (perhaps DHCP is slow or was down). In these cases, I've seen salt-minion stuck querying 127.0.0.1 for DNS. Even after the network does come online, it never picks up the changes because glibc caches the resolv.conf forever. Supposedly, debian/ubuntu and opensuse actually patch glibc to stat /etc/resolv.conf for changes (I have not verified this, just saw a reference online somewhere), but this doesn't happen on redhat. Without a fix in glibc, the current solution is to restart the salt-minion if the resolv.conf changes. However, salt could also trigger a res_init() to force glibc to reload the resolv.conf configuration.

I've tested the following by copying the dns_check() and hacking on it a bit. With the res_init() in there, it does pick up changes to resolv.conf:

import time
import socket
import ctypes
libc = ctypes.cdll.LoadLibrary('libc.so.6')
res_init = libc.__res_init

def dns_check(addr, safe=False, ipv6=False):
    res_init()
    print time.time(),": Checking", addr
    try:
        hostnames = socket.getaddrinfo(
            addr, None, socket.AF_UNSPEC, socket.SOCK_STREAM
        )
        if not hostnames:
            error = True
            print "FAILURE: no result"
        else:
            addr = False
            print "SUCCESS: ", hostnames
    except TypeError:
        err = ('Attempt to resolve address \'{0}\' failed. Invalid or unresolveable address').format(addr)
        print err
    except socket.error as msg:
        #error = True
        print "ERROR: socket error: ", msg

for i in range(10):
    dns_check('salt')
    print "Sleeping for 10 seconds ..."
    time.sleep(10)

I tested by changing /etc/resolv.conf to point to a bogus nameserver, then during the loop, changing it back and it shows it able to correctly resolve the address.

Obviously, this isn't the correct patch (It's probably not portable and likely not running res_init() in the right places), but it is a workable solution to a very annoying problem of my minion spitting these out and never being able to fix itself (because it's looking for a nameserver at 127.0.0.1)

2015-03-07 01:22:05,757 [salt.utils       ][ERROR   ] DNS lookup of 'salt01' failed.
2015-03-07 01:22:05,757 [salt.minion      ][ERROR   ] Master hostname: 'salt01' not found. Retrying in 30 seconds

This might become more important if #15122 and/or #10032 is addressed.

I'm running 2014.7.2, fyi.

@tjstansell tjstansell changed the title salt-minion getaddrinfo in check_dns() never gets updated nameservers because of glibc caching salt-minion getaddrinfo in dns_check() never gets updated nameservers because of glibc caching Mar 7, 2015
@jfindlay jfindlay added Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists labels Mar 9, 2015
@jfindlay jfindlay added this to the Approved milestone Mar 9, 2015
@jfindlay
Copy link
Contributor

jfindlay commented Mar 9, 2015

@tjstansell, thanks for the info.

@tjstansell
Copy link
Contributor Author

I have tested the following patch, which does seem to work on my CentOS 7 test box, but I'm unsure if this would be sufficient for all other platforms. I wrapped it in a try/except block so theoretically it should work everywhere. If folks think this is good, I can submit an actual PR. Since the only thing that uses dns_check() is resolve_dns() in minion.py when the minion is trying to resolve the master's IP, it seems fine to simply call res_init() every time (assuming it's available).

diff --git a/salt/utils/__init__.py b/salt/utils/__init__.py
index 3b89774..c646200 100644
--- a/salt/utils/__init__.py
+++ b/salt/utils/__init__.py
@@ -96,6 +96,14 @@ try:
 except ImportError:
     HAS_SETPROCTITLE = False

+try:
+    import ctypes
+    libc = ctypes.cdll.LoadLibrary('libc.so.6')
+    res_init = libc.__res_init
+    HAS_RESINIT = True
+except:
+    HAS_RESINIT = False
+
 # Import salt libs
 import salt._compat
 import salt.exitcodes
@@ -518,6 +526,9 @@ def dns_check(addr, safe=False, ipv6=False):
     '''
     error = False
     try:
+       # issue #21397: force glibc to re-read resolv.conf
+       if HAS_RESINIT:
+           res_init()
         hostnames = socket.getaddrinfo(
             addr, None, socket.AF_UNSPEC, socket.SOCK_STREAM
         )

@jfindlay
Copy link
Contributor

@tjstansell, this is an interesting approach, and may be similar to an eventual solution. Would it be better to locally read resolv.conf? I don't feel qualified to make a decision about this, though perhaps @cachedout or @whiteinge have better ideas.

@tjstansell
Copy link
Contributor Author

If you plan to continue using socket.getaddrinfo() to do the actual DNS resolution, there's no point in reading the resolv.conf yourself. In all the research I've done on this, it's a glibc problem that affects a lot of long-running processes. In one post I read, folks talked about Firefox being the only thing that doesn't have this problem because they go out of their way to force the res_init() like I'm suggesting here.

@jfindlay
Copy link
Contributor

@tjstansell, thanks for bringing your research and experience to this issue.

@jfindlay
Copy link
Contributor

What do you think about this, assuming there's some way to locate the name of the system library? whether or not that needs to be hand crafted into salt itself may not be an issue.

diff --git a/salt/utils/__init__.py b/salt/utils/__init__.py
index 0cd14e9..da07105 100644
--- a/salt/utils/__init__.py
+++ b/salt/utils/__init__.py
@@ -541,6 +541,20 @@ def ip_bracket(addr):
     return addr


+def reload_dns():
+    '''
+    cause the system to reload DNS info
+    '''
+    libc_name = figure_out_libc_name()
+    try:
+        import ctypes
+        libc = ctypes.cdll.LoadLibrary(libc_name)
+        libc.__res_init()
+        return True
+    except ImportError:
+        return False
+
+
 def dns_check(addr, safe=False, ipv6=False):
     '''
     Return the ip resolved by dns, but do not exit on failure, only raise an
@@ -548,6 +562,7 @@ def dns_check(addr, safe=False, ipv6=False):
     '''
     error = False
     try:
+        reload_dns()
         hostnames = socket.getaddrinfo(
             addr, None, socket.AF_UNSPEC, socket.SOCK_STREAM
         )

@tjstansell
Copy link
Contributor Author

In that case, it would try to import ctypes and load the library every time it calls reload_dns(), right? Wouldn't it be better to simply do that once at startup rather than every time? That said, I'm not sure how to determine the correct library name. The example shown on https://docs.python.org/2/library/ctypes.html uses libc.so.6, which is why I chose it. It will either work or it won't. Additionally, in your example, I don't think it should be except ImportError: because what if the ctypes import works, but the LoadLibrary() fails? I think just a generic except: is sufficient, but I'm also fairly new to python, so take all this with a grain of salt ;)

@jfindlay
Copy link
Contributor

That's true, but we also try to keep utils/__init__.py as light as possible.

@tjstansell
Copy link
Contributor Author

So after more searching, I found we can use the following to find the correct library name:

>>> import ctypes.util
>>> clib = ctypes.util.find_library("c")
>>> print clib
libc.so.6

@jfindlay
Copy link
Contributor

Of course, that makes sense, thanks @tjstansell. I suggest you submit a pull request with the changes you've proposed.

tjstansell added a commit to tjstansell/salt that referenced this issue Mar 26, 2015
@jfindlay jfindlay self-assigned this Mar 26, 2015
@jfindlay jfindlay removed their assignment Mar 26, 2015
thatch45 added a commit that referenced this issue Apr 1, 2015
fix #21397 - force glibc to re-read resolv.conf
@rvora
Copy link

rvora commented Jul 31, 2015

There are many places in Salt where socket.getaddrinfo is used. All these places need to call res_init() for this fix to work? Or can we have all these places call saltutil.dns_check?

Currently I am running into this same issue with network.connect module. During a recent highstate, /etc/resolv.conf was changed. Now network.connect can not resolve some local hostnames. A new nameserver was added to /etc/resolv.conf so this doesn't work until I restart the minion.

cloudyhead-2:salt rvora$ grep -r getaddrinfo .
./grains/core.py:        info = socket.getaddrinfo(hostname()['fqdn'], None, socket.AF_INET)
./grains/core.py:        info = socket.getaddrinfo(hostname()['fqdn'], None, socket.AF_INET6)
./modules/dnsutil.py:            addresses = [sock[4][0] for sock in socket.getaddrinfo(host, None, socket.AF_INET, 0, socket.SOCK_RAW)]
./modules/dnsutil.py:            addresses = [sock[4][0] for sock in socket.getaddrinfo(host, None, socket.AF_INET6, 0, socket.SOCK_RAW)]
./modules/network.py:         _address) = socket.getaddrinfo(address, port, __family, 0, __proto)[0]
./modules/win_network.py:         _address) = socket.getaddrinfo(address, port, __family, 0, __proto)[0]
./utils/__init__.py:        hostnames = socket.getaddrinfo(
./utils/network.py:        family, socktype, proto, canonname, sockaddr = socket.getaddrinfo(
./utils/network.py:    # try socket.getaddrinfo
./utils/network.py:        addrinfo = socket.getaddrinfo(
./utils/network.py:    # try socket.getaddrinfo
./utils/network.py:        addrinfo = socket.getaddrinfo(
./utils/verify.py:        hostnames = socket.getaddrinfo(

@jfindlay
Copy link
Contributor

@rvora, I think the appropriate solution is to at least call salt.utils.dns_check to make sure we are running on current DNS info, if not use that function in place of getaddrinfo. I can't say without looking at this closer what the best solution would be, but I agree that something like you suggest should be done in these uses of getaddrinfo.

@rvora
Copy link

rvora commented Jul 31, 2015

Thanks for your quick response. Thinking a bit more about this, since dns_check does not accept all the parameters getaddrinfo accepts, I think you kind of have to stick with using getaddrinfo for most of these situations.

May be this is a good compromise.

utils/__init__.py:
def resinit():
    if HAS_RESINIT:
        res_init()

def dns_check(.....):
    resinit()
.... continue ....
modules/network.py:

def connectI(....):
    salt.utils.resinit()

    ..... continue ....

@jfindlay
Copy link
Contributor

@rvora, that seems like a good idea to me, you are welcome to create a pull request against 2015.5, since this would be a bug fix, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists
Projects
None yet
Development

No branches or pull requests

3 participants