Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden connection error in pipeline services #506

Open
JenySadadia opened this issue Mar 27, 2024 · 3 comments
Open

Sudden connection error in pipeline services #506

JenySadadia opened this issue Mar 27, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@JenySadadia
Copy link
Collaborator

After starting API and Pipeline services, the services worked fine for some time. Then suddenly monitor, tarball, and scheduler-k8s services stopped. Other pipeline and API services were running OK while this issue was observed.

Error logs:

today at 10:13:0403/27/2024 04:43:04 AM UTC [ERROR] Traceback (most recent call last):
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 198, in _new_conn
today at 10:13:04    sock = connection.create_connection(
today at 10:13:04           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 60, in create_connection
today at 10:13:04    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
today at 10:13:04               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/socket.py", line 962, in getaddrinfo
today at 10:13:04    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
today at 10:13:04               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04socket.gaierror: [Errno -5] No address associated with hostname
today at 10:13:04
today at 10:13:04The above exception was the direct cause of the following exception:
today at 10:13:04
today at 10:13:04Traceback (most recent call last):
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 793, in urlopen
today at 10:13:04    response = self._make_request(
today at 10:13:04               ^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 491, in _make_request
today at 10:13:04    raise new_e
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 467, in _make_request
today at 10:13:04    self._validate_conn(conn)
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn
today at 10:13:04    conn.connect()
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 616, in connect
today at 10:13:04    self.sock = sock = self._new_conn()
today at 10:13:04                       ^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 205, in _new_conn
today at 10:13:04    raise NameResolutionError(self.host, self, e) from e
today at 10:13:04urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f073630b1d0>: Failed to resolve 'staging.kernelci.org' ([Errno -5] No address associated with hostname)
today at 10:13:04
today at 10:13:04The above exception was the direct cause of the following exception:
today at 10:13:04
today at 10:13:04Traceback (most recent call last):
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
today at 10:13:04    resp = conn.urlopen(
today at 10:13:04           ^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 847, in urlopen
today at 10:13:04    retries = retries.increment(
today at 10:13:04              ^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/util/retry.py", line 515, in increment
today at 10:13:04    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
today at 10:13:04    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='staging.kernelci.org', port=9000): Max retries exceeded with url: /latest/listen/18845 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f073630b1d0>: Failed to resolve 'staging.kernelci.org' ([Errno -5] No address associated with hostname)"))
today at 10:13:04
today at 10:13:04During handling of the above exception, another exception occurred:
today at 10:13:04
today at 10:13:04Traceback (most recent call last):
today at 10:13:04  File "/home/kernelci/pipeline/base.py", line 69, in run
today at 10:13:04    status = self._run(context)
today at 10:13:04             ^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/home/kernelci/./pipeline/monitor.py", line 60, in _run
today at 10:13:04    event = self._api.receive_event(sub_id)
today at 10:13:04            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/kernelci/api/latest.py", line 138, in receive_event
today at 10:13:04    resp = self._get(path)
today at 10:13:04           ^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/kernelci/api/__init__.py", line 66, in _get
today at 10:13:04    resp = requests.get(
today at 10:13:04           ^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/requests/api.py", line 73, in get
today at 10:13:04    return request("get", url, params=params, **kwargs)
today at 10:13:04           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/requests/api.py", line 59, in request
today at 10:13:04    return session.request(method=method, url=url, **kwargs)
today at 10:13:04           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
today at 10:13:04    resp = self.send(prep, **send_kwargs)
today at 10:13:04           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
today at 10:13:04    r = adapter.send(request, **kwargs)
today at 10:13:04        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 519, in send
today at 10:13:04    raise ConnectionError(e, request=request)
today at 10:13:04requests.exceptions.ConnectionError: HTTPSConnectionPool(host='staging.kernelci.org', port=9000): Max retries exceeded with url: /latest/listen/18845 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f073630b1d0>: Failed to resolve 'staging.kernelci.org' ([Errno -5] No address associated with hostname)"))
today at 10:13:04
today at 10:24:09Container stopped

It seems like something is blocking the pipeline services from accessing API. Maybe some Sysadmin related issue? @nuclearcat

@nuclearcat
Copy link
Member

I noticed DNS resolution is unreliable for last few days on Azure services in general, it is affecting even deploy scripts. Unfortunately not much we can do yet,we might add more DNS servers in network config

@r-c-n
Copy link
Contributor

r-c-n commented Mar 27, 2024

It's happening again, it seems. If these services are meant to be long-lived could we introduce any kind of mechanism to re-launch them before we move to production. Not a good idea at this moment, since some of them are still under development and could exit due to a programming error, and we don't want to keep re-launching them in those cases.

@nuclearcat
Copy link
Member

I added 3 more resolver entry on staging host, but not sure it will help anyhow with docker services, will investigate more now

@JenySadadia JenySadadia added the bug Something isn't working label Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants