Operator handler shows connection reset by peer or sometimes just hangs and doesn't respond #204

kopf-archiver · 2020-08-18T20:00:35Z

An issue by amolkavitkar at 2019-10-09 09:33:32+00:00
Original URL: zalando-incubator/kopf#204

Expected Behavior

The pod should respond immediately

Actual Behavior

When operator pod is long standing, and if we send and create request for custom resource it just hangs and doesn't respond, we have to restart pod then it comes back alive
Connection reset by peer in middle of creating resources

Steps to Reproduce the Problem

Create an operator pod and keep it long running for few days
try to create custom resource it doesn't go through

Note (sometimes it gives connection reset by peer)

Specifications

Platform: Ubuntu 16.04
Kubernetes version: (use kubectl version) 1.15
Python version: (use python --version) 3.5.2
Python packages installed: (use pip freeze --all)

connection reset by peer traceback
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/handling.py", line 387, in _execute
lifecycle=lifecycle, # just a default for the sub-handlers, not used directly.
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/handling.py", line 478, in _call_handler
**kwargs,
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/invocation.py", line 79, in invoke
result = await loop.run_in_executor(config.WorkersConfig.get_syn_executor(), real_fn)
File "/usr/local/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/handlers.py", line 159, in create_fn
obj = v1_client.create_namespaced_service(namespace, SVC_NOSELECTOR_TEMPLATE)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 6934, in create_namespaced_service
(data) = self.create_namespaced_service_with_http_info(namespace, body, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 7025, in create_namespaced_service_with_http_info
collection_formats=collection_formats)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
_request_timeout=_request_timeout)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
body=body)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 266, in POST
body=body)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 166, in request
headers=headers)
File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 72, in request
**urlopen_kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 150, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/poolmanager.py", line 326, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 641, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 368, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 603, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 344, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in validate_conn
conn.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 370, in connect
ssl_context=context)
File "/usr/local/lib/python3.7/site-packages/urllib3/util/ssl.py", line 368, in ssl_wrap_socket
return context.wrap_socket(sock)
File "/usr/local/lib/python3.7/ssl.py", line 423, in wrap_socket
session=session
File "/usr/local/lib/python3.7/ssl.py", line 870, in _create
self.do_handshake()
File "/usr/local/lib/python3.7/ssl.py", line 1139, in do_handshake
self._sslobj.do_handshake()
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

Commented by nolar at 2019-10-15 16:37:22+00:00

amolkavitkar Sorry for the delayed answer.

First of all, I see Python 3.5.2 mentioned. Kopf uses some features of Python 3.7, which did not exist earlier (such as contextvars, but also few more) — so, in theory, it cannot run on 3.5.2 (I didn't try though). Are you sure this is the right version of Python?

Second, can you please provide the Kopf version used, and a sample code with this issue reproduced (with the proprietary and unnecessary code removed)? Without the code, it is unclear what is happening and why, and what can cause the problems. In the stack trace, I see that you call the kubernetes client library with create_namespaced_service(), but this seems like an issue in that client library rather than in Kopf itself. Or, to be more specific, in the low-level SSL/HTTPS sub-libraries used by kubernetes client library.

Third, it is worth mentioning that the API requests for creation of the custom resources are, to my knowledge, non-blocking — i.e. k8s creates the resources and finishes the API request, and starts the handling in the operators after that — i.e., the operators cannot block the resource creation. If it is blocked, it is for some other reason (an assumption).

Also, some information on the scale of operator & cluster would be helpful to reproduce it: how many handled resources do you have (just the order of magnitude: 1s, 10s, 100s, 1000s?) And how often do they change (times per second/minute/hour, also a rough approximation)? High-scale operators can have their own issues with connection- & thread-pools, which has to be taken into account.

I would appreciate if this information is provided here in the issue, so that I could reproduce it or guess what happens internally.

One of the mentioned effects or the one that looks like it —operators not exiting properly— is known for some time, is well-described in #152, and is going to be fixed with asyncio/aiohttp for all HTTPS operations (drafted in #176). This happens with any sync HTTPS client libraries inside of the async app, so there is no good way to solve it now. The proper aiohttp way requires few hops to implement prerequisite changes as separate PRs (in focus right now). Currently, it is "kind-of-solved" by Kubernetes using SIGKILL on the pod after it does not exit properly with SIGTERM.

Commented by dozzman at 2019-10-22 02:27:42+00:00

We've also been having the same problem with our deployments, using kopf v0.20, although we are not experiencing any exceptions or dropped connections. In our case the pod simply stops receiving updates from the kubernetes API: Nothing is sent from the kubernetes API when a custom resource is created, or updated; and logs are no longer emitted.

Upon further investigation we've discovered that the connection between the kubernetes API and the operator is still in the still 'ESTABLISHED' state and there are no clear issues in the logs which would suggest that the API and operator can no longer communicate, however there is no longer any any data being exchanged on the connection as the octets sent and received by the kopf process remains steady (according to /proc/1/net/netstat).

I'd like to get you some sample code so you can investigate this further. However since we need this working ASAP, I've developed a liveness check based on my investigations which checks to ensure that there is some kind of communication occurring between the API and operator:

liveness-probe.sh -- To be added to the operator container

#! /bin/sh

# Files storing previous data values for comparison 
# You may want to change these file depending on frequency of /tmp deletion and frequency of liveness poll probing
LAST_DATA_FILE=/tmp/last_data
LAST_TIME_FILE=/tmp/last_time

# Process ID of kopf within container PID namespace (probably 1)
PID=1

get_data() {
        CURRENT_DATA=`cat /proc/$PID/net/netstat | grep 'IpExt' | tail -n 1 | cut -d ' ' -f8`
}


record_data() {
        echo $CURRENT_DATA > $LAST_DATA_FILE
}

record_time() {
	date > $LAST_TIME_FILE
}

get_data

if [ ! -e $LAST_DATA_FILE ]; then
        record_data
	record_time
        exit 0
fi

LAST_DATA=`cat $LAST_DATA_FILE`
LAST_TIME=`cat $LAST_TIME_FILE`

# provide some output for display in pod events
echo "previous = $LAST_DATA, current = $CURRENT_DATA, last received time = $LAST_TIME"


if [ ! $CURRENT_DATA -gt $LAST_DATA ]; then
	# implies it is equal, not neccessary to record data...
        record_data
        exit 1
fi

record_data
record_time

# exit explicitly
exit 0

snippet of liveness probe

        livenessProbe:
          exec:
            command:
              - /bin/sh
              - /srv/liveness-probe.sh
          initialDelaySeconds: 60
          periodSeconds: 60
          failureThreshold: 4
          successThreshold: 1
          timeoutSeconds: 1

The code works well with the python:3.7 docker image which we're using. If you're not using that you'll need to make sure you have coreutils and grep are installed at the least.

These liveness probe values also work well for us but can mean there are 'unhealthy' events in the operator as the period is less than the frequency of communication I've seen in between the operator and the API. The period can be bumped to 4 minutes and the failure threshold reduced to 1 to clean up your events as well.

Commented by nolar at 2019-11-13 17:27:11+00:00

I also observed this strange behaviour once (accidentally) — when the operator didn't see the custom resource created. Surprisingly, it saw other resources and reacted to them. But I couldn't catch and reproduce this situation after the restart.

Meanwhile, there is a new pre-release: kopf==0.23rc1 (0.23 will be released in couple days if the live testing goes well).

Among many other changes, this version fully switches Kopf to asyncio/aiohttp. Previously, there were synchronous pykube-ng -> requests -> urllib3 -> socket calls, which are now completely gone. And they already caused issues with the sync socket reads being stuck at exit (see #152).

It is highly probable that the issue will not reappear, since the whole I/O machinery is now changed. I'll be looking for this case more carefully. Please, do so too, and report if it does reappear. I will keep this issue open for few weeks, but then will close it if everything goes smooth.

Also —though not directly related to the reported issue— there is now a liveness endpoint possible with kopf run --liveness http://:8080/healthz, with optional @kopf.on.probe() handlers to collect the data for reporting as a response to the liveness probe. See the health-checks documentation. Maybe that will be useful for you.

Commented by pshchelo at 2020-03-06 14:40:59+00:00

nolar We experience something very similar with latest kopf release still (minus connection reset by peer part):
after leaving the controller for some time (like overnight) it stops handling changes of resources it watches - there's complete quietness in handlers logs, kopf annotation in the resource is not updated.
We've enabled liveness probes too, they are constantly passing (per logs and no container restarts), but resource handling is just not happening, controller seem to not receive updates from K8s.
Only controller restart helps.

Platform: Ubuntu 18.04
Kubernetes: 1.14 / Docker Enterprise
Python: 3.7.5
Python libs (relevant ones): kopf 0.25, aiohttp 3.6.2

After some netstatting and tcpdumping it seems that TCP connection is there, but the HTTP one is gone, but controller does not notice.
May be a manifestation of kubernetes/kubernetes#42552

Current suspect is that timeout=aiohttp.ClientTimeout(total=None) arg passed to GET ..?watch=true request here:

kopf/kopf/clients/watching.py

Line 204 in 51a3a70

timeout=aiohttp.ClientTimeout(total=None),

While I understand why it may be preferable to do so for such watch requests, it seems that (bad?) networking conditions / blips can cause such hangups when no strict timeout is set.
Our preliminary testing shows that setting some specific timeout in this place does not stop normal controller operations, but in case of lost connection controller crashes and is restarted by K8s, usually alleviating the problem.

I'll push a PR to make this timeout configurable (with default behaviour preserved as current one).

Commented by cliffburdick at 2020-06-17 23:25:20+00:00

I too saw a connection reset by peer today, which caused the thread to hang. Here was the output:

Root task 'poster of events' is failed: [Errno 104] Connection reset by peer
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/kopf/reactor/running.py", line 453, in _root_task_checker
    await coro
  File "/usr/local/lib/python3.8/dist-packages/kopf/engines/posting.py", line 165, in poster
    await events.post_event(
  File "/usr/local/lib/python3.8/dist-packages/kopf/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File "/usr/local/lib/python3.8/dist-packages/kopf/clients/events.py", line 77, in post_event
    response = await context.session.post(
  File "/usr/local/lib/python3.8/dist-packages/aiohttp/client.py", line 504, in _request
    await resp.start(conn)
  File "/usr/local/lib/python3.8/dist-packages/aiohttp/client_reqrep.py", line 847, in start
    message, payload = await self._protocol.read()  # type: ignore  # noqa
  File "/usr/local/lib/python3.8/dist-packages/aiohttp/streams.py", line 591, in read
    await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer
Root task 'daemon killer' is finished unexpectedly.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nhd/TriadController.py", line 134, in TriadControllerLoop
    loop.run_until_complete(kopf.operator(
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.8/dist-packages/kopf/reactor/running.py", line 161, in operator
    await run_tasks(operator_tasks, ignored=existing_tasks)
  File "/usr/local/lib/python3.8/dist-packages/kopf/reactor/running.py", line 370, in run_tasks
    await _reraise(root_done | root_cancelled | hung_done | hung_cancelled)
  File "/usr/local/lib/python3.8/dist-packages/kopf/reactor/running.py", line 437, in _reraise
    task.result()  # can raise the regular (non-cancellation) exceptions.
  File "/usr/local/lib/python3.8/dist-packages/kopf/reactor/running.py", line 453, in _root_task_checker
    await coro
  File "/usr/local/lib/python3.8/dist-packages/kopf/engines/posting.py", line 165, in poster
    await events.post_event(
  File "/usr/local/lib/python3.8/dist-packages/kopf/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File "/usr/local/lib/python3.8/dist-packages/kopf/clients/events.py", line 77, in post_event
    response = await context.session.post(
  File "/usr/local/lib/python3.8/dist-packages/aiohttp/client.py", line 504, in _request
    await resp.start(conn)
  File "/usr/local/lib/python3.8/dist-packages/aiohttp/client_reqrep.py", line 847, in start
    message, payload = await self._protocol.read()  # type: ignore  # noqa
  File "/usr/local/lib/python3.8/dist-packages/aiohttp/streams.py", line 591, in read
    await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer

The text was updated successfully, but these errors were encountered:

nolar · 2021-05-14T19:57:36Z

This issue was created in the times when synchronous API clients were used. Now, the underlying client is fully async — it is aiohttp.

Nevertheless, it does not seem that the issue has gone. On the good side (depends on how to look at this), it does not seem to be an issue in Kopf but rather in Kubernetes: as reported several times after that, it stops streaming the updates while the TCP/HTTP connection remains alive.

A reconnect of any kind helps: either by settings.watching.server_timeout (to ask server to disconnect) or settings.watching.client_timeout (to disconnect from the client-side): https://kopf.readthedocs.io/en/stable/configuration/#api-timeouts.

There is currently no default value because it is difficult to determine what be a good default here, as the issue happens only in some environments: too rare reconnects (10-60 mins) can delay the processing too much, but overly frequent reconnects (1-5 mins) can be a problem in large clusters.

Setting one of these timeouts usually helped (#232 (comment), #698 (comment)). So, I close it.

thpica · 2021-06-02T10:06:53Z

@nolar What if the TCP connection dies silently (no FIN packet)?
Is there a way to know it died without TCP Keep-Alive explicitly enabled in aiohttp?

kopf-archiver bot added the archive label Aug 18, 2020

kopf-archiver bot closed this as completed Aug 18, 2020

kopf-archiver bot changed the title ~~[archival placeholder]~~ Opearator handler shows connection rest by peer or sometimes just hangs and doesn't respond Aug 19, 2020

kopf-archiver bot added the bug Something isn't working label Aug 19, 2020

kopf-archiver bot reopened this Aug 19, 2020

cliffburdick mentioned this issue Aug 20, 2020

Catch extra exceptions when server connection closes #507

Merged

5 tasks

nolar mentioned this issue Feb 24, 2021

Recovering from Missed Events #698

Open

2 tasks

nolar closed this as completed May 14, 2021

nolar changed the title ~~Opearator handler shows connection rest by peer or sometimes just hangs and doesn't respond~~ Operator handler shows connection reset by peer or sometimes just hangs and doesn't respond May 14, 2021

psontag mentioned this issue Jun 15, 2021

Watch stream dies without recovering #787

Closed

nolar mentioned this issue Nov 3, 2021

Disconnect and reconnect the hanging watch-streams #860

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator handler shows connection reset by peer or sometimes just hangs and doesn't respond #204

Operator handler shows connection reset by peer or sometimes just hangs and doesn't respond #204

kopf-archiver bot commented Aug 18, 2020 •

edited

Loading

nolar commented May 14, 2021

thpica commented Jun 2, 2021

Operator handler shows connection reset by peer or sometimes just hangs and doesn't respond #204

Operator handler shows connection reset by peer or sometimes just hangs and doesn't respond #204

Comments

kopf-archiver bot commented Aug 18, 2020 • edited Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

nolar commented May 14, 2021

thpica commented Jun 2, 2021

kopf-archiver bot commented Aug 18, 2020 •

edited

Loading