-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator handler shows connection reset by peer or sometimes just hangs and doesn't respond #204
Comments
This issue was created in the times when synchronous API clients were used. Now, the underlying client is fully async — it is aiohttp. Nevertheless, it does not seem that the issue has gone. On the good side (depends on how to look at this), it does not seem to be an issue in Kopf but rather in Kubernetes: as reported several times after that, it stops streaming the updates while the TCP/HTTP connection remains alive. A reconnect of any kind helps: either by There is currently no default value because it is difficult to determine what be a good default here, as the issue happens only in some environments: too rare reconnects (10-60 mins) can delay the processing too much, but overly frequent reconnects (1-5 mins) can be a problem in large clusters. Setting one of these timeouts usually helped (#232 (comment), #698 (comment)). So, I close it. |
@nolar What if the TCP connection dies silently (no FIN packet)? |
Expected Behavior
The pod should respond immediately
Actual Behavior
When operator pod is long standing, and if we send and create request for custom resource it just hangs and doesn't respond, we have to restart pod then it comes back alive
Connection reset by peer in middle of creating resources
Steps to Reproduce the Problem
Note (sometimes it gives connection reset by peer)
Specifications
kubectl version
) 1.15python --version
) 3.5.2pip freeze --all
)connection reset by peer traceback
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/handling.py", line 387, in _execute
lifecycle=lifecycle, # just a default for the sub-handlers, not used directly.
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/handling.py", line 478, in _call_handler
**kwargs,
File "/usr/local/lib/python3.7/site-packages/kopf/reactor/invocation.py", line 79, in invoke
result = await loop.run_in_executor(config.WorkersConfig.get_syn_executor(), real_fn)
File "/usr/local/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/handlers.py", line 159, in create_fn
obj = v1_client.create_namespaced_service(namespace, SVC_NOSELECTOR_TEMPLATE)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 6934, in create_namespaced_service
(data) = self.create_namespaced_service_with_http_info(namespace, body, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 7025, in create_namespaced_service_with_http_info
collection_formats=collection_formats)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
_request_timeout=_request_timeout)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
body=body)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 266, in POST
body=body)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 166, in request
headers=headers)
File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 72, in request
**urlopen_kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 150, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/poolmanager.py", line 326, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 641, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 368, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 603, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 344, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in validate_conn
conn.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 370, in connect
ssl_context=context)
File "/usr/local/lib/python3.7/site-packages/urllib3/util/ssl.py", line 368, in ssl_wrap_socket
return context.wrap_socket(sock)
File "/usr/local/lib/python3.7/ssl.py", line 423, in wrap_socket
session=session
File "/usr/local/lib/python3.7/ssl.py", line 870, in _create
self.do_handshake()
File "/usr/local/lib/python3.7/ssl.py", line 1139, in do_handshake
self._sslobj.do_handshake()
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
amolkavitkar Sorry for the delayed answer.
First of all, I see Python 3.5.2 mentioned. Kopf uses some features of Python 3.7, which did not exist earlier (such as
contextvars
, but also few more) — so, in theory, it cannot run on 3.5.2 (I didn't try though). Are you sure this is the right version of Python?Second, can you please provide the Kopf version used, and a sample code with this issue reproduced (with the proprietary and unnecessary code removed)? Without the code, it is unclear what is happening and why, and what can cause the problems. In the stack trace, I see that you call the
kubernetes
client library withcreate_namespaced_service()
, but this seems like an issue in that client library rather than in Kopf itself. Or, to be more specific, in the low-level SSL/HTTPS sub-libraries used bykubernetes
client library.Third, it is worth mentioning that the API requests for creation of the custom resources are, to my knowledge, non-blocking — i.e. k8s creates the resources and finishes the API request, and starts the handling in the operators after that — i.e., the operators cannot block the resource creation. If it is blocked, it is for some other reason (an assumption).
Also, some information on the scale of operator & cluster would be helpful to reproduce it: how many handled resources do you have (just the order of magnitude: 1s, 10s, 100s, 1000s?) And how often do they change (times per second/minute/hour, also a rough approximation)? High-scale operators can have their own issues with connection- & thread-pools, which has to be taken into account.
I would appreciate if this information is provided here in the issue, so that I could reproduce it or guess what happens internally.
One of the mentioned effects or the one that looks like it —operators not exiting properly— is known for some time, is well-described in #152, and is going to be fixed with asyncio/aiohttp for all HTTPS operations (drafted in #176). This happens with any sync HTTPS client libraries inside of the async app, so there is no good way to solve it now. The proper aiohttp way requires few hops to implement prerequisite changes as separate PRs (in focus right now). Currently, it is "kind-of-solved" by Kubernetes using SIGKILL on the pod after it does not exit properly with SIGTERM.
We've also been having the same problem with our deployments, using kopf v0.20, although we are not experiencing any exceptions or dropped connections. In our case the pod simply stops receiving updates from the kubernetes API: Nothing is sent from the kubernetes API when a custom resource is created, or updated; and logs are no longer emitted.
Upon further investigation we've discovered that the connection between the kubernetes API and the operator is still in the still 'ESTABLISHED' state and there are no clear issues in the logs which would suggest that the API and operator can no longer communicate, however there is no longer any any data being exchanged on the connection as the octets sent and received by the kopf process remains steady (according to
/proc/1/net/netstat
).I'd like to get you some sample code so you can investigate this further. However since we need this working ASAP, I've developed a liveness check based on my investigations which checks to ensure that there is some kind of communication occurring between the API and operator:
liveness-probe.sh -- To be added to the operator container
snippet of liveness probe
The code works well with the
python:3.7
docker image which we're using. If you're not using that you'll need to make sure you havecoreutils
andgrep
are installed at the least.These liveness probe values also work well for us but can mean there are 'unhealthy' events in the operator as the period is less than the frequency of communication I've seen in between the operator and the API. The period can be bumped to 4 minutes and the failure threshold reduced to 1 to clean up your events as well.
I also observed this strange behaviour once (accidentally) — when the operator didn't see the custom resource created. Surprisingly, it saw other resources and reacted to them. But I couldn't catch and reproduce this situation after the restart.
Meanwhile, there is a new pre-release:
kopf==0.23rc1
(0.23 will be released in couple days if the live testing goes well).Among many other changes, this version fully switches Kopf to asyncio/aiohttp. Previously, there were synchronous
pykube-ng
->requests
->urllib3
->socket
calls, which are now completely gone. And they already caused issues with the sync socket reads being stuck at exit (see #152).It is highly probable that the issue will not reappear, since the whole I/O machinery is now changed. I'll be looking for this case more carefully. Please, do so too, and report if it does reappear. I will keep this issue open for few weeks, but then will close it if everything goes smooth.
Also —though not directly related to the reported issue— there is now a liveness endpoint possible with
kopf run --liveness http://:8080/healthz
, with optional@kopf.on.probe()
handlers to collect the data for reporting as a response to the liveness probe. See the health-checks documentation. Maybe that will be useful for you.nolar We experience something very similar with latest kopf release still (minus connection reset by peer part):
after leaving the controller for some time (like overnight) it stops handling changes of resources it watches - there's complete quietness in handlers logs, kopf annotation in the resource is not updated.
We've enabled liveness probes too, they are constantly passing (per logs and no container restarts), but resource handling is just not happening, controller seem to not receive updates from K8s.
Only controller restart helps.
Platform: Ubuntu 18.04
Kubernetes: 1.14 / Docker Enterprise
Python: 3.7.5
Python libs (relevant ones): kopf 0.25, aiohttp 3.6.2
After some netstatting and tcpdumping it seems that TCP connection is there, but the HTTP one is gone, but controller does not notice.
May be a manifestation of kubernetes/kubernetes#42552
Current suspect is that
timeout=aiohttp.ClientTimeout(total=None)
arg passed toGET ..?watch=true
request here:kopf/kopf/clients/watching.py
Line 204 in 51a3a70
While I understand why it may be preferable to do so for such watch requests, it seems that (bad?) networking conditions / blips can cause such hangups when no strict timeout is set.
Our preliminary testing shows that setting some specific timeout in this place does not stop normal controller operations, but in case of lost connection controller crashes and is restarted by K8s, usually alleviating the problem.
I'll push a PR to make this timeout configurable (with default behaviour preserved as current one).
I too saw a connection reset by peer today, which caused the thread to hang. Here was the output:
The text was updated successfully, but these errors were encountered: