-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small amount of 502s during rollouts with NEG even with "mitigations" #769
Comments
@freehan Can you take a look? |
@autrilla |
Also, what is the GKE master version? There has been multiple improvements in the programming pipeline. |
I forget. I think I ran into this on GKE 1.11, but I'm not sure. I'll see if I can get some time and reproduce it again and give you better data if the issue is still there. |
I'm seeing this as well. 502s after every rolling deployment. This doesn't seem production ready? |
@mackoftrack wondering if you have setup graceful termination like #769 (comment) |
@freehan Yes I have. Same issue. GKE master version 1.12.7-gke.25 |
@mackoftrack could you provide more detail regarding your setup? Anonymized yaml if possible? |
@freehan So you are saying you are not able to reproduce? Basically it's just a typical Deployment, with: terminationGracePeriodSeconds: 60 With a RollingUpdate deployment strategy An typical Ingress object, and a Service object with: metadata: |
The rolling update scenario is tested continuously in the K8s E2E suite: The yamls we used: In your setup, can you try switch to the container image used in the test and see if you still see 502s. https://github.com/kubernetes/kubernetes/blob/master/test/e2e/testing-manifests/ingress/neg-clusterip/rc.yaml |
This was probably it. I used a Python application with aiohttp as an HTTP server. I don't think it handled |
@freehan one difference I'm seeing is that my Service is type: NodePort and yours is type: ClusterIP Could that be the issue? I followed these tutorials: https://cloud.google.com/kubernetes-engine/docs/tutorials/http-balancer and https://cloud.google.com/kubernetes-engine/docs/concepts/ingress |
@mackoftrack That should be irrelevant. |
@freehan I couldn't reproduce the issue using your test yamls. I'm wondering if I'm hitting this: https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#scale-to-zero_workloads_interruption I haven't tested with the test yamls in our production cluster, but I will try that today. As our production cluster is a regional GKE cluster, and some of these deployments have only a single replica, that means that in a regional cluster there is 0 network endpoints in 2 out of 3 zones. Could that be the issue? |
@mackoftrack Based on my testing, the bug has been fixed. You can try it as well.
Step 4 should take only a few seconds, not minutes. |
@freehan I believe this is in fact the issue. I was able to reproduce using your test yamls on our regional GKE cluster in us-west1 region. This may in fact be a new bug.
|
@mackoftrack how many backends do you have? |
I'm using your test yamls, which creates a single backend |
FYI, the rolling update test itself scales up the deployment to multiple replicas. If there is only 1 replica, then it is basically a race between removing the old endpoint and adding a new one. If you want to avoid 502s for rolling update 1 replica, you need to tune the rolling update strategy to MaxUnavailable to 0:
|
Sorry, you're right, Setting
I didn't see the 502s. |
I'm still seeing problems with this in GKE (1.13.7-gke.8, regional deployment). I have terminationGracePeriodSeconds: 180 and minReadySeconds: 180. My deployment has 2 replicas. I see 502s if I redeploy or do a rolling restart. The problem seems to happen at the moment when a pod is stopped. I see a log indicating the endpoint has been removed from the endpoint group, and at around the same instant my pod receives a SIGTERM. 10s later the associated pod is deleted and I see a group of 502s. The logs for these indicate they're targeted at the endpoint that was removed from the endpoint group 10s previously. The error is "failed to connect to backend" I presumed that the point of the large terminationGracePeriodSeconds and minReadySeconds was to give the LB time to react to changes in the endpoint group. Should I be seeing a larger gap between the endpoint group change and pod termination? Is the pod sent a SIGTERM before the endpoint is removed from the LB? https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace indicates my app should shut down when it receives a SIGTERM, which I take to mean stop accepting new work and terminate when existing work is complete. I don't see how updating terminationGracePeriodSeconds will help unless my app essentially ignores SIGTERM I finally found the source code for the test app from the tests mentioned above. It does nothing on receipt of SIGTERM except sleep for 60s https://github.com/kubernetes/kubernetes/blob/master/test/images/agnhost/serve-hostname/serve_hostname.go. Previous versions used to panic immediately as far as I can tell. The file has moved around a lot so it's difficult to find any history or reasons for the change. Anyway, it looks like either the test is invalid as the tested app doesn't do the advised termination, or the advice is wrong. Any thoughts? |
Here's a simple nginx-based repro for the problem. With is config I run apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: nginx
spec:
selector:
matchLabels:
app: nginx
minReadySeconds: 60
replicas: 2 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
terminationGracePeriodSeconds: 60
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
lifecycle:
preStop:
exec:
command:
- /usr/sbin/nginx
- -s
- quit
---
apiVersion: v1
kind: Service
metadata:
name: nginx
annotations:
cloud.google.com/neg: '{"ingress": true}'
labels:
app: nginx
spec:
type: NodePort
selector:
app: nginx
ports:
- name: nginx
port: 80
protocol: TCP
targetPort: 80
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: nginx
spec:
backend:
serviceName: nginx
servicePort: 80 |
Yeah, now that @philpearl mentioned running on a regional GKE cluster, I tend to think that this problem is specific to regional clusters, as that is where I've been seeing these issue as well. |
@mackoftrack not so sure - I did that repro above in a single-zone cluster with a single machine |
I should make it a bit more clear. When pod is deleted (added deletion timestamp), these things happens in parallel:
After Endpoints resource got updated (pod endpoint removed from ready addresses), service programming starts:
There is a small time gap between step 1 and step 3&4. But programming iptables is generally faster than programming LBs, hence the gap is more visible. To avoid service disruption during pod deletion, the key is to keep serving requests during graceful termination. That will leave enough time for LB or iptables to get fully programmed. Ideally, K8s should remove the pod from service backend first and then send SIGTERM to the containers (Assuming most containers does not handle SIGTERM properly). But currently there is no such mechanism in K8s and the service life cycle is loosely coupled with pod life cycle. Hence, it is recommended to have proper SIGTERM handling on pods. |
@philpearl |
I believe so. But what is "proper handling of SIGTERM"? From this
explanation, the only sensible thing to do appears to be to completely
ignore it and keep serving. Am I missing something?
…On Thu, 25 Jul 2019, 9:46 pm Minhan Xia, ***@***.***> wrote:
@philpearl <https://github.com/philpearl>
I am not sure what - /usr/sbin/nginx -s -quit exactly do. I would imagine
it would stop responding to new requests?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#769?email_source=notifications&email_token=AADPPXBBDIM4S7IAEEO3LD3QBIGL3A5CNFSM4HXWEGDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD22XITQ#issuecomment-515208270>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADPPXHITNDANQWHPXIUD3DQBIGL3ANCNFSM4HXWEGDA>
.
|
@philpearl something like kubernetes/ingress-nginx#322 (comment) |
For generic HTTP server, yes, it means keep serving existing requests, but should be sending connection: close on the response so client have to reconnect. |
OK, this all makes sense, but the documentation is actively confusing. For a basic web service the advice seems to be as follows
If you're using something off-the-shelf like nginx you're probably out of luck. Best to sleep in the pre-stop hook to avoid SIGTERM reaching the server, and just hope there are no issues with long-standing HTTP connections. https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace is at least strongly hinting that you should do a "graceful shutdown" - and that's certainly not what you should do. |
@mackoftrack I don't think anyone is denying this is all a bit hacky. It seems like k8s is missing the synchronisation mechanisms needed to make this clean. I'm grumbling because the docs aren't clear - probably because clear documentation would make the hackiness very apparent! old man shakes fist at cloud |
@bowei, the idea is interesting to send There may have some edge cases if client did not send any requests between |
@freehan Cloud you please explain more about how HLB(NEG) service out a POD, when SIGTERM sending to a POD. I did some tests just tail nginx’s access log, and then delete that pod, in this case the access still coming up to almost 20 seconds, I think it is not related to the Readiness config of the POD which also applied to HLB’s healthcheck, right? I tested this with both http keep-alive on and off. |
So when the pod gets deleted, 2 things happen in parallel: A. Kubelet sends SIGTERM to the containers. B takes more time as A usually happens very fast. So it is recommended to configure the pod to do 2 things when SIGTERM is recieved:
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten @freehan is this still relevant? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Another alternative is to add MinReadySeconds to deployments. This would force deployment controller to slow down rollout and leave enough time for NEG programming. For example:
|
From #583:
For the NEG Programming Latency issue, even with
minReadySeconds
andterminationGracePeriodSeconds
set to 180, I'm still seeing a small amount of 502s during rollouts. Is this expected? My test was during a rollout of 11 pods, sending 100 requests per second to them overall. I saw 96 502s over the course of the rollout.I'd like to understand the cause of this issue. My current guess is that Kubernetes terminates the pod and stops routing traffic to it, but the NEG is updated after the pod is terminated. If so, could we perhaps solve this with a pre-stop hook that detaches the pod from the NEG before terminating?
The text was updated successfully, but these errors were encountered: