Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small amount of 502s during rollouts with NEG even with "mitigations" #769

Closed
autrilla opened this issue Jun 13, 2019 · 45 comments
Closed
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@autrilla
Copy link

From #583:

For the NEG Programming Latency issue, even with minReadySeconds and terminationGracePeriodSeconds set to 180, I'm still seeing a small amount of 502s during rollouts. Is this expected? My test was during a rollout of 11 pods, sending 100 requests per second to them overall. I saw 96 502s over the course of the rollout.

I'd like to understand the cause of this issue. My current guess is that Kubernetes terminates the pod and stops routing traffic to it, but the NEG is updated after the pod is terminated. If so, could we perhaps solve this with a pre-stop hook that detaches the pod from the NEG before terminating?

@rramkumar1
Copy link
Contributor

@freehan Can you take a look?

@freehan
Copy link
Contributor

freehan commented Jun 13, 2019

@autrilla
Have you set up graceful termination on the backend pods? LB will still send traffic for short amount of time after the pod becomes NotReady. But when pod get deleted, it gets SIGTERM, the container need to handle SIGTERM and shutdown gracefully while still accepting traffic for a while.
https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace

@freehan
Copy link
Contributor

freehan commented Jun 13, 2019

Also, what is the GKE master version? There has been multiple improvements in the programming pipeline.

@autrilla
Copy link
Author

I forget. I think I ran into this on GKE 1.11, but I'm not sure. I'll see if I can get some time and reproduce it again and give you better data if the issue is still there.

@RafiGreenberg
Copy link

I'm seeing this as well. 502s after every rolling deployment. This doesn't seem production ready?

@freehan
Copy link
Contributor

freehan commented Jul 17, 2019

@mackoftrack wondering if you have setup graceful termination like #769 (comment)

@RafiGreenberg
Copy link

RafiGreenberg commented Jul 17, 2019

@freehan Yes I have. Same issue. GKE master version 1.12.7-gke.25

@freehan
Copy link
Contributor

freehan commented Jul 17, 2019

@mackoftrack could you provide more detail regarding your setup? Anonymized yaml if possible?

@RafiGreenberg
Copy link

@freehan So you are saying you are not able to reproduce?

Basically it's just a typical Deployment, with:

terminationGracePeriodSeconds: 60
minReadySeconds: 60

With a RollingUpdate deployment strategy

An typical Ingress object, and a Service object with:

metadata:
annotations:
cloud.google.com/neg: '{"ingress": true}'

@autrilla
Copy link
Author

autrilla commented Jul 19, 2019

@mackoftrack wondering if you have setup graceful termination like #769 (comment)

This was probably it. I used a Python application with aiohttp as an HTTP server. I don't think it handled SIGTERM this way.

@RafiGreenberg
Copy link

@freehan one difference I'm seeing is that my Service is type: NodePort and yours is type: ClusterIP

Could that be the issue? I followed these tutorials: https://cloud.google.com/kubernetes-engine/docs/tutorials/http-balancer and https://cloud.google.com/kubernetes-engine/docs/concepts/ingress

@freehan
Copy link
Contributor

freehan commented Jul 19, 2019

@mackoftrack That should be irrelevant.

@RafiGreenberg
Copy link

RafiGreenberg commented Jul 22, 2019

@freehan I couldn't reproduce the issue using your test yamls. I'm wondering if I'm hitting this: https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#scale-to-zero_workloads_interruption

I haven't tested with the test yamls in our production cluster, but I will try that today.

As our production cluster is a regional GKE cluster, and some of these deployments have only a single replica, that means that in a regional cluster there is 0 network endpoints in 2 out of 3 zones. Could that be the issue?

@freehan
Copy link
Contributor

freehan commented Jul 22, 2019

@mackoftrack Based on my testing, the bug has been fixed. You can try it as well.

  1. Scale backend to 0
  2. Send traffic, observe 502s and wait for a few minutes
  3. Scale backend to 1
  4. Send traffic and observe how long it takes for LB to response 200.

Step 4 should take only a few seconds, not minutes.

@RafiGreenberg
Copy link

@freehan I believe this is in fact the issue. I was able to reproduce using your test yamls on our regional GKE cluster in us-west1 region. This may in fact be a new bug.

  1. Deploy your test yamls on a GKE regional cluster (in us-west1 this results in 3 NEG's, 1 in each zone us-west1-a, us-west1-b, us-west1-c)
  2. Do a rolling update of the deployment
  3. Observe 502s during deployment (there should be no 502s during a rolling update)

@freehan
Copy link
Contributor

freehan commented Jul 22, 2019

@mackoftrack how many backends do you have?

@RafiGreenberg
Copy link

I'm using your test yamls, which creates a single backend

@freehan
Copy link
Contributor

freehan commented Jul 22, 2019

FYI, the rolling update test itself scales up the deployment to multiple replicas.

If there is only 1 replica, then it is basically a race between removing the old endpoint and adding a new one. If you want to avoid 502s for rolling update 1 replica, you need to tune the rolling update strategy to MaxUnavailable to 0:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    run: hostname
  name: hostname
spec:
  minReadySeconds: 60
  selector:
    matchLabels:
      run: hostname
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        run: hostname
    spec:
      containers:
      - image: gcr.io/kubernetes-e2e-test-images/serve-hostname:1.1
        imagePullPolicy: IfNotPresent
        name: hostname
      terminationGracePeriodSeconds: 120

@RafiGreenberg
Copy link

Sorry, you're right, Setting

    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

I didn't see the 502s.

@philpearl
Copy link

philpearl commented Jul 25, 2019

I'm still seeing problems with this in GKE (1.13.7-gke.8, regional deployment). I have terminationGracePeriodSeconds: 180 and minReadySeconds: 180. My deployment has 2 replicas. I see 502s if I redeploy or do a rolling restart. The problem seems to happen at the moment when a pod is stopped.

I see a log indicating the endpoint has been removed from the endpoint group, and at around the same instant my pod receives a SIGTERM. 10s later the associated pod is deleted and I see a group of 502s. The logs for these indicate they're targeted at the endpoint that was removed from the endpoint group 10s previously. The error is "failed to connect to backend"

I presumed that the point of the large terminationGracePeriodSeconds and minReadySeconds was to give the LB time to react to changes in the endpoint group. Should I be seeing a larger gap between the endpoint group change and pod termination? Is the pod sent a SIGTERM before the endpoint is removed from the LB? https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace indicates my app should shut down when it receives a SIGTERM, which I take to mean stop accepting new work and terminate when existing work is complete. I don't see how updating terminationGracePeriodSeconds will help unless my app essentially ignores SIGTERM

I finally found the source code for the test app from the tests mentioned above. It does nothing on receipt of SIGTERM except sleep for 60s https://github.com/kubernetes/kubernetes/blob/master/test/images/agnhost/serve-hostname/serve_hostname.go. Previous versions used to panic immediately as far as I can tell. The file has moved around a lot so it's difficult to find any history or reasons for the change.

Anyway, it looks like either the test is invalid as the tested app doesn't do the advised termination, or the advice is wrong. Any thoughts?

@philpearl
Copy link

Here's a simple nginx-based repro for the problem. With is config I run ab -n 100000 -c 10 http://<ipaddr>/ and kubectl rollout restart deploy/nginxthen I get some 500s. If I replace the lifecycle preStop with a 60s sleep it doesn't 500 (at least not in the few times I've tried it)

apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  minReadySeconds: 60
  replicas: 2 # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: nginx
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: nginx
          image: nginx:1.7.9
          ports:
            - containerPort: 80
          lifecycle:
            preStop:
              exec:
                command:
                  - /usr/sbin/nginx
                  - -s
                  - quit
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
  labels:
    app: nginx
spec:
  type: NodePort
  selector:
    app: nginx
  ports:
    - name: nginx
      port: 80
      protocol: TCP
      targetPort: 80
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: nginx
spec:
  backend:
    serviceName: nginx
    servicePort: 80

@RafiGreenberg
Copy link

Yeah, now that @philpearl mentioned running on a regional GKE cluster, I tend to think that this problem is specific to regional clusters, as that is where I've been seeing these issue as well.

@philpearl
Copy link

@mackoftrack not so sure - I did that repro above in a single-zone cluster with a single machine

@freehan
Copy link
Contributor

freehan commented Jul 25, 2019

I should make it a bit more clear. When pod is deleted (added deletion timestamp), these things happens in parallel:

  1. kubelet send SIGTERM to pods
  2. endpoint controller update endpoints for corresponding services (if pod is selected by the service)

After Endpoints resource got updated (pod endpoint removed from ready addresses), service programming starts:

  1. kube-proxy programs iptables to remove pod from service backend
  2. other LB provider reacts to remove pod from LB backend.

There is a small time gap between step 1 and step 3&4. But programming iptables is generally faster than programming LBs, hence the gap is more visible.

To avoid service disruption during pod deletion, the key is to keep serving requests during graceful termination. That will leave enough time for LB or iptables to get fully programmed.

Ideally, K8s should remove the pod from service backend first and then send SIGTERM to the containers (Assuming most containers does not handle SIGTERM properly). But currently there is no such mechanism in K8s and the service life cycle is loosely coupled with pod life cycle. Hence, it is recommended to have proper SIGTERM handling on pods.

@freehan
Copy link
Contributor

freehan commented Jul 25, 2019

@philpearl
I am not sure what /usr/sbin/nginx -s -quit exactly do. I would imagine it would stop responding to new requests?

@philpearl
Copy link

philpearl commented Jul 25, 2019 via email

@aledbf
Copy link
Member

aledbf commented Jul 25, 2019

@philpearl something like kubernetes/ingress-nginx#322 (comment)
As @freehan mentioned, this is not an issue with ingress-gce itself.

@bowei
Copy link
Member

bowei commented Jul 25, 2019

For generic HTTP server, yes, it means keep serving existing requests, but should be sending connection: close on the response so client have to reconnect.
Servers that have a "lame duck" state handling that can actively reject existing sessions with the clients would start behaving differently.

@philpearl
Copy link

philpearl commented Jul 26, 2019

OK, this all makes sense, but the documentation is actively confusing.

For a basic web service the advice seems to be as follows

  • handle SIGTERM - you certainly shouldn't exit as soon as receive it. You probably can't quite ignore it.
  • don't do a "graceful shutdown" in the traditional sense. Don't stop serving new requests. Don't call Shutdown() on a Go http server.
  • start sending "Connection": "close" headers in HTTP responses (and your HTTP server library should drop connections after sending responses with these headers)
  • try not to feel bad about the inevitable windows where someone might re-use a pre-existing but idle http connection just as you finally die.

If you're using something off-the-shelf like nginx you're probably out of luck. Best to sleep in the pre-stop hook to avoid SIGTERM reaching the server, and just hope there are no issues with long-standing HTTP connections.

https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace is at least strongly hinting that you should do a "graceful shutdown" - and that's certainly not what you should do.

@philpearl
Copy link

@mackoftrack I don't think anyone is denying this is all a bit hacky. It seems like k8s is missing the synchronisation mechanisms needed to make this clean. I'm grumbling because the docs aren't clear - probably because clear documentation would make the hackiness very apparent! old man shakes fist at cloud

@axot
Copy link

axot commented Sep 3, 2019

@bowei, the idea is interesting to send connection: close when pod getting sigterm which could cut http keep-alive connection down, how do you implement this? It also can achieve by application layer, but if it could done in middleware like nginx, the solution could be more general.

There may have some edge cases if client did not send any requests between terminationGracePeriodSeconds period, the keep-alive may still working then. By setting up timeout in Keep-alive header will help this case.

@axot
Copy link

axot commented Sep 3, 2019

@freehan Cloud you please explain more about how HLB(NEG) service out a POD, when SIGTERM sending to a POD.

I did some tests just tail nginx’s access log, and then delete that pod, in this case the access still coming up to almost 20 seconds, I think it is not related to the Readiness config of the POD which also applied to HLB’s healthcheck, right? I tested this with both http keep-alive on and off.

@freehan
Copy link
Contributor

freehan commented Sep 20, 2019

So when the pod gets deleted, 2 things happen in parallel:

A. Kubelet sends SIGTERM to the containers.
B. Endpoint gets deprogrammed from the LB.

B takes more time as A usually happens very fast. So it is recommended to configure the pod to do 2 things when SIGTERM is recieved:

  1. Failed the LB health check. (This tells the LB to stop sending traffic no matter B is completed or not)
  2. Keep responding to traffic.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2019
@bowei
Copy link
Member

bowei commented Jan 9, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 8, 2020
@bowei
Copy link
Member

bowei commented May 9, 2020

/remove-lifecycle rotten

@freehan is this still relevant?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 9, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 7, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 6, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@freehan
Copy link
Contributor

freehan commented Mar 2, 2022

Another alternative is to add MinReadySeconds to deployments. This would force deployment controller to slow down rollout and leave enough time for NEG programming. For example:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    run: foo-backend
  name: foo-backend
spec:
  strategy:
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 100%
  minReadySeconds: 60
  replicas: 1
  selector:
    matchLabels:
      run: foo-backend
  template:
    metadata:
      labels:
        run: foo-backend
    spec:
      containers:
      - image: gcr.io/google_containers/serve-hostname-amd64:1.5
        imagePullPolicy: IfNotPresent
        name: hostname

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

10 participants