Small amount of 502s during rollouts with NEG even with "mitigations" #769

autrilla · 2019-06-13T05:57:04Z

From #583:

For the NEG Programming Latency issue, even with minReadySeconds and terminationGracePeriodSeconds set to 180, I'm still seeing a small amount of 502s during rollouts. Is this expected? My test was during a rollout of 11 pods, sending 100 requests per second to them overall. I saw 96 502s over the course of the rollout.

I'd like to understand the cause of this issue. My current guess is that Kubernetes terminates the pod and stops routing traffic to it, but the NEG is updated after the pod is terminated. If so, could we perhaps solve this with a pre-stop hook that detaches the pod from the NEG before terminating?

The text was updated successfully, but these errors were encountered:

rramkumar1 · 2019-06-13T14:21:05Z

@freehan Can you take a look?

freehan · 2019-06-13T18:54:06Z

@autrilla
Have you set up graceful termination on the backend pods? LB will still send traffic for short amount of time after the pod becomes NotReady. But when pod get deleted, it gets SIGTERM, the container need to handle SIGTERM and shutdown gracefully while still accepting traffic for a while.
https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace

freehan · 2019-06-13T18:55:18Z

Also, what is the GKE master version? There has been multiple improvements in the programming pipeline.

autrilla · 2019-06-13T18:56:30Z

I forget. I think I ran into this on GKE 1.11, but I'm not sure. I'll see if I can get some time and reproduce it again and give you better data if the issue is still there.

RafiGreenberg · 2019-07-16T20:45:42Z

I'm seeing this as well. 502s after every rolling deployment. This doesn't seem production ready?

freehan · 2019-07-17T17:46:27Z

@mackoftrack wondering if you have setup graceful termination like #769 (comment)

RafiGreenberg · 2019-07-17T18:02:58Z

@freehan Yes I have. Same issue. GKE master version 1.12.7-gke.25

freehan · 2019-07-17T18:07:55Z

@mackoftrack could you provide more detail regarding your setup? Anonymized yaml if possible?

RafiGreenberg · 2019-07-19T17:37:57Z

@freehan So you are saying you are not able to reproduce?

Basically it's just a typical Deployment, with:

terminationGracePeriodSeconds: 60
minReadySeconds: 60

With a RollingUpdate deployment strategy

An typical Ingress object, and a Service object with:

metadata:
annotations:
cloud.google.com/neg: '{"ingress": true}'

freehan · 2019-07-19T17:48:16Z

The rolling update scenario is tested continuously in the K8s E2E suite:
https://k8s-testgrid.appspot.com/sig-network-ingress-gce-e2e#ingress-gce-e2e&include-filter-by-regex=disruption

Test code:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/network/ingress.go#L390

The yamls we used:
https://github.com/kubernetes/kubernetes/tree/master/test/e2e/testing-manifests/ingress/neg-clusterip

In your setup, can you try switch to the container image used in the test and see if you still see 502s. https://github.com/kubernetes/kubernetes/blob/master/test/e2e/testing-manifests/ingress/neg-clusterip/rc.yaml

autrilla · 2019-07-19T18:07:05Z

@mackoftrack wondering if you have setup graceful termination like #769 (comment)

This was probably it. I used a Python application with aiohttp as an HTTP server. I don't think it handled SIGTERM this way.

RafiGreenberg · 2019-07-19T18:35:56Z

@freehan one difference I'm seeing is that my Service is type: NodePort and yours is type: ClusterIP

Could that be the issue? I followed these tutorials: https://cloud.google.com/kubernetes-engine/docs/tutorials/http-balancer and https://cloud.google.com/kubernetes-engine/docs/concepts/ingress

freehan · 2019-07-19T19:48:04Z

@mackoftrack That should be irrelevant.

RafiGreenberg · 2019-07-22T14:33:49Z

@freehan I couldn't reproduce the issue using your test yamls. I'm wondering if I'm hitting this: https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#scale-to-zero_workloads_interruption

I haven't tested with the test yamls in our production cluster, but I will try that today.

As our production cluster is a regional GKE cluster, and some of these deployments have only a single replica, that means that in a regional cluster there is 0 network endpoints in 2 out of 3 zones. Could that be the issue?

freehan · 2019-07-22T18:02:32Z

@mackoftrack Based on my testing, the bug has been fixed. You can try it as well.

Scale backend to 0
Send traffic, observe 502s and wait for a few minutes
Scale backend to 1
Send traffic and observe how long it takes for LB to response 200.

Step 4 should take only a few seconds, not minutes.

RafiGreenberg · 2019-07-22T18:15:43Z

@freehan I believe this is in fact the issue. I was able to reproduce using your test yamls on our regional GKE cluster in us-west1 region. This may in fact be a new bug.

Deploy your test yamls on a GKE regional cluster (in us-west1 this results in 3 NEG's, 1 in each zone us-west1-a, us-west1-b, us-west1-c)
Do a rolling update of the deployment
Observe 502s during deployment (there should be no 502s during a rolling update)

freehan · 2019-07-22T18:22:27Z

@mackoftrack how many backends do you have?

RafiGreenberg · 2019-07-22T18:27:28Z

I'm using your test yamls, which creates a single backend

freehan · 2019-07-22T18:39:00Z

FYI, the rolling update test itself scales up the deployment to multiple replicas.

If there is only 1 replica, then it is basically a race between removing the old endpoint and adding a new one. If you want to avoid 502s for rolling update 1 replica, you need to tune the rolling update strategy to MaxUnavailable to 0:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    run: hostname
  name: hostname
spec:
  minReadySeconds: 60
  selector:
    matchLabels:
      run: hostname
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        run: hostname
    spec:
      containers:
      - image: gcr.io/kubernetes-e2e-test-images/serve-hostname:1.1
        imagePullPolicy: IfNotPresent
        name: hostname
      terminationGracePeriodSeconds: 120

RafiGreenberg · 2019-07-22T18:52:55Z

Sorry, you're right, Setting

    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

I didn't see the 502s.

philpearl · 2019-07-25T13:16:00Z

I'm still seeing problems with this in GKE (1.13.7-gke.8, regional deployment). I have terminationGracePeriodSeconds: 180 and minReadySeconds: 180. My deployment has 2 replicas. I see 502s if I redeploy or do a rolling restart. The problem seems to happen at the moment when a pod is stopped.

I see a log indicating the endpoint has been removed from the endpoint group, and at around the same instant my pod receives a SIGTERM. 10s later the associated pod is deleted and I see a group of 502s. The logs for these indicate they're targeted at the endpoint that was removed from the endpoint group 10s previously. The error is "failed to connect to backend"

I presumed that the point of the large terminationGracePeriodSeconds and minReadySeconds was to give the LB time to react to changes in the endpoint group. Should I be seeing a larger gap between the endpoint group change and pod termination? Is the pod sent a SIGTERM before the endpoint is removed from the LB? https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace indicates my app should shut down when it receives a SIGTERM, which I take to mean stop accepting new work and terminate when existing work is complete. I don't see how updating terminationGracePeriodSeconds will help unless my app essentially ignores SIGTERM

I finally found the source code for the test app from the tests mentioned above. It does nothing on receipt of SIGTERM except sleep for 60s https://github.com/kubernetes/kubernetes/blob/master/test/images/agnhost/serve-hostname/serve_hostname.go. Previous versions used to panic immediately as far as I can tell. The file has moved around a lot so it's difficult to find any history or reasons for the change.

Anyway, it looks like either the test is invalid as the tested app doesn't do the advised termination, or the advice is wrong. Any thoughts?

philpearl · 2019-07-25T15:34:03Z

Here's a simple nginx-based repro for the problem. With is config I run ab -n 100000 -c 10 http://<ipaddr>/ and kubectl rollout restart deploy/nginxthen I get some 500s. If I replace the lifecycle preStop with a 60s sleep it doesn't 500 (at least not in the few times I've tried it)

apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  minReadySeconds: 60
  replicas: 2 # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: nginx
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: nginx
          image: nginx:1.7.9
          ports:
            - containerPort: 80
          lifecycle:
            preStop:
              exec:
                command:
                  - /usr/sbin/nginx
                  - -s
                  - quit
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
  labels:
    app: nginx
spec:
  type: NodePort
  selector:
    app: nginx
  ports:
    - name: nginx
      port: 80
      protocol: TCP
      targetPort: 80
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: nginx
spec:
  backend:
    serviceName: nginx
    servicePort: 80

RafiGreenberg · 2019-07-25T15:38:49Z

Yeah, now that @philpearl mentioned running on a regional GKE cluster, I tend to think that this problem is specific to regional clusters, as that is where I've been seeing these issue as well.

philpearl · 2019-07-25T17:04:31Z

@mackoftrack not so sure - I did that repro above in a single-zone cluster with a single machine

freehan · 2019-07-25T20:43:56Z

I should make it a bit more clear. When pod is deleted (added deletion timestamp), these things happens in parallel:

kubelet send SIGTERM to pods
endpoint controller update endpoints for corresponding services (if pod is selected by the service)

After Endpoints resource got updated (pod endpoint removed from ready addresses), service programming starts:

kube-proxy programs iptables to remove pod from service backend
other LB provider reacts to remove pod from LB backend.

There is a small time gap between step 1 and step 3&4. But programming iptables is generally faster than programming LBs, hence the gap is more visible.

To avoid service disruption during pod deletion, the key is to keep serving requests during graceful termination. That will leave enough time for LB or iptables to get fully programmed.

Ideally, K8s should remove the pod from service backend first and then send SIGTERM to the containers (Assuming most containers does not handle SIGTERM properly). But currently there is no such mechanism in K8s and the service life cycle is loosely coupled with pod life cycle. Hence, it is recommended to have proper SIGTERM handling on pods.

freehan · 2019-07-25T20:46:49Z

@philpearl
I am not sure what /usr/sbin/nginx -s -quit exactly do. I would imagine it would stop responding to new requests?

philpearl · 2019-07-25T21:05:48Z

I believe so. But what is "proper handling of SIGTERM"? From this explanation, the only sensible thing to do appears to be to completely ignore it and keep serving. Am I missing something?

…

On Thu, 25 Jul 2019, 9:46 pm Minhan Xia, ***@***.***> wrote: @philpearl <https://github.com/philpearl> I am not sure what - /usr/sbin/nginx -s -quit exactly do. I would imagine it would stop responding to new requests? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#769?email_source=notifications&email_token=AADPPXBBDIM4S7IAEEO3LD3QBIGL3A5CNFSM4HXWEGDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD22XITQ#issuecomment-515208270>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADPPXHITNDANQWHPXIUD3DQBIGL3ANCNFSM4HXWEGDA> .

aledbf · 2019-07-25T21:10:40Z

@philpearl something like kubernetes/ingress-nginx#322 (comment)
As @freehan mentioned, this is not an issue with ingress-gce itself.

bowei · 2019-07-25T21:18:49Z

For generic HTTP server, yes, it means keep serving existing requests, but should be sending connection: close on the response so client have to reconnect.
Servers that have a "lame duck" state handling that can actively reject existing sessions with the clients would start behaving differently.

philpearl · 2019-07-26T07:16:11Z

OK, this all makes sense, but the documentation is actively confusing.

For a basic web service the advice seems to be as follows

handle SIGTERM - you certainly shouldn't exit as soon as receive it. You probably can't quite ignore it.
don't do a "graceful shutdown" in the traditional sense. Don't stop serving new requests. Don't call Shutdown() on a Go http server.
start sending "Connection": "close" headers in HTTP responses (and your HTTP server library should drop connections after sending responses with these headers)
try not to feel bad about the inevitable windows where someone might re-use a pre-existing but idle http connection just as you finally die.

If you're using something off-the-shelf like nginx you're probably out of luck. Best to sleep in the pre-stop hook to avoid SIGTERM reaching the server, and just hope there are no issues with long-standing HTTP connections.

https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace is at least strongly hinting that you should do a "graceful shutdown" - and that's certainly not what you should do.

philpearl · 2019-07-26T08:08:22Z

@mackoftrack I don't think anyone is denying this is all a bit hacky. It seems like k8s is missing the synchronisation mechanisms needed to make this clean. I'm grumbling because the docs aren't clear - probably because clear documentation would make the hackiness very apparent! old man shakes fist at cloud

axot · 2019-09-03T14:40:45Z

@bowei, the idea is interesting to send connection: close when pod getting sigterm which could cut http keep-alive connection down, how do you implement this? It also can achieve by application layer, but if it could done in middleware like nginx, the solution could be more general.

There may have some edge cases if client did not send any requests between terminationGracePeriodSeconds period, the keep-alive may still working then. By setting up timeout in Keep-alive header will help this case.

axot · 2019-09-03T14:51:20Z

@freehan Cloud you please explain more about how HLB(NEG) service out a POD, when SIGTERM sending to a POD.

I did some tests just tail nginx’s access log, and then delete that pod, in this case the access still coming up to almost 20 seconds, I think it is not related to the Readiness config of the POD which also applied to HLB’s healthcheck, right? I tested this with both http keep-alive on and off.

freehan · 2019-09-20T23:42:33Z

So when the pod gets deleted, 2 things happen in parallel:

A. Kubelet sends SIGTERM to the containers.
B. Endpoint gets deprogrammed from the LB.

B takes more time as A usually happens very fast. So it is recommended to configure the pod to do 2 things when SIGTERM is recieved:

Failed the LB health check. (This tells the LB to stop sending traffic no matter B is completed or not)
Keep responding to traffic.

fejta-bot · 2019-12-20T00:32:26Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

bowei · 2020-01-09T22:12:19Z

/remove-lifecycle stale

fejta-bot · 2020-04-08T22:52:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-05-08T23:36:28Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

bowei · 2020-05-09T00:12:04Z

/remove-lifecycle rotten

@freehan is this still relevant?

fejta-bot · 2020-08-07T00:42:09Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-09-06T01:24:34Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-10-06T02:07:30Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-10-06T02:07:43Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

freehan · 2022-03-02T18:59:56Z

Another alternative is to add MinReadySeconds to deployments. This would force deployment controller to slow down rollout and leave enough time for NEG programming. For example:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    run: foo-backend
  name: foo-backend
spec:
  strategy:
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 100%
  minReadySeconds: 60
  replicas: 1
  selector:
    matchLabels:
      run: foo-backend
  template:
    metadata:
      labels:
        run: foo-backend
    spec:
      containers:
      - image: gcr.io/google_containers/serve-hostname-amd64:1.5
        imagePullPolicy: IfNotPresent
        name: hostname

rramkumar1 assigned freehan Jun 13, 2019

freehan mentioned this issue Jul 26, 2019

Document how to avoid 502s #34

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 8, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 9, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 7, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 6, 2020

k8s-ci-robot closed this as completed Oct 6, 2020

edclement mentioned this issue Oct 28, 2021

Update deployment to mostly resolve 502 ingress errors xlts-dev/verdaccio-gke-charts#6

Merged

jtackaberry mentioned this issue Jan 28, 2022

Configurable grace period on SIGTERM static-web-server/static-web-server#79

Closed

jtackaberry mentioned this issue Jul 5, 2022

Option to defer SIGTERM by a configurable amount of time Yelp/dumb-init#281

Closed

jtackaberry mentioned this issue Nov 30, 2022

Support graceful restarts without client impact rqlite/rqlite#1132

Closed

Small amount of 502s during rollouts with NEG even with "mitigations" #769

Small amount of 502s during rollouts with NEG even with "mitigations" #769

Comments

autrilla commented Jun 13, 2019

rramkumar1 commented Jun 13, 2019

freehan commented Jun 13, 2019

freehan commented Jun 13, 2019 • edited Loading

autrilla commented Jun 13, 2019

RafiGreenberg commented Jul 16, 2019

freehan commented Jul 17, 2019

RafiGreenberg commented Jul 17, 2019 • edited Loading

freehan commented Jul 17, 2019 • edited Loading

RafiGreenberg commented Jul 19, 2019

freehan commented Jul 19, 2019

autrilla commented Jul 19, 2019 • edited Loading

RafiGreenberg commented Jul 19, 2019

freehan commented Jul 19, 2019

RafiGreenberg commented Jul 22, 2019 • edited Loading

freehan commented Jul 22, 2019 • edited Loading

RafiGreenberg commented Jul 22, 2019

freehan commented Jul 22, 2019

RafiGreenberg commented Jul 22, 2019

freehan commented Jul 22, 2019 • edited Loading

RafiGreenberg commented Jul 22, 2019

philpearl commented Jul 25, 2019 • edited Loading

philpearl commented Jul 25, 2019

RafiGreenberg commented Jul 25, 2019

philpearl commented Jul 25, 2019

freehan commented Jul 25, 2019 • edited Loading

freehan commented Jul 25, 2019 • edited Loading

philpearl commented Jul 25, 2019 via email

aledbf commented Jul 25, 2019

bowei commented Jul 25, 2019

philpearl commented Jul 26, 2019 • edited Loading

philpearl commented Jul 26, 2019

axot commented Sep 3, 2019 • edited Loading

axot commented Sep 3, 2019 • edited Loading

freehan commented Sep 20, 2019 • edited Loading

fejta-bot commented Dec 20, 2019

bowei commented Jan 9, 2020

fejta-bot commented Apr 8, 2020

fejta-bot commented May 8, 2020

bowei commented May 9, 2020

fejta-bot commented Aug 7, 2020

fejta-bot commented Sep 6, 2020

fejta-bot commented Oct 6, 2020

k8s-ci-robot commented Oct 6, 2020

freehan commented Mar 2, 2022

freehan commented Jun 13, 2019 •

edited

Loading

RafiGreenberg commented Jul 17, 2019 •

edited

Loading

freehan commented Jul 17, 2019 •

edited

Loading

autrilla commented Jul 19, 2019 •

edited

Loading

RafiGreenberg commented Jul 22, 2019 •

edited

Loading

freehan commented Jul 22, 2019 •

edited

Loading

freehan commented Jul 22, 2019 •

edited

Loading

philpearl commented Jul 25, 2019 •

edited

Loading

freehan commented Jul 25, 2019 •

edited

Loading

freehan commented Jul 25, 2019 •

edited

Loading

philpearl commented Jul 26, 2019 •

edited

Loading

axot commented Sep 3, 2019 •

edited

Loading

axot commented Sep 3, 2019 •

edited

Loading

freehan commented Sep 20, 2019 •

edited

Loading