Experiencing downtime when updating hosts backend in ingress controller #116

znorris · 2018-01-26T22:32:59Z

Issue

Why would I experience downtime when I update more than one backend service at a time, but not when I updated a single backend? (This may not be the correct question or issue summary but at the moment I'm not clear on why this is happening. It could have to do with the old backend being completely dereferenced from my ingress config.)

Reproduce

I've created a repo that includes pretty much everything one would need to reproduce this issue. However, I think the example below illustrates it well enough that you won't need the repo.
https://github.com/znorris/gce_ingress_troubleshoot

Example

For instance, I had a single app & service handling requests for several hosts.

apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Ingress
  spec:
    rules:
    - host: a.host
      http:
        paths:
        - backend:
            serviceName: app-z
            servicePort: 80
    - host: b.host
      http:
        paths:
        - backend:
            serviceName: app-z
            servicePort: 80
    - host: c.host
      http:
        paths:
        - backend:
            serviceName: app-z
            servicePort: 80

NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
app-z         NodePort    10.0.0.1        <none>        80:30001/TCP     1d

I then added a new app/deployment and service for each of the three hosts.

NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
app-a         NodePort    10.0.0.1        <none>        80:30002/TCP     1d
app-b         NodePort    10.0.0.1        <none>        80:30003/TCP     1d
app-c         NodePort    10.0.0.1        <none>        80:30004/TCP     1d
app-z         NodePort    10.0.0.1        <none>        80:30001/TCP     1d

I verified that the apps were responding to HCs and that the NodePorts were working.
Once that was complete I updated a single hosts backend in the ingress (a.host).

apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Ingress
  spec:
    rules:
    - host: a.host
      http:
        paths:
        - backend:
            serviceName: app-a
            servicePort: 80
    - host: b.host
      http:
        paths:
        - backend:
            serviceName: app-z
            servicePort: 80
    - host: c.host
      http:
        paths:
        - backend:
            serviceName: app-z
            servicePort: 80

I waited for the load balancer to update, and for the new service/app to respond to this traffic. I also verified in the cloud console that the health check associated with this new backend was passing. This is how I would expect everything to work and it had zero downtime.

I then decided that I had 5 more hosts to update on the ingress controller, and that it would be faster to update the remaining hosts all at once.

apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Ingress
  spec:
    rules:
    - host: a.host
      http:
        paths:
        - backend:
            serviceName: app-a
            servicePort: 80
    - host: b.host
      http:
        paths:
        - backend:
            serviceName: app-b
            servicePort: 80
    - host: c.host
      http:
        paths:
        - backend:
            serviceName: app-c
            servicePort: 80

Once that was done I began to see failing requests (HTTP 502 from load balancer) for roughly 5 minutes for those hosts I had changed in bulk. After the 5 minutes requests were OK. During that time the load balancer was logging these 502's:

jsonPayload:{
  @type:  "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"   
  statusDetails:  "failed_to_connect_to_backend"   
 }

I then checked that the appropriate service/app was responding to requests for all hosts in the ingress controller. Everything looked good. I then went into cloud console and verified that the app-z backend was no longer present and that its HCs had been cleaned up as well. They were, so I then removed the old service and deployment/app.

NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
app-a         NodePort    10.0.0.1        <none>        80:30002/TCP     1d
app-b         NodePort    10.0.0.1        <none>        80:30003/TCP     1d
app-c         NodePort    10.0.0.1        <none>        80:30004/TCP     1d

I'm now in my desired state and everything is working as expected.

The text was updated successfully, but these errors were encountered:

nicksardo · 2018-01-29T23:31:30Z

Let me get this straight, you're editing an ingress spec to point to multiple services which weren't previously used for ingress? I wouldn't be surprised about receiving 502s. It could be that the GCE health checks haven't passed yet and it's failing closed. It could be that the firewall rule change hasn't propagated yet (most likely the case here).

I would be irritated if an existing host/backend experiences 502s when a new backend is added; however, I'm not seeing the issue of a small startup blip.

znorris · 2018-01-30T17:54:55Z

@nicksardo No, I don't believe your summary of the issue is correct. When I update the backend of a single existing host rule, a.host in the example above, to a new backend I do not get any downtime. If it were about the creation of a new service which wasn't previously used, as you suggest, I would receive downtime for that update.

I don’t know the inner workings of the ingress controller so I can only speculate as to why this is happening. It appears to me to only be an issue when I’m completely removing the reference to the old backend.

nicksardo · 2018-01-30T23:40:58Z

On the contrary, I do get 502s when updating a single backend to a new service, even when the previous backend still exists for another ingress. I can observe them with a simple curl loop: while true; do echo $(curl -I http://xxx.xxx.xxx.xxx/ 2> /dev/null | head -n 1 | cut -d ' ' -f 2); done. When updating an ingress causing the old backend to be discarded entirely, I also received quite a few 502s.

In any case, minimizing 502s while changing targeted services is not a priority right now.

znorris · 2018-01-31T21:05:38Z

@nicksardo You're correct. I wasn't testing at a high enough rate. Updating an existing rule to utilize a new backend will result in downtime.

rramkumar1 · 2018-04-27T21:44:08Z

Closing this for now. @znorris reopen if you are still having issues.
/close

infernix · 2018-12-07T13:42:04Z

In any case, minimizing 502s while changing targeted services is not a priority right now.

I'd like to reopen this. What I'm seeing is similar. We're basically updating a Deployment with a newer version. We've already set things like minReadySeconds 60, maxSurge 6, maxUnavailable 0, and adjusted the healthchecks to period=5 timeout=4 success=1 failure=2 initialdelay=60.

So when we push the exact same Deployment with a newer container image version, I get no 502s during the startup of the 6 new pods (total of 12 at that point with replica=6), but as soon as the old containers are being teared down, 502s start to happen. This lasts for the entire duration of the teardown per old pod and within the failure*period time. So if the health checks are conservative (say period=60 timeout=10 success=1 failure=5) this takes quite a long time.

Given the fact that there's really nothing more that can be done, and it kind of defeats the purpose of a Rolling Upgrade, I'd like to see what can be done to address it.

lucasfais · 2019-04-12T15:05:12Z

@infernix Are you using nginx within each pod? If so, this might help: https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304

samwightt · 2021-03-25T16:58:19Z

This should be re-opened, I'm experiencing the exact same behavior that @infernix is describing almost two years later. Very disappointed that something as serious as downtime hasn't been resolved.

AkselAllas · 2022-03-04T11:50:26Z

Should definitely be reopened. We are using GKE Autopilot and have the same exact scenario with updating a Deployment with a newer version that @infernix described.

AkselAllas · 2022-03-08T12:01:12Z

I managed to solve the exact downtime described by @infrenix by adding the following to my deployment:

spec:
  minReadySeconds: 45

This is a new feature of k8s 1.22. And it is useful for situations where Cloud service load balancers need extra time to setup after k8s endpoints are ready. (Which is the case here.)

EDIT: Seems minReadySeconds works for me only with an nginx readiness check, but not with a java Spring actuator readiness check. 🤔

This is definitely an issue that needs to be fixed. As such this issue should be reopened. Or a new one should be created at least. @swetharepakula @freehan

It seems that my pods without 502 have the following in kubectl describe pod

  Normal  LoadBalancerNegReady     42m   neg-readiness-reflector                
Pod has become Healthy in NEG "Key{\"k8s1-eqa3dbe-default-frontend-80\", zone: \"europe-west1-a\"}" 
attached to BackendService "Key{\"k8s1-eqa3dbe-default-frontend-80\"}". 
Marking condition "cloud.google.com/load-balancer-neg-ready" to True.

And pods with 502 had:

  Normal   LoadBalancerNegWithoutHealthCheck  3m26s                   neg-readiness-reflector                
  Pod is in NEG "Key{\"k8s1-eqa3dbe-default-frontend-80\", zone: \"europe-west1-a\"}". 
NEG is not attached to any BackendService with health checking. 
Marking condition "cloud.google.com/load-balancer-neg-ready" to True.

Seems that to get 502s, all pods must haveLoadBalancerNegWithoutHealthCheck

Seems to me that LoadBalancerNegWithoutHealthCheck gets checked by neg-readiness-reflector a non-uniform time after the pod gets started and it doesn't take into account for the time it takes for load balancer to propagate. And as such the state described in pod descriptions can be outdated.

@samwightt @renescheepers @znorris @alex-v-mihai @javihgil @Rizal-I @Turee

AkselAllas · 2022-03-09T10:40:48Z

The downtime also seems to happen more when the load balancer gets swapped out to a new region with no active load balancer. e.g.

europe-west1-a -> europe-west1-a gives no error

but

europe-west1-a -> europe-west1-b gives error.

During load balancer cold start we get 502 errors.

mikegin · 2022-05-16T22:40:07Z

Also experiencing this issue, any help would be appreciated!

znorris changed the title ~~Experiencing downtime when updating updating backend~~ Experiencing downtime when updating backend Jan 26, 2018

znorris changed the title ~~Experiencing downtime when updating backend~~ Experiencing downtime when updating hosts backend in ingress controller Jan 26, 2018

k8s-ci-robot assigned rramkumar1 Apr 27, 2018

k8s-ci-robot closed this as completed Apr 27, 2018

govargo added a commit to govargo/gcp-manifests that referenced this issue Jul 14, 2024

Use minReadySeconds=60 by kubernetes/ingress-gce#116 (comment)

1486077

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiencing downtime when updating hosts backend in ingress controller #116

Experiencing downtime when updating hosts backend in ingress controller #116

znorris commented Jan 26, 2018

nicksardo commented Jan 29, 2018

znorris commented Jan 30, 2018

nicksardo commented Jan 30, 2018

znorris commented Jan 31, 2018 •

edited

Loading

rramkumar1 commented Apr 27, 2018

infernix commented Dec 7, 2018

lucasfais commented Apr 12, 2019

samwightt commented Mar 25, 2021

AkselAllas commented Mar 4, 2022

AkselAllas commented Mar 8, 2022 •

edited

Loading

AkselAllas commented Mar 9, 2022 •

edited

Loading

mikegin commented May 16, 2022

Experiencing downtime when updating hosts backend in ingress controller #116

Experiencing downtime when updating hosts backend in ingress controller #116

Comments

znorris commented Jan 26, 2018

Issue

Reproduce

Example

nicksardo commented Jan 29, 2018

znorris commented Jan 30, 2018

nicksardo commented Jan 30, 2018

znorris commented Jan 31, 2018 • edited Loading

rramkumar1 commented Apr 27, 2018

infernix commented Dec 7, 2018

lucasfais commented Apr 12, 2019

samwightt commented Mar 25, 2021

AkselAllas commented Mar 4, 2022

AkselAllas commented Mar 8, 2022 • edited Loading

AkselAllas commented Mar 9, 2022 • edited Loading

mikegin commented May 16, 2022

znorris commented Jan 31, 2018 •

edited

Loading

AkselAllas commented Mar 8, 2022 •

edited

Loading

AkselAllas commented Mar 9, 2022 •

edited

Loading