Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiencing downtime when updating hosts backend in ingress controller #116

Closed
znorris opened this issue Jan 26, 2018 · 12 comments
Closed
Assignees

Comments

@znorris
Copy link

znorris commented Jan 26, 2018

Issue

Why would I experience downtime when I update more than one backend service at a time, but not when I updated a single backend? (This may not be the correct question or issue summary but at the moment I'm not clear on why this is happening. It could have to do with the old backend being completely dereferenced from my ingress config.)

Reproduce

I've created a repo that includes pretty much everything one would need to reproduce this issue. However, I think the example below illustrates it well enough that you won't need the repo.
https://github.com/znorris/gce_ingress_troubleshoot

Example

For instance, I had a single app & service handling requests for several hosts.

apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Ingress
  spec:
    rules:
    - host: a.host
      http:
        paths:
        - backend:
            serviceName: app-z
            servicePort: 80
    - host: b.host
      http:
        paths:
        - backend:
            serviceName: app-z
            servicePort: 80
    - host: c.host
      http:
        paths:
        - backend:
            serviceName: app-z
            servicePort: 80
NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
app-z         NodePort    10.0.0.1        <none>        80:30001/TCP     1d

I then added a new app/deployment and service for each of the three hosts.

NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
app-a         NodePort    10.0.0.1        <none>        80:30002/TCP     1d
app-b         NodePort    10.0.0.1        <none>        80:30003/TCP     1d
app-c         NodePort    10.0.0.1        <none>        80:30004/TCP     1d
app-z         NodePort    10.0.0.1        <none>        80:30001/TCP     1d

I verified that the apps were responding to HCs and that the NodePorts were working.
Once that was complete I updated a single hosts backend in the ingress (a.host).

apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Ingress
  spec:
    rules:
    - host: a.host
      http:
        paths:
        - backend:
            serviceName: app-a
            servicePort: 80
    - host: b.host
      http:
        paths:
        - backend:
            serviceName: app-z
            servicePort: 80
    - host: c.host
      http:
        paths:
        - backend:
            serviceName: app-z
            servicePort: 80

I waited for the load balancer to update, and for the new service/app to respond to this traffic. I also verified in the cloud console that the health check associated with this new backend was passing. This is how I would expect everything to work and it had zero downtime.

I then decided that I had 5 more hosts to update on the ingress controller, and that it would be faster to update the remaining hosts all at once.

apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Ingress
  spec:
    rules:
    - host: a.host
      http:
        paths:
        - backend:
            serviceName: app-a
            servicePort: 80
    - host: b.host
      http:
        paths:
        - backend:
            serviceName: app-b
            servicePort: 80
    - host: c.host
      http:
        paths:
        - backend:
            serviceName: app-c
            servicePort: 80

Once that was done I began to see failing requests (HTTP 502 from load balancer) for roughly 5 minutes for those hosts I had changed in bulk. After the 5 minutes requests were OK. During that time the load balancer was logging these 502's:

jsonPayload:{
  @type:  "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"   
  statusDetails:  "failed_to_connect_to_backend"   
 }

I then checked that the appropriate service/app was responding to requests for all hosts in the ingress controller. Everything looked good. I then went into cloud console and verified that the app-z backend was no longer present and that its HCs had been cleaned up as well. They were, so I then removed the old service and deployment/app.

NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
app-a         NodePort    10.0.0.1        <none>        80:30002/TCP     1d
app-b         NodePort    10.0.0.1        <none>        80:30003/TCP     1d
app-c         NodePort    10.0.0.1        <none>        80:30004/TCP     1d

I'm now in my desired state and everything is working as expected.

@znorris znorris changed the title Experiencing downtime when updating updating backend Experiencing downtime when updating backend Jan 26, 2018
@znorris znorris changed the title Experiencing downtime when updating backend Experiencing downtime when updating hosts backend in ingress controller Jan 26, 2018
@nicksardo
Copy link
Contributor

Let me get this straight, you're editing an ingress spec to point to multiple services which weren't previously used for ingress? I wouldn't be surprised about receiving 502s. It could be that the GCE health checks haven't passed yet and it's failing closed. It could be that the firewall rule change hasn't propagated yet (most likely the case here).

I would be irritated if an existing host/backend experiences 502s when a new backend is added; however, I'm not seeing the issue of a small startup blip.

@znorris
Copy link
Author

znorris commented Jan 30, 2018

@nicksardo No, I don't believe your summary of the issue is correct. When I update the backend of a single existing host rule, a.host in the example above, to a new backend I do not get any downtime. If it were about the creation of a new service which wasn't previously used, as you suggest, I would receive downtime for that update.

I don’t know the inner workings of the ingress controller so I can only speculate as to why this is happening. It appears to me to only be an issue when I’m completely removing the reference to the old backend.

@nicksardo
Copy link
Contributor

On the contrary, I do get 502s when updating a single backend to a new service, even when the previous backend still exists for another ingress. I can observe them with a simple curl loop: while true; do echo $(curl -I http://xxx.xxx.xxx.xxx/ 2> /dev/null | head -n 1 | cut -d ' ' -f 2); done. When updating an ingress causing the old backend to be discarded entirely, I also received quite a few 502s.

In any case, minimizing 502s while changing targeted services is not a priority right now.

@znorris
Copy link
Author

znorris commented Jan 31, 2018

@nicksardo You're correct. I wasn't testing at a high enough rate. Updating an existing rule to utilize a new backend will result in downtime.

@rramkumar1
Copy link
Contributor

Closing this for now. @znorris reopen if you are still having issues.
/close

@infernix
Copy link

infernix commented Dec 7, 2018

In any case, minimizing 502s while changing targeted services is not a priority right now.

I'd like to reopen this. What I'm seeing is similar. We're basically updating a Deployment with a newer version. We've already set things like minReadySeconds 60, maxSurge 6, maxUnavailable 0, and adjusted the healthchecks to period=5 timeout=4 success=1 failure=2 initialdelay=60.

So when we push the exact same Deployment with a newer container image version, I get no 502s during the startup of the 6 new pods (total of 12 at that point with replica=6), but as soon as the old containers are being teared down, 502s start to happen. This lasts for the entire duration of the teardown per old pod and within the failure*period time. So if the health checks are conservative (say period=60 timeout=10 success=1 failure=5) this takes quite a long time.

Given the fact that there's really nothing more that can be done, and it kind of defeats the purpose of a Rolling Upgrade, I'd like to see what can be done to address it.

@lucasfais
Copy link

@infernix Are you using nginx within each pod? If so, this might help: https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304

@samwightt
Copy link

This should be re-opened, I'm experiencing the exact same behavior that @infernix is describing almost two years later. Very disappointed that something as serious as downtime hasn't been resolved.

@AkselAllas
Copy link

Should definitely be reopened. We are using GKE Autopilot and have the same exact scenario with updating a Deployment with a newer version that @infernix described.

@AkselAllas
Copy link

AkselAllas commented Mar 8, 2022

I managed to solve the exact downtime described by @infrenix by adding the following to my deployment:

spec:
  minReadySeconds: 45

This is a new feature of k8s 1.22. And it is useful for situations where Cloud service load balancers need extra time to setup after k8s endpoints are ready. (Which is the case here.)

EDIT: Seems minReadySeconds works for me only with an nginx readiness check, but not with a java Spring actuator readiness check. 🤔

This is definitely an issue that needs to be fixed. As such this issue should be reopened. Or a new one should be created at least. @swetharepakula @freehan

It seems that my pods without 502 have the following in kubectl describe pod

  Normal  LoadBalancerNegReady     42m   neg-readiness-reflector                
Pod has become Healthy in NEG "Key{\"k8s1-eqa3dbe-default-frontend-80\", zone: \"europe-west1-a\"}" 
attached to BackendService "Key{\"k8s1-eqa3dbe-default-frontend-80\"}". 
Marking condition "cloud.google.com/load-balancer-neg-ready" to True.

And pods with 502 had:

  Normal   LoadBalancerNegWithoutHealthCheck  3m26s                   neg-readiness-reflector                
  Pod is in NEG "Key{\"k8s1-eqa3dbe-default-frontend-80\", zone: \"europe-west1-a\"}". 
NEG is not attached to any BackendService with health checking. 
Marking condition "cloud.google.com/load-balancer-neg-ready" to True.

Seems that to get 502s, all pods must haveLoadBalancerNegWithoutHealthCheck

Seems to me that LoadBalancerNegWithoutHealthCheck gets checked by neg-readiness-reflector a non-uniform time after the pod gets started and it doesn't take into account for the time it takes for load balancer to propagate. And as such the state described in pod descriptions can be outdated.

@samwightt @renescheepers @znorris @alex-v-mihai @javihgil @Rizal-I @Turee

@AkselAllas
Copy link

AkselAllas commented Mar 9, 2022

The downtime also seems to happen more when the load balancer gets swapped out to a new region with no active load balancer. e.g.

europe-west1-a -> europe-west1-a gives no error

but

europe-west1-a -> europe-west1-b gives error.

During load balancer cold start we get 502 errors.

@mikegin
Copy link

mikegin commented May 16, 2022

Also experiencing this issue, any help would be appreciated!

govargo added a commit to govargo/gcp-manifests that referenced this issue Jul 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants