-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiencing downtime when updating hosts backend in ingress controller #116
Comments
Let me get this straight, you're editing an ingress spec to point to multiple services which weren't previously used for ingress? I wouldn't be surprised about receiving 502s. It could be that the GCE health checks haven't passed yet and it's failing closed. It could be that the firewall rule change hasn't propagated yet (most likely the case here). I would be irritated if an existing host/backend experiences 502s when a new backend is added; however, I'm not seeing the issue of a small startup blip. |
@nicksardo No, I don't believe your summary of the issue is correct. When I update the backend of a single existing host rule, I don’t know the inner workings of the ingress controller so I can only speculate as to why this is happening. It appears to me to only be an issue when I’m completely removing the reference to the old backend. |
On the contrary, I do get 502s when updating a single backend to a new service, even when the previous backend still exists for another ingress. I can observe them with a simple curl loop: In any case, minimizing 502s while changing targeted services is not a priority right now. |
@nicksardo You're correct. I wasn't testing at a high enough rate. Updating an existing rule to utilize a new backend will result in downtime. |
Closing this for now. @znorris reopen if you are still having issues. |
I'd like to reopen this. What I'm seeing is similar. We're basically updating a Deployment with a newer version. We've already set things like minReadySeconds 60, maxSurge 6, maxUnavailable 0, and adjusted the healthchecks to period=5 timeout=4 success=1 failure=2 initialdelay=60. So when we push the exact same Deployment with a newer container image version, I get no 502s during the startup of the 6 new pods (total of 12 at that point with replica=6), but as soon as the old containers are being teared down, 502s start to happen. This lasts for the entire duration of the teardown per old pod and within the failure*period time. So if the health checks are conservative (say period=60 timeout=10 success=1 failure=5) this takes quite a long time. Given the fact that there's really nothing more that can be done, and it kind of defeats the purpose of a Rolling Upgrade, I'd like to see what can be done to address it. |
@infernix Are you using nginx within each pod? If so, this might help: https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304 |
This should be re-opened, I'm experiencing the exact same behavior that @infernix is describing almost two years later. Very disappointed that something as serious as downtime hasn't been resolved. |
Should definitely be reopened. We are using GKE Autopilot and have the same exact scenario with updating a Deployment with a newer version that @infernix described. |
I managed to solve the exact downtime described by @infrenix by adding the following to my deployment:
This is a new feature of k8s 1.22. And it is useful for situations where Cloud service load balancers need extra time to setup after k8s endpoints are ready. (Which is the case here.) EDIT: Seems minReadySeconds works for me only with an nginx readiness check, but not with a java Spring actuator readiness check. 🤔 This is definitely an issue that needs to be fixed. As such this issue should be reopened. Or a new one should be created at least. @swetharepakula @freehan It seems that my pods without 502 have the following in
And pods with 502 had:
Seems that to get 502s, all pods must have Seems to me that @samwightt @renescheepers @znorris @alex-v-mihai @javihgil @Rizal-I @Turee |
The downtime also seems to happen more when the load balancer gets swapped out to a new region with no active load balancer. e.g. europe-west1-a -> europe-west1-a gives no error but europe-west1-a -> europe-west1-b gives error. During load balancer cold start we get 502 errors. |
Also experiencing this issue, any help would be appreciated! |
Issue
Why would I experience downtime when I update more than one backend service at a time, but not when I updated a single backend? (This may not be the correct question or issue summary but at the moment I'm not clear on why this is happening. It could have to do with the old backend being completely dereferenced from my ingress config.)
Reproduce
I've created a repo that includes pretty much everything one would need to reproduce this issue. However, I think the example below illustrates it well enough that you won't need the repo.
https://github.com/znorris/gce_ingress_troubleshoot
Example
For instance, I had a single app & service handling requests for several hosts.
I then added a new app/deployment and service for each of the three hosts.
I verified that the apps were responding to HCs and that the NodePorts were working.
Once that was complete I updated a single hosts backend in the ingress (
a.host
).I waited for the load balancer to update, and for the new service/app to respond to this traffic. I also verified in the cloud console that the health check associated with this new backend was passing. This is how I would expect everything to work and it had zero downtime.
I then decided that I had 5 more hosts to update on the ingress controller, and that it would be faster to update the remaining hosts all at once.
Once that was done I began to see failing requests (HTTP 502 from load balancer) for roughly 5 minutes for those hosts I had changed in bulk. After the 5 minutes requests were OK. During that time the load balancer was logging these 502's:
I then checked that the appropriate service/app was responding to requests for all hosts in the ingress controller. Everything looked good. I then went into cloud console and verified that the
app-z
backend was no longer present and that its HCs had been cleaned up as well. They were, so I then removed the old service and deployment/app.I'm now in my desired state and everything is working as expected.
The text was updated successfully, but these errors were encountered: