Point gce ingress health checks at the node for onlylocal services #17

bowei · 2017-10-11T17:33:50Z

From @bprashanth on November 18, 2016 22:35

We now have a new beta annotation on Services, external-traffic (http://kubernetes.io/docs/user-guide/load-balancer/#loss-of-client-source-ip-for-external-traffic). With this annotation set to onlyLocal NodePort Services only proxy to local endpoints. If there are no local endpoints, iptables is configured to drop packets. Currently sticking an onlyLocal Service behind an Ingress works, but does so in a suboptimal way.

The issue is, currently, the best way to configure lb health checks is to set high failure threshold so we detect nodes with bad networking, but not flake on bad endpoints. With this approach, if all endpoints evacuate a node, it'll take eg: 10 health checks*10 seconds per health check = 1.5 minutes to mark that node unhealthy, but the node will start DROPing packets for the NodePort immediately. If we pointed the lb health check at the healthcheck-nodeport (a nodePort that's managed by kube-proxy), it would fail in < 10s even with the high thresholds described above.

@thockin

Copied from original issue: kubernetes/ingress-nginx#19

The text was updated successfully, but these errors were encountered:

bowei · 2017-10-11T17:33:51Z

From @thockin on February 22, 2017 16:37

@MrHohn Found it!

bowei · 2017-10-11T17:33:51Z

From @MrHohn on February 28, 2017 1:28

Sorry, it's a bit late to ask but I'm lost at some arguments here:

With this approach, if all endpoints evacuate a node, it'll take eg: 10 health checks*10 seconds per health check = 1.5 minutes to mark that node unhealthy, but the node will start DROPing packets for the NodePort immediately.

Not sure what 'if all endpoints evacuate a node' means. If it means this node is broken and all endpoints get rescheduled to other nodes. Why would it take 10*10 seconds to mark this node unhealthy? Just checked the default setup for ingress healthcheck, the unhealthy threshold is set to 10 consecutive failures and the interval is set to 1 second. So I assume it would take 10*1 seconds for this node to be unhealthy for that specific backend group (also assuming healthchecks are sending to every instance in this group)?

Add @nicksardo who might also have the answer.

bowei · 2017-10-11T17:33:52Z

From @thockin on March 1, 2017 8:29

Not sure what 'if all endpoints evacuate a node' means

If, for some reason, all the endpoints for a Service get moved to a
different node.

Why would it take 10*10 seconds to mark this node unhealthy

The way the healthchecks function today. I think the point is that this is
bad.

The way I think it should work is:

If OnlyLocal {
HC targets HC-NodePort
HC gets 200 if there are local backends, else 404 or whatever
} else {
HC targets kube-proxy healthz
HC gets 200 if kube-proxy is online, else fail
}

For OnlyLocal Services, this will result in a node dropping out when there
are no local backends - correct behavior.

For Regular Services, this will result in a node dropping out if kube-proxy
is dead - arguably correct behavior, especially considering the recent
kube-proxy evicted fiasco.

What we do today - actually HC'ing the Service - isn't right. It will pick
a potentially different backend every time, so a dead backend won't
reliably be detected this way, and a flaky backend could cause any number
of whole nodes to drop out.

The proposed above is not robust to, e.g. iptables failures. Maybe we can
make it touch that path, too, but I think what we have is just wrong.

On Mon, Feb 27, 2017 at 5:28 PM, Zihong Zheng [email protected]
wrote:

Sorry, it's a bit late to ask but I'm lost at some arguments here:

With this approach, if all endpoints evacuate a node, it'll take eg: 10
health checks*10 seconds per health check = 1.5 minutes to mark that node
unhealthy, but the node will start DROPing packets for the NodePort
immediately.

Not sure what 'if all endpoints evacuate a node' means. If it means
this node is broken and all endpoints get rescheduled to other nodes. Why
would it take 1010 seconds to mark this node unhealthy? Just checked the
default setup for ingress healthcheck, the unhealthy threshold is set to 10
consecutive failures and the interval is set to 1 second. So I assume it
would take 101 seconds for this node to be unhealthy for that specific
backend group (also assuming healthchecks are sending to every instance in
this group)?

Add @nicksardo https://github.com/nicksardo who might also have the
answer.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
kubernetes/ingress-nginx#19 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVLzLmbvPUQKaeNaqPPiCZBlN8QsIks5rg3hagaJpZM4K2_2n
.

bowei · 2017-10-11T17:33:52Z

From @MrHohn on March 1, 2017 18:27

The way the healthchecks function today. I think the point is that this is bad.

Thanks for clarifying. Yeah I think this is really the point. It seems like current ingress health check setup is totally broken --- don't think bad endpoints could be easily detected when good endpoints also exist.

If OnlyLocal {
HC targets HC-NodePort
HC gets 200 if there are local backends, else 404 or whatever
} else {
HC targets kube-proxy healthz
HC gets 200 if kube-proxy is online, else fail
}

This seems feasible. We may also take kubernetes/kubernetes#14661 into account. In future if we also have node-level healthcheck from LBs for non-OnlyLocal services, then we could expand above health check mechanism to non-OnlyLocal LoadBalancer services as well.

May also need to think of NodePort services. ESIPP does not assign healthCheckNodePort to OnlyLocal NodePort services for now, which means above health check mechanism can not be used on them. For expanding this to NodePort services, we need to assign healthCheckNodePort to NodePort services as well. But then we may start worry about allocating too many healthCheckNodePort --- considering potentially port exhaustion and overhead in kube-proxy for holding too many health check servers.

More to think, if we are expanding this mechanism to non-OnlyLocal LoadBalancer services. What about non-OnlyLocal NodePort services? Feel like we need a concrete proposal :)

bowei · 2017-10-11T17:33:53Z

From @MrHohn on March 1, 2017 18:34

Oh, made a mistake above. For OnlyLocal NodePort service the original health check already works, traffic sent to nodes that do not have endpoints on it will be dropped.

bowei · 2017-10-11T17:33:53Z

From @sanderploegsma on June 22, 2017 13:55

Why would it take 10*10 seconds to mark this node unhealthy? Just checked the default setup for ingress healthcheck, the unhealthy threshold is set to 10 consecutive failures and the interval is set to 1 second.

This does not seem to be correct, the current source code explicitly mentions an interval of 60 seconds. Even better: creating an ingress connected to a service that has a readiness probe will combine these intervals, so if the readiness probe is set up to check at an interval of 10 seconds the resulting health check will have an interval of 70 seconds. It will then take over 10 minutes before a node stops receiving traffic it will drop because of the annotation.

We can of course change these parameters by hand, but it looks like the documentation fails to mention this behaviour. We found out the hard way in our production environment after having almost 15 minutes of seemingly random outage after deploying a new version of one of our applications...

bowei · 2017-10-11T17:33:54Z

From @nicksardo on June 22, 2017 17:14

I wouldn't recommend lowering the interval, @sanderploegsma. Since health checks hit the service's nodeport, if you run a few pods on a large node cluster, the pods will receive a significant amount of traffic. As thockin said earlier,

What we do today - actually HC'ing the Service - isn't right. It will pick
a potentially different backend every time, so a dead backend won't
reliably be detected this way, and a flaky backend could cause any number
of whole nodes to drop out.

Therefore, you shouldn't rely on LB health checks for monitoring pod or node health at this time.

bowei · 2017-10-11T17:33:54Z

From @sanderploegsma on June 22, 2017 18:15

Right, that makes sense. So there currently is no way for us to combine the external-traffic annotation with an Ingress? There must be a way to correctly determine to which nodes traffic should be directed, right?

bowei · 2017-10-11T17:33:54Z

From @nicksardo on June 22, 2017 18:38

I don't know of anyone else who has tried using that annotation with ingress. Be aware that it's not a documented/supported annotation of services used by GCE ingress. Also, I don't recommend manually modifying health checks. The GLBC 0.9.3 and 0.9.4 will drop these settings when migrating to a newer health check type. This was made seamless in 0.9.5.

There must be a way to correctly determine to which nodes traffic should be directed, right?

Better solutions are being worked on.

fejta-bot · 2018-01-09T17:49:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-02-11T04:54:10Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-03-13T05:40:32Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

bowei assigned nicksardo Oct 11, 2017

bowei added the backend/gce label Oct 11, 2017

bowei mentioned this issue Oct 11, 2017

Point gce ingress health checks at the node for onlylocal services kubernetes/ingress-nginx#19

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2018

nicksardo removed their assignment Jan 11, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 11, 2018

k8s-ci-robot closed this as completed Mar 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Point gce ingress health checks at the node for onlylocal services #17

Point gce ingress health checks at the node for onlylocal services #17

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

fejta-bot commented Jan 9, 2018

fejta-bot commented Feb 11, 2018

fejta-bot commented Mar 13, 2018

Point gce ingress health checks at the node for onlylocal services #17

Point gce ingress health checks at the node for onlylocal services #17

Comments

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

bowei commented Oct 11, 2017

fejta-bot commented Jan 9, 2018

fejta-bot commented Feb 11, 2018

fejta-bot commented Mar 13, 2018