📖 CAEP: Machine health checking a.k.a node auto repair proposal by enxebre · Pull Request #1684 · kubernetes-sigs/cluster-api

enxebre · 2019-10-30T11:20:22Z

What this PR does / why we need it:
Add machine health checking a.k.a node auto repair proposal

k8s-ci-robot · 2019-10-30T11:20:29Z

Hi @enxebre. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

docs/proposals/20191030-machine-health-checking.md

detiber · 2019-10-30T11:41:20Z

docs/proposals/20191030-machine-health-checking.md

+---
+
+# Title
+- Machine health checking a.k.a node auto repair


Suggested change

- Machine health checking a.k.a node auto repair

- Machine remediation a.k.a node auto repair

detiber · 2019-10-30T11:41:41Z

docs/proposals/20191030-machine-health-checking.md

+
+## Glossary
+
+Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).


It might be helpful to add a definition for Machine/Node remediation

docs/proposals/20191030-machine-health-checking.md

detiber · 2019-10-30T12:19:25Z

docs/proposals/20191030-machine-health-checking.md

+  maxUnhealthy: "40%"
+status:
+  currentHealthy: 5
+  expectedMachines: 5


How is expectedMachines determined? Would it be better to call this totalMachines?

expectedMachines is given by spec.selector, I'm good with totalMachines

docs/proposals/20191030-machine-health-checking.md

docs/proposals/images/mhc.svg

docs/proposals/20191030-machine-health-checking.md

vincepri · 2019-10-30T17:38:01Z

/kind proposal

ncdc · 2019-10-31T14:49:59Z

/ok-to-test

docs/proposals/20191030-machine-health-checking.md

johngmyers · 2019-11-03T06:15:07Z

docs/proposals/20191030-machine-health-checking.md

+For a node notFound or a failed machine, the machine is considerable unrecoverable, remediation can be triggered right away.
+
+### Remediation:
+- A deletion request for the machine is sent to the API server.


Should this not allow for a maxSurge like the RollingUpdate MachineDeploymentStrategyType?

If remediation does not use or coordinate with the MachineDeployment, then a concurrent RollingUpdate and remediation could end up removing more nodes than the application can tolerate.

To expand on this, I worry that cluster-api is walking into a failed design. If there are numerous subsystems each trying to drain and/or delete nodes independently from each other, it is likely that they will occasionally overstress applications, causing databases to drop below quorum and the like.

There should be one controller responsible for doing rolling updates of nodes. Other subsystems could then independently nominate/mark nodes for the rolling update controller to get rid of.

This is a good point. This proposal is actually trying to respect the right boundaries/responsibilities between controllers and using the right consumable semantics that allow composability to manipulate machine resources and signal other controllers.
The MHC is just nominating a machine for deletion by signalling the api server a desire for a machine resource to be deleted; this could also be requested anytime by e.g a user. As soon as the deletionTimestamp is set any watcher can react as they see fit.
The replicas reconciliation is so delegated to where it belongs, the controller owning the machine e.g machineDeployment, machineSet, other...
The actual deletion process is so delegated to where it belongs, the machine controller, and this is the layer where the application disruption toleration must be enforced via draining and PDB.

It is not clear to me that this mechanism would respect the maxSurge and maxUnavailable settings of the machineDeployment. I believe the MDC should use a mechanism that respects those settings, especially if the machine has workloads on it that may still be functioning.

This might be more of a machineDeployment/machineSet issue, but I can see a distinction between requesting voluntary and involuntary eviction of workloads. Deletion seems like it would tend towards the involuntary side of things.

Suggested change

- A deletion request for the machine is sent to the API server.

- The MDC requests deletion of the machine by placing a `cluster.k8s.io/delete-machine` annotation on it.

though I may misunderstand the semantics of that annotation.

There's no such annotation. The process is the same as for any other entity signalling a deletion desire. The machine health check sends a request to the API server i.e client.delete(). Everything else happens out of band:

The machine is give a deletionTimestamp.

Any watcher can adjust based on the deletionTimestamp (usually by filtering)

The machine controller sees the deletionTimestamp and enforces workloads availability policy by draining and pod pdb.

The machine controller removes the machine finalizer.

The api server removes the machine object from etcd.
You can also still set the maxUnhealthy machine health checker field to short circuit and to be less tolerant than the deployment maxUnhealthy.

There is such an annotation, at

cluster-api/controllers/machineset_delete_policy.go

Line 36 in aa3d2c2

DeleteNodeAnnotation = "cluster.k8s.io/delete-machine"

Ah got you, that's for the machineSet to prioritise deletion during a scale down operation. A machine with a deletionTimestamp won't even make it to the list of machines being prioritised, It will be filtered out and dismissed when reconciling towards expected number of replicas

docs/proposals/20191030-machine-health-checking.md

ncdc · 2019-11-08T18:23:42Z

docs/proposals/images/machine-health-check/mhc.plantuml

+start;
+:Machine Health Check controller;
+repeat
+  repeat


This nested repeat block doesn't render as I think you probably want in the image. Is there some change you could make?

mm I couldn't find a better way by looking here http://plantuml.com/guide

docs/proposals/20191030-machine-health-checking.md

detiber · 2019-11-08T19:35:09Z

docs/proposals/20191030-machine-health-checking.md

+
+e2e testing as part of the cluster-api e2e test suite.
+
+For failing early testing we could consider a test suite leveraging kubemark as a provider to simulate healthy/unhealthy nodes in a cloud agnostic manner without the need to bring up a real instance.


@thebsdbox had built a fake provider for testing recently, we could potentially leverage that for this type of testing.

enxebre · 2019-11-26T09:29:26Z

@detiber @ncdc how's this looking to you so far, any objections to move forward?

ncdc · 2019-11-26T16:31:04Z

LGTM. I hear what @johngmyers is saying about the MHC potentially fighting with the MD/MS controllers, or not taking max surge and/or unavailable values into consideration, but I'd like to see an implementation of the MHC. Maybe it works flawlessly, or maybe it has some conflicts - either way, let's write some code & test it! 😄

michaelgugino

We should define some mechanism to disable remediation on a particular node/machine as well.

docs/proposals/20191030-machine-health-checking.md

michaelgugino · 2019-12-02T14:52:00Z

docs/proposals/20191030-machine-health-checking.md

+#### MachineHealthCheck CRD:
+- Enable watching a group of machines (based on a label selector).
+- Enable defining an unhealthy node criteria (based on a list of node conditions).
+- Enable setting a threshold of unhealthy nodes. If the current number is at or above this threshold no further remediation will take place. This can be expressed as an int or as a percentage of the total targets in the pool.


This seems like it could easily be misconfigured. We should look at the total number of unhealthy nodes, not just the nodes in the 'target pool'. Also, we're watching a group of machines, not a group of nodes. Also, it's unclear what should happen if a node/machine is covered in multiple 'groups of machines' mentioned above.

I'm not sure I agree here. I would expect users to likely have groups of multiple MachineDeployments and to leverage those as pools of Nodes with different scheduling requirements (GPU availability, public/private app, etc), and as such I would expect to be able to define pertinent health checks against each of these separately.

What I'm talking about is a cluster-wide view of unhealthy nodes rather than a view of just a subset of the nodes when making the determination of whether or not to remediate problematic nodes.

Also, it's still unclear what should happen if a node is in multiple groups.

What I'm talking about is a cluster-wide view of unhealthy nodes rather than a view of just a subset of the nodes when making the determination of whether or not to remediate problematic nodes.

Yes, I understand that. I'm saying that for non-trivial use cases (i.e. one MachineDeployment for all worker nodes) I think this is not the view that we care about. We care more about the interruption of the pool of nodes that are serving individual scheduling concerns rather than the full set of nodes in the cluster.

I suspect this is something that we will need to make sure has plenty of clear documentation around, though.

Also, it's still unclear what should happen if a node is in multiple groups.

This is probably something that should be clarified.

Also, it's still unclear what should happen if a node is in multiple groups.

A node that happens to be covered by more than one MHC is liable to be remediated by any of them satisfying the requirements.

Later on we could discuss things like rejecting multiple groups, rate limiting instead of fully short circuiting, may be add zone awareness and cluster wide view, etc. But unless there're strong objections I'd prefer to keep this proposal simple and follow up with RFEs as we start gathering feedback based on an initial tangible implementation.

michaelgugino · 2019-12-02T14:56:13Z

docs/proposals/20191030-machine-health-checking.md

+  namespace: machine-api
+spec:
+  selector:
+    matchLabels:


Above it says we're looking at machine labels, but the 'role' label seems to be specific to a node. Is this intended to match a machine object's labels?

This is an arbitrary label that matches a group of machines

vincepri · 2019-12-03T14:58:51Z

I'll take a final pass by end of week, looking good so far!

ncdc · 2019-12-11T18:18:10Z

Lazy consensus starts now. Expires in 1 week on 12/18.

ncdc · 2019-12-18T18:20:19Z

/approve
/lgtm

k8s-ci-robot · 2019-12-18T18:20:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, ncdc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ncdc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 30, 2019

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 30, 2019

k8s-ci-robot requested review from chuckha and vincepri October 30, 2019 11:20

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 30, 2019

enxebre changed the title ~~📖 Add machine health checking a.k.a node auto repair proposal~~ 📖 Add machine health checking a.k.a node auto repair proposal Oct 30, 2019

detiber reviewed Oct 30, 2019

View reviewed changes

k8s-ci-robot requested a review from randomvariable October 30, 2019 12:27

chuckha reviewed Oct 30, 2019

View reviewed changes

docs/proposals/20191030-machine-health-checking.md Show resolved Hide resolved

vincepri changed the title ~~📖 Add machine health checking a.k.a node auto repair proposal~~ 📖 CAEP: Machine health checking a.k.a node auto repair proposal Oct 30, 2019

k8s-ci-robot added the kind/proposal Issues or PRs related to proposals. label Oct 30, 2019

enxebre force-pushed the mhc-proposal branch 2 times, most recently from 5933680 to 70675f8 Compare October 31, 2019 10:06

ncdc mentioned this pull request Oct 31, 2019

Decide on & document how infrastructure providers should (or must) handle permanently failed servers/VMs #1205

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 31, 2019

johngmyers reviewed Nov 3, 2019

View reviewed changes

docs/proposals/20191030-machine-health-checking.md Show resolved Hide resolved

johngmyers reviewed Nov 3, 2019

View reviewed changes

enxebre force-pushed the mhc-proposal branch 6 times, most recently from 33539d6 to cc415c3 Compare November 7, 2019 18:03

ncdc reviewed Nov 8, 2019

View reviewed changes

detiber reviewed Nov 8, 2019

View reviewed changes

enxebre force-pushed the mhc-proposal branch from cc415c3 to 6b9c0e6 Compare November 11, 2019 15:28

enxebre force-pushed the mhc-proposal branch from 6b9c0e6 to 6fdd98c Compare November 11, 2019 15:31

ncdc approved these changes Nov 26, 2019

View reviewed changes

michaelgugino suggested changes Dec 2, 2019

View reviewed changes

Add machine health checking a.k.a node auto repair proposal

9879d3f

enxebre force-pushed the mhc-proposal branch from 6fdd98c to 9879d3f Compare December 3, 2019 11:05

ncdc mentioned this pull request Dec 6, 2019

machine-controller problem self-detection #1078

Closed

detiber approved these changes Dec 11, 2019

View reviewed changes

vincepri approved these changes Dec 11, 2019

View reviewed changes

chuckha approved these changes Dec 17, 2019

View reviewed changes

k8s-ci-robot assigned ncdc Dec 18, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 18, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 18, 2019

k8s-ci-robot merged commit e7652cc into kubernetes-sigs:master Dec 18, 2019

michaelgugino mentioned this pull request Dec 20, 2019

add KEP: Node Maintenance Lease kubernetes/enhancements#1411

Closed

enxebre mentioned this pull request Jan 3, 2020

Machine Health Checking tracking issue #1990

Closed

7 tasks

enxebre mentioned this pull request Feb 5, 2020

REQUEST: New membership for enxebre kubernetes/org#1614

Closed

6 tasks

	- Machine health checking a.k.a node auto repair
	- Machine remediation a.k.a node auto repair


		## Glossary

		Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).

	- A deletion request for the machine is sent to the API server.
	- The MDC requests deletion of the machine by placing a `cluster.k8s.io/delete-machine` annotation on it.


		e2e testing as part of the cluster-api e2e test suite.

		For failing early testing we could consider a test suite leveraging kubemark as a provider to simulate healthy/unhealthy nodes in a cloud agnostic manner without the need to bring up a real instance.

Conversation

enxebre commented Oct 30, 2019

Uh oh!

k8s-ci-robot commented Oct 30, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vincepri commented Oct 30, 2019

Uh oh!

ncdc commented Oct 31, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johngmyers Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enxebre Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enxebre commented Nov 26, 2019

Uh oh!

ncdc commented Nov 26, 2019

Uh oh!

michaelgugino left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enxebre Dec 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

johngmyers Nov 13, 2019 •

edited

Loading

enxebre Nov 14, 2019 •

edited

Loading

enxebre Dec 3, 2019 •

edited

Loading