📖 CAEP: Machine health checking a.k.a node auto repair proposal#1684
📖 CAEP: Machine health checking a.k.a node auto repair proposal#1684k8s-ci-robot merged 1 commit intokubernetes-sigs:masterfrom
Conversation
|
Hi @enxebre. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| --- | ||
|
|
||
| # Title | ||
| - Machine health checking a.k.a node auto repair |
There was a problem hiding this comment.
| - Machine health checking a.k.a node auto repair | |
| - Machine remediation a.k.a node auto repair |
|
|
||
| ## Glossary | ||
|
|
||
| Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html). |
There was a problem hiding this comment.
It might be helpful to add a definition for Machine/Node remediation
| maxUnhealthy: "40%" | ||
| status: | ||
| currentHealthy: 5 | ||
| expectedMachines: 5 |
There was a problem hiding this comment.
How is expectedMachines determined? Would it be better to call this totalMachines?
There was a problem hiding this comment.
expectedMachines is given by spec.selector, I'm good with totalMachines
|
/kind proposal |
5933680 to
70675f8
Compare
|
/ok-to-test |
| For a node notFound or a failed machine, the machine is considerable unrecoverable, remediation can be triggered right away. | ||
|
|
||
| ### Remediation: | ||
| - A deletion request for the machine is sent to the API server. |
There was a problem hiding this comment.
Should this not allow for a maxSurge like the RollingUpdate MachineDeploymentStrategyType?
If remediation does not use or coordinate with the MachineDeployment, then a concurrent RollingUpdate and remediation could end up removing more nodes than the application can tolerate.
There was a problem hiding this comment.
To expand on this, I worry that cluster-api is walking into a failed design. If there are numerous subsystems each trying to drain and/or delete nodes independently from each other, it is likely that they will occasionally overstress applications, causing databases to drop below quorum and the like.
There should be one controller responsible for doing rolling updates of nodes. Other subsystems could then independently nominate/mark nodes for the rolling update controller to get rid of.
There was a problem hiding this comment.
This is a good point. This proposal is actually trying to respect the right boundaries/responsibilities between controllers and using the right consumable semantics that allow composability to manipulate machine resources and signal other controllers.
The MHC is just nominating a machine for deletion by signalling the api server a desire for a machine resource to be deleted; this could also be requested anytime by e.g a user. As soon as the deletionTimestamp is set any watcher can react as they see fit.
The replicas reconciliation is so delegated to where it belongs, the controller owning the machine e.g machineDeployment, machineSet, other...
The actual deletion process is so delegated to where it belongs, the machine controller, and this is the layer where the application disruption toleration must be enforced via draining and PDB.
There was a problem hiding this comment.
It is not clear to me that this mechanism would respect the maxSurge and maxUnavailable settings of the machineDeployment. I believe the MDC should use a mechanism that respects those settings, especially if the machine has workloads on it that may still be functioning.
This might be more of a machineDeployment/machineSet issue, but I can see a distinction between requesting voluntary and involuntary eviction of workloads. Deletion seems like it would tend towards the involuntary side of things.
There was a problem hiding this comment.
| - A deletion request for the machine is sent to the API server. | |
| - The MDC requests deletion of the machine by placing a `cluster.k8s.io/delete-machine` annotation on it. |
though I may misunderstand the semantics of that annotation.
There was a problem hiding this comment.
There's no such annotation. The process is the same as for any other entity signalling a deletion desire. The machine health check sends a request to the API server i.e client.delete(). Everything else happens out of band:
- The machine is give a deletionTimestamp.
- Any watcher can adjust based on the deletionTimestamp (usually by filtering)
- The machine controller sees the deletionTimestamp and enforces workloads availability policy by draining and pod pdb.
- The machine controller removes the machine finalizer.
- The api server removes the machine object from etcd.
You can also still set the maxUnhealthy machine health checker field to short circuit and to be less tolerant than the deployment maxUnhealthy.
There was a problem hiding this comment.
There is such an annotation, at
There was a problem hiding this comment.
Ah got you, that's for the machineSet to prioritise deletion during a scale down operation. A machine with a deletionTimestamp won't even make it to the list of machines being prioritised, It will be filtered out and dismissed when reconciling towards expected number of replicas
33539d6 to
cc415c3
Compare
| start; | ||
| :Machine Health Check controller; | ||
| repeat | ||
| repeat |
There was a problem hiding this comment.
This nested repeat block doesn't render as I think you probably want in the image. Is there some change you could make?
There was a problem hiding this comment.
mm I couldn't find a better way by looking here http://plantuml.com/guide
|
|
||
| e2e testing as part of the cluster-api e2e test suite. | ||
|
|
||
| For failing early testing we could consider a test suite leveraging kubemark as a provider to simulate healthy/unhealthy nodes in a cloud agnostic manner without the need to bring up a real instance. |
There was a problem hiding this comment.
@thebsdbox had built a fake provider for testing recently, we could potentially leverage that for this type of testing.
cc415c3 to
6b9c0e6
Compare
6b9c0e6 to
6fdd98c
Compare
|
LGTM. I hear what @johngmyers is saying about the MHC potentially fighting with the MD/MS controllers, or not taking max surge and/or unavailable values into consideration, but I'd like to see an implementation of the MHC. Maybe it works flawlessly, or maybe it has some conflicts - either way, let's write some code & test it! 😄 |
michaelgugino
left a comment
There was a problem hiding this comment.
We should define some mechanism to disable remediation on a particular node/machine as well.
| #### MachineHealthCheck CRD: | ||
| - Enable watching a group of machines (based on a label selector). | ||
| - Enable defining an unhealthy node criteria (based on a list of node conditions). | ||
| - Enable setting a threshold of unhealthy nodes. If the current number is at or above this threshold no further remediation will take place. This can be expressed as an int or as a percentage of the total targets in the pool. |
There was a problem hiding this comment.
This seems like it could easily be misconfigured. We should look at the total number of unhealthy nodes, not just the nodes in the 'target pool'. Also, we're watching a group of machines, not a group of nodes. Also, it's unclear what should happen if a node/machine is covered in multiple 'groups of machines' mentioned above.
There was a problem hiding this comment.
I'm not sure I agree here. I would expect users to likely have groups of multiple MachineDeployments and to leverage those as pools of Nodes with different scheduling requirements (GPU availability, public/private app, etc), and as such I would expect to be able to define pertinent health checks against each of these separately.
There was a problem hiding this comment.
What I'm talking about is a cluster-wide view of unhealthy nodes rather than a view of just a subset of the nodes when making the determination of whether or not to remediate problematic nodes.
Also, it's still unclear what should happen if a node is in multiple groups.
There was a problem hiding this comment.
What I'm talking about is a cluster-wide view of unhealthy nodes rather than a view of just a subset of the nodes when making the determination of whether or not to remediate problematic nodes.
Yes, I understand that. I'm saying that for non-trivial use cases (i.e. one MachineDeployment for all worker nodes) I think this is not the view that we care about. We care more about the interruption of the pool of nodes that are serving individual scheduling concerns rather than the full set of nodes in the cluster.
I suspect this is something that we will need to make sure has plenty of clear documentation around, though.
Also, it's still unclear what should happen if a node is in multiple groups.
This is probably something that should be clarified.
There was a problem hiding this comment.
Also, it's still unclear what should happen if a node is in multiple groups.
A node that happens to be covered by more than one MHC is liable to be remediated by any of them satisfying the requirements.
Later on we could discuss things like rejecting multiple groups, rate limiting instead of fully short circuiting, may be add zone awareness and cluster wide view, etc. But unless there're strong objections I'd prefer to keep this proposal simple and follow up with RFEs as we start gathering feedback based on an initial tangible implementation.
| namespace: machine-api | ||
| spec: | ||
| selector: | ||
| matchLabels: |
There was a problem hiding this comment.
Above it says we're looking at machine labels, but the 'role' label seems to be specific to a node. Is this intended to match a machine object's labels?
There was a problem hiding this comment.
This is an arbitrary label that matches a group of machines
|
I'll take a final pass by end of week, looking good so far! |
|
Lazy consensus starts now. Expires in 1 week on 12/18. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre, ncdc The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Add machine health checking a.k.a node auto repair proposal