Skip to content

📖 CAEP: Machine health checking a.k.a node auto repair proposal#1684

Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:masterfrom
enxebre:mhc-proposal
Dec 18, 2019
Merged

📖 CAEP: Machine health checking a.k.a node auto repair proposal#1684
k8s-ci-robot merged 1 commit intokubernetes-sigs:masterfrom
enxebre:mhc-proposal

Conversation

@enxebre
Copy link
Member

@enxebre enxebre commented Oct 30, 2019

What this PR does / why we need it:
Add machine health checking a.k.a node auto repair proposal

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 30, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @enxebre. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 30, 2019
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 30, 2019
@enxebre enxebre changed the title 📖 Add machine health checking a.k.a node auto repair proposal 📖 Add machine health checking a.k.a node auto repair proposal Oct 30, 2019
---

# Title
- Machine health checking a.k.a node auto repair
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Machine health checking a.k.a node auto repair
- Machine remediation a.k.a node auto repair


## Glossary

Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to add a definition for Machine/Node remediation

maxUnhealthy: "40%"
status:
currentHealthy: 5
expectedMachines: 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is expectedMachines determined? Would it be better to call this totalMachines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expectedMachines is given by spec.selector, I'm good with totalMachines

@vincepri vincepri changed the title 📖 Add machine health checking a.k.a node auto repair proposal 📖 CAEP: Machine health checking a.k.a node auto repair proposal Oct 30, 2019
@vincepri
Copy link
Member

/kind proposal

@ncdc
Copy link
Contributor

ncdc commented Oct 31, 2019

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 31, 2019
For a node notFound or a failed machine, the machine is considerable unrecoverable, remediation can be triggered right away.

### Remediation:
- A deletion request for the machine is sent to the API server.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not allow for a maxSurge like the RollingUpdate MachineDeploymentStrategyType?

If remediation does not use or coordinate with the MachineDeployment, then a concurrent RollingUpdate and remediation could end up removing more nodes than the application can tolerate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand on this, I worry that cluster-api is walking into a failed design. If there are numerous subsystems each trying to drain and/or delete nodes independently from each other, it is likely that they will occasionally overstress applications, causing databases to drop below quorum and the like.

There should be one controller responsible for doing rolling updates of nodes. Other subsystems could then independently nominate/mark nodes for the rolling update controller to get rid of.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. This proposal is actually trying to respect the right boundaries/responsibilities between controllers and using the right consumable semantics that allow composability to manipulate machine resources and signal other controllers.
The MHC is just nominating a machine for deletion by signalling the api server a desire for a machine resource to be deleted; this could also be requested anytime by e.g a user. As soon as the deletionTimestamp is set any watcher can react as they see fit.
The replicas reconciliation is so delegated to where it belongs, the controller owning the machine e.g machineDeployment, machineSet, other...
The actual deletion process is so delegated to where it belongs, the machine controller, and this is the layer where the application disruption toleration must be enforced via draining and PDB.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me that this mechanism would respect the maxSurge and maxUnavailable settings of the machineDeployment. I believe the MDC should use a mechanism that respects those settings, especially if the machine has workloads on it that may still be functioning.

This might be more of a machineDeployment/machineSet issue, but I can see a distinction between requesting voluntary and involuntary eviction of workloads. Deletion seems like it would tend towards the involuntary side of things.

Copy link

@johngmyers johngmyers Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- A deletion request for the machine is sent to the API server.
- The MDC requests deletion of the machine by placing a `cluster.k8s.io/delete-machine` annotation on it.

though I may misunderstand the semantics of that annotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no such annotation. The process is the same as for any other entity signalling a deletion desire. The machine health check sends a request to the API server i.e client.delete(). Everything else happens out of band:

  • The machine is give a deletionTimestamp.
  • Any watcher can adjust based on the deletionTimestamp (usually by filtering)
  • The machine controller sees the deletionTimestamp and enforces workloads availability policy by draining and pod pdb.
  • The machine controller removes the machine finalizer.
  • The api server removes the machine object from etcd.
    You can also still set the maxUnhealthy machine health checker field to short circuit and to be less tolerant than the deployment maxUnhealthy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is such an annotation, at

DeleteNodeAnnotation = "cluster.k8s.io/delete-machine"

Copy link
Member Author

@enxebre enxebre Nov 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got you, that's for the machineSet to prioritise deletion during a scale down operation. A machine with a deletionTimestamp won't even make it to the list of machines being prioritised, It will be filtered out and dismissed when reconciling towards expected number of replicas

@enxebre enxebre force-pushed the mhc-proposal branch 6 times, most recently from 33539d6 to cc415c3 Compare November 7, 2019 18:03
start;
:Machine Health Check controller;
repeat
repeat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested repeat block doesn't render as I think you probably want in the image. Is there some change you could make?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm I couldn't find a better way by looking here http://plantuml.com/guide


e2e testing as part of the cluster-api e2e test suite.

For failing early testing we could consider a test suite leveraging kubemark as a provider to simulate healthy/unhealthy nodes in a cloud agnostic manner without the need to bring up a real instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thebsdbox had built a fake provider for testing recently, we could potentially leverage that for this type of testing.

@enxebre
Copy link
Member Author

enxebre commented Nov 26, 2019

@detiber @ncdc how's this looking to you so far, any objections to move forward?

@ncdc
Copy link
Contributor

ncdc commented Nov 26, 2019

LGTM. I hear what @johngmyers is saying about the MHC potentially fighting with the MD/MS controllers, or not taking max surge and/or unavailable values into consideration, but I'd like to see an implementation of the MHC. Maybe it works flawlessly, or maybe it has some conflicts - either way, let's write some code & test it! 😄

Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define some mechanism to disable remediation on a particular node/machine as well.

#### MachineHealthCheck CRD:
- Enable watching a group of machines (based on a label selector).
- Enable defining an unhealthy node criteria (based on a list of node conditions).
- Enable setting a threshold of unhealthy nodes. If the current number is at or above this threshold no further remediation will take place. This can be expressed as an int or as a percentage of the total targets in the pool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could easily be misconfigured. We should look at the total number of unhealthy nodes, not just the nodes in the 'target pool'. Also, we're watching a group of machines, not a group of nodes. Also, it's unclear what should happen if a node/machine is covered in multiple 'groups of machines' mentioned above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree here. I would expect users to likely have groups of multiple MachineDeployments and to leverage those as pools of Nodes with different scheduling requirements (GPU availability, public/private app, etc), and as such I would expect to be able to define pertinent health checks against each of these separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm talking about is a cluster-wide view of unhealthy nodes rather than a view of just a subset of the nodes when making the determination of whether or not to remediate problematic nodes.

Also, it's still unclear what should happen if a node is in multiple groups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm talking about is a cluster-wide view of unhealthy nodes rather than a view of just a subset of the nodes when making the determination of whether or not to remediate problematic nodes.

Yes, I understand that. I'm saying that for non-trivial use cases (i.e. one MachineDeployment for all worker nodes) I think this is not the view that we care about. We care more about the interruption of the pool of nodes that are serving individual scheduling concerns rather than the full set of nodes in the cluster.

I suspect this is something that we will need to make sure has plenty of clear documentation around, though.

Also, it's still unclear what should happen if a node is in multiple groups.

This is probably something that should be clarified.

Copy link
Member Author

@enxebre enxebre Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's still unclear what should happen if a node is in multiple groups.

A node that happens to be covered by more than one MHC is liable to be remediated by any of them satisfying the requirements.

Later on we could discuss things like rejecting multiple groups, rate limiting instead of fully short circuiting, may be add zone awareness and cluster wide view, etc. But unless there're strong objections I'd prefer to keep this proposal simple and follow up with RFEs as we start gathering feedback based on an initial tangible implementation.

namespace: machine-api
spec:
selector:
matchLabels:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above it says we're looking at machine labels, but the 'role' label seems to be specific to a node. Is this intended to match a machine object's labels?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an arbitrary label that matches a group of machines

@vincepri
Copy link
Member

vincepri commented Dec 3, 2019

I'll take a final pass by end of week, looking good so far!

@ncdc
Copy link
Contributor

ncdc commented Dec 11, 2019

Lazy consensus starts now. Expires in 1 week on 12/18.

@ncdc
Copy link
Contributor

ncdc commented Dec 18, 2019

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 18, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, ncdc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 18, 2019
@k8s-ci-robot k8s-ci-robot merged commit e7652cc into kubernetes-sigs:master Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/proposal Issues or PRs related to proposals. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants