Define health check strategy for MachineSet #47

rsdcastro · 2018-04-12T21:06:42Z

From @rsdcastro on March 1, 2018 21:10

This is to track discussion and documentation on how health checking will be done for machines in a set.

Copied from original issue: kubernetes-retired/kube-deploy#632

rsdcastro · 2018-04-12T21:06:43Z

rsdcastro · 2018-04-12T21:06:43Z

From @mrIncompetent on March 5, 2018 6:13

Could you explain a bit what "health checking" means in this case?
Is this about a definition about when and how to update MachineSet.Status ?

rsdcastro · 2018-04-12T21:06:44Z

The overall discussion would be similar to pod health checking:

A mechanism to health check the machine from time to time to make sure it's up & and running well. This could mean that we utilize something specific to the infrastructure provider.
When to determine that a machine is unhealthy, we could have defaults but allow the user to override those.

Once machines are unhealthy, status should be reported and then a system needs to replace them if the user so chooses - an auto-repair functionality.

There are different ways to pursue this, and the item is vague on purpose as we need to list the architecture options here to decide what might be incorporated into the controller and what might not.

Let me know your thoughts on this, especially based on your experience managing machines on other platforms.

rsdcastro · 2018-04-12T21:06:44Z

From @mrIncompetent on March 5, 2018 18:38

I would propose that we rely on the Machine.Status.NodeRef & the node conditions in the first place.
The Node object already has health checking in place, if that's not enough for us, we might simply improve on that instead if creating a new way of health checking instances/nodes.

rsdcastro · 2018-04-12T21:06:45Z

I agree with the suggestion to build on node health. It might not be enough, though.

Quoting https://kubernetes.io/docs/concepts/architecture/nodes/:

The second is keeping the node controller’s internal list of nodes up to date with the cloud provider’s list of available machines. When running in a cloud environment, whenever a node is unhealthy, the node controller asks the cloud provider if the VM for that node is still available. If not, the node controller deletes the node from its list of nodes.

I'd like to understand what "asks the cloud provider if the VM for that node is still available" means and how that could be related to our work.

The third is monitoring the nodes’ health. The node controller is responsible for updating the NodeReady condition of NodeStatus to ConditionUnknown when a node becomes unreachable (i.e. the node controller stops receiving heartbeats for some reason, e.g. due to the node being down), and then later evicting all the pods from the node (using graceful termination) if the node continues to be unreachable. (The default timeouts are 40s to start reporting ConditionUnknown and 5m after that to start evicting pods.) The node controller checks the state of each node every --node-monitor-period seconds.

In the case the node is unhealthy (unreachable), the schedule is smart to evict pods. That means also, that in a machine set, we'll have one fewer machine than the user's intent.

Do we do anything automatically at that point? Or do we have user settings to determine when to do something? Thoughts?

k4leung4 · 2018-07-11T15:40:27Z

The basic machine set status is now being populated, PR #180, leveraging the NodeRef and Node condition.
As mentioned by @rsdcastro , this might not be enough.

hardikdr · 2018-08-21T10:56:20Z

We recently had a dicsussion around Machine-health strategy in wg-call.

With regards to the same, I am wondering if it would be better if MachineSet controller rather relies only on Machine-object to get the machine-health related status and not fetch the Node-object. Basically we put the NodeConditions into MachineStatus.

Couple of supporting pointers could be:

Separation of responsibilities:
- MachineDeployment --talks to --> MachineSet --> Machine --> Node/Cloud-provider.
- Otherwise MachineDeployment/Set controllers, or any future higher level controllers would need to talk to Node-object directly
  - essentially if machine-api-stack and node-objects are in 2 different clusters, 2 api-endpoints[kubeconfigs] need to be fed to controller-manager otherwise.
MachineStatus already contains ProviderStatus field: link , MachineSet/Other-higher-level-controllers infuture would also want to have a look on this ProviderStatus field to get the signals from Cloudprovider-specific errors/status, MachineObject could become one stop for all higher-level controllers then, if it contains the NodeConditions as well.
Temporary unavailibility of node-objects.
- If I understand correct, kube-controller-manager takes care of not-deregistering the node-object if corresponding machine is rebooted or unavailable for temporary reasons - but thats for cloud-providers supported by kcm.
- I am not sure, for instance - incase of BM, if node-object disappears for small time period [reboot?] and MachineSet should not reacreate the Machine rather wait for configurable time-out - which is possible only if we have last-seen status[via NodeConditions] on MachineObject.

In general for Machine-health stategy: we can categorize health-problems as permenant and temporary as it is in NPD.

Permenant problems could be: DiskPressure and KubeleteNotReady from X minutes and so on ..
Temporary problems could be for docker/kubelet/systemd restarts - we could take help from NPD problem-domains and take in-place actions then in future.

roberthbailey · 2018-08-22T06:41:50Z

Thanks @hardikdr -- that is a great summary. A couple of points of clarification:

I like the idea of the MachineSet controller only needing to watch Machines and not have dependencies on Nodes (for the reasons you mentioned).
I dislike the idea of MachineSet controller needing to decode ProviderStatus on Machines, because that makes it dependent on specific implementations of Machine; instead if there are fields that need to be part of the common Machine API they should be promoted.
The NodeController (currently in kube-controller-manager but probably moving to cloud-controller-manager once that exists) deletes Node objects if the corresponding virtual machine is removed (in environments where this behavior doesn't exist, the Node object is left indefinitely). If the machine is rebooted or becomes unavailable the Node object won't be deleted because the virtual machine will still exist when querying the cloud provider.
The MachineSet controller won't know if the underlying virtual machine is still listed in the cloud provider, since it only knows about Machines. The MachineController should be responsible for making sure a virtual machine exists that matches the machine spec. I'm wondering if the waiting logic would be a better fit in the MachineController rather than the MachineSet controller.

timothysc · 2019-01-10T21:56:07Z

@hardikdr ping

hardikdr · 2019-01-14T09:59:06Z

Thanks @roberthbailey @timothysc for comments.
I could see the following major objectives to be covered as part of a basic machine-health strategy.

Replace Unhealthy machines after a configurable timeout period[ ~ X mins]
- Basically, machine-controller declares the machine as Failed if kubelet does not respond for more than X-mins.
- MachineSet controller creates fresh machine-object and deletes the older one.
- Eg. User accidently deletes the machine from cloud-provider or machine-gets unhealthy due to other reasons.
Replace Machines with high disk pressure after a configurable timeout period.[~Y Mins]
- Machine-controller declares the machine as Failed if Disk pressure is high in a machine from last Y mins.
- As above, MachineSet controller replaces the older one by fresh machine-object.
- Eg. Noisy applications fills up the disk, and new applications can not run properly there.
  Generally Y > X, as we might want to give some time to applications to garbage-collect the mess in case implemented. Probable defaults could be -- X=10mins, Y=30min..

Both of the health-issues can be learned from the NodeConditions in MachineStatus. In future we might want to deal with melt-down situations of machine, but we could keep it for later iterations.
If we find both cases suitable we could already implement the basic feature with current machine-api.

roberthbailey · 2019-01-14T23:55:01Z

Basically, machine-controller declares the machine as Failed if kubelet does not respond for more than X-mins.

Many of the cloud providers do something like this already, where they delete a node object (and kill the pods) if a kubelet stops sending heartbeats. I think that the threshold on GCE at least is set to 5 minutes. Which means that this would never get hit: Nodes would be deleted before they could become 10 minutes stale.

hardikdr · 2019-01-19T13:20:45Z

We discussed this in the last meeting.
Apparently, we would need a way to eventually delete the machine from cloud-provider in any case.
The conclusion seems to be to implement both of the objectives mentioned above for the first cut health strategy.

ncdc · 2019-03-05T15:53:00Z

Are there any action items for this issue for v1alpha1?

ncdc · 2019-03-06T18:25:53Z

As discussed in the Cluster API meeting today, this is not required for v1alpha1.

/milestone Next

fejta-bot · 2019-06-04T18:50:11Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

detiber · 2019-06-04T18:54:07Z

/remove-lifecycle stale

vincepri · 2019-06-10T15:26:30Z

/area health

timothysc · 2019-06-14T18:58:09Z

@vincepri - could you please re-eval when noderef lands.

vincepri · 2019-06-21T18:07:06Z

This can be probably closed given that #1011 has been merged and #1052 is up for review

/close

k8s-ci-robot · 2019-06-21T18:07:08Z

@vincepri: Closing this issue.

In response to this:

This can be probably closed given that #1011 has been merged and #1052 is up for review

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…pr-templates Add issue and pr templates from capa

rsdcastro added the cluster-api label Apr 12, 2018

rsdcastro mentioned this issue Apr 12, 2018

Define health check strategy for MachineSet kubernetes-retired/kube-deploy#632

Closed

rsdcastro added this to the cluster-api-beta-implementation milestone Apr 13, 2018

hardikdr mentioned this issue Aug 21, 2018

Add node-conditions to the MachineStatus #483

Merged

enxebre mentioned this issue Sep 12, 2018

dynamic link and reconciling between machines/nodes #497

Closed

hardikdr mentioned this issue Nov 2, 2018

REQUEST: New membership for @hardikdr kubernetes/org#216

Closed

6 tasks

timothysc removed the kind/cluster-api label Jan 10, 2019

timothysc modified the milestones: cluster-api-beta-implementation, Next, v1alpha1 Jan 10, 2019

timothysc added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 10, 2019

timothysc assigned hardikdr Jan 10, 2019

k8s-ci-robot modified the milestones: v1alpha1, Next Mar 6, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 4, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 4, 2019

k8s-ci-robot added the area/health label Jun 10, 2019

timothysc assigned vincepri Jun 14, 2019

k8s-ci-robot closed this as completed Jun 21, 2019

davidewatson mentioned this issue Jun 28, 2019

Remove NodeConditions from Machine.Status #1081

Merged

chuckha pushed a commit to chuckha/cluster-api that referenced this issue Oct 2, 2019

Merge pull request kubernetes-sigs#47 from ashish-amarnath/add-issue-…

8f5d793

…pr-templates Add issue and pr templates from capa

sm4ll-3gg mentioned this issue Aug 6, 2024

MachineDeployment rolls out Unhealthy Machines with Ready Nodes #11023

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define health check strategy for MachineSet #47

Define health check strategy for MachineSet #47

rsdcastro commented Apr 12, 2018

rsdcastro commented Apr 12, 2018

rsdcastro commented Apr 12, 2018

rsdcastro commented Apr 12, 2018

rsdcastro commented Apr 12, 2018

rsdcastro commented Apr 12, 2018

k4leung4 commented Jul 11, 2018

hardikdr commented Aug 21, 2018

roberthbailey commented Aug 22, 2018

timothysc commented Jan 10, 2019

hardikdr commented Jan 14, 2019

roberthbailey commented Jan 14, 2019

hardikdr commented Jan 19, 2019

ncdc commented Mar 5, 2019

ncdc commented Mar 6, 2019

fejta-bot commented Jun 4, 2019

detiber commented Jun 4, 2019

vincepri commented Jun 10, 2019

timothysc commented Jun 14, 2019

vincepri commented Jun 21, 2019

k8s-ci-robot commented Jun 21, 2019

Define health check strategy for MachineSet #47

Define health check strategy for MachineSet #47

Comments

rsdcastro commented Apr 12, 2018

rsdcastro commented Apr 12, 2018

rsdcastro commented Apr 12, 2018

rsdcastro commented Apr 12, 2018

rsdcastro commented Apr 12, 2018

rsdcastro commented Apr 12, 2018

k4leung4 commented Jul 11, 2018

hardikdr commented Aug 21, 2018

roberthbailey commented Aug 22, 2018

timothysc commented Jan 10, 2019

hardikdr commented Jan 14, 2019

roberthbailey commented Jan 14, 2019

hardikdr commented Jan 19, 2019

ncdc commented Mar 5, 2019

ncdc commented Mar 6, 2019

fejta-bot commented Jun 4, 2019

detiber commented Jun 4, 2019

vincepri commented Jun 10, 2019

timothysc commented Jun 14, 2019

vincepri commented Jun 21, 2019

k8s-ci-robot commented Jun 21, 2019