Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define health check strategy for MachineSet #47

Closed
rsdcastro opened this issue Apr 12, 2018 · 20 comments
Closed

Define health check strategy for MachineSet #47

rsdcastro opened this issue Apr 12, 2018 · 20 comments
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@rsdcastro
Copy link

From @rsdcastro on March 1, 2018 21:10

This is to track discussion and documentation on how health checking will be done for machines in a set.

Copied from original issue: kubernetes-retired/kube-deploy#632

@rsdcastro
Copy link
Author

cc @p0lyn0mial @mrIncompetent

@rsdcastro
Copy link
Author

From @mrIncompetent on March 5, 2018 6:13

Could you explain a bit what "health checking" means in this case?
Is this about a definition about when and how to update MachineSet.Status ?

@rsdcastro
Copy link
Author

The overall discussion would be similar to pod health checking:

  • A mechanism to health check the machine from time to time to make sure it's up & and running well. This could mean that we utilize something specific to the infrastructure provider.
  • When to determine that a machine is unhealthy, we could have defaults but allow the user to override those.

Once machines are unhealthy, status should be reported and then a system needs to replace them if the user so chooses - an auto-repair functionality.

There are different ways to pursue this, and the item is vague on purpose as we need to list the architecture options here to decide what might be incorporated into the controller and what might not.

Let me know your thoughts on this, especially based on your experience managing machines on other platforms.

@rsdcastro
Copy link
Author

From @mrIncompetent on March 5, 2018 18:38

I would propose that we rely on the Machine.Status.NodeRef & the node conditions in the first place.
The Node object already has health checking in place, if that's not enough for us, we might simply improve on that instead if creating a new way of health checking instances/nodes.

@rsdcastro
Copy link
Author

I agree with the suggestion to build on node health. It might not be enough, though.

Quoting https://kubernetes.io/docs/concepts/architecture/nodes/:

The second is keeping the node controller’s internal list of nodes up to date with the cloud provider’s list of available machines. When running in a cloud environment, whenever a node is unhealthy, the node controller asks the cloud provider if the VM for that node is still available. If not, the node controller deletes the node from its list of nodes.

I'd like to understand what "asks the cloud provider if the VM for that node is still available" means and how that could be related to our work.

The third is monitoring the nodes’ health. The node controller is responsible for updating the NodeReady condition of NodeStatus to ConditionUnknown when a node becomes unreachable (i.e. the node controller stops receiving heartbeats for some reason, e.g. due to the node being down), and then later evicting all the pods from the node (using graceful termination) if the node continues to be unreachable. (The default timeouts are 40s to start reporting ConditionUnknown and 5m after that to start evicting pods.) The node controller checks the state of each node every --node-monitor-period seconds.

In the case the node is unhealthy (unreachable), the schedule is smart to evict pods. That means also, that in a machine set, we'll have one fewer machine than the user's intent.

Do we do anything automatically at that point? Or do we have user settings to determine when to do something? Thoughts?

@rsdcastro rsdcastro added this to the cluster-api-beta-implementation milestone Apr 13, 2018
@k4leung4
Copy link
Contributor

The basic machine set status is now being populated, PR #180, leveraging the NodeRef and Node condition.
As mentioned by @rsdcastro , this might not be enough.

@hardikdr
Copy link
Member

We recently had a dicsussion around Machine-health strategy in wg-call.

With regards to the same, I am wondering if it would be better if MachineSet controller rather relies only on Machine-object to get the machine-health related status and not fetch the Node-object. Basically we put the NodeConditions into MachineStatus.

Couple of supporting pointers could be:

  • Separation of responsibilities:

    • MachineDeployment --talks to --> MachineSet --> Machine --> Node/Cloud-provider.
    • Otherwise MachineDeployment/Set controllers, or any future higher level controllers would need to talk to Node-object directly
      • essentially if machine-api-stack and node-objects are in 2 different clusters, 2 api-endpoints[kubeconfigs] need to be fed to controller-manager otherwise.
  • MachineStatus already contains ProviderStatus field: link , MachineSet/Other-higher-level-controllers infuture would also want to have a look on this ProviderStatus field to get the signals from Cloudprovider-specific errors/status, MachineObject could become one stop for all higher-level controllers then, if it contains the NodeConditions as well.

  • Temporary unavailibility of node-objects.

    • If I understand correct, kube-controller-manager takes care of not-deregistering the node-object if corresponding machine is rebooted or unavailable for temporary reasons - but thats for cloud-providers supported by kcm.
    • I am not sure, for instance - incase of BM, if node-object disappears for small time period [reboot?] and MachineSet should not reacreate the Machine rather wait for configurable time-out - which is possible only if we have last-seen status[via NodeConditions] on MachineObject.

In general for Machine-health stategy: we can categorize health-problems as permenant and temporary as it is in NPD.

  • Permenant problems could be: DiskPressure and KubeleteNotReady from X minutes and so on ..
  • Temporary problems could be for docker/kubelet/systemd restarts - we could take help from NPD problem-domains and take in-place actions then in future.

@roberthbailey
Copy link
Contributor

Thanks @hardikdr -- that is a great summary. A couple of points of clarification:

  • I like the idea of the MachineSet controller only needing to watch Machines and not have dependencies on Nodes (for the reasons you mentioned).
  • I dislike the idea of MachineSet controller needing to decode ProviderStatus on Machines, because that makes it dependent on specific implementations of Machine; instead if there are fields that need to be part of the common Machine API they should be promoted.
  • The NodeController (currently in kube-controller-manager but probably moving to cloud-controller-manager once that exists) deletes Node objects if the corresponding virtual machine is removed (in environments where this behavior doesn't exist, the Node object is left indefinitely). If the machine is rebooted or becomes unavailable the Node object won't be deleted because the virtual machine will still exist when querying the cloud provider.
  • The MachineSet controller won't know if the underlying virtual machine is still listed in the cloud provider, since it only knows about Machines. The MachineController should be responsible for making sure a virtual machine exists that matches the machine spec. I'm wondering if the waiting logic would be a better fit in the MachineController rather than the MachineSet controller.

@timothysc timothysc modified the milestones: cluster-api-beta-implementation, Next, v1alpha1 Jan 10, 2019
@timothysc timothysc added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 10, 2019
@timothysc
Copy link
Member

@hardikdr ping

@hardikdr
Copy link
Member

Thanks @roberthbailey @timothysc for comments.
I could see the following major objectives to be covered as part of a basic machine-health strategy.

  1. Replace Unhealthy machines after a configurable timeout period[ ~ X mins]

    • Basically, machine-controller declares the machine as Failed if kubelet does not respond for more than X-mins.
    • MachineSet controller creates fresh machine-object and deletes the older one.
    • Eg. User accidently deletes the machine from cloud-provider or machine-gets unhealthy due to other reasons.
  2. Replace Machines with high disk pressure after a configurable timeout period.[~Y Mins]

    • Machine-controller declares the machine as Failed if Disk pressure is high in a machine from last Y mins.
    • As above, MachineSet controller replaces the older one by fresh machine-object.
    • Eg. Noisy applications fills up the disk, and new applications can not run properly there.
      Generally Y > X, as we might want to give some time to applications to garbage-collect the mess in case implemented. Probable defaults could be -- X=10mins, Y=30min..

Both of the health-issues can be learned from the NodeConditions in MachineStatus. In future we might want to deal with melt-down situations of machine, but we could keep it for later iterations.
If we find both cases suitable we could already implement the basic feature with current machine-api.

@roberthbailey
Copy link
Contributor

Basically, machine-controller declares the machine as Failed if kubelet does not respond for more than X-mins.

Many of the cloud providers do something like this already, where they delete a node object (and kill the pods) if a kubelet stops sending heartbeats. I think that the threshold on GCE at least is set to 5 minutes. Which means that this would never get hit: Nodes would be deleted before they could become 10 minutes stale.

@hardikdr
Copy link
Member

We discussed this in the last meeting.
Apparently, we would need a way to eventually delete the machine from cloud-provider in any case.
The conclusion seems to be to implement both of the objectives mentioned above for the first cut health strategy.

@ncdc
Copy link
Contributor

ncdc commented Mar 5, 2019

Are there any action items for this issue for v1alpha1?

@ncdc
Copy link
Contributor

ncdc commented Mar 6, 2019

As discussed in the Cluster API meeting today, this is not required for v1alpha1.

/milestone Next

@k8s-ci-robot k8s-ci-robot modified the milestones: v1alpha1, Next Mar 6, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 4, 2019
@detiber
Copy link
Member

detiber commented Jun 4, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 4, 2019
@vincepri
Copy link
Member

/area health

@timothysc
Copy link
Member

@vincepri - could you please re-eval when noderef lands.

@vincepri
Copy link
Member

This can be probably closed given that #1011 has been merged and #1052 is up for review

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

This can be probably closed given that #1011 has been merged and #1052 is up for review

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chuckha pushed a commit to chuckha/cluster-api that referenced this issue Oct 2, 2019
…pr-templates

Add issue and pr templates from capa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

10 participants