-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define health check strategy for MachineSet #47
Comments
From @mrIncompetent on March 5, 2018 6:13 Could you explain a bit what "health checking" means in this case? |
The overall discussion would be similar to pod health checking:
Once machines are unhealthy, status should be reported and then a system needs to replace them if the user so chooses - an auto-repair functionality. There are different ways to pursue this, and the item is vague on purpose as we need to list the architecture options here to decide what might be incorporated into the controller and what might not. Let me know your thoughts on this, especially based on your experience managing machines on other platforms. |
From @mrIncompetent on March 5, 2018 18:38 I would propose that we rely on the |
I agree with the suggestion to build on node health. It might not be enough, though. Quoting https://kubernetes.io/docs/concepts/architecture/nodes/:
I'd like to understand what "asks the cloud provider if the VM for that node is still available" means and how that could be related to our work.
In the case the node is unhealthy (unreachable), the schedule is smart to evict pods. That means also, that in a machine set, we'll have one fewer machine than the user's intent. Do we do anything automatically at that point? Or do we have user settings to determine when to do something? Thoughts? |
The basic machine set status is now being populated, PR #180, leveraging the NodeRef and Node condition. |
We recently had a dicsussion around Machine-health strategy in wg-call. With regards to the same, I am wondering if it would be better if MachineSet controller rather relies only on Machine-object to get the machine-health related status and not fetch the Node-object. Basically we put the NodeConditions into MachineStatus. Couple of supporting pointers could be:
In general for Machine-health stategy: we can categorize health-problems as permenant and temporary as it is in NPD.
|
Thanks @hardikdr -- that is a great summary. A couple of points of clarification:
|
@hardikdr ping |
Thanks @roberthbailey @timothysc for comments.
Both of the health-issues can be learned from the NodeConditions in MachineStatus. In future we might want to deal with melt-down situations of machine, but we could keep it for later iterations. |
Many of the cloud providers do something like this already, where they delete a node object (and kill the pods) if a kubelet stops sending heartbeats. I think that the threshold on GCE at least is set to 5 minutes. Which means that this would never get hit: Nodes would be deleted before they could become 10 minutes stale. |
We discussed this in the last meeting. |
Are there any action items for this issue for v1alpha1? |
As discussed in the Cluster API meeting today, this is not required for v1alpha1. /milestone Next |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
/area health |
@vincepri - could you please re-eval when noderef lands. |
@vincepri: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…pr-templates Add issue and pr templates from capa
From @rsdcastro on March 1, 2018 21:10
This is to track discussion and documentation on how health checking will be done for machines in a set.
Copied from original issue: kubernetes-retired/kube-deploy#632
The text was updated successfully, but these errors were encountered: