Skip to content
This repository has been archived by the owner on May 22, 2020. It is now read-only.

Define health check strategy for MachineSet #632

Closed
rsdcastro opened this issue Mar 1, 2018 · 6 comments
Closed

Define health check strategy for MachineSet #632

rsdcastro opened this issue Mar 1, 2018 · 6 comments

Comments

@rsdcastro
Copy link

This is to track discussion and documentation on how health checking will be done for machines in a set.

@rsdcastro rsdcastro added this to the cluster-api-beta milestone Mar 1, 2018
@rsdcastro
Copy link
Author

cc @p0lyn0mial @mrIncompetent

@mrIncompetent
Copy link
Contributor

Could you explain a bit what "health checking" means in this case?
Is this about a definition about when and how to update MachineSet.Status ?

@rsdcastro
Copy link
Author

The overall discussion would be similar to pod health checking:

  • A mechanism to health check the machine from time to time to make sure it's up & and running well. This could mean that we utilize something specific to the infrastructure provider.
  • When to determine that a machine is unhealthy, we could have defaults but allow the user to override those.

Once machines are unhealthy, status should be reported and then a system needs to replace them if the user so chooses - an auto-repair functionality.

There are different ways to pursue this, and the item is vague on purpose as we need to list the architecture options here to decide what might be incorporated into the controller and what might not.

Let me know your thoughts on this, especially based on your experience managing machines on other platforms.

@mrIncompetent
Copy link
Contributor

I would propose that we rely on the Machine.Status.NodeRef & the node conditions in the first place.
The Node object already has health checking in place, if that's not enough for us, we might simply improve on that instead if creating a new way of health checking instances/nodes.

@rsdcastro
Copy link
Author

I agree with the suggestion to build on node health. It might not be enough, though.

Quoting https://kubernetes.io/docs/concepts/architecture/nodes/:

The second is keeping the node controller’s internal list of nodes up to date with the cloud provider’s list of available machines. When running in a cloud environment, whenever a node is unhealthy, the node controller asks the cloud provider if the VM for that node is still available. If not, the node controller deletes the node from its list of nodes.

I'd like to understand what "asks the cloud provider if the VM for that node is still available" means and how that could be related to our work.

The third is monitoring the nodes’ health. The node controller is responsible for updating the NodeReady condition of NodeStatus to ConditionUnknown when a node becomes unreachable (i.e. the node controller stops receiving heartbeats for some reason, e.g. due to the node being down), and then later evicting all the pods from the node (using graceful termination) if the node continues to be unreachable. (The default timeouts are 40s to start reporting ConditionUnknown and 5m after that to start evicting pods.) The node controller checks the state of each node every --node-monitor-period seconds.

In the case the node is unhealthy (unreachable), the schedule is smart to evict pods. That means also, that in a machine set, we'll have one fewer machine than the user's intent.

Do we do anything automatically at that point? Or do we have user settings to determine when to do something? Thoughts?

@rsdcastro
Copy link
Author

This issue was moved to kubernetes-sigs/cluster-api#47

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants