-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
machine controller should sense the event that infra obj turns into notReady #1622
Comments
I believe this is handled in a different place. The reconcile flow is:
At this point, the machine's However, as best I can tell, once the machine's |
sry, i might not express my self clearly, if the infra machine turned into a permanent failure condition, the controller on the infra side should set errMsg/errReason, i agreed with this. but if the infra machine is recoverable(like reboot), the controller on the infra side(in my implemention) would just set the infra machine from ready to not ready, and machine controller in cluster-api could not sense it but stayed in running state. |
and i have this situation discussed with @detiber on slack. he agreed with me too. |
Ok, what you're pointing out makes sense. My apologies for being slightly confused 😄. |
/help |
@ncdc: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
For context, in v1alpha2 we chose not to recover from a failure case to avoid loops or going back and forth between states outlined in the machine proposal. This behavior aligns with the assumption that Machines are immutable, if you get a failure, the action the system expects an operator to take is to delete and recreate the Machine. |
@vincepri There are currently plenty of recoverable errors that we encounter today, we just don't necessarily bubble them up in any way. For example if the AWS API is unavailable for any reason. I don't think we should say that any error should be unrecoverable, nor should we say that we should resolve errors in the controllers. I think the case that is more grey (and potentially where this issue comes into play) is how to bubble up recoverable errors (that are resolved on the cloud provider side) to users without turning them into non-recoverable errors (ErrorMessage/ErrorReason). In the specific case of Reboot, if an instance is triggered into a 'Reboot' state by the cloud provider somehow and we don't need to take remedial action to resolve it, then why shouldn't we allow it to recover gracefully? |
Yes, we need to distinguish (somehow) between controller-retryable errors and failures.
It'll be up to the infrastructure provider to make sure a machine that's being rebooted isn't marked as failed. Nevermind, https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/machine_controller_phases.go#L60-L63 the code actually doesn't check for node readiness. |
@vincepri to look up what we wrote in the v1a2 proposal around ready->notReady transitions. |
One thing to note, I don't think we necessarily need to reset the Machine Phase back to a previous state, we could potentially introduce a new Phase that could be used to indicate formerly ready but not currently ready. |
If we allow to flip from ready -> notReady, the phase would go back to
I'm not super opposed to this behavior, but not in total favor either. I'd rather have the Machine controller detect that the linked node is not in a
I'd like to avoid, at least for now, to keep track of previous states and introduce something like |
👍 for checking the node ref and flipping it back to provisioned for the time being, and doing a separate proposal around conditions (hint hint @timothysc 😄) |
/assign I'll work on this issue if nobody has any concerns. |
I'm wholly against reversing machine states. The flow of the machine-controller and associated infra controller should be one-way. If some machine that was previously ready (eg, became a node) has now failed, we'll see that indication on the node, and we remediate:
Immutable infrastructure. |
For me this is less about reversing states as much as giving users visibility into the underlying resources we are managing. From a user's point of view (not taking into account automated remediation for the moment). If I'm looking at Cluster API and it is telling me that my Machine is "Ready" (as a proxy for both Node being present and the underlying infrastructure being in a running state), but the underlying infra is not in a running state I no longer have a sense for what or how to start remediating the issue. I don't think we should be thinking of this from the perspective of reversing the state machine as much as providing visibility into the actual state of the world wrt the resources we are managing for the user. From a troubleshooting perspective, I'd like to ideally get signal from the |
A node is an overlying resource relative to a machine. That's where a user's focus should be, the health of nodes. If we're talking about disposable, immutable machines, if the node goes unready/unhealthy, you get rid of it. You don't have to care what the layer underneath is doing, that's the entire point.
Yeah, "Ready" doesn't make sense for a machine. Probably the terminal state should be 'Complete' after it becomes a node.
Then we should sync some aggregate health information from the nodes to the cluster object. Machines are just an implementation detail. There's also room for syncing some info about machines that haven't received a nodeRef after X time. |
I think it's more than just the nodes though, if we expect someone to take manual remediation efforts, we need to point them to the resource that is causing the trouble, otherwise we are requiring them to know all the implementation details of how cluster-api and providers work before they could even start to troubleshoot and fix the issue. |
Remediation always needs to start at the node, automated or otherwise. If you have an unhealthy node, you delete the backing machine-object. If the resolution is any more complicated than that, they'll need to know those provider details either way. We could build logic into the machine-controller to delete an associated machine if the node object is deleted, but there's other tradeoffs with that (eg, users might not drain the node first, etc). Let's take Kubernetes out of the equation: Say you have an application deployed across 3 instances in the cloud, and they're deployed behind a load balancer. The load balancer has healthchecks, it detects whether or not the application is up, not he backing instance. You could build automation to watch which endpoints are healthy in the loadbalancer, and replace those instances when they fail. Back to Kuberentes: In our case, the application we care about is the node. Our application can go down for a variety of reasons, but that's our primary signal. If the instance dies, the application dies with it. If the application can't start due to corruption, well the instance might be alive and well as far as the cloud provider is concerned. It doesn't make any sense to label the cloud-vm itself with the node's status, thus it doesn't make any sense to label the machine-object with the node's status. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
I think this has been fixed on master / v1alpha3. We reconcile the ready value whenever it changes, can this be closed? /cc @ncdc |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Closing, fixed in v1alpha3 |
@vincepri: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
if the infra obj turns from ready to notReady and not been patched with any ErrorMsg/ErrReason, we only requeue the machine obj and stays in running state.
cluster-api/controllers/machine_controller_phases.go
Line 232 in 0da8bde
if the infra machcine is recoverable(like reboot), i don't think the infra controller should set any ErrorMsg/ErrReason on the infra resource.
/kind bug
The text was updated successfully, but these errors were encountered: