-
Notifications
You must be signed in to change notification settings - Fork 22
TalosControlPlane unable to scale down replaced node #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We don't recommend hosting CAPI components in the cluster managed by the same CAPI setup. It is going to cause various issues. |
@smira Thanks for the reply. Is that specifically mentioned somewhere in the docs? |
If your management cluster goes down for whatever reason, no easy way to recover. You can try this setup, but I would never recommend it. |
Well, sure. But that's a general design flaw of CAPI. It's even worse then this because kubernetes-sigs/cluster-api#7061 exists and it doesn't seem like there will be a fix for it anytime soon. |
I think the Issue could be fixed by deleting the machine prior to |
I can confirm that the issue seems to be exactly that: the controller is waiting for the etcd to become healthy on 2 nodes (single control plane szenario in this case) which is only the case for a very short time. If the controller reconciles exactly during that time, the upgrade process will continue. Otherwise it will get stuck waiting for two nodes to become healthy while the old one is already being shut down:
|
kubernetes-sigs/cluster-api#2651 It seems that the Kubeadm Controlplane Provider had the same issue, but they fixed it (by, as far as I understand, marking controlplane nodes where etcd was stopped as healthy and thus if the loop is triggered again, the machine gets deleted) |
I noticed today that this problem occurs whenever the It doesn't matter which workload cluster is beeing rollouted. |
Probably fixed in 0.5.8 |
@smira could you please elaborate what was changed so this is considered fixed? |
Some other comments on this issue. The original issue of running CAPI in a self-managed cluster is not supported and recommended. |
When doing a rolling update under certain conditions the update will never finish.
Steps to reproduce:
TalosControlPlane
resourceWhat happens:
TalosControlPlane
starts a rolling update by creating a newMachine
Machine
is created by whatever Infrastructure provider is usedTalosControlPlane
resource is unable to scale down to 1 and never deletes the oldMachine
of the old control-plane nodeHow to solve the problem:
Machine
of the old control-plane node. The used infrastructure provider then will handle the deletion of the node and theTalosControlPlane
resource will scale down to 1 and become ready again.What should happen:
Machine
resource of the old control-plane node.Note: this issue only happens if two conditions are met:
The text was updated successfully, but these errors were encountered: