-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Do not update KCP and MS status when unable to get workload cluster #10229
🐛 Do not update KCP and MS status when unable to get workload cluster #10229
Conversation
Welcome @jessehu! |
Hi @jessehu. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on!
/ok-to-test
@@ -43,8 +43,7 @@ func (r *KubeadmControlPlaneReconciler) updateStatus(ctx context.Context, contro | |||
|
|||
// set basic data that does not require interacting with the workload cluster | |||
controlPlane.KCP.Status.Replicas = replicas | |||
controlPlane.KCP.Status.ReadyReplicas = 0 | |||
controlPlane.KCP.Status.UnavailableReplicas = replicas | |||
controlPlane.KCP.Status.UnavailableReplicas = replicas - controlPlane.KCP.Status.ReadyReplicas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This change avoids setting controlPlane.KCP.Status.ReadyReplicas from none 0 to 0 when GetWorkloadCluster(ctx) hits ErrClusterLocked or other error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW:https://github.com/kubernetes-sigs/cluster-api/blob/release-1.5/controlplane/kubeadm/internal/controllers/status.go#L93 uses replicas rather than desiredReplicas to calculate UnavailableReplicas. Not sure if it's a bug or on purpose. cc @fabriziopandini please help take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be odd to have two different ways to calculate the unavailable replicas within the same status update function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @neolit123, actually it uses the same way to calculate the unavailable replicas. The line 46 is only for covering the case that KCP.Status.ReadyReplicas is 0 since the default value of UnavailableReplicas is 0 when KCP is newly created. If ReadyReplicas is not 0, UnavailableReplicas should equal to replicas - controlPlane.KCP.Status.ReadyReplicas
.
line 45-46:
controlPlane.KCP.Status.Replicas = replicas
controlPlane.KCP.Status.UnavailableReplicas = replicas - controlPlane.KCP.Status.ReadyReplicas
line 91-92:
controlPlane.KCP.Status.ReadyReplicas = status.ReadyNodes
controlPlane.KCP.Status.UnavailableReplicas = replicas - status.ReadyNodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @neolit123 @fabriziopandini could you please help take some time to move this forward? We need this patch to fix the issue hit in our project based on CAPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbueringer Got your point. If doing it this way, we may also need to move controlPlane.KCP.Status.Replicas = replicas
to before line 96 controlPlane.KCP.Status.ReadyReplicas = status.ReadyNodes
, otherwise only Status.Replicas increases or decreases by 1 and other Replicas fields unchanged (e.g. UpdatedReplicas, ReadyReplicas, UnavailableReplicas etc. which seems unreasonable). Even if it's moved, the conditions are updated while no Replicas fields are updated (which is unreasonable but maybe acceptedable since it will be corrected in the coming reconciles soon ?).
If updating both Status.Replicas and Status.UnavailableReplicas before getting ClusterStatus like this PR currently behaviors is not expected:
- it sounds unreasonable when replicas changes from 3 to 4 (when rolling out KCP): Replicas -> 4 & UnavailableReplicas -> 1 (incorrect)
- it sounds unreasonable when replicas changes from 3 to 2 (when manually deleted a KCP Mahcine ?): Replicas -> 2 & UnavailableReplicas -> -1 (should be 1)
The root cause is line 93controlPlane.KCP.Status.UnavailableReplicas = replicas - status.ReadyNodes
(and line 47 as well) should becontrolPlane.KCP.Status.UnavailableReplicas = desiredReplicas- status.ReadyNodes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Let's do the following:
- in the create case: let's set .status.replicas, .status.unavailableReplicas, .status.readyReplicas (we'll set readyReplicas to 0 but let's do it anyway for clarity)
- Then we set .stauts.version & conditions (l.55-82)
- Then let's get the cluster status and only afterwards modify the status replica fields (let's also move down the update of status updatedReplicas)
I think it makes sense to update the conditions in 2 as they are all based on information from the mgmt cluster.
Overall we're just missing a way to signal that we can't communicate with the wl cluster anymore, but we can follow-up on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbueringer Got it. I pushed a commit to use Status.UnavailableReplicas = desiredReplicas- status.ReadyNodes
which should solve the problems we discussed above. Please take a look if it is. If not, I will make the changes as you described, but not sure why we need to move down the update of status updatedReplicas because it doesn't depend on the workload cluste status: controlPlane.KCP.Status.UpdatedReplicas = int32(len(controlPlane.UpToDateMachines()))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea would have been to not update any status replica fields if we can't communicate with the workload cluster to not get them into an inconsistent state with each other. I'll think a bit about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I agree that the idea way would have been to not update any status replica fields and conditions fields if we can't communicate with the workload cluster.
/area machineset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general idea for the change LGTM
approval is up to CAPI maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
leaving lgtm/approve to active maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: d4abd1501740aab050ed6f27e26ac81af5433520
|
/ok-to-test |
New changes are detected. LGTM label has been removed. |
This comment was marked as outdated.
This comment was marked as outdated.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
g.Expect(kcp.Status.ReadyReplicas).To(BeEquivalentTo(3)) | ||
g.Expect(kcp.Status.UnavailableReplicas).To(BeEquivalentTo(0)) | ||
g.Expect(kcp.Status.Ready).To(BeTrue()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add an additional call to updateStatus here:
- drop one controlPlane.machines (or add another one)
- call updateStatus
- check that status.Replicas changed and the other ones stayed the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will add the case after the discussion is resolved: #10229 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessehu sorry but I don't think we should rush this fix in in the current state for a couple of reasons:
- the problem description in this issue is about the MD controller handling cluster cacher tracker lock while restarting the controllers.
- the problem description in the issue also talks about kcp flicking from 3 to 2 and back to 3 while restarting controllers. This seems unrelated to the lock above (there is only one reconciler running in KCP, so there are no other reconcilers that could hold the lock for the same cluster); it means this is probably a different issue, but probably we don't have enough data to triage
- the discussion on the issue lead to a more wide problem about how to handle "we cannot get the node state", and somehow shifted to the semantic of the replica fields, but this IMO requires more thinking / design.
I admit this is not ideal. We should try to do better as a maintainer when providing guidance.
Unfortunately, the effect of this lack of analysis/scope creeping on the issue, is that the current PR might introduce more confusion (and not fix the original issue). I have tried to explain this in the comments below, but probably it is more productive to think about a way forward.
My suggestion is to focus on the specific issue about the MD controller handling cluster cacher tracker lock while restarting the controllers.
We can probably mitigate this with a simple change I have discussed with @sbueringer (at least for the happy case where there is connectivity to the workload cluster)
@@ -956,6 +960,9 @@ func (r *Reconciler) getMachineNode(ctx context.Context, cluster *clusterv1.Clus | |||
} | |||
node := &corev1.Node{} | |||
if err := remoteClient.Get(ctx, client.ObjectKey{Name: machine.Status.NodeRef.Name}, node); err != nil { | |||
if apierrors.IsNotFound(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why are we ignoring not found.
Is that for the case when someone manually deletes a node? If yes, please add a comment (but it also seems unrelated to the issue we are trying to fix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarification.
The idea behind this one was to signal to the calling function that the node doesn't exist (vs. that it's just an error). Because not finding the Node is also a valuable information (which basically comes down to that the node is not ready)
This was added to preserve the previous behavior one level above (where we only logged the error before but still considered a not found node a node that is not ready)
But yeah it becomes pretty had to understand
@@ -882,8 +887,7 @@ func (r *Reconciler) updateStatus(ctx context.Context, cluster *clusterv1.Cluste | |||
|
|||
node, err := r.getMachineNode(ctx, cluster, machine) | |||
if err != nil && machine.GetDeletionTimestamp().IsZero() { | |||
log.Error(err, "Unable to retrieve Node status", "node", klog.KObj(node)) | |||
continue | |||
return errors.Wrapf(err, "unable to retrieve the status of Node %s", klog.KObj(node)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the issue
We probably also want to have a timeout on this behaviour. If we haven't seen the Node in x time then we assume its status is unknown
But if I'm not wrong with the current implementation we are going to freeze the replica count indefinetly, which could be confusing or eventually also wrong, given that the number of replicas/machine might change in the meantime.
Frankly speaking, in order to properly fix this issue I think that we first need to figure out how to get/store the "last seen" information for each node.
Given that info, we can decide if it is ok to flip the replica status or if to wait (but I don't have a concrete idea on how to do so, I need some time to research into it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important point that I discussed with Fabrizio. We're not only freezing the replica fields, we would also freeze other status fields like ObservedGeneration and conditions.
@@ -844,7 +848,8 @@ func (r *Reconciler) shouldAdopt(ms *clusterv1.MachineSet) bool { | |||
} | |||
|
|||
// updateStatus updates the Status field for the MachineSet | |||
// It checks for the current state of the replicas and updates the Status of the MachineSet. | |||
// It checks for the current state of the replicas and updates the Status field of the MachineSet. | |||
// When unable to retrieve the Node status, it returns error and won't update the Status field of the MachineSet. | |||
func (r *Reconciler) updateStatus(ctx context.Context, cluster *clusterv1.Cluster, ms *clusterv1.MachineSet, filteredMachines []*clusterv1.Machine) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here a proposed solution to the ErrClusterLocked issue: (that Fabrizio was referring to)
// Retry getting a remote client.
// Note: This is to ensure that we don't run into errors here just
// because multiple reconcilers try to create the client at the same time
// (for the case where the workload cluster is reachable).
var canCommunicateWithWorkloadCluster bool
var remoteClient client.Client
err = retry.OnError(wait.Backoff{
Steps: 5,
Duration: 200 * time.Millisecond,
Factor: 1.0,
}, func(err error) bool {
// Retry as long as we get remote.ErrClusterLocked errors.
return errors.Is(err, remote.ErrClusterLocked)
}, func() error {
remoteClient, err = r.Tracker.GetClient(ctx, util.ObjectKey(cluster))
if err != nil {
return err
}
canCommunicateWithWorkloadCluster = true
return nil
})
if err != nil {
log.Error(err, "Unable to retrieve status of Nodes: failed to create remote client")
}
for _, machine := range filteredMachines {
log := log.WithValues("Machine", klog.KObj(machine))
if templateLabel.Matches(labels.Set(machine.Labels)) {
fullyLabeledReplicasCount++
}
if machine.Status.NodeRef == nil {
log.V(4).Info("Waiting for the machine controller to set status.nodeRef on the Machine")
continue
}
if !canCommunicateWithWorkloadCluster {
// Skip the rest of the for loop if we can't communicate with the workload cluster.
continue
}
node, err := r.getMachineNode(ctx, remoteClient, machine)
if err != nil && machine.GetDeletionTimestamp().IsZero() {
log.Error(err, "Unable to retrieve Node status", "Node", klog.KObj(node))
continue
}
if noderefutil.IsNodeReady(node) {
readyReplicasCount++
if noderefutil.IsNodeAvailable(node, ms.Spec.MinReadySeconds, metav1.Now()) {
availableReplicasCount++
}
} else if machine.GetDeletionTimestamp().IsZero() {
log.V(4).Info("Waiting for the Kubernetes node on the machine to report ready state")
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that when the workload cluster is reachable, we only get ErrClusterLocked for a very short amount of time (the time it takes to create a client). For this case it is good enough to simply retry creating the client.
We will fallback to the previous behavior only if the workload cluster is actually not reachable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!I'm not quite sure if it's worthy to add this bulk of code to resolve the temporary inconsistency of some status replicas and conditions fields caused by ErrClusterLocked error. The simple change in current PR can solve this problem while there are inconsistency but acceptable and won't cause issues per my thinking (maybe I missed some cases).
cc @fabriziopandini for awareness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbueringer and I are in total agreement, and we are trying to help in re-focusing the PR to the original issue. We also took some additional steps to help in finding a way forward by proposing an alternative solution.
With regards to the current PR, I already tried to explain my concern and I will be happy to add more if you have any doubts (and I already answered one).
But given the concern above, I'm personally -1 to merge the PR in the current state.
Instead, we both think the change proposed by @sbueringer solves the original issues, but ultimately it is up to you to accept it or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fabriziopandini for the details. Here are my two cents:
- Should this new if block calls
return err
instead ofcontinue
since canCommunicateWithWorkloadCluster won't change in the for loop.
if !canCommunicateWithWorkloadCluster {
// Skip the rest of the for loop if we can't communicate with the workload cluster.
continue
}
- There are many places that return ErrClusterLocked (not only in MS controller and KCP controller). If we want this retry logic, shall we add it in other places as well? Or just here to resolve the original status updating issue?
- Current PR code returns err in the for loop in updateStatus() in MS controller when hitting ErrClusterLocked or other error. In this case, it won't update MS.Status because the status update code is after the for loop at lin 903
newStatus.Replicas = int32(len(filteredMachines))
.
However KCP controller is different and already updates some status fields before hitting ErrClusterLocked.
machineset_controller.go:
if err != nil && machine.GetDeletionTimestamp().IsZero() {
- log.Error(err, "Unable to retrieve Node status", "node", klog.KObj(node))
- continue
+ return errors.Wrapf(err, "unable to retrieve the status of Node %s", klog.KObj(node))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing: not sure if canCommunicateWithWorkloadCluster == true
can 100% guarentee getMachineNode() won't hit ErrClusterLocked since they are two serial func calls but not atom calls ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this new if block calls return err instead of continue since canCommunicateWithWorkloadCluster won't change in the for loop.
Our main concern is basically that if we return an error here as soon as we hit ErrClusterLocked we don't update the status at all anymore. This should be okay in the happy path, which is that we can actually communicate with the workload cluster and it's just a matter of waiting until another goroutine successfully created the client so we can use it. The problem we see is if the workload cluster is actually not reachable. Because in that case we will just continuously return the error forever. In this case we "freeze" the values of status: FullyLabeledReplicas, ReadyReplicas, AvailableReplicas, ObservedGeneration and a few conditions.
The concern Fabrizio raised (and I didn't have on my radar before) is that if we freeze the status indefinitely in these cases (or rather until the workload cluster is reachable) this can be pretty confusing for users. So for this case we should actually have a more extensive solution which also covers signalling to users that we can't communicate with the workload cluster anymore.
What we were trying to suggest was a mitigation for the issue for the happy path, where we actually can talk to the workload cluster, but replicas are flipping only because of the way we create the client initially for a cluster.
There are many places that return ErrClusterLocked (not only in MS controller and KCP controller). If we want this retry logic, shall we add it in other places as well? Or just here to resolve the original status updating issue?
I'm not really sure about this one. Basically we introduced the current "TryLock" over a "Lock" to make sure we don't block all reconcile for a cluster when a workload cluster is not reachable. The "Lock" we had before lead to really problematic performance issues when some workload clusters (with a lot of Machines) were unreachable.
So I think we should be careful with introducing these retries as they can lead to serious performance degradation (basically by just introducing this here, every reconcile of a MachineSet of an unreachable cluster will take 1 second).
Just trying to summarize the overall direction of my and Fabrizio's recent comments:
Sorry for the confusion overall, it's just a fairly complex topic and we're trying to figure out what the right thing to do is |
@sbueringer Thanks very much for the summary. I totally agree on bullet 1/2/3.
|
Yup absolutely no question that if we hit ErrClusterLocked, this happens. What I don't understand is how we can hit ErrClusterLocked. This should only be possible by concurrent access to the ClusterCacheTracker for the same cluster. But we only have one reconciler (KubeadmControlPlaneReconciler) in this controller and this one can't run concurrently for the same KCP/Cluster object. I think what could help to figure this out is to add a log to print the call stack whenever we enter ClusterCacheTracker.GetClient (+ for which cluster GetClient is called) Q: Are you also getting ErrClusterLocked printed in the KCP controller logs? (IIRC the issue had a log from the MachineSet controller) |
I observed the issue when upgrading CAPI components. Would it be able to produce the ClusterCacheTracker lock scenario when rolling update KCP controller deployment? |
For MachineSet specifically: It definitely solves the "happy path", but it makes the "unhappy path" worse by freezing the status indefinitely. Without the changes the ready and available replicas were dropping to 0 when the communication broke down. Probably a matter of preference if in case of the communication breakdown we prefer to keep ready and available replicas as is or drop them to 0. But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions) |
No the lock is purely in-memory, basically it just ensures that only one worker (within a controller) at a time can call |
I'm sorry for the confusion that the forked repo was deleted by others today by mistake. I will create a new PR to handle the MS status update issue first as suggested. |
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #10195
Test: