Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply machine deployment changes only to latest machine set, never both (e.g. during update) #754

Closed
vlerenc opened this issue Oct 24, 2022 · 2 comments · Fixed by #765
Closed
Labels
kind/bug Bug priority/critical Needs to be resolved soon, because it impacts users negatively priority/1 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@vlerenc
Copy link
Member

vlerenc commented Oct 24, 2022

What happened:

While investigating issues with pending workload during a rolling update that looked as if drain started on nodes without sufficient new capacity, we found out that changes to the machine deployment (e.g. cluster auto-scaler out- or in-scaling) are applied to both machine sets, the old one and the new one. This is at least a bug, but it is also the most likely culprit for the pending workload:

  • If the cluster scales out, this bug may lead to one more node that shouldn't be added (to the old machine set). This node will become Ready and will be tainted immediately, but it may confuse the progress tracking and may result in a wrong node being drained.
  • If the cluster scales in, this bug may lead to one more node that shouldn't be removed (from the old machine set). This node will be drained regardless of the ongoing rolling update and the workload become pending.

Scale-Out

3 x 3 nodes, 2 zones completed their rolling update, 1 zone hangs, maxSurge is 6 / 3 = 2 (maxUnavailable is 0):
18-pods

After adding more workload, all machine deployments scale 3 -> 4, but the backing machine sets for the blocked zone are both incremented, leading to not 3 (as per machine deployment requested), but 4 new machines:
24-pods

Scale-In

Same situation as above, now manually reducing the machine deployment replica count 3 -> 2:
At-3-Scaling-To-2

Outcome after the 3 -> 2 change, the old machine set replica count was reduced, now violating the rolling update contract (2 and no longer the old 3 nodes are Ready even though no new machine has joined the cluster yet). Now further reducing the machine deployment replica count 2 -> 1:
At-2-Scaling-To-1

Outcome after the 2 -> 1 change, now even both machine set replica counts were reduced:
At-1

What was expected:

Changes to the machine deployment should only be brought out to the new machine set and never to the old machine set while a rolling update is under way. This is definitely true for scale-out. As for scale-in, that's something we generally should avoid while performing a rolling update. That's why we annotate nodes, I believe. I am not sure whether we do that also for new nodes? Either way, as an additional safeguard, we should at a minimum only apply the change to the new machine set, if at all.

How to reproduce it:

Have a cluster with a worker pool. Either block new machine creation (verified) or have blocking PDBs (not verified) that slow down the drain. Then initiate a cluster update (e.g. new Kubernetes minor version or OS version). While the process in progress, modify the machine deployment (scale-out can be tested also indirectly with more workload that needs capacity). You will see that both machine sets are modified.

Anything else we need to know?

As this may break sensitive workload during rolling updates, the priority is critical.

@vlerenc vlerenc added kind/bug Bug priority/critical Needs to be resolved soon, because it impacts users negatively priority/1 Priority (lower number equals higher priority) labels Oct 24, 2022
@elankath
Copy link
Contributor

elankath commented Nov 22, 2022

Still under analysis for root cause of bug. The current code distributes replicas across old and new machine sets in dc.scale.

This logic appears to be correct and was done to reduce number of un-available machines and was apparently introduced to follow: Scale Deployments Proportionally

@himanshu-kun
Copy link
Contributor

After internal discussion and going through the code we found , that the double scale-up , scale-down happens , because of the two MCM logics running one after another.

  • First logic -> scaling logic which does proportional scaling on a deployment replica's increase/decrease
  • Second logic -> rolling update logic which tries to keep the number of machines b/w range [minAvail and replicas+maxSurge] (minAvail Ready machines to maxSurge+replicas total machines(includes notReady also)).

Both logics have responded properly to the state of the systems individually but the problem is that they both depend on the d.replicas, and reducing it leads to change in the range mentioned above and then rolling update removes a machine(considering scale-down case) even though there isn't a spare machine to take the workload of this machine.

This has caused problems as we know with the way autoscaler asks to remove machines from a machineDeployment.
Consider the following rolling update scenario example :
(O -> Old machineSet's machines, N-> New machineSet's machines, all machines are assumed to be running)

Consider 20 total Running machines under the machineDeployment
 d.Replicas=18 , 3 (O) , 17 (N) , maxSurge=2, maxUnvailable=0

Now autoscaler wanted to remove 2 machines so that total machines become 18 , but it scaled down d.Replicas
  Scaled 18->16 (Range 16 --- 18)
  
  proportional scaling happens first
  17(N) -> 15(N) (Because of scaling) -----> 16, 3 (O) , 15(N) ---> now we have 18 machines which was the desired state by autoscaler

But,...

 Now Since the total number of machines are 18 , which is the max of the range, rolling update logic would remove machines from the old machineSet .This is a problem because the new machineSet's machines are already full and so the workload on the old removed machines has nowhere to go.

  3 (O) -> 1(O)  (Because of rolloutRolling)

Final state 

d.Replicas=16 , 1(O), 15(N)

The above problem of workload not having node to be placed on is because of 2 major reasons:

  • if autoscaler is the actor , and if it just wants to remove 2 machines out of the initial 20 machines , it should NOT reduce the d.Replicas by 2. this leads to change in the range for rolling update and so rolling update logic causes disaster
  • MCM rolling update logic currently , just relies on range and doesn't take into account if the pods would be schedulable after draining. This is mostly because its autoscaler's job to make these calculations, so wasn't included in MCM.

To deal with the above, for now we have done the following:

  • on scale-up of d.Replicas during rolling update, MCM would increase only new MachineSet replicas
  • autoscaler wouldn't scale-down machineDeployment during rolling-update

For users using MCM open-source and if they want desired results after reducing machineDeployment's replicas they should reduce the d.Replicas in a way that the range doesn't get affected (for that he might have to tweak maxSurge , maxUnavailable as they are used to calculate the range.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug priority/critical Needs to be resolved soon, because it impacts users negatively priority/1 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
4 participants