Apply machine deployment changes only to latest machine set, never both (e.g. during update) #754

vlerenc · 2022-10-24T08:29:41Z

What happened:

While investigating issues with pending workload during a rolling update that looked as if drain started on nodes without sufficient new capacity, we found out that changes to the machine deployment (e.g. cluster auto-scaler out- or in-scaling) are applied to both machine sets, the old one and the new one. This is at least a bug, but it is also the most likely culprit for the pending workload:

If the cluster scales out, this bug may lead to one more node that shouldn't be added (to the old machine set). This node will become Ready and will be tainted immediately, but it may confuse the progress tracking and may result in a wrong node being drained.
If the cluster scales in, this bug may lead to one more node that shouldn't be removed (from the old machine set). This node will be drained regardless of the ongoing rolling update and the workload become pending.

Scale-Out

3 x 3 nodes, 2 zones completed their rolling update, 1 zone hangs, maxSurge is 6 / 3 = 2 (maxUnavailable is 0):

After adding more workload, all machine deployments scale 3 -> 4, but the backing machine sets for the blocked zone are both incremented, leading to not 3 (as per machine deployment requested), but 4 new machines:

Scale-In

Same situation as above, now manually reducing the machine deployment replica count 3 -> 2:

Outcome after the 3 -> 2 change, the old machine set replica count was reduced, now violating the rolling update contract (2 and no longer the old 3 nodes are Ready even though no new machine has joined the cluster yet). Now further reducing the machine deployment replica count 2 -> 1:

Outcome after the 2 -> 1 change, now even both machine set replica counts were reduced:

What was expected:

Changes to the machine deployment should only be brought out to the new machine set and never to the old machine set while a rolling update is under way. This is definitely true for scale-out. As for scale-in, that's something we generally should avoid while performing a rolling update. That's why we annotate nodes, I believe. I am not sure whether we do that also for new nodes? Either way, as an additional safeguard, we should at a minimum only apply the change to the new machine set, if at all.

How to reproduce it:

Have a cluster with a worker pool. Either block new machine creation (verified) or have blocking PDBs (not verified) that slow down the drain. Then initiate a cluster update (e.g. new Kubernetes minor version or OS version). While the process in progress, modify the machine deployment (scale-out can be tested also indirectly with more workload that needs capacity). You will see that both machine sets are modified.

Anything else we need to know?

As this may break sensitive workload during rolling updates, the priority is critical.

The text was updated successfully, but these errors were encountered:

elankath · 2022-11-22T10:47:48Z

Still under analysis for root cause of bug. The current code distributes replicas across old and new machine sets in dc.scale.

This logic appears to be correct and was done to reduce number of un-available machines and was apparently introduced to follow: Scale Deployments Proportionally

himanshu-kun · 2023-01-09T06:39:56Z

After internal discussion and going through the code we found , that the double scale-up , scale-down happens , because of the two MCM logics running one after another.

First logic -> scaling logic which does proportional scaling on a deployment replica's increase/decrease
Second logic -> rolling update logic which tries to keep the number of machines b/w range [minAvail and replicas+maxSurge] (minAvail Ready machines to maxSurge+replicas total machines(includes notReady also)).

Both logics have responded properly to the state of the systems individually but the problem is that they both depend on the d.replicas, and reducing it leads to change in the range mentioned above and then rolling update removes a machine(considering scale-down case) even though there isn't a spare machine to take the workload of this machine.

This has caused problems as we know with the way autoscaler asks to remove machines from a machineDeployment.
Consider the following rolling update scenario example :
(O -> Old machineSet's machines, N-> New machineSet's machines, all machines are assumed to be running)

Consider 20 total Running machines under the machineDeployment
 d.Replicas=18 , 3 (O) , 17 (N) , maxSurge=2, maxUnvailable=0

Now autoscaler wanted to remove 2 machines so that total machines become 18 , but it scaled down d.Replicas
  Scaled 18->16 (Range 16 --- 18)
  
  proportional scaling happens first
  17(N) -> 15(N) (Because of scaling) -----> 16, 3 (O) , 15(N) ---> now we have 18 machines which was the desired state by autoscaler

But,...

 Now Since the total number of machines are 18 , which is the max of the range, rolling update logic would remove machines from the old machineSet .This is a problem because the new machineSet's machines are already full and so the workload on the old removed machines has nowhere to go.

  3 (O) -> 1(O)  (Because of rolloutRolling)

Final state 

d.Replicas=16 , 1(O), 15(N)

The above problem of workload not having node to be placed on is because of 2 major reasons:

if autoscaler is the actor , and if it just wants to remove 2 machines out of the initial 20 machines , it should NOT reduce the d.Replicas by 2. this leads to change in the range for rolling update and so rolling update logic causes disaster
MCM rolling update logic currently , just relies on range and doesn't take into account if the pods would be schedulable after draining. This is mostly because its autoscaler's job to make these calculations, so wasn't included in MCM.

To deal with the above, for now we have done the following:

on scale-up of d.Replicas during rolling update, MCM would increase only new MachineSet replicas
autoscaler wouldn't scale-down machineDeployment during rolling-update

For users using MCM open-source and if they want desired results after reducing machineDeployment's replicas they should reduce the d.Replicas in a way that the range doesn't get affected (for that he might have to tweak maxSurge , maxUnavailable as they are used to calculate the range.)

vlerenc added kind/bug Bug priority/critical Needs to be resolved soon, because it impacts users negatively priority/1 Priority (lower number equals higher priority) labels Oct 24, 2022

vlerenc mentioned this issue Oct 24, 2022

React faster if VM instance is gone (i.e. don’t wait until full machineHealthTimeout/machineCreationTimeout lapses) #755

Open

himanshu-kun mentioned this issue Dec 13, 2022

Update Progressing condition on machineDeployment for nil progressDeadlineSeconds #762

Merged

himanshu-kun added the status/under-investigation Issue is under investigation label Dec 13, 2022

This was referenced Jan 9, 2023

Scale-up only new machineSet while scale-down all active mSs proportionally #765

Merged

No scale down for rolling update machineDeployment gardener/autoscaler#160

Merged

himanshu-kun closed this as completed in #765 Jan 23, 2023

gardener-robot added status/closed Issue is closed (either delivered or triaged) and removed status/under-investigation Issue is under investigation labels Jan 23, 2023

himanshu-kun mentioned this issue Mar 29, 2023

Automated cherry pick of #803: edge cases dealt with, unit tests added #804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply machine deployment changes only to latest machine set, never both (e.g. during update) #754

Apply machine deployment changes only to latest machine set, never both (e.g. during update) #754

vlerenc commented Oct 24, 2022

elankath commented Nov 22, 2022 •

edited

Loading

himanshu-kun commented Jan 9, 2023

Apply machine deployment changes only to latest machine set, never both (e.g. during update) #754

Apply machine deployment changes only to latest machine set, never both (e.g. during update) #754

Comments

vlerenc commented Oct 24, 2022

What happened:

Scale-Out

Scale-In

What was expected:

How to reproduce it:

Anything else we need to know?

elankath commented Nov 22, 2022 • edited Loading

himanshu-kun commented Jan 9, 2023

elankath commented Nov 22, 2022 •

edited

Loading