Add new operator controller MachineSet by cadenmarchese · Pull Request #1532 · Azure/ARO-RP

cadenmarchese · 2021-06-07T16:55:27Z

Which issue this PR addresses:

User Story 8749435 - cluster upgrade is blocked if there is only one worker VM, because the ingress and console operators are Degraded. We need to ensure that there are at least two worker replicas in place to avoid this, as well as an additional third worker replica to meet Microsoft's support policies.

What this PR does / why we need it:

This controller watches MachineSet objects for changes, and tallies worker replicas. If Spec.Replicas inside of an object matching our infraID is reduced to where total replicas is less than three, the operator will revert the change / scale the MachineSet back up.

Test plan for issue:

Added to E2E to confirm that the object is being modified as it should.
Tested the controller in a development cluster with a custom MachineSet, to confirm that the controller functioned as intended even when a custom MachineSet was in place, and confirmed the custom MachineSet was never modified.
Confirmed that the controller only reconciles on changes to MachineSet objects.

Is there any documentation that needs to be updated for this PR?

Likely not needed, as the ARO support policy includes a line on maintaining a minimum of three worker nodes. That being said, it may help to also add that the operator will now attempt to scale when/if this merges.

troy0820 · 2021-06-07T19:18:57Z

You could add an e2e tests that to see if your reconciler scales the nodes automatically. We already do scaling but we don't automatically resolve the scaling events.

troy0820

This is looking good, just a few comments.

pkg/operator/controllers/machineset/machineset_controller.go

test/e2e/scalenodes.go

pkg/operator/controllers/machineset/machineset_controller.go

github-actions · 2021-07-13T08:26:51Z

Please rebase pull request.

github-actions · 2021-07-15T18:55:31Z

Please rebase pull request.

nilsanderselde

Nit: Going forward, these clients should be alphabetized. There's a PR open to standardize it for the existing controllers.

pkg/operator/controllers/machineset/machineset_controller.go

test/e2e/scalenodes.go

pkg/operator/controllers/machineset/machineset_controller.go

m1kola · 2021-07-20T10:19:16Z

pkg/operator/controllers/machineset/machineset_controller.go

What if this happens?

Customer creates a cluster with 3 masters and 3 workers

Creates a new machine set without infra id in the name and with 3 replicas

Scales down default machine sets (these which contains infra id) to 0. At this point we still have 3 masters and 3 workers

Scales down custom machin ests to 0.

The controller won't intervene in this case as long as total worker replica count is 3, because the List operation here counts all worker replicas towards replica count, not just ones matching our infra ID: https://github.com/cadenmarchese/ARO-RP/blob/machineset-controller/pkg/operator/controllers/machineset/machineset_controller.go#L49

If the user were to scale down our replicas before making their own, then the controller would intervene.

@cadenmarchese yes, I think the controller won't intervene in the above case, but the customer ends up with 3 masters and 0 workers after the step 4. No?

You're right, the controller wouldn't scale ours back up, because they aren't the objects being modified at that time. I am struggling to think of how to account for this use case without changing other assumptions (for example, that custom machinesets should count, and that we aren't modifying custom objects). I'm happy for suggestions.

We can:

Scale customer created machine sets back to meet the requiremenrt (technically easisest solution, but might be more confusing to the customer)

Always scale our default worker pool no mater which machine set was removed/modified. It is a bit more complicated as we have to take multi-az setup into account: there will be cases where we have 3 machien sets (muti-az setup) and there will be cases where we have only one az (no az support in the region).

@mjudeikis so what you are saying is: only watch for original machine set and try to scale up if needed and ignore everything else, right?

So in theory we covering only basic scenario where customer didnt do any magic in MachineSet just got idea to scale down existing one.

From comments above, I am understanding:

We bail out if any custom machinesets are found in the list of machinesets (we can't cover all use cases here, so assume positive intent and leave it alone)

We watch for changes to machineset objects and revert them if needed (no need to do AZ sorting, because we are just watching for scale down)

We are moving to 2 minSupportedReplicas instead of 3 (Add new operator controller MachineSet #1532 (comment)), which is fine in the context of our upgrades and the user story

Does this sound accurate @m1kola @mjudeikis?

We bail out if any custom machinesets are found in the list of machinesets (we can't cover all use cases here, so assume positive intent and leave it alone)

No. My undersntanding of MJ's idea is that we should do this:

If reconciliation request is for default worker node (with infra Id) we proceed with reconciliation and check total number of worker nodes

If the request is for something else - we bail out

Everything else sounds good to me, but I would like MJ to confirm that we understood his idea correctly.

These two parts:

bailout in any way we deviate from default configuration

If customer scaled down original machineSet AND where is other machienset (dont care Az or available replicas) - bail out

To me, this sounds like we are bailing out in any case where there are custom machinesets (not just when a custom machineset triggers reconcile). Something like this:

for _, machineset := range machinesets.Items { if !strings.Contains(machineset.Name, instance.Spec.InfraID)) { return reconcile.Result{}, nil // don't do anything, and don't requeue } if machineset.Spec.Replicas != nil { replicaCount += int(*machineset.Spec.Replicas) } }

So we would still reconcile each time an object is changed, but this way we're covered if the second point above is true. Let me know if I am not understanding correctly.

I caught up with MJ about this, and we want to bail out in either case. @m1kola do you mind reviewing the controller code changes? From there we can wrap up tests.

pkg/operator/controllers/machineset/machineset_controller.go

test/e2e/scalenodes.go

pkg/operator/controllers/machineset/machineset_controller_test.go

pkg/operator/controllers/machineset/machineset_controller.go

m1kola

We are in the right direction, but we still need to handle 404 & non-404 errors on Get and decide what we want to do here. And adjust tests accordingly.

I suggest that you put test changes on hold until we settle on the decision how we scale back machine sets. Otherwise I suspect you will spend a lot of time adjusting tests.

m1kola · 2021-07-22T10:01:34Z

pkg/operator/controllers/machineset/machineset_controller.go

We can:

Scale customer created machine sets back to meet the requiremenrt (technically easisest solution, but might be more confusing to the customer)

Always scale our default worker pool no mater which machine set was removed/modified. It is a bit more complicated as we have to take multi-az setup into account: there will be cases where we have 3 machien sets (muti-az setup) and there will be cases where we have only one az (no az support in the region).

pkg/operator/controllers/machineset/machineset_controller.go

m1kola · 2021-07-22T10:30:07Z

pkg/operator/controllers/machineset/machineset_controller_test.go

Nit: you can just do something like this

Suggested change

request := ctrl.Request{}

request.Name = tt.objectName

request.Namespace = machineSetsNamespace

request := ctrl.Request{

Name: tt.objectName,

Namespace: machineSetsNamespace

}

I wasn't able to get this to work - it looks like ctrl.Request{} will only take NamespacedName. Let me know if I am missing something.

It becasue it embeds another type. I'm happy to leave it as is becasue it is going to be less cumbersome, but it is a good execrices to figure out how to make it work ;)

pkg/operator/controllers/machineset/machineset_controller_test.go

pkg/operator/controllers/machineset/machineset_controller.go

pkg/operator/controllers/machineset/machineset_controller_test.go

cadenmarchese · 2021-07-26T21:44:53Z

Azure/azure-cli#18950 seems to still be a blocker for the checks here, but they are passing locally.

mjudeikis

my 2 cents

README.md

pkg/operator/controllers/machineset/machineset_controller.go

mjudeikis · 2021-07-28T10:46:50Z

pkg/operator/controllers/machineset/machineset_controller.go

I think our goal was to keep it simple, and bailout in any way we deviate from default configuration.

If customer deleted original workerSets - bailout

If customer scaled down original machineSet AND where is no other machinesets available - scale up.

If customer scaled down original machineSet AND where is other machienset (dont care Az or available replicas) - bail out as we can't cover ALL customer ideas and thinking behind.

So in theory we covering only basic scenario where customer didnt do any magic in MachineSet just got idea to scale down existing one.

pkg/operator/deploy/deploy.go

cadenmarchese · 2021-07-30T20:11:41Z

@m1kola @troy0820 @mjudeikis @nilsanderselde To tidy up this PR a bit, I have squashed commits and resolved comments that were outdated or had been addressed. If there's something I missed, feel free to unresolve and highlight me.

m1kola · 2021-08-05T14:47:50Z

pkg/operator/controllers/machineset/machineset_controller_test.go

It becasue it embeds another type. I'm happy to leave it as is becasue it is going to be less cumbersome, but it is a good execrices to figure out how to make it work ;)

test/e2e/scalenodes.go

apiversions: changed apiversion to 2021-01-15

cadenmarchese requested review from bennerv and jewzaam June 7, 2021 16:55

cadenmarchese added the work-in-progress label Jun 25, 2021

cadenmarchese force-pushed the machineset-controller branch from 2949495 to 325c0ad Compare June 28, 2021 20:21

cadenmarchese force-pushed the machineset-controller branch from c687fdb to 8d53364 Compare July 8, 2021 20:22

cadenmarchese added ready-for-review and removed work-in-progress labels Jul 9, 2021

cadenmarchese marked this pull request as ready for review July 9, 2021 19:09

cadenmarchese requested review from m1kola and mjudeikis as code owners July 9, 2021 19:09

troy0820 suggested changes Jul 12, 2021

View reviewed changes

m1kola reviewed Jul 12, 2021

View reviewed changes

pkg/operator/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

cadenmarchese force-pushed the machineset-controller branch from 6445579 to 4072ed0 Compare July 12, 2021 21:23

github-actions bot added the needs-rebase branch needs a rebase label Jul 13, 2021

cadenmarchese force-pushed the machineset-controller branch from 4072ed0 to 5ef8aa4 Compare July 13, 2021 16:22

github-actions bot removed the needs-rebase branch needs a rebase label Jul 13, 2021

github-actions bot added the needs-rebase branch needs a rebase label Jul 15, 2021

cadenmarchese force-pushed the machineset-controller branch from 5ef8aa4 to 7f83543 Compare July 15, 2021 20:52

github-actions bot removed the needs-rebase branch needs a rebase label Jul 15, 2021

nilsanderselde suggested changes Jul 16, 2021

View reviewed changes

pkg/operator/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

pkg/operator/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

pkg/operator/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

nilsanderselde suggested changes Jul 16, 2021

View reviewed changes

pkg/operator/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

pkg/operator/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

nilsanderselde reviewed Jul 16, 2021

View reviewed changes

pkg/operator/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

m1kola suggested changes Jul 19, 2021

View reviewed changes

m1kola suggested changes Jul 20, 2021

View reviewed changes

m1kola suggested changes Jul 22, 2021

View reviewed changes

mjudeikis reviewed Jul 28, 2021

View reviewed changes

m1kola reviewed Jul 30, 2021

View reviewed changes

pkg/operator/deploy/deploy.go Outdated Show resolved Hide resolved

cadenmarchese force-pushed the machineset-controller branch from e9c6231 to 18aeafb Compare July 30, 2021 20:01

cadenmarchese force-pushed the machineset-controller branch from 18aeafb to fe52c21 Compare July 30, 2021 21:07

m1kola reviewed Aug 5, 2021

View reviewed changes

cadenmarchese force-pushed the machineset-controller branch from f78571f to eae44aa Compare August 5, 2021 18:02

mjudeikis and others added 6 commits August 5, 2021 16:37

Add norway west for tests

6721ad9

Make EncryptionAtHost enum

aff4ea7

vendor: cosmos-db/mgmt/2021-01-15/documentdb

f059771

cosmosdb: add PeriodicModeBackupPolicy with retention and intervals

125a425

cosmosdb: remove 'extraCosmosDBIPs' parameters to use 'ipRules'

dfb379d

apiversions: changed apiversion to 2021-01-15

Adds new operator controller MachineSet

a347f3c

cadenmarchese force-pushed the machineset-controller branch from 47cbfb4 to a347f3c Compare August 5, 2021 20:58

cadenmarchese closed this Aug 5, 2021

cadenmarchese mentioned this pull request Aug 5, 2021

Add new operator controller MachineSet #1655

Merged

-			request := ctrl.Request{}
-			request.Name = tt.objectName
-			request.Namespace = machineSetsNamespace
+			request := ctrl.Request{
+				Name: tt.objectName,
+				Namespace: machineSetsNamespace
+			}

Conversation

cadenmarchese commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue this PR addresses:

What this PR does / why we need it:

Test plan for issue:

Is there any documentation that needs to be updated for this PR?

Uh oh!

troy0820 commented Jun 7, 2021

Uh oh!

troy0820 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jul 13, 2021

Uh oh!

github-actions bot commented Jul 15, 2021

Uh oh!

nilsanderselde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m1kola Jul 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m1kola Jul 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

m1kola left a comment

Choose a reason for hiding this comment

Uh oh!

m1kola Jul 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadenmarchese commented Jun 7, 2021 •

edited

Loading

m1kola Jul 20, 2021 •

edited

Loading

m1kola Jul 22, 2021 •

edited

Loading

m1kola Jul 22, 2021 •

edited

Loading