Skip to content

Add new operator controller MachineSet#1532

Closed
cadenmarchese wants to merge 6 commits intoAzure:masterfrom
cadenmarchese:machineset-controller
Closed

Add new operator controller MachineSet#1532
cadenmarchese wants to merge 6 commits intoAzure:masterfrom
cadenmarchese:machineset-controller

Conversation

@cadenmarchese
Copy link
Collaborator

@cadenmarchese cadenmarchese commented Jun 7, 2021

Which issue this PR addresses:

User Story 8749435 - cluster upgrade is blocked if there is only one worker VM, because the ingress and console operators are Degraded. We need to ensure that there are at least two worker replicas in place to avoid this, as well as an additional third worker replica to meet Microsoft's support policies.

What this PR does / why we need it:

This controller watches MachineSet objects for changes, and tallies worker replicas. If Spec.Replicas inside of an object matching our infraID is reduced to where total replicas is less than three, the operator will revert the change / scale the MachineSet back up.

Test plan for issue:

  • Added to E2E to confirm that the object is being modified as it should.
  • Tested the controller in a development cluster with a custom MachineSet, to confirm that the controller functioned as intended even when a custom MachineSet was in place, and confirmed the custom MachineSet was never modified.
  • Confirmed that the controller only reconciles on changes to MachineSet objects.

Is there any documentation that needs to be updated for this PR?

Likely not needed, as the ARO support policy includes a line on maintaining a minimum of three worker nodes. That being said, it may help to also add that the operator will now attempt to scale when/if this merges.

@cadenmarchese cadenmarchese requested review from bennerv and jewzaam June 7, 2021 16:55
@troy0820
Copy link
Contributor

troy0820 commented Jun 7, 2021

You could add an e2e tests that to see if your reconciler scales the nodes automatically. We already do scaling but we don't automatically resolve the scaling events.

Copy link
Contributor

@troy0820 troy0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, just a few comments.

@cadenmarchese cadenmarchese force-pushed the machineset-controller branch from 6445579 to 4072ed0 Compare July 12, 2021 21:23
@github-actions
Copy link

Please rebase pull request.

@github-actions github-actions bot added the needs-rebase branch needs a rebase label Jul 13, 2021
@cadenmarchese cadenmarchese force-pushed the machineset-controller branch from 4072ed0 to 5ef8aa4 Compare July 13, 2021 16:22
@github-actions github-actions bot removed the needs-rebase branch needs a rebase label Jul 13, 2021
@github-actions
Copy link

Please rebase pull request.

@github-actions github-actions bot added the needs-rebase branch needs a rebase label Jul 15, 2021
@cadenmarchese cadenmarchese force-pushed the machineset-controller branch from 5ef8aa4 to 7f83543 Compare July 15, 2021 20:52
@github-actions github-actions bot removed the needs-rebase branch needs a rebase label Jul 15, 2021
Copy link
Contributor

@nilsanderselde nilsanderselde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Going forward, these clients should be alphabetized. There's a PR open to standardize it for the existing controllers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this happens?

  1. Customer creates a cluster with 3 masters and 3 workers
  2. Creates a new machine set without infra id in the name and with 3 replicas
  3. Scales down default machine sets (these which contains infra id) to 0. At this point we still have 3 masters and 3 workers
  4. Scales down custom machin ests to 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller won't intervene in this case as long as total worker replica count is 3, because the List operation here counts all worker replicas towards replica count, not just ones matching our infra ID: https://github.com/cadenmarchese/ARO-RP/blob/machineset-controller/pkg/operator/controllers/machineset/machineset_controller.go#L49

If the user were to scale down our replicas before making their own, then the controller would intervene.

Copy link
Contributor

@m1kola m1kola Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cadenmarchese yes, I think the controller won't intervene in the above case, but the customer ends up with 3 masters and 0 workers after the step 4. No?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the controller wouldn't scale ours back up, because they aren't the objects being modified at that time. I am struggling to think of how to account for this use case without changing other assumptions (for example, that custom machinesets should count, and that we aren't modifying custom objects). I'm happy for suggestions.

Copy link
Contributor

@m1kola m1kola Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can:

  1. Scale customer created machine sets back to meet the requiremenrt (technically easisest solution, but might be more confusing to the customer)
  2. Always scale our default worker pool no mater which machine set was removed/modified. It is a bit more complicated as we have to take multi-az setup into account: there will be cases where we have 3 machien sets (muti-az setup) and there will be cases where we have only one az (no az support in the region).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjudeikis so what you are saying is: only watch for original machine set and try to scale up if needed and ignore everything else, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in theory we covering only basic scenario where customer didnt do any magic in MachineSet just got idea to scale down existing one.

From comments above, I am understanding:

  • We bail out if any custom machinesets are found in the list of machinesets (we can't cover all use cases here, so assume positive intent and leave it alone)
  • We watch for changes to machineset objects and revert them if needed (no need to do AZ sorting, because we are just watching for scale down)
  • We are moving to 2 minSupportedReplicas instead of 3 (Add new operator controller MachineSet #1532 (comment)), which is fine in the context of our upgrades and the user story

Does this sound accurate @m1kola @mjudeikis?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We bail out if any custom machinesets are found in the list of machinesets (we can't cover all use cases here, so assume positive intent and leave it alone)

No. My undersntanding of MJ's idea is that we should do this:

  • If reconciliation request is for default worker node (with infra Id) we proceed with reconciliation and check total number of worker nodes
  • If the request is for something else - we bail out

Everything else sounds good to me, but I would like MJ to confirm that we understood his idea correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two parts:

bailout in any way we deviate from default configuration

If customer scaled down original machineSet AND where is other machienset (dont care Az or available replicas) - bail out

To me, this sounds like we are bailing out in any case where there are custom machinesets (not just when a custom machineset triggers reconcile). Something like this:

for _, machineset := range machinesets.Items {
	if !strings.Contains(machineset.Name, instance.Spec.InfraID)) {
		return reconcile.Result{}, nil // don't do anything, and don't requeue
	}
	if machineset.Spec.Replicas != nil {
		replicaCount += int(*machineset.Spec.Replicas)
	}
}

So we would still reconcile each time an object is changed, but this way we're covered if the second point above is true. Let me know if I am not understanding correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I caught up with MJ about this, and we want to bail out in either case. @m1kola do you mind reviewing the controller code changes? From there we can wrap up tests.

Copy link
Contributor

@m1kola m1kola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in the right direction, but we still need to handle 404 & non-404 errors on Get and decide what we want to do here. And adjust tests accordingly.

I suggest that you put test changes on hold until we settle on the decision how we scale back machine sets. Otherwise I suspect you will spend a lot of time adjusting tests.

Copy link
Contributor

@m1kola m1kola Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can:

  1. Scale customer created machine sets back to meet the requiremenrt (technically easisest solution, but might be more confusing to the customer)
  2. Always scale our default worker pool no mater which machine set was removed/modified. It is a bit more complicated as we have to take multi-az setup into account: there will be cases where we have 3 machien sets (muti-az setup) and there will be cases where we have only one az (no az support in the region).

Comment on lines +267 to +317
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you can just do something like this

Suggested change
request := ctrl.Request{}
request.Name = tt.objectName
request.Namespace = machineSetsNamespace
request := ctrl.Request{
Name: tt.objectName,
Namespace: machineSetsNamespace
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to get this to work - it looks like ctrl.Request{} will only take NamespacedName. Let me know if I am missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It becasue it embeds another type. I'm happy to leave it as is becasue it is going to be less cumbersome, but it is a good execrices to figure out how to make it work ;)

@cadenmarchese
Copy link
Collaborator Author

Azure/azure-cli#18950 seems to still be a blocker for the checks here, but they are passing locally.

Copy link
Contributor

@mjudeikis mjudeikis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 cents

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our goal was to keep it simple, and bailout in any way we deviate from default configuration.

  1. If customer deleted original workerSets - bailout
  2. If customer scaled down original machineSet AND where is no other machinesets available - scale up.
  3. If customer scaled down original machineSet AND where is other machienset (dont care Az or available replicas) - bail out as we can't cover ALL customer ideas and thinking behind.

So in theory we covering only basic scenario where customer didnt do any magic in MachineSet just got idea to scale down existing one.

@cadenmarchese cadenmarchese force-pushed the machineset-controller branch from e9c6231 to 18aeafb Compare July 30, 2021 20:01
@cadenmarchese
Copy link
Collaborator Author

@m1kola @troy0820 @mjudeikis @nilsanderselde To tidy up this PR a bit, I have squashed commits and resolved comments that were outdated or had been addressed. If there's something I missed, feel free to unresolve and highlight me.

@cadenmarchese cadenmarchese force-pushed the machineset-controller branch from 18aeafb to fe52c21 Compare July 30, 2021 21:07
Comment on lines +267 to +317
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It becasue it embeds another type. I'm happy to leave it as is becasue it is going to be less cumbersome, but it is a good execrices to figure out how to make it work ;)

@cadenmarchese cadenmarchese force-pushed the machineset-controller branch from f78571f to eae44aa Compare August 5, 2021 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants