Add new operator controller MachineSet#1532
Add new operator controller MachineSet#1532cadenmarchese wants to merge 6 commits intoAzure:masterfrom
Conversation
|
You could add an e2e tests that to see if your reconciler scales the nodes automatically. We already do scaling but we don't automatically resolve the scaling events. |
2949495 to
325c0ad
Compare
c687fdb to
8d53364
Compare
troy0820
left a comment
There was a problem hiding this comment.
This is looking good, just a few comments.
6445579 to
4072ed0
Compare
|
Please rebase pull request. |
4072ed0 to
5ef8aa4
Compare
|
Please rebase pull request. |
5ef8aa4 to
7f83543
Compare
nilsanderselde
left a comment
There was a problem hiding this comment.
Nit: Going forward, these clients should be alphabetized. There's a PR open to standardize it for the existing controllers.
There was a problem hiding this comment.
What if this happens?
- Customer creates a cluster with 3 masters and 3 workers
- Creates a new machine set without infra id in the name and with 3 replicas
- Scales down default machine sets (these which contains infra id) to 0. At this point we still have 3 masters and 3 workers
- Scales down custom machin ests to 0.
There was a problem hiding this comment.
The controller won't intervene in this case as long as total worker replica count is 3, because the List operation here counts all worker replicas towards replica count, not just ones matching our infra ID: https://github.com/cadenmarchese/ARO-RP/blob/machineset-controller/pkg/operator/controllers/machineset/machineset_controller.go#L49
If the user were to scale down our replicas before making their own, then the controller would intervene.
There was a problem hiding this comment.
@cadenmarchese yes, I think the controller won't intervene in the above case, but the customer ends up with 3 masters and 0 workers after the step 4. No?
There was a problem hiding this comment.
You're right, the controller wouldn't scale ours back up, because they aren't the objects being modified at that time. I am struggling to think of how to account for this use case without changing other assumptions (for example, that custom machinesets should count, and that we aren't modifying custom objects). I'm happy for suggestions.
There was a problem hiding this comment.
We can:
- Scale customer created machine sets back to meet the requiremenrt (technically easisest solution, but might be more confusing to the customer)
- Always scale our default worker pool no mater which machine set was removed/modified. It is a bit more complicated as we have to take multi-az setup into account: there will be cases where we have 3 machien sets (muti-az setup) and there will be cases where we have only one az (no az support in the region).
There was a problem hiding this comment.
@mjudeikis so what you are saying is: only watch for original machine set and try to scale up if needed and ignore everything else, right?
There was a problem hiding this comment.
So in theory we covering only basic scenario where customer didnt do any magic in MachineSet just got idea to scale down existing one.
From comments above, I am understanding:
- We bail out if any custom machinesets are found in the list of machinesets (we can't cover all use cases here, so assume positive intent and leave it alone)
- We watch for changes to machineset objects and revert them if needed (no need to do AZ sorting, because we are just watching for scale down)
- We are moving to 2 minSupportedReplicas instead of 3 (Add new operator controller MachineSet #1532 (comment)), which is fine in the context of our upgrades and the user story
Does this sound accurate @m1kola @mjudeikis?
There was a problem hiding this comment.
We bail out if any custom machinesets are found in the list of machinesets (we can't cover all use cases here, so assume positive intent and leave it alone)
No. My undersntanding of MJ's idea is that we should do this:
- If reconciliation request is for default worker node (with infra Id) we proceed with reconciliation and check total number of worker nodes
- If the request is for something else - we bail out
Everything else sounds good to me, but I would like MJ to confirm that we understood his idea correctly.
There was a problem hiding this comment.
These two parts:
bailout in any way we deviate from default configuration
If customer scaled down original machineSet AND where is other machienset (dont care Az or available replicas) - bail out
To me, this sounds like we are bailing out in any case where there are custom machinesets (not just when a custom machineset triggers reconcile). Something like this:
for _, machineset := range machinesets.Items {
if !strings.Contains(machineset.Name, instance.Spec.InfraID)) {
return reconcile.Result{}, nil // don't do anything, and don't requeue
}
if machineset.Spec.Replicas != nil {
replicaCount += int(*machineset.Spec.Replicas)
}
}
So we would still reconcile each time an object is changed, but this way we're covered if the second point above is true. Let me know if I am not understanding correctly.
There was a problem hiding this comment.
I caught up with MJ about this, and we want to bail out in either case. @m1kola do you mind reviewing the controller code changes? From there we can wrap up tests.
pkg/operator/controllers/machineset/machineset_controller_test.go
Outdated
Show resolved
Hide resolved
pkg/operator/controllers/machineset/machineset_controller_test.go
Outdated
Show resolved
Hide resolved
m1kola
left a comment
There was a problem hiding this comment.
We are in the right direction, but we still need to handle 404 & non-404 errors on Get and decide what we want to do here. And adjust tests accordingly.
I suggest that you put test changes on hold until we settle on the decision how we scale back machine sets. Otherwise I suspect you will spend a lot of time adjusting tests.
There was a problem hiding this comment.
We can:
- Scale customer created machine sets back to meet the requiremenrt (technically easisest solution, but might be more confusing to the customer)
- Always scale our default worker pool no mater which machine set was removed/modified. It is a bit more complicated as we have to take multi-az setup into account: there will be cases where we have 3 machien sets (muti-az setup) and there will be cases where we have only one az (no az support in the region).
There was a problem hiding this comment.
Nit: you can just do something like this
| request := ctrl.Request{} | |
| request.Name = tt.objectName | |
| request.Namespace = machineSetsNamespace | |
| request := ctrl.Request{ | |
| Name: tt.objectName, | |
| Namespace: machineSetsNamespace | |
| } |
There was a problem hiding this comment.
I wasn't able to get this to work - it looks like ctrl.Request{} will only take NamespacedName. Let me know if I am missing something.
There was a problem hiding this comment.
It becasue it embeds another type. I'm happy to leave it as is becasue it is going to be less cumbersome, but it is a good execrices to figure out how to make it work ;)
pkg/operator/controllers/machineset/machineset_controller_test.go
Outdated
Show resolved
Hide resolved
pkg/operator/controllers/machineset/machineset_controller_test.go
Outdated
Show resolved
Hide resolved
pkg/operator/controllers/machineset/machineset_controller_test.go
Outdated
Show resolved
Hide resolved
pkg/operator/controllers/machineset/machineset_controller_test.go
Outdated
Show resolved
Hide resolved
pkg/operator/controllers/machineset/machineset_controller_test.go
Outdated
Show resolved
Hide resolved
pkg/operator/controllers/machineset/machineset_controller_test.go
Outdated
Show resolved
Hide resolved
|
Azure/azure-cli#18950 seems to still be a blocker for the checks here, but they are passing locally. |
There was a problem hiding this comment.
I think our goal was to keep it simple, and bailout in any way we deviate from default configuration.
- If customer deleted original workerSets - bailout
- If customer scaled down original machineSet AND where is no other machinesets available - scale up.
- If customer scaled down original machineSet AND where is other machienset (dont care Az or available replicas) - bail out as we can't cover ALL customer ideas and thinking behind.
So in theory we covering only basic scenario where customer didnt do any magic in MachineSet just got idea to scale down existing one.
e9c6231 to
18aeafb
Compare
|
@m1kola @troy0820 @mjudeikis @nilsanderselde To tidy up this PR a bit, I have squashed commits and resolved comments that were outdated or had been addressed. If there's something I missed, feel free to unresolve and highlight me. |
18aeafb to
fe52c21
Compare
There was a problem hiding this comment.
It becasue it embeds another type. I'm happy to leave it as is becasue it is going to be less cumbersome, but it is a good execrices to figure out how to make it work ;)
f78571f to
eae44aa
Compare
apiversions: changed apiversion to 2021-01-15
47cbfb4 to
a347f3c
Compare
Which issue this PR addresses:
User Story 8749435 - cluster upgrade is blocked if there is only one worker VM, because the ingress and console operators are Degraded. We need to ensure that there are at least two worker replicas in place to avoid this, as well as an additional third worker replica to meet Microsoft's support policies.
What this PR does / why we need it:
This controller watches MachineSet objects for changes, and tallies worker replicas. If Spec.Replicas inside of an object matching our infraID is reduced to where total replicas is less than three, the operator will revert the change / scale the MachineSet back up.
Test plan for issue:
Is there any documentation that needs to be updated for this PR?
Likely not needed, as the ARO support policy includes a line on maintaining a minimum of three worker nodes. That being said, it may help to also add that the operator will now attempt to scale when/if this merges.