-
Notifications
You must be signed in to change notification settings - Fork 70
Set Baremetal CO to Disabled before Reconciler runs #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sadasu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
stbenjam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the dockerfile to uncomment the release label, so we see if e2e-agnostic passes with the CBO as part of the release payload?
main.go
Outdated
| // set ClusterOperator status to disabled=true, available=true | ||
| err = controllers.SetCOInDisabledState(osClient, releaseVersion) | ||
| if err != nil { | ||
| setupLog.Error(err, "unable to Baremetal CO to disabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is a verb missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is it ok to exit in such case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the verb.
Regarding whether to continue with setup of the operator or exit was not a clear choice for me.
If the CO is not updated correctly, the CVO considers the Operator to be in a bad state anyways. And in the case of non-Baremetal platforms, this code will run only once so there is no path where this will correct itself (which can happen when we update the CO in the reconcile methond).
Please check the "Althernate approach" that I listed above and let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is that the alternate approach could be a little bit more clear. The resource will be amended within a reconcile loop, which could be a better place than the main entry point of the operator.
f090c8b to
6da3e47
Compare
|
/test unit |
| } | ||
|
|
||
| conds := defaultStatusConditions() | ||
| v1helpers.SetStatusCondition(&conds, setStatusCondition(OperatorDisabled, osconfigv1.ConditionTrue, string(ReasonUnsupported), "Nothing to do on this Platform")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
/test e2e-agnostic |
| func (r *ProvisioningReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) { | ||
| //log := r.Log.WithValues("provisioning", req.NamespacedName) | ||
|
|
||
| enabled, err := r.isEnabled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, what about keeping just a safety check to prevent the reconcile loop running, in case the Provisioning resource will be deployed by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misread your comment earlier. I think we need to add this check back because the controller now watches the CO resource too. (https://github.com/openshift/cluster-baremetal-operator/blob/master/controllers/provisioning_controller.go#L239)
|
I don't think setting the ClusterOperator status in main() is going to be sufficient to address the problem. According to https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#how-clusterversionoperator-handles-clusteroperator-in-release-image, the cluster-version-operator creates the CO resource initially but each operator is responsible for re-creating its own ClusterOperator resource if it is deleted. So, if we only set the status to Disabled in main() then there will be no code to handle that delete case. Since we can't ensure our Provisioning API resource is always present, I think we're going to actually need another controller added to this process to manage the disabled case. If we create a controller that reconciles on the ClusterOperator resource then it can detect the disabled state and set the CO values accordingly. It could also detect when the CO is deleted, and then schedule a reconcile after that to recreate the resource (we will be able to tell that case because we're in the reconcile loop looking for a resource that does not exist). The cluster-authentication-operator has something like this, using the controller defined in library-go (see https://github.com/openshift/cluster-authentication-operator/blob/master/vendor/github.com/openshift/library-go/pkg/operator/status/status_controller.go). I don't know if we will be able to use that implementation directly. The existing controller that is reconciling on the Provisioning resource should continue to check if the operator is enabled and do nothing in that case (by restoring the isEnabled() call that Andrea commented on). |
|
@dhellmann these were my concerns also. This PR just takes care of the successful install of CVO in non-Baremetal environments and doesn't really take care of the steady state scenarios. I think #40 brings in functionality that can be used maintaining the CO in the correct state after initial bringup. Specifically, https://github.com/openshift/cluster-baremetal-operator/blob/master/controllers/provisioning_controller.go#L239. With that change, the CO is another resource that is being watched. So, the reconcile loop would be called for any changes to the "baremetal" CO. With this change in place, it makes sense to add back the isEnabled() check within the Reconciler and the updateCOState() already takes care of creating the baremetal CO if it doesn't exist. |
|
/test e2e-metal-ipi |
|
/retest |
|
/test e2e-agnostic |
|
/test e2e-metal-ipi |
OK, good, I hadn't made the connection between that work and re-creating the ClusterOperator. |
dhellmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
This looks good. We can iterate on the messages (see comments inline) in another PR.
/hold
Let's wait to merge it until our morning in case something breaks and we have to revert quickly again.
| ReasonEmpty StatusReason = "" | ||
|
|
||
| // ReasonEmpty is an empty StatusReason | ||
| ReasonExpected StatusReason = "AsExpected" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment does not match the variable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
main.go
Outdated
| // Check the Platform Type to determine the state of the CO | ||
| enabled, err := controllers.IsEnabled(osClient) | ||
| if err != nil { | ||
| setupLog.Error(err, "unable to determine Infrastructure Platform type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message assumes knowledge of what controllers.IsEnabled() is doing. Here we only know we are checking if we are enabled, and we can't figure that out but don't actually know why. I don't think it's worth holding up this PR over it, but we should plan a clean up patch in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
main.go
Outdated
| //Set ClusterOperator status to disabled=true, available=true | ||
| err = controllers.SetCOInDisabledState(osClient, releaseVersion) | ||
| if err != nil { | ||
| setupLog.Error(err, "unable to set Baremetal CO to disabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone reading the log may not know what a "CO" is. We should spell it out. The name is also "baremetal" not "Baremetal", right? Again, not worth holding this up, so we can fix it in the same PR that fixes the other message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
/test e2e-agnostic |
| // Remove our finalizer from the list and update it. | ||
| baremetalConfig.ObjectMeta.Finalizers = removeString(baremetalConfig.ObjectMeta.Finalizers, | ||
| metal3iov1alpha1.ProvisioningFinalizer) | ||
| if err := r.Client.Update(context.Background(), baremetalConfig); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as this update is applied the API is going to delete the resource. Is that what we want here? Or is there some other logic we want to apply first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had additional logic to apply, this is the place to do it as you suggest. We may want to delete the metal3 deployment when the provisioning CR is deleted but it is not necessary to specific that now.
Yes, removing the finalizer annotation lets OpenShift know that this resource and some internal data-structures it maintains for this resource can all be garbage collected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Normally the reason for adding a finalizer is to give the controller a chance to do something before the resource is deleted. The way this is implemented, we will be notified, but we won't actually make the API server wait to delete the resource until we've done what we wanted. In the other controllers I've seen, there is at least a check to decide if the controller is ready to have the finalizer removed. What sort of similar check could we apply here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks prime to be moved into a separate function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhellmann agreed. And, we should be looking into adding new logic to probably delete a metal3 deployment when the provisioning resource is deleted. Deliberately not adding a lot of new functionality in here as part of this PR. Just making sure that the functionality that is already part of the CBO works in both metal3 and agnostic CIs.
|
@asalkeld I needed to update 0000_31_cluster-baremetal-operator_05_rbac.yaml manually for this PR and generate-check doesn't like it. How can I fix that? |
| rules: | ||
| - resources: | ||
| - events | ||
| - apiGroups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect all of these changes need to be made in files under config/rbac instead of directly to this generated file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if that is the complete picture because "deployments" are not specified in any yaml files in config/rbac but is present in manifests/0000_31_cluster-baremetal-operator_05_rbac.yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that in config/rbac/role.yaml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't realized #52 had merged overnight. That explains all that I was seeing.
| Owns(&osconfigv1.ClusterOperator{}). | ||
| Complete(r) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are libraries that do this try https://github.com/thoas/go-funk
funk.Contains()
| return false | ||
| } | ||
|
|
||
| // Helper function to remove string from a slice of strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see Prune() in the lib above
| } | ||
| if !enabled { | ||
| //Set ClusterOperator status to disabled=true, available=true | ||
| err = controllers.SetCOInDisabledState(osClient, releaseVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should move the content of SetCOInDisabledState() into updateCOStatus() - if there is anything missing
| err = controllers.SetCOInDisabledState(osClient, releaseVersion) | |
| err = controllers.updateCOStatus(...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, you could do this within SetupWithManager so that you have the reconcile object
|
/hold cancel |
|
The implementation looks OK as a first pass. I agree we can probably simplify, but since this works and we do have other work blocked on having some version of this, I think we should move ahead. The commit history in the PR looks like it's going to be hard to follow later. There's a merge from master into the working branch and there are a bunch of separate RBAC commits. Could you rebase to get rid of the merge commit and squash the RBAC changes so they are 1 commit per type with the output of |
The controller-runtime client's cache is not initialized that early, so use the OpenShift client instead.
|
/lgtm |
Current implementation sets the Baremetal CO to Disabled in Platforms other than BareMetal before starting the Reconciler.
Alternate approach: (re-worded based on @andfasano 's feedback below)
The CO manifest can be defined with the CO in Disabled state. So, when CVO installs all the manifests in the /manifest directory, the Baremetal CO starts in the Disabled state. On the BareMetal platform, the CO resource would be amended within the reconcile loop.