Skip to content

Conversation

@sadasu
Copy link
Contributor

@sadasu sadasu commented Nov 4, 2020

Current implementation sets the Baremetal CO to Disabled in Platforms other than BareMetal before starting the Reconciler.

Alternate approach: (re-worded based on @andfasano 's feedback below)
The CO manifest can be defined with the CO in Disabled state. So, when CVO installs all the manifests in the /manifest directory, the Baremetal CO starts in the Disabled state. On the BareMetal platform, the CO resource would be amended within the reconcile loop.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 4, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sadasu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2020
Copy link
Member

@stbenjam stbenjam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the dockerfile to uncomment the release label, so we see if e2e-agnostic passes with the CBO as part of the release payload?

main.go Outdated
// set ClusterOperator status to disabled=true, available=true
err = controllers.SetCOInDisabledState(osClient, releaseVersion)
if err != nil {
setupLog.Error(err, "unable to Baremetal CO to disabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a verb missing here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it ok to exit in such case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the verb.

Regarding whether to continue with setup of the operator or exit was not a clear choice for me.
If the CO is not updated correctly, the CVO considers the Operator to be in a bad state anyways. And in the case of non-Baremetal platforms, this code will run only once so there is no path where this will correct itself (which can happen when we update the CO in the reconcile methond).

Please check the "Althernate approach" that I listed above and let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that the alternate approach could be a little bit more clear. The resource will be amended within a reconcile loop, which could be a better place than the main entry point of the operator.

@sadasu sadasu force-pushed the CO-Disabled branch 3 times, most recently from f090c8b to 6da3e47 Compare November 4, 2020 19:41
@sadasu
Copy link
Contributor Author

sadasu commented Nov 4, 2020

/test unit

}

conds := defaultStatusConditions()
v1helpers.SetStatusCondition(&conds, setStatusCondition(OperatorDisabled, osconfigv1.ConditionTrue, string(ReasonUnsupported), "Nothing to do on this Platform"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking I tried to incorporate some of your suggestions from #61.

@sadasu
Copy link
Contributor Author

sadasu commented Nov 5, 2020

/test e2e-agnostic

@sadasu sadasu changed the title WIP: Set Baremetal CO to Disabled before Reconciler runs Set Baremetal CO to Disabled before Reconciler runs Nov 5, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2020
@stbenjam
Copy link
Member

stbenjam commented Nov 5, 2020

This looks good to me, and e2e-agnostic is passing.

@wking Would you mind taking a look at this?

/assign @wking

func (r *ProvisioningReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
//log := r.Log.WithValues("provisioning", req.NamespacedName)

enabled, err := r.isEnabled()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, what about keeping just a safety check to prevent the reconcile loop running, in case the Provisioning resource will be deployed by mistake?

Copy link
Contributor Author

@sadasu sadasu Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misread your comment earlier. I think we need to add this check back because the controller now watches the CO resource too. (https://github.com/openshift/cluster-baremetal-operator/blob/master/controllers/provisioning_controller.go#L239)

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2020
@dhellmann
Copy link
Contributor

I don't think setting the ClusterOperator status in main() is going to be sufficient to address the problem.

According to https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#how-clusterversionoperator-handles-clusteroperator-in-release-image, the cluster-version-operator creates the CO resource initially but each operator is responsible for re-creating its own ClusterOperator resource if it is deleted. So, if we only set the status to Disabled in main() then there will be no code to handle that delete case.

Since we can't ensure our Provisioning API resource is always present, I think we're going to actually need another controller added to this process to manage the disabled case. If we create a controller that reconciles on the ClusterOperator resource then it can detect the disabled state and set the CO values accordingly. It could also detect when the CO is deleted, and then schedule a reconcile after that to recreate the resource (we will be able to tell that case because we're in the reconcile loop looking for a resource that does not exist). The cluster-authentication-operator has something like this, using the controller defined in library-go (see https://github.com/openshift/cluster-authentication-operator/blob/master/vendor/github.com/openshift/library-go/pkg/operator/status/status_controller.go). I don't know if we will be able to use that implementation directly.

The existing controller that is reconciling on the Provisioning resource should continue to check if the operator is enabled and do nothing in that case (by restoring the isEnabled() call that Andrea commented on).

@sadasu
Copy link
Contributor Author

sadasu commented Nov 6, 2020

@dhellmann these were my concerns also. This PR just takes care of the successful install of CVO in non-Baremetal environments and doesn't really take care of the steady state scenarios.

I think #40 brings in functionality that can be used maintaining the CO in the correct state after initial bringup.

Specifically, https://github.com/openshift/cluster-baremetal-operator/blob/master/controllers/provisioning_controller.go#L239. With that change, the CO is another resource that is being watched. So, the reconcile loop would be called for any changes to the "baremetal" CO.

With this change in place, it makes sense to add back the isEnabled() check within the Reconciler and the updateCOState() already takes care of creating the baremetal CO if it doesn't exist.

@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2020
@sadasu
Copy link
Contributor Author

sadasu commented Nov 7, 2020

/test e2e-metal-ipi

@sadasu
Copy link
Contributor Author

sadasu commented Nov 9, 2020

/retest

@sadasu
Copy link
Contributor Author

sadasu commented Nov 9, 2020

/test e2e-agnostic

@sadasu
Copy link
Contributor Author

sadasu commented Nov 9, 2020

/test e2e-metal-ipi

@dhellmann
Copy link
Contributor

@dhellmann these were my concerns also. This PR just takes care of the successful install of CVO in non-Baremetal environments and doesn't really take care of the steady state scenarios.

I think #40 brings in functionality that can be used maintaining the CO in the correct state after initial bringup.

OK, good, I hadn't made the connection between that work and re-creating the ClusterOperator.

Copy link
Contributor

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

This looks good. We can iterate on the messages (see comments inline) in another PR.

/hold

Let's wait to merge it until our morning in case something breaks and we have to revert quickly again.

ReasonEmpty StatusReason = ""

// ReasonEmpty is an empty StatusReason
ReasonExpected StatusReason = "AsExpected"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment does not match the variable here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

main.go Outdated
// Check the Platform Type to determine the state of the CO
enabled, err := controllers.IsEnabled(osClient)
if err != nil {
setupLog.Error(err, "unable to determine Infrastructure Platform type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message assumes knowledge of what controllers.IsEnabled() is doing. Here we only know we are checking if we are enabled, and we can't figure that out but don't actually know why. I don't think it's worth holding up this PR over it, but we should plan a clean up patch in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

main.go Outdated
//Set ClusterOperator status to disabled=true, available=true
err = controllers.SetCOInDisabledState(osClient, releaseVersion)
if err != nil {
setupLog.Error(err, "unable to set Baremetal CO to disabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone reading the log may not know what a "CO" is. We should spell it out. The name is also "baremetal" not "Baremetal", right? Again, not worth holding this up, so we can fix it in the same PR that fixes the other message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2020
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 10, 2020
@sadasu
Copy link
Contributor Author

sadasu commented Nov 11, 2020

/test e2e-agnostic

// Remove our finalizer from the list and update it.
baremetalConfig.ObjectMeta.Finalizers = removeString(baremetalConfig.ObjectMeta.Finalizers,
metal3iov1alpha1.ProvisioningFinalizer)
if err := r.Client.Update(context.Background(), baremetalConfig); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as this update is applied the API is going to delete the resource. Is that what we want here? Or is there some other logic we want to apply first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had additional logic to apply, this is the place to do it as you suggest. We may want to delete the metal3 deployment when the provisioning CR is deleted but it is not necessary to specific that now.

Yes, removing the finalizer annotation lets OpenShift know that this resource and some internal data-structures it maintains for this resource can all be garbage collected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Normally the reason for adding a finalizer is to give the controller a chance to do something before the resource is deleted. The way this is implemented, we will be notified, but we won't actually make the API server wait to delete the resource until we've done what we wanted. In the other controllers I've seen, there is at least a check to decide if the controller is ready to have the finalizer removed. What sort of similar check could we apply here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks prime to be moved into a separate function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhellmann agreed. And, we should be looking into adding new logic to probably delete a metal3 deployment when the provisioning resource is deleted. Deliberately not adding a lot of new functionality in here as part of this PR. Just making sure that the functionality that is already part of the CBO works in both metal3 and agnostic CIs.

@sadasu
Copy link
Contributor Author

sadasu commented Nov 11, 2020

@asalkeld I needed to update 0000_31_cluster-baremetal-operator_05_rbac.yaml manually for this PR and generate-check doesn't like it. How can I fix that?

rules:
- resources:
- events
- apiGroups:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect all of these changes need to be made in files under config/rbac instead of directly to this generated file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if that is the complete picture because "deployments" are not specified in any yaml files in config/rbac but is present in manifests/0000_31_cluster-baremetal-operator_05_rbac.yaml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that in config/rbac/role.yaml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't realized #52 had merged overnight. That explains all that I was seeing.

Owns(&osconfigv1.ClusterOperator{}).
Complete(r)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are libraries that do this try https://github.com/thoas/go-funk

funk.Contains()

return false
}

// Helper function to remove string from a slice of strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see Prune() in the lib above

}
if !enabled {
//Set ClusterOperator status to disabled=true, available=true
err = controllers.SetCOInDisabledState(osClient, releaseVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should move the content of SetCOInDisabledState() into updateCOStatus() - if there is anything missing

Suggested change
err = controllers.SetCOInDisabledState(osClient, releaseVersion)
err = controllers.updateCOStatus(...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, you could do this within SetupWithManager so that you have the reconcile object

@dhellmann
Copy link
Contributor

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020
@dhellmann
Copy link
Contributor

dhellmann commented Nov 12, 2020

The implementation looks OK as a first pass. I agree we can probably simplify, but since this works and we do have other work blocked on having some version of this, I think we should move ahead.

The commit history in the PR looks like it's going to be hard to follow later. There's a merge from master into the working branch and there are a bunch of separate RBAC commits. Could you rebase to get rid of the merge commit and squash the RBAC changes so they are 1 commit per type with the output of make generate manifests included at each step?

@dhellmann
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 12, 2020
@openshift-merge-robot openshift-merge-robot merged commit a7f05f2 into openshift:master Nov 12, 2020
@sadasu sadasu deleted the CO-Disabled branch January 7, 2021 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants