Set Baremetal CO to Disabled before Reconciler runs #63

sadasu · 2020-11-04T01:21:59Z

Current implementation sets the Baremetal CO to Disabled in Platforms other than BareMetal before starting the Reconciler.

Alternate approach: (re-worded based on @andfasano 's feedback below)
The CO manifest can be defined with the CO in Disabled state. So, when CVO installs all the manifests in the /manifest directory, the Baremetal CO starts in the Disabled state. On the BareMetal platform, the CO resource would be amended within the reconcile loop.

openshift-ci-robot · 2020-11-04T01:22:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sadasu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [sadasu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

stbenjam

Can you update the dockerfile to uncomment the release label, so we see if e2e-agnostic passes with the CBO as part of the release payload?

stbenjam · 2020-11-04T11:56:43Z

main.go

+		// set ClusterOperator status to disabled=true, available=true
+		err = controllers.SetCOInDisabledState(osClient, releaseVersion)
+		if err != nil {
+			setupLog.Error(err, "unable to Baremetal CO to disabled")


is a verb missing here?

Also, is it ok to exit in such case?

Fixed the verb.

Regarding whether to continue with setup of the operator or exit was not a clear choice for me.
If the CO is not updated correctly, the CVO considers the Operator to be in a bad state anyways. And in the case of non-Baremetal platforms, this code will run only once so there is no path where this will correct itself (which can happen when we update the CO in the reconcile methond).

Please check the "Althernate approach" that I listed above and let me know what you think.

My feeling is that the alternate approach could be a little bit more clear. The resource will be amended within a reconcile loop, which could be a better place than the main entry point of the operator.

sadasu · 2020-11-04T20:22:10Z

/test unit

sadasu · 2020-11-05T14:26:30Z

controllers/clusteroperator.go

+	}
+
+	conds := defaultStatusConditions()
+	v1helpers.SetStatusCondition(&conds, setStatusCondition(OperatorDisabled, osconfigv1.ConditionTrue, string(ReasonUnsupported), "Nothing to do on this Platform"))


@wking I tried to incorporate some of your suggestions from #61.

sadasu · 2020-11-05T15:26:14Z

/test e2e-agnostic

stbenjam · 2020-11-05T17:20:18Z

This looks good to me, and e2e-agnostic is passing.

@wking Would you mind taking a look at this?

/assign @wking

andfasano · 2020-11-06T08:38:31Z

controllers/provisioning_controller.go

 func (r *ProvisioningReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
 	//log := r.Log.WithValues("provisioning", req.NamespacedName)

-	enabled, err := r.isEnabled()


On a second thought, what about keeping just a safety check to prevent the reconcile loop running, in case the Provisioning resource will be deployed by mistake?

I misread your comment earlier. I think we need to add this check back because the controller now watches the CO resource too. (https://github.com/openshift/cluster-baremetal-operator/blob/master/controllers/provisioning_controller.go#L239)

dhellmann · 2020-11-06T17:29:40Z

I don't think setting the ClusterOperator status in main() is going to be sufficient to address the problem.

According to https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#how-clusterversionoperator-handles-clusteroperator-in-release-image, the cluster-version-operator creates the CO resource initially but each operator is responsible for re-creating its own ClusterOperator resource if it is deleted. So, if we only set the status to Disabled in main() then there will be no code to handle that delete case.

Since we can't ensure our Provisioning API resource is always present, I think we're going to actually need another controller added to this process to manage the disabled case. If we create a controller that reconciles on the ClusterOperator resource then it can detect the disabled state and set the CO values accordingly. It could also detect when the CO is deleted, and then schedule a reconcile after that to recreate the resource (we will be able to tell that case because we're in the reconcile loop looking for a resource that does not exist). The cluster-authentication-operator has something like this, using the controller defined in library-go (see https://github.com/openshift/cluster-authentication-operator/blob/master/vendor/github.com/openshift/library-go/pkg/operator/status/status_controller.go). I don't know if we will be able to use that implementation directly.

The existing controller that is reconciling on the Provisioning resource should continue to check if the operator is enabled and do nothing in that case (by restoring the isEnabled() call that Andrea commented on).

sadasu · 2020-11-06T20:54:34Z

@dhellmann these were my concerns also. This PR just takes care of the successful install of CVO in non-Baremetal environments and doesn't really take care of the steady state scenarios.

I think #40 brings in functionality that can be used maintaining the CO in the correct state after initial bringup.

Specifically, https://github.com/openshift/cluster-baremetal-operator/blob/master/controllers/provisioning_controller.go#L239. With that change, the CO is another resource that is being watched. So, the reconcile loop would be called for any changes to the "baremetal" CO.

With this change in place, it makes sense to add back the isEnabled() check within the Reconciler and the updateCOState() already takes care of creating the baremetal CO if it doesn't exist.

sadasu · 2020-11-07T17:59:37Z

/test e2e-metal-ipi

sadasu · 2020-11-09T17:55:31Z

/retest

sadasu · 2020-11-09T19:13:01Z

/test e2e-agnostic

sadasu · 2020-11-09T21:12:15Z

/test e2e-metal-ipi

dhellmann · 2020-11-09T21:27:13Z

@dhellmann these were my concerns also. This PR just takes care of the successful install of CVO in non-Baremetal environments and doesn't really take care of the steady state scenarios.

I think #40 brings in functionality that can be used maintaining the CO in the correct state after initial bringup.

OK, good, I hadn't made the connection between that work and re-creating the ClusterOperator.

dhellmann

/lgtm

This looks good. We can iterate on the messages (see comments inline) in another PR.

/hold

Let's wait to merge it until our morning in case something breaks and we have to revert quickly again.

dhellmann · 2020-11-09T21:29:33Z

controllers/clusteroperator.go

 	ReasonEmpty StatusReason = ""

+	// ReasonEmpty is an empty StatusReason
+	ReasonExpected StatusReason = "AsExpected"


The comment does not match the variable here.

dhellmann · 2020-11-09T21:46:36Z

main.go

+	// Check the Platform Type to determine the state of the CO
+	enabled, err := controllers.IsEnabled(osClient)
+	if err != nil {
+		setupLog.Error(err, "unable to determine Infrastructure Platform type")


This error message assumes knowledge of what controllers.IsEnabled() is doing. Here we only know we are checking if we are enabled, and we can't figure that out but don't actually know why. I don't think it's worth holding up this PR over it, but we should plan a clean up patch in another PR.

dhellmann · 2020-11-09T21:47:39Z

main.go

+		//Set ClusterOperator status to disabled=true, available=true
+		err = controllers.SetCOInDisabledState(osClient, releaseVersion)
+		if err != nil {
+			setupLog.Error(err, "unable to set Baremetal CO to disabled")


Someone reading the log may not know what a "CO" is. We should spell it out. The name is also "baremetal" not "Baremetal", right? Again, not worth holding this up, so we can fix it in the same PR that fixes the other message.

sadasu · 2020-11-11T12:01:58Z

/test e2e-agnostic

dhellmann · 2020-11-11T16:46:18Z

controllers/provisioning_controller.go

+			// Remove our finalizer from the list and update it.
+			baremetalConfig.ObjectMeta.Finalizers = removeString(baremetalConfig.ObjectMeta.Finalizers,
+				metal3iov1alpha1.ProvisioningFinalizer)
+			if err := r.Client.Update(context.Background(), baremetalConfig); err != nil {


As soon as this update is applied the API is going to delete the resource. Is that what we want here? Or is there some other logic we want to apply first?

If we had additional logic to apply, this is the place to do it as you suggest. We may want to delete the metal3 deployment when the provisioning CR is deleted but it is not necessary to specific that now.

Yes, removing the finalizer annotation lets OpenShift know that this resource and some internal data-structures it maintains for this resource can all be garbage collected.

OK. Normally the reason for adding a finalizer is to give the controller a chance to do something before the resource is deleted. The way this is implemented, we will be notified, but we won't actually make the API server wait to delete the resource until we've done what we wanted. In the other controllers I've seen, there is at least a check to decide if the controller is ready to have the finalizer removed. What sort of similar check could we apply here?

this looks prime to be moved into a separate function

@dhellmann agreed. And, we should be looking into adding new logic to probably delete a metal3 deployment when the provisioning resource is deleted. Deliberately not adding a lot of new functionality in here as part of this PR. Just making sure that the functionality that is already part of the CBO works in both metal3 and agnostic CIs.

sadasu · 2020-11-11T20:35:14Z

@asalkeld I needed to update 0000_31_cluster-baremetal-operator_05_rbac.yaml manually for this PR and generate-check doesn't like it. How can I fix that?

dhellmann · 2020-11-11T20:37:28Z

manifests/0000_31_cluster-baremetal-operator_05_rbac.yaml

 rules:
- resources:
-  - events
+- apiGroups:


I expect all of these changes need to be made in files under config/rbac instead of directly to this generated file.

I am not sure if that is the complete picture because "deployments" are not specified in any yaml files in config/rbac but is present in manifests/0000_31_cluster-baremetal-operator_05_rbac.yaml.

Isn't that in config/rbac/role.yaml?

I hadn't realized #52 had merged overnight. That explains all that I was seeing.

asalkeld · 2020-11-12T00:19:15Z

controllers/provisioning_controller.go

 		Owns(&osconfigv1.ClusterOperator{}).
 		Complete(r)
 }
+


there are libraries that do this try https://github.com/thoas/go-funk

funk.Contains()

asalkeld · 2020-11-12T00:20:04Z

controllers/provisioning_controller.go

+	return false
+}
+
+// Helper function to remove string from a slice of strings


see Prune() in the lib above

asalkeld · 2020-11-12T00:21:23Z

main.go

+	}
+	if !enabled {
+		//Set ClusterOperator status to disabled=true, available=true
+		err = controllers.SetCOInDisabledState(osClient, releaseVersion)


I think you should move the content of SetCOInDisabledState() into updateCOStatus() - if there is anything missing

Suggested change

err = controllers.SetCOInDisabledState(osClient, releaseVersion)

err = controllers.updateCOStatus(...)

also, you could do this within SetupWithManager so that you have the reconcile object

dhellmann · 2020-11-12T12:57:48Z

/hold cancel

dhellmann · 2020-11-12T13:42:20Z

The implementation looks OK as a first pass. I agree we can probably simplify, but since this works and we do have other work blocked on having some version of this, I think we should move ahead.

The commit history in the PR looks like it's going to be hard to follow later. There's a merge from master into the working branch and there are a bunch of separate RBAC commits. Could you rebase to get rid of the merge commit and squash the RBAC changes so they are 1 commit per type with the output of make generate manifests included at each step?

The controller-runtime client's cache is not initialized that early, so use the OpenShift client instead.

dhellmann · 2020-11-12T17:31:40Z

/lgtm

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 4, 2020

openshift-ci-robot requested review from andfasano and stbenjam November 4, 2020 01:22

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2020

stbenjam requested changes Nov 4, 2020

View reviewed changes

stbenjam reviewed Nov 4, 2020

View reviewed changes

sadasu force-pushed the CO-Disabled branch 3 times, most recently from f090c8b to 6da3e47 Compare November 4, 2020 19:41

sadasu commented Nov 5, 2020

View reviewed changes

sadasu changed the title ~~WIP: Set Baremetal CO to Disabled before Reconciler runs~~ Set Baremetal CO to Disabled before Reconciler runs Nov 5, 2020

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2020

openshift-ci-robot assigned wking Nov 5, 2020

andfasano reviewed Nov 6, 2020

View reviewed changes

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2020

sadasu force-pushed the CO-Disabled branch from 48948da to 7a7715c Compare November 6, 2020 22:53

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2020

dhellmann reviewed Nov 9, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2020

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 10, 2020

sadasu force-pushed the CO-Disabled branch from 92df85c to b309a6b Compare November 11, 2020 05:35

sadasu force-pushed the CO-Disabled branch from 7e8490d to 09a1a65 Compare November 11, 2020 16:11

dhellmann reviewed Nov 11, 2020

View reviewed changes

asalkeld reviewed Nov 12, 2020

View reviewed changes

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020

sadasu added 10 commits November 12, 2020 09:48

Set Baremetal CO to Disabled before Reconciler runs

a93a6e0

Use an OpenShift client to access the Infrastructure resource

16b5b0d

The controller-runtime client's cache is not initialized that early, so use the OpenShift client instead.

Set RelatedObjects for the CO even when Disabled

ac92114

Update Reason strings and tests for CO in Disabled state

1b7c189

Fix and improve error messages and comments

95ab938

Add finalizer for Provisioning CR

6e25848

Update generated rbac after rebase

b1fa5bf

Fixed rbac strings for resources that don't have a group value

3611c3a

Update rbac verbs for provisioning at Cluster scope

525f114

Add CBO as part of the release payload

8b1f7bd

sadasu force-pushed the CO-Disabled branch from c0b7d0b to 8b1f7bd Compare November 12, 2020 15:07

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 12, 2020

openshift-merge-robot merged commit a7f05f2 into openshift:master Nov 12, 2020

dhellmann mentioned this pull request Nov 12, 2020

move logic for managing disabled state back inside the reconciler #69

Merged

sadasu mentioned this pull request Nov 19, 2020

Bug 1906935: Delete resources when Provisioning CR is deleted #71

Merged

sadasu deleted the CO-Disabled branch January 7, 2021 13:44

	err = controllers.SetCOInDisabledState(osClient, releaseVersion)
	err = controllers.updateCOStatus(...)

Set Baremetal CO to Disabled before Reconciler runs #63

Set Baremetal CO to Disabled before Reconciler runs #63

Uh oh!

Conversation

sadasu commented Nov 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Nov 4, 2020

Uh oh!

stbenjam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadasu commented Nov 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadasu commented Nov 5, 2020

Uh oh!

stbenjam commented Nov 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadasu Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhellmann commented Nov 6, 2020

Uh oh!

sadasu commented Nov 6, 2020

Uh oh!

sadasu commented Nov 7, 2020

Uh oh!

sadasu commented Nov 9, 2020

Uh oh!

sadasu commented Nov 9, 2020

Uh oh!

sadasu commented Nov 9, 2020

Uh oh!

dhellmann commented Nov 9, 2020

Uh oh!

dhellmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadasu commented Nov 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadasu commented Nov 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

sadasu commented Nov 4, 2020 •

edited

Loading

sadasu Nov 6, 2020 •

edited

Loading

dhellmann commented Nov 12, 2020 •

edited

Loading