Bug 1906935: Delete resources when Provisioning CR is deleted #71

sadasu · 2020-11-19T22:07:30Z

Delete all Secrets and the metal3 Deployment object created by the CBO when the Provisioning CR is deleted.

sadasu · 2020-11-19T22:12:34Z

@asalkeld I tried to incorporate comments provided as part of #63.

I tried using https://github.com/thoas/go-funk library based on comments #63 (comment) and #63 (comment).

I found funk.Contains() but not funk.Prune() although the goDoc does say that Prune() exists https://godoc.org/github.com/thoas/go-funk#Prune.

controllers/provisioning_controller.go

provisioning/baremetal_secrets.go

controllers/provisioning_controller.go

sadasu · 2020-12-11T20:43:07Z

/retitle Bug 1906935: Delete resources when Provisioning CR is deleted

openshift-ci-robot · 2020-12-11T20:43:29Z

@sadasu: This pull request references Bugzilla bug 1906935, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

Bug 1906935: Delete resources when Provisioning CR is deleted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sadasu · 2020-12-14T17:20:56Z

/test e2e-metal-ipi

sadasu · 2020-12-14T17:21:07Z

/test e2e-metal-ipi-ovn-ipv6

sadasu · 2020-12-15T03:37:45Z

/test e2e-agnostic

sadasu · 2020-12-15T15:02:02Z

/test e2e-agnostic

Failure seemed related to the kube version. Now that we have moved to 1.20, I expect this CI to pass too.

kirankt

/lgtm

openshift-ci-robot · 2020-12-15T19:40:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kirankt, sadasu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [kirankt,sadasu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-12-15T21:29:27Z

/retest

Please review the full test history for this PR and help us cut down flakes.

Delete all Secrets and the metal3 Deployment object when the Provisioning CR is deleted.

openshift-bot · 2020-12-15T21:42:28Z

/retest

Please review the full test history for this PR and help us cut down flakes.

asalkeld · 2020-12-15T23:23:05Z

controllers/provisioning_controller.go

+		// Provisioning configuration is not valid.
+		// Requeue request.
+		r.Log.Error(err, "invalid contents in images Config Map")
+		co_err := r.updateCOStatus(ReasonInvalidConfiguration, err.Error(), "invalid contents in images Config Map")


all these calls to updateCOStatus() can fail if the CO does not exist. Can you please move the ensureClusterOperator() to above the first possible call to updateCOStatus?

Updated. Please let me know if you are fine with the current amount of refactoring done within Reconcile() so that it doesn't get bloated.

asalkeld · 2020-12-16T00:14:50Z

controllers/provisioning_controller.go

+	if err == nil && deleted {
+		err = r.updateCOStatus(ReasonComplete, "all Metal3 resources deleted", "")
+		if err != nil {
+			return ctrl.Result{}, fmt.Errorf("unable to put %q ClusterOperator in Available state: %v", clusterOperatorName, err)


Suggested change

return ctrl.Result{}, fmt.Errorf("unable to put %q ClusterOperator in Available state: %v", clusterOperatorName, err)

return ctrl.Result{}, fmt.Errorf("unable to put %q ClusterOperator in Available state: %w", clusterOperatorName, err)

i believe that when wrapping errors we should be using %w as this allows the error to be unwrapped (please check in the rest of the PR too)

asalkeld · 2020-12-16T00:18:59Z

controllers/provisioning_controller.go

+
+//Delete Secrets and the Metal3 Deployment objects
+func (r *ProvisioningReconciler) deleteMetal3Resources(info *provisioning.ProvisioningInfo) error {
+	failedSecrets := provisioning.DeleteAllSecrets(info)


i think this should return an error

Do you think deleteMetal3Resources() should return error when DeleteAllSecrets() returns error(s) or just log it and proceed with deleting other resources as it is doing now?

asalkeld · 2020-12-16T00:19:47Z

provisioning/baremetal_secrets.go

+}
+
+func DeleteAllSecrets(info *ProvisioningInfo) []string {
+	deleteFailures := []string{}


errs := []error

asalkeld · 2020-12-16T00:22:07Z

provisioning/baremetal_secrets.go

+		deleteFailures = append(deleteFailures, ironicrpcSecretName)
+	}
+
+	return deleteFailures


return errors.NewAggregate(errs)

Didn't want to import a new errors package so achieved the same result differently.

asalkeld · 2020-12-16T00:28:06Z

controllers/provisioning_controller.go

+	if err != nil {
+		return errors.Wrap(err, "failed to delete metal3 service")
+	}
+	err = provisioning.DeleteImageCache(info)


What do you think about this? We have a lot of repeated functions that just check for notFound

err = provisioning.IgnoreNotFound(info.Client.AppsV1().DaemonSets(info.Namespace).Delete(context.Background(), imageCacheService, metav1.DeleteOptions{}))

func IgnoreNotFound(err error) error { if apierrors.IsNotFound(err) { return nil } return err }

Not really going to help the Reconcile method a whole lot. We currently call apierrors.IsNotFound() twice and the number of return values are different.
In other places in the provisioning package, we use the return value of apierrors.IsNotFound() to decide if we need to create something and not return nil.

sadasu

I am assuming you are referring the 2nd call to ensureClusterOperator(). And, yes, this time it will add the baremetal config in the OwnerReference for the CO.
The 1st time ensureClusterOperator(), we pass in nil because the baremetalConfig is not yet read and verified to be correct at that time. And the only time we update the CO before reading/verifying baremetalConfig is when the platform != BareMetal, so passing in nil and not setting the OwnerReference should be OK.

Includes some other generic improvements to reporting wrapped errors.

sadasu · 2020-12-16T17:48:51Z

/test e2e-agnostic

Errors unrelated to CBO.

asalkeld · 2020-12-17T04:53:27Z

/lgtm

openshift-ci-robot · 2020-12-17T04:57:25Z

@sadasu: All pull requests linked via external trackers have merged:

openshift/cluster-baremetal-operator#71

Bugzilla bug 1906935 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1906935: Delete resources when Provisioning CR is deleted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

… SetupWithManager SetupWithManager(mgr) is called before mgr.Start() in main(), so it's running before the leader lease has been acquired. We don't want to be writing to anything before we acquire the lease, because that could cause contention with an actual lease-holder that is managing those resources. We can still perform read-only prechecks in SetupWithManager. And then we wait patiently until we get the lease, and update ClusterOperator (and anything else we manage) in Instead, patiently wait until we have the lease in the Reconcile() function. This partially rolls back 2e9d117 (Ensure baremetal CO is completely setup before Reconcile, 2020-11-30, openshift#81), but that addition predated ensureClusterOperator being added early in Reconcile in 4f2d314 (Make sure ensureClusterOperator() is called before its status is updated, 2020-12-15, openshift#71): $ git log --oneline | grep -n '2e9d1177\|4f2d3141' 468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated 506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile So the ensureClusterOperator call in SetupWithManager is no longer needed. And this partially rolls back 8798044 (Handle case when Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81). That "we're enabled, but there isn't a Provisioning custom resource yet" handling happens continually in Reconcile (where the write will be protected by the operator holding the lease). Among other improvements, this change will prevent a nominally-successful install where the operator never acquired a lease [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log $ head -n4 cbo.log I1222 01:05:34.274563 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1222 01:05:34.318283 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1222 01:05:34.403202 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" I1222 01:05:34.430552 1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1 $ tail -n2 cbo.log E1222 02:36:57.323869 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" E1222 02:37:00.248442 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" but still managed to write Available=True (with that 'new CO status' line): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR: 2023-12-22T01:05:34Z Degraded=False : 2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform 2023-12-22T01:05:34Z Upgradeable=True : 2023-12-22T01:05:34Z Disabled=False : "I'll never get this lease, and I need a lease to run all my controllers" doesn't seem very Available=True to me, and with this commit, we won't touch the ClusterOperator and the install will time out. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352

openshift-ci-robot requested review from markmc and zaneb November 19, 2020 22:07

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2020

asalkeld reviewed Nov 20, 2020

View reviewed changes

controllers/provisioning_controller.go Outdated Show resolved Hide resolved

asalkeld reviewed Nov 20, 2020

View reviewed changes

controllers/provisioning_controller.go Outdated Show resolved Hide resolved

asalkeld reviewed Nov 20, 2020

View reviewed changes

controllers/provisioning_controller.go Outdated Show resolved Hide resolved

zaneb reviewed Nov 23, 2020

View reviewed changes

controllers/provisioning_controller.go Outdated Show resolved Hide resolved

controllers/provisioning_controller.go Outdated Show resolved Hide resolved

controllers/provisioning_controller.go Outdated Show resolved Hide resolved

asalkeld reviewed Nov 27, 2020

View reviewed changes

provisioning/baremetal_secrets.go Outdated Show resolved Hide resolved

asalkeld reviewed Dec 2, 2020

View reviewed changes

controllers/provisioning_controller.go Outdated Show resolved Hide resolved

sadasu force-pushed the delete-metal3 branch from ffec70f to 0c51016 Compare December 4, 2020 18:37

asalkeld reviewed Dec 8, 2020

View reviewed changes

controllers/provisioning_controller.go Outdated Show resolved Hide resolved

sadasu force-pushed the delete-metal3 branch from 0c51016 to 6467f40 Compare December 11, 2020 20:37

openshift-ci-robot changed the title ~~Delete resources when Provisioning CR is deleted~~ Bug 1906935: Delete resources when Provisioning CR is deleted Dec 11, 2020

openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Dec 11, 2020

sadasu force-pushed the delete-metal3 branch from 6467f40 to 6366bdf Compare December 14, 2020 22:19

kirankt approved these changes Dec 15, 2020

View reviewed changes

openshift-ci-robot assigned kirankt Dec 15, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2020

Delete resources when Provisioning CR is deleted

568a1cf

Delete all Secrets and the metal3 Deployment object when the Provisioning CR is deleted.

sadasu added 3 commits December 15, 2020 16:47

Delete image cache daemonset and metal3 service when the CR is deleted

4ee43b3

Improve Secret create and delete methods

6989fba

Fix couple of comments to match implementation

87f3a47

sadasu force-pushed the delete-metal3 branch from 6366bdf to 87f3a47 Compare December 15, 2020 21:52

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2020

asalkeld reviewed Dec 15, 2020

View reviewed changes

asalkeld reviewed Dec 16, 2020

View reviewed changes

Make sure ensureClusterOperator() is called before its status is updated

4f2d314

sadasu commented Dec 16, 2020

View reviewed changes

Improve error handling while deleting secrets

3d2ffa4

Includes some other generic improvements to reporting wrapped errors.

openshift-ci-robot assigned asalkeld Dec 17, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 17, 2020

openshift-merge-robot merged commit 084bb70 into openshift:master Dec 17, 2020

sadasu deleted the delete-metal3 branch January 7, 2021 13:44

	return ctrl.Result{}, fmt.Errorf("unable to put %q ClusterOperator in Available state: %v", clusterOperatorName, err)
	return ctrl.Result{}, fmt.Errorf("unable to put %q ClusterOperator in Available state: %w", clusterOperatorName, err)

Bug 1906935: Delete resources when Provisioning CR is deleted #71

Bug 1906935: Delete resources when Provisioning CR is deleted #71

Uh oh!

Conversation

sadasu commented Nov 19, 2020

Uh oh!

sadasu commented Nov 19, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sadasu commented Dec 11, 2020

Uh oh!

openshift-ci-robot commented Dec 11, 2020

Uh oh!

sadasu commented Dec 14, 2020

Uh oh!

sadasu commented Dec 14, 2020

Uh oh!

sadasu commented Dec 15, 2020

Uh oh!

sadasu commented Dec 15, 2020

Uh oh!

kirankt left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Dec 15, 2020

Uh oh!

openshift-bot commented Dec 15, 2020

Uh oh!

openshift-bot commented Dec 15, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadasu Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadasu left a comment

Choose a reason for hiding this comment

Uh oh!

sadasu commented Dec 16, 2020

Uh oh!

asalkeld commented Dec 17, 2020

Uh oh!

openshift-ci-robot commented Dec 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

sadasu Dec 16, 2020 •

edited

Loading