Skip to content

Conversation

@sadasu
Copy link
Contributor

@sadasu sadasu commented Nov 19, 2020

Delete all Secrets and the metal3 Deployment object created by the CBO when the Provisioning CR is deleted.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2020
@sadasu
Copy link
Contributor Author

sadasu commented Nov 19, 2020

@asalkeld I tried to incorporate comments provided as part of #63.

I tried using https://github.com/thoas/go-funk library based on comments #63 (comment) and #63 (comment).

I found funk.Contains() but not funk.Prune() although the goDoc does say that Prune() exists https://godoc.org/github.com/thoas/go-funk#Prune.

@sadasu
Copy link
Contributor Author

sadasu commented Dec 11, 2020

/retitle Bug 1906935: Delete resources when Provisioning CR is deleted

@openshift-ci-robot openshift-ci-robot changed the title Delete resources when Provisioning CR is deleted Bug 1906935: Delete resources when Provisioning CR is deleted Dec 11, 2020
@openshift-ci-robot
Copy link
Contributor

@sadasu: This pull request references Bugzilla bug 1906935, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1906935: Delete resources when Provisioning CR is deleted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Dec 11, 2020
@sadasu
Copy link
Contributor Author

sadasu commented Dec 14, 2020

/test e2e-metal-ipi

@sadasu
Copy link
Contributor Author

sadasu commented Dec 14, 2020

/test e2e-metal-ipi-ovn-ipv6

@sadasu
Copy link
Contributor Author

sadasu commented Dec 15, 2020

/test e2e-agnostic

@sadasu
Copy link
Contributor Author

sadasu commented Dec 15, 2020

/test e2e-agnostic

Failure seemed related to the kube version. Now that we have moved to 1.20, I expect this CI to pass too.

Copy link
Contributor

@kirankt kirankt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kirankt, sadasu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

Delete all Secrets and the metal3 Deployment object when the
Provisioning CR is deleted.
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2020
// Provisioning configuration is not valid.
// Requeue request.
r.Log.Error(err, "invalid contents in images Config Map")
co_err := r.updateCOStatus(ReasonInvalidConfiguration, err.Error(), "invalid contents in images Config Map")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these calls to updateCOStatus() can fail if the CO does not exist. Can you please move the ensureClusterOperator() to above the first possible call to updateCOStatus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Please let me know if you are fine with the current amount of refactoring done within Reconcile() so that it doesn't get bloated.

if err == nil && deleted {
err = r.updateCOStatus(ReasonComplete, "all Metal3 resources deleted", "")
if err != nil {
return ctrl.Result{}, fmt.Errorf("unable to put %q ClusterOperator in Available state: %v", clusterOperatorName, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ctrl.Result{}, fmt.Errorf("unable to put %q ClusterOperator in Available state: %v", clusterOperatorName, err)
return ctrl.Result{}, fmt.Errorf("unable to put %q ClusterOperator in Available state: %w", clusterOperatorName, err)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe that when wrapping errors we should be using %w as this allows the error to be unwrapped (please check in the rest of the PR too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


//Delete Secrets and the Metal3 Deployment objects
func (r *ProvisioningReconciler) deleteMetal3Resources(info *provisioning.ProvisioningInfo) error {
failedSecrets := provisioning.DeleteAllSecrets(info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should return an error

Copy link
Contributor Author

@sadasu sadasu Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think deleteMetal3Resources() should return error when DeleteAllSecrets() returns error(s) or just log it and proceed with deleting other resources as it is doing now?

}

func DeleteAllSecrets(info *ProvisioningInfo) []string {
deleteFailures := []string{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errs := []error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

deleteFailures = append(deleteFailures, ironicrpcSecretName)
}

return deleteFailures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return errors.NewAggregate(errs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't want to import a new errors package so achieved the same result differently.

if err != nil {
return errors.Wrap(err, "failed to delete metal3 service")
}
err = provisioning.DeleteImageCache(info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this? We have a lot of repeated functions that just check for notFound

err = provisioning.IgnoreNotFound(info.Client.AppsV1().DaemonSets(info.Namespace).Delete(context.Background(), imageCacheService, metav1.DeleteOptions{}))
func IgnoreNotFound(err error) error {
	if apierrors.IsNotFound(err) {
		return nil
	}
	return err
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really going to help the Reconcile method a whole lot. We currently call apierrors.IsNotFound() twice and the number of return values are different.
In other places in the provisioning package, we use the return value of apierrors.IsNotFound() to decide if we need to create something and not return nil.

Copy link
Contributor Author

@sadasu sadasu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming you are referring the 2nd call to ensureClusterOperator(). And, yes, this time it will add the baremetal config in the OwnerReference for the CO.
The 1st time ensureClusterOperator(), we pass in nil because the baremetalConfig is not yet read and verified to be correct at that time. And the only time we update the CO before reading/verifying baremetalConfig is when the platform != BareMetal, so passing in nil and not setting the OwnerReference should be OK.

Includes some other generic improvements to reporting wrapped errors.
@sadasu
Copy link
Contributor Author

sadasu commented Dec 16, 2020

/test e2e-agnostic

Errors unrelated to CBO.

@asalkeld
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 17, 2020
@openshift-merge-robot openshift-merge-robot merged commit 084bb70 into openshift:master Dec 17, 2020
@openshift-ci-robot
Copy link
Contributor

@sadasu: All pull requests linked via external trackers have merged:

Bugzilla bug 1906935 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1906935: Delete resources when Provisioning CR is deleted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sadasu sadasu deleted the delete-metal3 branch January 7, 2021 13:44
wking added a commit to wking/cluster-baremetal-operator that referenced this pull request Dec 22, 2023
… SetupWithManager

SetupWithManager(mgr) is called before mgr.Start() in main(), so it's
running before the leader lease has been acquired.  We don't want to
be writing to anything before we acquire the lease, because that could
cause contention with an actual lease-holder that is managing those
resources.  We can still perform read-only prechecks in
SetupWithManager.  And then we wait patiently until we get the lease,
and update ClusterOperator (and anything else we manage) in Instead,
patiently wait until we have the lease in the Reconcile() function.

This partially rolls back 2e9d117 (Ensure baremetal CO is
completely setup before Reconcile, 2020-11-30, openshift#81), but that addition
predated ensureClusterOperator being added early in Reconcile in
4f2d314 (Make sure ensureClusterOperator() is called before its
status is updated, 2020-12-15, openshift#71):

  $ git log --oneline | grep -n '2e9d1177\|4f2d3141'
  468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated
  506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile

So the ensureClusterOperator call in SetupWithManager is no longer
needed.

And this partially rolls back 8798044 (Handle case when
Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81).
That "we're enabled, but there isn't a Provisioning custom resource
yet" handling happens continually in Reconcile (where the write will
be protected by the operator holding the lease).

Among other improvements, this change will prevent a
nominally-successful install where the operator never acquired a lease
[1]:

 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log
  $ head -n4 cbo.log
  I1222 01:05:34.274563       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
  I1222 01:05:34.318283       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
  I1222 01:05:34.403202       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
  I1222 01:05:34.430552       1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1
  $ tail -n2 cbo.log
  E1222 02:36:57.323869       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"
  E1222 02:37:00.248442       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"

but still managed to write Available=True (with that 'new CO status' line):

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR:
  2023-12-22T01:05:34Z Degraded=False :
  2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform
  2023-12-22T01:05:34Z Upgradeable=True :
  2023-12-22T01:05:34Z Disabled=False :

"I'll never get this lease, and I need a lease to run all my
controllers" doesn't seem very Available=True to me, and with this
commit, we won't touch the ClusterOperator and the install will time
out.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
wking added a commit to wking/cluster-baremetal-operator that referenced this pull request Dec 22, 2023
… SetupWithManager

SetupWithManager(mgr) is called before mgr.Start() in main(), so it's
running before the leader lease has been acquired.  We don't want to
be writing to anything before we acquire the lease, because that could
cause contention with an actual lease-holder that is managing those
resources.  We can still perform read-only prechecks in
SetupWithManager.  And then we wait patiently until we get the lease,
and update ClusterOperator (and anything else we manage) in Instead,
patiently wait until we have the lease in the Reconcile() function.

This partially rolls back 2e9d117 (Ensure baremetal CO is
completely setup before Reconcile, 2020-11-30, openshift#81), but that addition
predated ensureClusterOperator being added early in Reconcile in
4f2d314 (Make sure ensureClusterOperator() is called before its
status is updated, 2020-12-15, openshift#71):

  $ git log --oneline | grep -n '2e9d1177\|4f2d3141'
  468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated
  506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile

So the ensureClusterOperator call in SetupWithManager is no longer
needed.

And this partially rolls back 8798044 (Handle case when
Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81).
That "we're enabled, but there isn't a Provisioning custom resource
yet" handling happens continually in Reconcile (where the write will
be protected by the operator holding the lease).

Among other improvements, this change will prevent a
nominally-successful install where the operator never acquired a lease
[1]:

 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log
  $ head -n4 cbo.log
  I1222 01:05:34.274563       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
  I1222 01:05:34.318283       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
  I1222 01:05:34.403202       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
  I1222 01:05:34.430552       1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1
  $ tail -n2 cbo.log
  E1222 02:36:57.323869       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"
  E1222 02:37:00.248442       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"

but still managed to write Available=True (with that 'new CO status' line):

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR:
  2023-12-22T01:05:34Z Degraded=False :
  2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform
  2023-12-22T01:05:34Z Upgradeable=True :
  2023-12-22T01:05:34Z Disabled=False :

"I'll never get this lease, and I need a lease to run all my
controllers" doesn't seem very Available=True to me, and with this
commit, we won't touch the ClusterOperator and the install will time
out.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-baremetal-operator that referenced this pull request Feb 7, 2024
… SetupWithManager

SetupWithManager(mgr) is called before mgr.Start() in main(), so it's
running before the leader lease has been acquired.  We don't want to
be writing to anything before we acquire the lease, because that could
cause contention with an actual lease-holder that is managing those
resources.  We can still perform read-only prechecks in
SetupWithManager.  And then we wait patiently until we get the lease,
and update ClusterOperator (and anything else we manage) in Instead,
patiently wait until we have the lease in the Reconcile() function.

This partially rolls back 2e9d117 (Ensure baremetal CO is
completely setup before Reconcile, 2020-11-30, openshift#81), but that addition
predated ensureClusterOperator being added early in Reconcile in
4f2d314 (Make sure ensureClusterOperator() is called before its
status is updated, 2020-12-15, openshift#71):

  $ git log --oneline | grep -n '2e9d1177\|4f2d3141'
  468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated
  506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile

So the ensureClusterOperator call in SetupWithManager is no longer
needed.

And this partially rolls back 8798044 (Handle case when
Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81).
That "we're enabled, but there isn't a Provisioning custom resource
yet" handling happens continually in Reconcile (where the write will
be protected by the operator holding the lease).

Among other improvements, this change will prevent a
nominally-successful install where the operator never acquired a lease
[1]:

 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log
  $ head -n4 cbo.log
  I1222 01:05:34.274563       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
  I1222 01:05:34.318283       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
  I1222 01:05:34.403202       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
  I1222 01:05:34.430552       1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1
  $ tail -n2 cbo.log
  E1222 02:36:57.323869       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"
  E1222 02:37:00.248442       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"

but still managed to write Available=True (with that 'new CO status' line):

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR:
  2023-12-22T01:05:34Z Degraded=False :
  2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform
  2023-12-22T01:05:34Z Upgradeable=True :
  2023-12-22T01:05:34Z Disabled=False :

"I'll never get this lease, and I need a lease to run all my
controllers" doesn't seem very Available=True to me, and with this
commit, we won't touch the ClusterOperator and the install will time
out.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants