-
Notifications
You must be signed in to change notification settings - Fork 70
Bug 1906935: Delete resources when Provisioning CR is deleted #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@asalkeld I tried to incorporate comments provided as part of #63. I tried using https://github.com/thoas/go-funk library based on comments #63 (comment) and #63 (comment). I found funk.Contains() but not funk.Prune() although the goDoc does say that Prune() exists https://godoc.org/github.com/thoas/go-funk#Prune. |
0c51016 to
6467f40
Compare
|
/retitle Bug 1906935: Delete resources when Provisioning CR is deleted |
|
@sadasu: This pull request references Bugzilla bug 1906935, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test e2e-metal-ipi |
|
/test e2e-metal-ipi-ovn-ipv6 |
6467f40 to
6366bdf
Compare
|
/test e2e-agnostic |
|
/test e2e-agnostic Failure seemed related to the kube version. Now that we have moved to 1.20, I expect this CI to pass too. |
kirankt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kirankt, sadasu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
Delete all Secrets and the metal3 Deployment object when the Provisioning CR is deleted.
|
/retest Please review the full test history for this PR and help us cut down flakes. |
6366bdf to
87f3a47
Compare
| // Provisioning configuration is not valid. | ||
| // Requeue request. | ||
| r.Log.Error(err, "invalid contents in images Config Map") | ||
| co_err := r.updateCOStatus(ReasonInvalidConfiguration, err.Error(), "invalid contents in images Config Map") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these calls to updateCOStatus() can fail if the CO does not exist. Can you please move the ensureClusterOperator() to above the first possible call to updateCOStatus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Please let me know if you are fine with the current amount of refactoring done within Reconcile() so that it doesn't get bloated.
| if err == nil && deleted { | ||
| err = r.updateCOStatus(ReasonComplete, "all Metal3 resources deleted", "") | ||
| if err != nil { | ||
| return ctrl.Result{}, fmt.Errorf("unable to put %q ClusterOperator in Available state: %v", clusterOperatorName, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return ctrl.Result{}, fmt.Errorf("unable to put %q ClusterOperator in Available state: %v", clusterOperatorName, err) | |
| return ctrl.Result{}, fmt.Errorf("unable to put %q ClusterOperator in Available state: %w", clusterOperatorName, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe that when wrapping errors we should be using %w as this allows the error to be unwrapped (please check in the rest of the PR too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| //Delete Secrets and the Metal3 Deployment objects | ||
| func (r *ProvisioningReconciler) deleteMetal3Resources(info *provisioning.ProvisioningInfo) error { | ||
| failedSecrets := provisioning.DeleteAllSecrets(info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this should return an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think deleteMetal3Resources() should return error when DeleteAllSecrets() returns error(s) or just log it and proceed with deleting other resources as it is doing now?
provisioning/baremetal_secrets.go
Outdated
| } | ||
|
|
||
| func DeleteAllSecrets(info *ProvisioningInfo) []string { | ||
| deleteFailures := []string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errs := []error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
provisioning/baremetal_secrets.go
Outdated
| deleteFailures = append(deleteFailures, ironicrpcSecretName) | ||
| } | ||
|
|
||
| return deleteFailures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return errors.NewAggregate(errs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't want to import a new errors package so achieved the same result differently.
| if err != nil { | ||
| return errors.Wrap(err, "failed to delete metal3 service") | ||
| } | ||
| err = provisioning.DeleteImageCache(info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about this? We have a lot of repeated functions that just check for notFound
err = provisioning.IgnoreNotFound(info.Client.AppsV1().DaemonSets(info.Namespace).Delete(context.Background(), imageCacheService, metav1.DeleteOptions{}))func IgnoreNotFound(err error) error {
if apierrors.IsNotFound(err) {
return nil
}
return err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really going to help the Reconcile method a whole lot. We currently call apierrors.IsNotFound() twice and the number of return values are different.
In other places in the provisioning package, we use the return value of apierrors.IsNotFound() to decide if we need to create something and not return nil.
sadasu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming you are referring the 2nd call to ensureClusterOperator(). And, yes, this time it will add the baremetal config in the OwnerReference for the CO.
The 1st time ensureClusterOperator(), we pass in nil because the baremetalConfig is not yet read and verified to be correct at that time. And the only time we update the CO before reading/verifying baremetalConfig is when the platform != BareMetal, so passing in nil and not setting the OwnerReference should be OK.
Includes some other generic improvements to reporting wrapped errors.
|
/test e2e-agnostic Errors unrelated to CBO. |
|
/lgtm |
|
@sadasu: All pull requests linked via external trackers have merged: Bugzilla bug 1906935 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
… SetupWithManager SetupWithManager(mgr) is called before mgr.Start() in main(), so it's running before the leader lease has been acquired. We don't want to be writing to anything before we acquire the lease, because that could cause contention with an actual lease-holder that is managing those resources. We can still perform read-only prechecks in SetupWithManager. And then we wait patiently until we get the lease, and update ClusterOperator (and anything else we manage) in Instead, patiently wait until we have the lease in the Reconcile() function. This partially rolls back 2e9d117 (Ensure baremetal CO is completely setup before Reconcile, 2020-11-30, openshift#81), but that addition predated ensureClusterOperator being added early in Reconcile in 4f2d314 (Make sure ensureClusterOperator() is called before its status is updated, 2020-12-15, openshift#71): $ git log --oneline | grep -n '2e9d1177\|4f2d3141' 468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated 506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile So the ensureClusterOperator call in SetupWithManager is no longer needed. And this partially rolls back 8798044 (Handle case when Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81). That "we're enabled, but there isn't a Provisioning custom resource yet" handling happens continually in Reconcile (where the write will be protected by the operator holding the lease). Among other improvements, this change will prevent a nominally-successful install where the operator never acquired a lease [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log $ head -n4 cbo.log I1222 01:05:34.274563 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1222 01:05:34.318283 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1222 01:05:34.403202 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" I1222 01:05:34.430552 1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1 $ tail -n2 cbo.log E1222 02:36:57.323869 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" E1222 02:37:00.248442 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" but still managed to write Available=True (with that 'new CO status' line): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR: 2023-12-22T01:05:34Z Degraded=False : 2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform 2023-12-22T01:05:34Z Upgradeable=True : 2023-12-22T01:05:34Z Disabled=False : "I'll never get this lease, and I need a lease to run all my controllers" doesn't seem very Available=True to me, and with this commit, we won't touch the ClusterOperator and the install will time out. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
… SetupWithManager SetupWithManager(mgr) is called before mgr.Start() in main(), so it's running before the leader lease has been acquired. We don't want to be writing to anything before we acquire the lease, because that could cause contention with an actual lease-holder that is managing those resources. We can still perform read-only prechecks in SetupWithManager. And then we wait patiently until we get the lease, and update ClusterOperator (and anything else we manage) in Instead, patiently wait until we have the lease in the Reconcile() function. This partially rolls back 2e9d117 (Ensure baremetal CO is completely setup before Reconcile, 2020-11-30, openshift#81), but that addition predated ensureClusterOperator being added early in Reconcile in 4f2d314 (Make sure ensureClusterOperator() is called before its status is updated, 2020-12-15, openshift#71): $ git log --oneline | grep -n '2e9d1177\|4f2d3141' 468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated 506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile So the ensureClusterOperator call in SetupWithManager is no longer needed. And this partially rolls back 8798044 (Handle case when Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81). That "we're enabled, but there isn't a Provisioning custom resource yet" handling happens continually in Reconcile (where the write will be protected by the operator holding the lease). Among other improvements, this change will prevent a nominally-successful install where the operator never acquired a lease [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log $ head -n4 cbo.log I1222 01:05:34.274563 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1222 01:05:34.318283 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1222 01:05:34.403202 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" I1222 01:05:34.430552 1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1 $ tail -n2 cbo.log E1222 02:36:57.323869 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" E1222 02:37:00.248442 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" but still managed to write Available=True (with that 'new CO status' line): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR: 2023-12-22T01:05:34Z Degraded=False : 2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform 2023-12-22T01:05:34Z Upgradeable=True : 2023-12-22T01:05:34Z Disabled=False : "I'll never get this lease, and I need a lease to run all my controllers" doesn't seem very Available=True to me, and with this commit, we won't touch the ClusterOperator and the install will time out. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
… SetupWithManager SetupWithManager(mgr) is called before mgr.Start() in main(), so it's running before the leader lease has been acquired. We don't want to be writing to anything before we acquire the lease, because that could cause contention with an actual lease-holder that is managing those resources. We can still perform read-only prechecks in SetupWithManager. And then we wait patiently until we get the lease, and update ClusterOperator (and anything else we manage) in Instead, patiently wait until we have the lease in the Reconcile() function. This partially rolls back 2e9d117 (Ensure baremetal CO is completely setup before Reconcile, 2020-11-30, openshift#81), but that addition predated ensureClusterOperator being added early in Reconcile in 4f2d314 (Make sure ensureClusterOperator() is called before its status is updated, 2020-12-15, openshift#71): $ git log --oneline | grep -n '2e9d1177\|4f2d3141' 468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated 506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile So the ensureClusterOperator call in SetupWithManager is no longer needed. And this partially rolls back 8798044 (Handle case when Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81). That "we're enabled, but there isn't a Provisioning custom resource yet" handling happens continually in Reconcile (where the write will be protected by the operator holding the lease). Among other improvements, this change will prevent a nominally-successful install where the operator never acquired a lease [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log $ head -n4 cbo.log I1222 01:05:34.274563 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1222 01:05:34.318283 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1222 01:05:34.403202 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" I1222 01:05:34.430552 1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1 $ tail -n2 cbo.log E1222 02:36:57.323869 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" E1222 02:37:00.248442 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" but still managed to write Available=True (with that 'new CO status' line): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR: 2023-12-22T01:05:34Z Degraded=False : 2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform 2023-12-22T01:05:34Z Upgradeable=True : 2023-12-22T01:05:34Z Disabled=False : "I'll never get this lease, and I need a lease to run all my controllers" doesn't seem very Available=True to me, and with this commit, we won't touch the ClusterOperator and the install will time out. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
Delete all Secrets and the metal3 Deployment object created by the CBO when the Provisioning CR is deleted.