-
Notifications
You must be signed in to change notification settings - Fork 70
Bug 1901301: Handle case when Provisioning CR is absent on BareMetal Platform #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1901301: Handle case when Provisioning CR is absent on BareMetal Platform #81
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sadasu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| namespacedName := types.NamespacedName{Name: BaremetalProvisioningCR} | ||
| baremetalConfig, err := r.readProvisioningCR(namespacedName) | ||
| if err != nil || baremetalConfig == nil { | ||
| err = r.updateCOStatus(ReasonComplete, "nothing to do in assisted install", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to not mention assisted. Conceivably, someone could install a compact 3-node baremetal cluster without provisioning services by deleting the CR.
So maybe something like "Provisioning CR not found on baremetal platform; marking operator as available"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not like that string. Thanks for the suggestion.
789fc79 to
94011dc
Compare
|
/test e2e-metal-assisted |
|
This is going to conflict with #69. I don't mind rebasing, but we could also combine them into 1 PR if I rebase this on top of the other. Let me know what you think. |
Yes, it will. Lets combine this into 1 PR as soon as we use this PR to iron out our CI issues. (I didn't want to add more noise to #69.) |
Is there any chance we could land this one as-is? I'd prefer to unblock assisted as quickly as possible as some other work is held up on it. |
|
/lgtm |
|
Failure looks related: |
This could happen during the assisted installation scenario.
94011dc to
8798044
Compare
|
/test e2e-metal-assisted |
|
/test e2e-metal-assisted |
|
/test e2e-agnostic |
|
cc: @hexfusion @YuviGold . assisted-installer passed here |
|
/lgtm |
|
/retitle Bug 1901301: Handle case when Provisioning CR is absent on BareMetal Platform |
|
@sadasu: This pull request references Bugzilla bug 1901301, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@sadasu: All pull requests linked via external trackers have merged: Bugzilla bug 1901301 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
|
@stbenjam: Bugzilla bug 1901301 is in an unrecognized state (MODIFIED) and will not be moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
… SetupWithManager SetupWithManager(mgr) is called before mgr.Start() in main(), so it's running before the leader lease has been acquired. We don't want to be writing to anything before we acquire the lease, because that could cause contention with an actual lease-holder that is managing those resources. We can still perform read-only prechecks in SetupWithManager. And then we wait patiently until we get the lease, and update ClusterOperator (and anything else we manage) in Instead, patiently wait until we have the lease in the Reconcile() function. This partially rolls back 2e9d117 (Ensure baremetal CO is completely setup before Reconcile, 2020-11-30, openshift#81), but that addition predated ensureClusterOperator being added early in Reconcile in 4f2d314 (Make sure ensureClusterOperator() is called before its status is updated, 2020-12-15, openshift#71): $ git log --oneline | grep -n '2e9d1177\|4f2d3141' 468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated 506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile So the ensureClusterOperator call in SetupWithManager is no longer needed. And this partially rolls back 8798044 (Handle case when Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81). That "we're enabled, but there isn't a Provisioning custom resource yet" handling happens continually in Reconcile (where the write will be protected by the operator holding the lease). Among other improvements, this change will prevent a nominally-successful install where the operator never acquired a lease [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log $ head -n4 cbo.log I1222 01:05:34.274563 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1222 01:05:34.318283 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1222 01:05:34.403202 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" I1222 01:05:34.430552 1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1 $ tail -n2 cbo.log E1222 02:36:57.323869 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" E1222 02:37:00.248442 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" but still managed to write Available=True (with that 'new CO status' line): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR: 2023-12-22T01:05:34Z Degraded=False : 2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform 2023-12-22T01:05:34Z Upgradeable=True : 2023-12-22T01:05:34Z Disabled=False : "I'll never get this lease, and I need a lease to run all my controllers" doesn't seem very Available=True to me, and with this commit, we won't touch the ClusterOperator and the install will time out. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
… SetupWithManager SetupWithManager(mgr) is called before mgr.Start() in main(), so it's running before the leader lease has been acquired. We don't want to be writing to anything before we acquire the lease, because that could cause contention with an actual lease-holder that is managing those resources. We can still perform read-only prechecks in SetupWithManager. And then we wait patiently until we get the lease, and update ClusterOperator (and anything else we manage) in Instead, patiently wait until we have the lease in the Reconcile() function. This partially rolls back 2e9d117 (Ensure baremetal CO is completely setup before Reconcile, 2020-11-30, openshift#81), but that addition predated ensureClusterOperator being added early in Reconcile in 4f2d314 (Make sure ensureClusterOperator() is called before its status is updated, 2020-12-15, openshift#71): $ git log --oneline | grep -n '2e9d1177\|4f2d3141' 468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated 506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile So the ensureClusterOperator call in SetupWithManager is no longer needed. And this partially rolls back 8798044 (Handle case when Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81). That "we're enabled, but there isn't a Provisioning custom resource yet" handling happens continually in Reconcile (where the write will be protected by the operator holding the lease). Among other improvements, this change will prevent a nominally-successful install where the operator never acquired a lease [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log $ head -n4 cbo.log I1222 01:05:34.274563 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1222 01:05:34.318283 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1222 01:05:34.403202 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" I1222 01:05:34.430552 1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1 $ tail -n2 cbo.log E1222 02:36:57.323869 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" E1222 02:37:00.248442 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" but still managed to write Available=True (with that 'new CO status' line): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR: 2023-12-22T01:05:34Z Degraded=False : 2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform 2023-12-22T01:05:34Z Upgradeable=True : 2023-12-22T01:05:34Z Disabled=False : "I'll never get this lease, and I need a lease to run all my controllers" doesn't seem very Available=True to me, and with this commit, we won't touch the ClusterOperator and the install will time out. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
… SetupWithManager SetupWithManager(mgr) is called before mgr.Start() in main(), so it's running before the leader lease has been acquired. We don't want to be writing to anything before we acquire the lease, because that could cause contention with an actual lease-holder that is managing those resources. We can still perform read-only prechecks in SetupWithManager. And then we wait patiently until we get the lease, and update ClusterOperator (and anything else we manage) in Instead, patiently wait until we have the lease in the Reconcile() function. This partially rolls back 2e9d117 (Ensure baremetal CO is completely setup before Reconcile, 2020-11-30, openshift#81), but that addition predated ensureClusterOperator being added early in Reconcile in 4f2d314 (Make sure ensureClusterOperator() is called before its status is updated, 2020-12-15, openshift#71): $ git log --oneline | grep -n '2e9d1177\|4f2d3141' 468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated 506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile So the ensureClusterOperator call in SetupWithManager is no longer needed. And this partially rolls back 8798044 (Handle case when Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81). That "we're enabled, but there isn't a Provisioning custom resource yet" handling happens continually in Reconcile (where the write will be protected by the operator holding the lease). Among other improvements, this change will prevent a nominally-successful install where the operator never acquired a lease [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log $ head -n4 cbo.log I1222 01:05:34.274563 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1222 01:05:34.318283 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1222 01:05:34.403202 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" I1222 01:05:34.430552 1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1 $ tail -n2 cbo.log E1222 02:36:57.323869 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" E1222 02:37:00.248442 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" but still managed to write Available=True (with that 'new CO status' line): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR: 2023-12-22T01:05:34Z Degraded=False : 2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform 2023-12-22T01:05:34Z Upgradeable=True : 2023-12-22T01:05:34Z Disabled=False : "I'll never get this lease, and I need a lease to run all my controllers" doesn't seem very Available=True to me, and with this commit, we won't touch the ClusterOperator and the install will time out. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
No description provided.