Skip to content

Conversation

@sadasu
Copy link
Contributor

@sadasu sadasu commented Nov 30, 2020

No description provided.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sadasu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 30, 2020
namespacedName := types.NamespacedName{Name: BaremetalProvisioningCR}
baremetalConfig, err := r.readProvisioningCR(namespacedName)
if err != nil || baremetalConfig == nil {
err = r.updateCOStatus(ReasonComplete, "nothing to do in assisted install", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to not mention assisted. Conceivably, someone could install a compact 3-node baremetal cluster without provisioning services by deleting the CR.

So maybe something like "Provisioning CR not found on baremetal platform; marking operator as available"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not like that string. Thanks for the suggestion.

@sadasu sadasu force-pushed the handle-assisted-install branch from 789fc79 to 94011dc Compare November 30, 2020 19:42
@sadasu sadasu changed the title Handle assisted install case when Provisioning CR is absent Handle case when Provisioning CR is absent on BareMetal Platform Nov 30, 2020
@sadasu
Copy link
Contributor Author

sadasu commented Nov 30, 2020

/test e2e-metal-assisted

@dhellmann
Copy link
Contributor

This is going to conflict with #69. I don't mind rebasing, but we could also combine them into 1 PR if I rebase this on top of the other. Let me know what you think.

@sadasu
Copy link
Contributor Author

sadasu commented Nov 30, 2020

This is going to conflict with #69. I don't mind rebasing, but we could also combine them into 1 PR if I rebase this on top of the other. Let me know what you think.

Yes, it will. Lets combine this into 1 PR as soon as we use this PR to iron out our CI issues. (I didn't want to add more noise to #69.)

@stbenjam
Copy link
Member

This is going to conflict with #69. I don't mind rebasing, but we could also combine them into 1 PR if I rebase this on top of the other. Let me know what you think.

Yes, it will. Lets combine this into 1 PR (#69) as soon as we use this PR to iron out our CI issues. (I didn't want to add more noise to #69.)

Is there any chance we could land this one as-is? I'd prefer to unblock assisted as quickly as possible as some other work is held up on it.

@dhellmann
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 30, 2020
@stbenjam
Copy link
Member

Failure looks related:

oc explain should contain proper spec+status for CRDs [Suite:openshift/conformance/parallel] expand_less | 36s
-- | --
fail [github.com/openshift/origin/test/extended/cli/explain.go:442]: Unexpected error:     <*errors.errorString \| 0xc002860a50>: {         s: "oc explain [\"provisionings\" \"--api-version=metal3.io/v1alpha1\"] result {KIND:     Provisioning\nVERSION:  metal3.io/v1alpha1\n\nDESCRIPTION:\n     <empty>} doesn't match pattern {(?s)DESCRIPTION:.*FIELDS:.*spec.*<.*>.*(status.*<.*>.*)?}",     }     oc explain ["provisionings" "--api-version=metal3.io/v1alpha1"] result {KIND:     Provisioning     VERSION:  metal3.io/v1alpha1

This could happen during the assisted installation scenario.
@sadasu sadasu force-pushed the handle-assisted-install branch from 94011dc to 8798044 Compare November 30, 2020 22:52
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 30, 2020
@sadasu
Copy link
Contributor Author

sadasu commented Dec 1, 2020

/test e2e-metal-assisted

@sadasu
Copy link
Contributor Author

sadasu commented Dec 1, 2020

/test e2e-metal-assisted

@romfreiman
Copy link

/test e2e-agnostic

@romfreiman
Copy link

cc: @hexfusion @YuviGold . assisted-installer passed here

@stbenjam
Copy link
Member

stbenjam commented Dec 1, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 1, 2020
@stbenjam
Copy link
Member

stbenjam commented Dec 1, 2020

/retitle Bug 1901301: Handle case when Provisioning CR is absent on BareMetal Platform

@openshift-ci-robot openshift-ci-robot changed the title Handle case when Provisioning CR is absent on BareMetal Platform Bug 1901301: Handle case when Provisioning CR is absent on BareMetal Platform Dec 1, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Dec 1, 2020
@openshift-ci-robot
Copy link
Contributor

@sadasu: This pull request references Bugzilla bug 1901301, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1901301: Handle case when Provisioning CR is absent on BareMetal Platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Dec 1, 2020
@openshift-merge-robot openshift-merge-robot merged commit e85ba8b into openshift:master Dec 1, 2020
@openshift-ci-robot
Copy link
Contributor

@sadasu: All pull requests linked via external trackers have merged:

Bugzilla bug 1901301 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1901301: Handle case when Provisioning CR is absent on BareMetal Platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stbenjam
Copy link
Member

stbenjam commented Dec 1, 2020

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@stbenjam: Bugzilla bug 1901301 is in an unrecognized state (MODIFIED) and will not be moved to the MODIFIED state.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sadasu sadasu deleted the handle-assisted-install branch January 7, 2021 13:44
wking added a commit to wking/cluster-baremetal-operator that referenced this pull request Dec 22, 2023
… SetupWithManager

SetupWithManager(mgr) is called before mgr.Start() in main(), so it's
running before the leader lease has been acquired.  We don't want to
be writing to anything before we acquire the lease, because that could
cause contention with an actual lease-holder that is managing those
resources.  We can still perform read-only prechecks in
SetupWithManager.  And then we wait patiently until we get the lease,
and update ClusterOperator (and anything else we manage) in Instead,
patiently wait until we have the lease in the Reconcile() function.

This partially rolls back 2e9d117 (Ensure baremetal CO is
completely setup before Reconcile, 2020-11-30, openshift#81), but that addition
predated ensureClusterOperator being added early in Reconcile in
4f2d314 (Make sure ensureClusterOperator() is called before its
status is updated, 2020-12-15, openshift#71):

  $ git log --oneline | grep -n '2e9d1177\|4f2d3141'
  468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated
  506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile

So the ensureClusterOperator call in SetupWithManager is no longer
needed.

And this partially rolls back 8798044 (Handle case when
Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81).
That "we're enabled, but there isn't a Provisioning custom resource
yet" handling happens continually in Reconcile (where the write will
be protected by the operator holding the lease).

Among other improvements, this change will prevent a
nominally-successful install where the operator never acquired a lease
[1]:

 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log
  $ head -n4 cbo.log
  I1222 01:05:34.274563       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
  I1222 01:05:34.318283       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
  I1222 01:05:34.403202       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
  I1222 01:05:34.430552       1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1
  $ tail -n2 cbo.log
  E1222 02:36:57.323869       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"
  E1222 02:37:00.248442       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"

but still managed to write Available=True (with that 'new CO status' line):

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR:
  2023-12-22T01:05:34Z Degraded=False :
  2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform
  2023-12-22T01:05:34Z Upgradeable=True :
  2023-12-22T01:05:34Z Disabled=False :

"I'll never get this lease, and I need a lease to run all my
controllers" doesn't seem very Available=True to me, and with this
commit, we won't touch the ClusterOperator and the install will time
out.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
wking added a commit to wking/cluster-baremetal-operator that referenced this pull request Dec 22, 2023
… SetupWithManager

SetupWithManager(mgr) is called before mgr.Start() in main(), so it's
running before the leader lease has been acquired.  We don't want to
be writing to anything before we acquire the lease, because that could
cause contention with an actual lease-holder that is managing those
resources.  We can still perform read-only prechecks in
SetupWithManager.  And then we wait patiently until we get the lease,
and update ClusterOperator (and anything else we manage) in Instead,
patiently wait until we have the lease in the Reconcile() function.

This partially rolls back 2e9d117 (Ensure baremetal CO is
completely setup before Reconcile, 2020-11-30, openshift#81), but that addition
predated ensureClusterOperator being added early in Reconcile in
4f2d314 (Make sure ensureClusterOperator() is called before its
status is updated, 2020-12-15, openshift#71):

  $ git log --oneline | grep -n '2e9d1177\|4f2d3141'
  468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated
  506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile

So the ensureClusterOperator call in SetupWithManager is no longer
needed.

And this partially rolls back 8798044 (Handle case when
Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81).
That "we're enabled, but there isn't a Provisioning custom resource
yet" handling happens continually in Reconcile (where the write will
be protected by the operator holding the lease).

Among other improvements, this change will prevent a
nominally-successful install where the operator never acquired a lease
[1]:

 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log
  $ head -n4 cbo.log
  I1222 01:05:34.274563       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
  I1222 01:05:34.318283       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
  I1222 01:05:34.403202       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
  I1222 01:05:34.430552       1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1
  $ tail -n2 cbo.log
  E1222 02:36:57.323869       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"
  E1222 02:37:00.248442       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"

but still managed to write Available=True (with that 'new CO status' line):

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR:
  2023-12-22T01:05:34Z Degraded=False :
  2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform
  2023-12-22T01:05:34Z Upgradeable=True :
  2023-12-22T01:05:34Z Disabled=False :

"I'll never get this lease, and I need a lease to run all my
controllers" doesn't seem very Available=True to me, and with this
commit, we won't touch the ClusterOperator and the install will time
out.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-baremetal-operator that referenced this pull request Feb 7, 2024
… SetupWithManager

SetupWithManager(mgr) is called before mgr.Start() in main(), so it's
running before the leader lease has been acquired.  We don't want to
be writing to anything before we acquire the lease, because that could
cause contention with an actual lease-holder that is managing those
resources.  We can still perform read-only prechecks in
SetupWithManager.  And then we wait patiently until we get the lease,
and update ClusterOperator (and anything else we manage) in Instead,
patiently wait until we have the lease in the Reconcile() function.

This partially rolls back 2e9d117 (Ensure baremetal CO is
completely setup before Reconcile, 2020-11-30, openshift#81), but that addition
predated ensureClusterOperator being added early in Reconcile in
4f2d314 (Make sure ensureClusterOperator() is called before its
status is updated, 2020-12-15, openshift#71):

  $ git log --oneline | grep -n '2e9d1177\|4f2d3141'
  468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated
  506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile

So the ensureClusterOperator call in SetupWithManager is no longer
needed.

And this partially rolls back 8798044 (Handle case when
Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81).
That "we're enabled, but there isn't a Provisioning custom resource
yet" handling happens continually in Reconcile (where the write will
be protected by the operator holding the lease).

Among other improvements, this change will prevent a
nominally-successful install where the operator never acquired a lease
[1]:

 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log
  $ head -n4 cbo.log
  I1222 01:05:34.274563       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
  I1222 01:05:34.318283       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
  I1222 01:05:34.403202       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
  I1222 01:05:34.430552       1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1
  $ tail -n2 cbo.log
  E1222 02:36:57.323869       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"
  E1222 02:37:00.248442       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"

but still managed to write Available=True (with that 'new CO status' line):

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR:
  2023-12-22T01:05:34Z Degraded=False :
  2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform
  2023-12-22T01:05:34Z Upgradeable=True :
  2023-12-22T01:05:34Z Disabled=False :

"I'll never get this lease, and I need a lease to run all my
controllers" doesn't seem very Available=True to me, and with this
commit, we won't touch the ClusterOperator and the install will time
out.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants