Allow CVO to deploy CBO #25

andfasano · 2020-09-29T10:51:39Z

This PR adds the manifests required by CVO for the deployment of CBO, as specified in the related CVO documentation

manifests/0000_30_cluster-baremetal-operator_04_deployment.yaml

manifests/0000_30_cluster-baremetal-operator_03_serviceaccount.yaml

manifests/0000_30_cluster-baremetal-operator_05_deployment.yaml

sadasu

I have a generic question inline. Other than that this looks good.

dhellmann · 2020-10-06T17:17:18Z

manifests/0000_31_cluster-baremetal-operator_06_clusteroperator.cr.yaml

@@ -0,0 +1,8 @@
+apiVersion: config.openshift.io/v1


This should probably come before the deployment, since the deployment uses it.

I'm not sure whether it matters. Shouldn't CBO create the CO deployement if not found anyways> FYI, MAO uses the same runlevel.

Based on the note in this section of the docs I think the ClusterOperator resource has to come after the Deployment. The CVO doesn't use the manifest to actually create the resource, it only uses it to know which CO resources it should wait for before continuing with the next step of the install/upgrade. So, I think this is correct.

dhellmann · 2020-10-06T17:17:29Z

manifests/0000_31_cluster-baremetal-operator_05_deployment.yaml

+      - name: cluster-baremetal-operator
+        env:
+        - name: RELEASE_VERSION
+          value: "0.0.1-snapshot"


Does something in the build process change this value?

Not 100% sure. But this value is used in the CO to signal upgrades.

This section of the docs includes an example with a comment that the value "is substituted in the manifests when the payload is built" so I guess something about building the release image modifies the value.

dhellmann · 2020-10-06T19:39:39Z

/approve

openshift-ci-robot · 2020-10-06T19:39:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano, dhellmann

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [andfasano,dhellmann]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kirankt · 2020-10-06T20:29:25Z

/lgtm

…ader election The option has been available for years: $ git blame main.go | grep enable-leader-election dcbe86f (Sandhya Dasu 2020-08-18 21:09:29 -0400 72) flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, and without it overlapping operator pods can crash-loop [1]: : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers 0s { event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times} while fighting each other over the same ClusterOperator status: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort 2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up. I'm also setting a Recreate strategy [2], because: 1. Incoming pod surged by the default Deployment strategy. 2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it. 3. Outgoing pod releases the lease and exits. 4. Incoming pod tries again, and this time acquires the lease. can be slow in the 3-to-4 phase, while: 1. Outgoing pod releases the lease and exits. 2. Incoming pod created, scheduled, and acquires the lease. tends to be faster. And again, the component should be able to coast through small durations without a functioning leader. See openshift/machine-config-operator@7530ded86 (install: Recreate and delayed default ServiceAccount deletion, 2023-08-29, openshift/machine-config-operator#3895) for another example of how Recreate can help that way. To get the leader election working, I'm also setting LeaderElectionID and LeaderElectionNamespace, pattern-matching from [3]. Using cluster-baremetal-operator as the ID avoids collisions with baremetal-operator [4]. COMPONENT_NAMESPACE is from: - name: COMPONENT_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace using the pattern documented in [5]. We've been asking for that environment variable since f368a4f (Add sa, deployment and co manifests, 2020-09-29, openshift#25), although grepping through history, I don't see anything consuming it. But with this commit, we now have a consumer :). [1]: https://issues.redhat.com/browse/OCPBUGS-25766 [2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment [3]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L207-L208 [4]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L71 [5]: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/

…ader election The option has been available for years: $ git blame main.go | grep enable-leader-election dcbe86f (Sandhya Dasu 2020-08-18 21:09:29 -0400 72) flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, and without it overlapping operator pods can crash-loop [1]: : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers 0s { event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times} while fighting each other over the same ClusterOperator status: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort 2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up. I'm also setting a Recreate strategy [2], because: 1. Incoming pod surged by the default Deployment strategy. 2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it. 3. Outgoing pod releases the lease and exits. 4. Incoming pod tries again, and this time acquires the lease. can be slow in the 3-to-4 phase, while: 1. Outgoing pod releases the lease and exits. 2. Incoming pod created, scheduled, and acquires the lease. tends to be faster. And again, the component should be able to coast through small durations without a functioning leader. See openshift/machine-config-operator@7530ded86 (install: Recreate and delayed default ServiceAccount deletion, 2023-08-29, openshift/machine-config-operator#3895) for another example of how Recreate can help that way. To get the leader election working, I'm also setting LeaderElectionID and LeaderElectionNamespace, pattern-matching from [3]. Using cluster-baremetal-operator as the ID avoids collisions with baremetal-operator [4]. There is a COMPONENT_NAMESPACE is from: - name: COMPONENT_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace using the pattern documented in [5]. We've been asking for that environment variable since f368a4f (Add sa, deployment and co manifests, 2020-09-29, openshift#25), although grepping through history, I don't see anything consuming it. Perhaps it's somehow feeding controllers.ComponentNamespace, which is what I'm using. [1]: https://issues.redhat.com/browse/OCPBUGS-25766 [2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment [3]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L207-L208 [4]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L71 [5]: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 29, 2020

openshift-ci-robot requested review from kirankt and markmc September 29, 2020 10:51

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 29, 2020

andfasano force-pushed the cvo-deploy branch from de0d1a6 to 3a6172e Compare September 29, 2020 13:29

dhellmann reviewed Sep 29, 2020

View reviewed changes

manifests/0000_30_cluster-baremetal-operator_04_deployment.yaml Show resolved Hide resolved

kirankt reviewed Sep 29, 2020

View reviewed changes

manifests/0000_30_cluster-baremetal-operator_03_serviceaccount.yaml Outdated Show resolved Hide resolved

andfasano changed the title ~~[WIP] Allow CVO to deploy CBO~~ Allow CVO to deploy CBO Sep 30, 2020

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 30, 2020

sadasu reviewed Sep 30, 2020

View reviewed changes

manifests/0000_30_cluster-baremetal-operator_05_deployment.yaml Show resolved Hide resolved

sadasu reviewed Sep 30, 2020

View reviewed changes

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2020

Add sa, deployment and co manifests

f368a4f

andfasano force-pushed the cvo-deploy branch from 87a8fc5 to f368a4f Compare October 6, 2020 17:09

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 6, 2020

dhellmann reviewed Oct 6, 2020

View reviewed changes

openshift-ci-robot assigned kirankt Oct 6, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 6, 2020

openshift-merge-robot merged commit c2d3bc8 into openshift:master Oct 6, 2020

andfasano deleted the cvo-deploy branch October 7, 2020 05:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow CVO to deploy CBO #25

Allow CVO to deploy CBO #25

Uh oh!

andfasano commented Sep 29, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sadasu left a comment

Uh oh!

dhellmann Oct 6, 2020

Uh oh!

kirankt Oct 6, 2020

Uh oh!

dhellmann Oct 6, 2020

Uh oh!

dhellmann Oct 6, 2020

Uh oh!

kirankt Oct 6, 2020

Uh oh!

dhellmann Oct 6, 2020

Uh oh!

dhellmann commented Oct 6, 2020

Uh oh!

openshift-ci-robot commented Oct 6, 2020

Uh oh!

kirankt commented Oct 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Allow CVO to deploy CBO #25

Allow CVO to deploy CBO #25

Uh oh!

Conversation

andfasano commented Sep 29, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sadasu left a comment

Choose a reason for hiding this comment

Uh oh!

dhellmann Oct 6, 2020

Choose a reason for hiding this comment

Uh oh!

kirankt Oct 6, 2020

Choose a reason for hiding this comment

Uh oh!

dhellmann Oct 6, 2020

Choose a reason for hiding this comment

Uh oh!

dhellmann Oct 6, 2020

Choose a reason for hiding this comment

Uh oh!

kirankt Oct 6, 2020

Choose a reason for hiding this comment

Uh oh!

dhellmann Oct 6, 2020

Choose a reason for hiding this comment

Uh oh!

dhellmann commented Oct 6, 2020

Uh oh!

openshift-ci-robot commented Oct 6, 2020

Uh oh!

kirankt commented Oct 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants