Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Oct 4, 2023

We'd dropped the last of these in 856aab2 (#33005) and 5e746a7 (#39897). There's now renewed interest in how these sorts of rollbacks look, so I'm reviving them for recent releases. I expect the issues with these rollbacks will at least include issues with the cluster-version operator losing the ability to write to ClusterVersion as the older CRD's enum rejects the capabilities added in the new release:

$ cd openshift/api
$ git diff origin/release-4.13..origin/release-4.14 -- config/v1/types_cluster_version.go | grep kubebuilder:validation:Enum

generating:

-// +kubebuilder:validation:Enum=openshift-samples;baremetal;marketplace;Console;Insights;Storage;CSISnapshot;NodeTuning
+// +kubebuilder:validation:Enum=openshift-samples;baremetal;marketplace;Console;Insights;Storage;CSISnapshot;NodeTuning;MachineAPI;Build;DeploymentConfig;ImageRegistry
-// +kubebuilder:validation:Enum=None;v4.11;v4.12;v4.13;vCurrent
+// +kubebuilder:validation:Enum=None;v4.11;v4.12;v4.13;v4.14;vCurrent

So a cluster updating from 4.13 to 4.14 will enable (possibly implicitly) MachineAPI and other newly-labeled-in-4.14 capabilities. And then when the 4.13 ClusterVersion CRD is pushed during the rollback, those values become illegal, and the Kubernetes API server will reject the cluster-version operators attempts to write ClusterVersion status with errors complaining about the unrecognised MachineAPI and other capability string:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/941/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change/1671502401497993216/artifacts/e2e-agnostic-ovn-upgrade-out-of-change/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7fd84b7b99-8b2qk_cluster-version-operator.log | grep 'ClusterVersion.config.openshift.io "version" is invalid' | tail -n1
I0621 16:45:41.154360       1 cvo.go:601] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: status.capabilities.enabledCapabilities[3]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning"

@openshift-ci openshift-ci bot requested review from neisw and vrutkovs October 4, 2023 18:55
@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Oct 4, 2023
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2023
@wking wking force-pushed the revive-cross-minor-rollbacks branch from 8a02da6 to 923834f Compare October 4, 2023 18:59
@openshift-ci-robot openshift-ci-robot removed the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Oct 4, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 4, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…y-4.14-upgrade-from-stable-4.13: Restore cross-minor rollbacks

We'd dropped the last of these in 856aab2
(ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10:
Drop failing rollback jobs, 2022-10-11, openshift#33005) and 5e746a7
(ci-operator/config/openshift/release: Drop cross-minor rollback jobs,
2023-06-07, openshift#39897).  There's now renewed interest in how these sorts
of rollbacks look, so I'm reviving them for recent releases.  I expect
the issues with these rollbacks will at least include issues with the
cluster-version operator losing the ability to write to ClusterVersion
as the older CRD's enum rejects the capabilities added in the new
release:

  openshift/api $ git diff origin/release-4.13..origin/release-4.14 -- config/v1/types_cluster_version.go | grep kubebuilder:validation:Enum
  -// +kubebuilder:validation:Enum=openshift-samples;baremetal;marketplace;Console;Insights;Storage;CSISnapshot;NodeTuning
  +// +kubebuilder:validation:Enum=openshift-samples;baremetal;marketplace;Console;Insights;Storage;CSISnapshot;NodeTuning;MachineAPI;Build;DeploymentConfig;ImageRegistry
  -// +kubebuilder:validation:Enum=None;v4.11;v4.12;v4.13;vCurrent
  +// +kubebuilder:validation:Enum=None;v4.11;v4.12;v4.13;v4.14;vCurrent

So a cluster updating from 4.13 to 4.14 will enable (possibly
implicitly) MachineAPI and other newly-labeled-in-4.14 capabilities.
And then when the 4.13 ClusterVersion CRD is pushed during the
rollback, those values become illegal, and the Kubernetes API server
will reject the cluster-version operators attempts to write
ClusterVersion status with errors complaining about the unrecognised
MachineAPI and other capability string [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/941/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change/1671502401497993216/artifacts/e2e-agnostic-ovn-upgrade-out-of-change/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7fd84b7b99-8b2qk_cluster-version-operator.log | grep 'ClusterVersion.config.openshift.io "version" is invalid' | tail -n1
  I0621 16:45:41.154360       1 cvo.go:601] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: status.capabilities.enabledCapabilities[3]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning"

[1]: openshift/cluster-version-operator#941 (review)
@wking wking force-pushed the revive-cross-minor-rollbacks branch from 923834f to f2d70d2 Compare October 4, 2023 19:04
@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@wking: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade-rollback N/A periodic Periodic changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 10 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 20 rehearsals
Comment: /pj-rehearse max to run up to 35 rehearsals
Comment: /pj-rehearse auto-ack to run up to 10 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse abort to abort all active rehearsals

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@wking
Copy link
Member Author

wking commented Oct 4, 2023

/pj-rehearse

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 5, 2023

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade-rollback f2d70d2 link unknown /pj-rehearse periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade-rollback

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Oct 5, 2023

Rehearsal failed to gather pod logs in gather-extra. But with a kubeconfig for the running job, you can see the expected enum issues:

$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail 50 | grep 'ClusterVersion.config.openshift.io "version" is invalid:' | tail -n1
I1005 01:41:28.165931       1 cvo.go:600] Dropping "openshift-cluster-version/version" out of the queue &{0xc00039eae0 0xc0004142e8}: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.knownCapabilities[0]: Unsupported value: "Build": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[3]: Unsupported value: "DeploymentConfig": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[4]: Unsupported value: "ImageRegistry": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[6]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[0]: Unsupported value: "Build": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[3]: Unsupported value: "DeploymentConfig": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[4]: Unsupported value: "ImageRegistry": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[6]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning"]

openshift-ovn-kubernetes also seems to be having some trouble, with the ovnkube-master-* pods all triggering KubePodCrashLooping. Although I grabbed an inspect of that namespace, and am not sure quite what was failing:

$ oc adm inspect namespace/openshift-ovn-kubernetes
$ tail -n2 inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-*/*/*/logs/previous.log 
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-9jvqc/kube-rbac-proxy/kube-rbac-proxy/logs/previous.log <==

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-9jvqc/nbdb/nbdb/logs/previous.log <==
2023-10-05T01:38:30.251308118Z 2023-10-05T01:38:30.251Z|00016|memory|INFO|atoms:15 cells:20 monitors:0 n-weak-refs:0
2023-10-05T01:39:33.646046973Z [1]+  Done                    exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --db-nb-cluster-remote-port=9643 --db-nb-cluster-remote-addr=${init_ip} --db-nb-cluster-remote-proto=ssl --ovn-nb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" run_nb_ovsdb

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-9jvqc/northd/northd/logs/previous.log <==

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-9jvqc/ovn-dbchecker/ovn-dbchecker/logs/previous.log <==

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-9jvqc/ovnkube-master/ovnkube-master/logs/previous.log <==
2023-10-05T01:41:23.741727226Z I1005 01:41:23.741713       1 reflector.go:227] Stopping reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:150
2023-10-05T01:41:23.748127082Z I1005 01:41:23.748105       1 ovnkube.go:376] No longer leader; exiting

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-9jvqc/sbdb/sbdb/logs/previous.log <==
2023-10-05T01:39:51.105932998Z 2023-10-05T01:39:51.105Z|00016|memory|INFO|atoms:15 cells:20 monitors:0 n-weak-refs:0
2023-10-05T01:40:14.150330913Z [1]+  Done                    exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --db-sb-cluster-remote-port=9644 --db-sb-cluster-remote-addr=${init_ip} --db-sb-cluster-remote-proto=ssl --ovn-sb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" run_sb_ovsdb

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-bn428/kube-rbac-proxy/kube-rbac-proxy/logs/previous.log <==

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-bn428/nbdb/nbdb/logs/previous.log <==
2023-10-05T01:41:05.065005478Z 2023-10-05T01:41:05.064Z|00047|raft|INFO|term 7: 10894 ms timeout expired, starting election
2023-10-05T01:41:11.468711173Z [1]+  Done                    exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --ovn-nb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" ${election_timer} run_nb_ovsdb

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-bn428/northd/northd/logs/previous.log <==

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-bn428/ovn-dbchecker/ovn-dbchecker/logs/previous.log <==

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-bn428/ovnkube-master/ovnkube-master/logs/previous.log <==
2023-10-05T01:42:00.855984335Z I1005 01:42:00.855968       1 metrics.go:504] Stopping metrics server 127.0.0.1:29102
2023-10-05T01:42:00.860873890Z I1005 01:42:00.860850       1 ovnkube.go:376] No longer leader; exiting

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-bn428/sbdb/sbdb/logs/previous.log <==
2023-10-05T01:41:28.891165874Z 2023-10-05T01:41:28.891Z|00018|memory|INFO|atoms:29 cells:39 monitors:0 n-weak-refs:0 raft-log:3 txn-history:2 txn-history-atoms:12
2023-10-05T01:41:59.983157155Z [1]+  Done                    exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --ovn-sb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" ${election_timer} run_sb_ovsdb

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-t8phs/kube-rbac-proxy/kube-rbac-proxy/logs/previous.log <==

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-t8phs/nbdb/nbdb/logs/previous.log <==
2023-10-05T01:40:00.414759707Z 2023-10-05T01:40:00.414Z|00038|ovsdb_jsonrpc_server|ERR|pssl:9641: listen failed: Address already in use
2023-10-05T01:40:12.071012110Z [1]+  Done                    exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --db-nb-cluster-remote-port=9643 --db-nb-cluster-remote-addr=${init_ip} --db-nb-cluster-remote-proto=ssl --ovn-nb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" run_nb_ovsdb

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-t8phs/northd/northd/logs/previous.log <==

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-t8phs/ovn-dbchecker/ovn-dbchecker/logs/previous.log <==

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-t8phs/ovnkube-master/ovnkube-master/logs/previous.log <==
2023-10-05T01:40:53.349933768Z I1005 01:40:53.349925       1 reflector.go:227] Stopping reflector *v1.EgressIP (0s) from github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:131
2023-10-05T01:40:53.355518493Z I1005 01:40:53.355497       1 ovnkube.go:376] No longer leader; exiting

==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-t8phs/sbdb/sbdb/logs/previous.log <==
2023-10-05T01:40:29.512154920Z 2023-10-05T01:40:29.512Z|00016|memory|INFO|atoms:15 cells:20 monitors:0 n-weak-refs:0
2023-10-05T01:40:52.501826012Z [1]+  Done                    exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --db-sb-cluster-remote-port=9644 --db-sb-cluster-remote-addr=${init_ip} --db-sb-cluster-remote-proto=ssl --ovn-sb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" run_sb_ovsdb

Could certainly be other issues going on beyond those as well.

@wking
Copy link
Member Author

wking commented Oct 6, 2023

I'll close this for now, but it's easy to re-open or float similar changes in the future if folks want to take another run at it :)

@wking wking closed this Oct 6, 2023
@wking wking deleted the revive-cross-minor-rollbacks branch October 6, 2023 02:43
wking added a commit to wking/openshift-api that referenced this pull request Oct 16, 2023
… capabilities

The cluster-version operator should be the only actor writing to
ClusterVersion status.  Trust it to pick appropriate values, instead
of restricting it with an enum.

When updating into new capabilities, the current timeline is:

1. Outgoing CVO accepts the new target, and figures out verified and
   implicit capabilities.
2. Outoing CVO starts trying to write to status, but because of the
   new capability not being a part of the outgoing CRD's capability
   enum, this fails.
3. Outgoing CVO pushes the incoming ClusterVersion CRD in runlevel 0
   index 1 [1].
4. status sync attempts, if any, start working again.
5. Outgoing CVO pushes the incoming CVO Deployment in runlevel 0 index
   3 [2].
6. Deployment controller TERMs the outgoing CVO process.
7. Outgoing CVO wraps up the manifest sycning.
8. Outgoing CVO attempts a final status sync [3].
9. Outgoing CVO releases the leader lock [4].
10. Outgoing CVO exists.

With the status enum removal in this commit, ClusterVersion status
syncing will remain working the whole time, reducing the risk of the
outgoing CVO shutting down before it had recorded important
information in status.

The enum relaxation will also help with rollbacks that drop
capabilities.  Those currently fail on the same status-sync issue [5]:

  $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail 50 | grep 'ClusterVersion.config.openshift.io "version" is invalid:' | tail -n1
  I1005 01:41:28.165931       1 cvo.go:600] Dropping "openshift-cluster-version/version" out of the queue &{0xc00039eae0 0xc0004142e8}: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.knownCapabilities[0]: Unsupported value: "Build": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[3]: Unsupported value: "DeploymentConfig": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[4]: Unsupported value: "ImageRegistry": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[6]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[0]: Unsupported value: "Build": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[3]: Unsupported value: "DeploymentConfig": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[4]: Unsupported value: "ImageRegistry": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[6]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning"]

but unlike the roll-forward case, where that is a temporary issue, on
rollbacks the additional capabilities never became acceptable again,
and the cluster-version operator was permanently blocked from writing
to ClusterVersion status.  With the status enum removal in this
commit, ClusterVersion status syncing will remain working, even though
it will list as an enabled capability the capability which had been
added during the earlier roll-forward.

[1]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/Dockerfile.rhel#L11
[2]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/install/0000_00_cluster-version-operator_03_deployment.yaml
[3]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/pkg/cvo/cvo.go#L421-L423
[4]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/pkg/start/start.go#L258-L278
[5]: openshift/release#43984 (comment)
wking added a commit to wking/openshift-api that referenced this pull request Oct 16, 2023
… capabilities

The cluster-version operator should be the only actor writing to
ClusterVersion status.  Trust it to pick appropriate values, instead
of restricting it with an enum.

When updating into new capabilities, the current timeline is:

1. Outgoing CVO accepts the new target, and figures out verified and
   implicit capabilities.
2. Outoing CVO starts trying to write to status, but because of the
   new capability not being a part of the outgoing CRD's capability
   enum, this fails.
3. Outgoing CVO pushes the incoming ClusterVersion CRD in runlevel 0
   index 1 [1].
4. status sync attempts, if any, start working again.
5. Outgoing CVO pushes the incoming CVO Deployment in runlevel 0 index
   3 [2].
6. Deployment controller TERMs the outgoing CVO process.
7. Outgoing CVO wraps up the manifest sycning.
8. Outgoing CVO attempts a final status sync [3].
9. Outgoing CVO releases the leader lock [4].
10. Outgoing CVO exists.

With the status enum removal in this commit, ClusterVersion status
syncing will remain working the whole time, reducing the risk of the
outgoing CVO shutting down before it had recorded important
information in status.

The enum relaxation will also help with rollbacks that drop
capabilities.  Those currently fail on the same status-sync issue [5]:

  $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail 50 | grep 'ClusterVersion.config.openshift.io "version" is invalid:' | tail -n1
  I1005 01:41:28.165931       1 cvo.go:600] Dropping "openshift-cluster-version/version" out of the queue &{0xc00039eae0 0xc0004142e8}: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.knownCapabilities[0]: Unsupported value: "Build": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[3]: Unsupported value: "DeploymentConfig": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[4]: Unsupported value: "ImageRegistry": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[6]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[0]: Unsupported value: "Build": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[3]: Unsupported value: "DeploymentConfig": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[4]: Unsupported value: "ImageRegistry": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[6]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning"]

but unlike the roll-forward case, where that is a temporary issue, on
rollbacks the additional capabilities never became acceptable again,
and the cluster-version operator was permanently blocked from writing
to ClusterVersion status.  With the status enum removal in this
commit, ClusterVersion status syncing will remain working, even though
it will list as an enabled capability the capability which had been
added during the earlier roll-forward.

[1]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/Dockerfile.rhel#L11
[2]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/install/0000_00_cluster-version-operator_03_deployment.yaml
[3]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/pkg/cvo/cvo.go#L421-L423
[4]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/pkg/start/start.go#L258-L278
[5]: openshift/release#43984 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants