-
Notifications
You must be signed in to change notification settings - Fork 2.1k
ci-operator/config/openshift/release/openshift-release-master__nightly-4.14-upgrade-from-stable-4.13: Restore cross-minor rollbacks #43984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8a02da6 to
923834f
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…y-4.14-upgrade-from-stable-4.13: Restore cross-minor rollbacks We'd dropped the last of these in 856aab2 (ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10: Drop failing rollback jobs, 2022-10-11, openshift#33005) and 5e746a7 (ci-operator/config/openshift/release: Drop cross-minor rollback jobs, 2023-06-07, openshift#39897). There's now renewed interest in how these sorts of rollbacks look, so I'm reviving them for recent releases. I expect the issues with these rollbacks will at least include issues with the cluster-version operator losing the ability to write to ClusterVersion as the older CRD's enum rejects the capabilities added in the new release: openshift/api $ git diff origin/release-4.13..origin/release-4.14 -- config/v1/types_cluster_version.go | grep kubebuilder:validation:Enum -// +kubebuilder:validation:Enum=openshift-samples;baremetal;marketplace;Console;Insights;Storage;CSISnapshot;NodeTuning +// +kubebuilder:validation:Enum=openshift-samples;baremetal;marketplace;Console;Insights;Storage;CSISnapshot;NodeTuning;MachineAPI;Build;DeploymentConfig;ImageRegistry -// +kubebuilder:validation:Enum=None;v4.11;v4.12;v4.13;vCurrent +// +kubebuilder:validation:Enum=None;v4.11;v4.12;v4.13;v4.14;vCurrent So a cluster updating from 4.13 to 4.14 will enable (possibly implicitly) MachineAPI and other newly-labeled-in-4.14 capabilities. And then when the 4.13 ClusterVersion CRD is pushed during the rollback, those values become illegal, and the Kubernetes API server will reject the cluster-version operators attempts to write ClusterVersion status with errors complaining about the unrecognised MachineAPI and other capability string [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/941/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change/1671502401497993216/artifacts/e2e-agnostic-ovn-upgrade-out-of-change/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7fd84b7b99-8b2qk_cluster-version-operator.log | grep 'ClusterVersion.config.openshift.io "version" is invalid' | tail -n1 I0621 16:45:41.154360 1 cvo.go:601] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: status.capabilities.enabledCapabilities[3]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning" [1]: openshift/cluster-version-operator#941 (review)
923834f to
f2d70d2
Compare
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse |
|
@wking: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Rehearsal failed to gather pod logs in $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail 50 | grep 'ClusterVersion.config.openshift.io "version" is invalid:' | tail -n1
I1005 01:41:28.165931 1 cvo.go:600] Dropping "openshift-cluster-version/version" out of the queue &{0xc00039eae0 0xc0004142e8}: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.knownCapabilities[0]: Unsupported value: "Build": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[3]: Unsupported value: "DeploymentConfig": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[4]: Unsupported value: "ImageRegistry": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[6]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[0]: Unsupported value: "Build": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[3]: Unsupported value: "DeploymentConfig": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[4]: Unsupported value: "ImageRegistry": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[6]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning"]
$ oc adm inspect namespace/openshift-ovn-kubernetes
$ tail -n2 inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-*/*/*/logs/previous.log
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-9jvqc/kube-rbac-proxy/kube-rbac-proxy/logs/previous.log <==
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-9jvqc/nbdb/nbdb/logs/previous.log <==
2023-10-05T01:38:30.251308118Z 2023-10-05T01:38:30.251Z|00016|memory|INFO|atoms:15 cells:20 monitors:0 n-weak-refs:0
2023-10-05T01:39:33.646046973Z [1]+ Done exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --db-nb-cluster-remote-port=9643 --db-nb-cluster-remote-addr=${init_ip} --db-nb-cluster-remote-proto=ssl --ovn-nb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" run_nb_ovsdb
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-9jvqc/northd/northd/logs/previous.log <==
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-9jvqc/ovn-dbchecker/ovn-dbchecker/logs/previous.log <==
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-9jvqc/ovnkube-master/ovnkube-master/logs/previous.log <==
2023-10-05T01:41:23.741727226Z I1005 01:41:23.741713 1 reflector.go:227] Stopping reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:150
2023-10-05T01:41:23.748127082Z I1005 01:41:23.748105 1 ovnkube.go:376] No longer leader; exiting
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-9jvqc/sbdb/sbdb/logs/previous.log <==
2023-10-05T01:39:51.105932998Z 2023-10-05T01:39:51.105Z|00016|memory|INFO|atoms:15 cells:20 monitors:0 n-weak-refs:0
2023-10-05T01:40:14.150330913Z [1]+ Done exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --db-sb-cluster-remote-port=9644 --db-sb-cluster-remote-addr=${init_ip} --db-sb-cluster-remote-proto=ssl --ovn-sb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" run_sb_ovsdb
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-bn428/kube-rbac-proxy/kube-rbac-proxy/logs/previous.log <==
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-bn428/nbdb/nbdb/logs/previous.log <==
2023-10-05T01:41:05.065005478Z 2023-10-05T01:41:05.064Z|00047|raft|INFO|term 7: 10894 ms timeout expired, starting election
2023-10-05T01:41:11.468711173Z [1]+ Done exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --ovn-nb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" ${election_timer} run_nb_ovsdb
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-bn428/northd/northd/logs/previous.log <==
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-bn428/ovn-dbchecker/ovn-dbchecker/logs/previous.log <==
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-bn428/ovnkube-master/ovnkube-master/logs/previous.log <==
2023-10-05T01:42:00.855984335Z I1005 01:42:00.855968 1 metrics.go:504] Stopping metrics server 127.0.0.1:29102
2023-10-05T01:42:00.860873890Z I1005 01:42:00.860850 1 ovnkube.go:376] No longer leader; exiting
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-bn428/sbdb/sbdb/logs/previous.log <==
2023-10-05T01:41:28.891165874Z 2023-10-05T01:41:28.891Z|00018|memory|INFO|atoms:29 cells:39 monitors:0 n-weak-refs:0 raft-log:3 txn-history:2 txn-history-atoms:12
2023-10-05T01:41:59.983157155Z [1]+ Done exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --ovn-sb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" ${election_timer} run_sb_ovsdb
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-t8phs/kube-rbac-proxy/kube-rbac-proxy/logs/previous.log <==
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-t8phs/nbdb/nbdb/logs/previous.log <==
2023-10-05T01:40:00.414759707Z 2023-10-05T01:40:00.414Z|00038|ovsdb_jsonrpc_server|ERR|pssl:9641: listen failed: Address already in use
2023-10-05T01:40:12.071012110Z [1]+ Done exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --db-nb-cluster-remote-port=9643 --db-nb-cluster-remote-addr=${init_ip} --db-nb-cluster-remote-proto=ssl --ovn-nb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" run_nb_ovsdb
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-t8phs/northd/northd/logs/previous.log <==
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-t8phs/ovn-dbchecker/ovn-dbchecker/logs/previous.log <==
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-t8phs/ovnkube-master/ovnkube-master/logs/previous.log <==
2023-10-05T01:40:53.349933768Z I1005 01:40:53.349925 1 reflector.go:227] Stopping reflector *v1.EgressIP (0s) from github.com/openshift/ovn-kubernetes/go-controller/pkg/crd/egressip/v1/apis/informers/externalversions/factory.go:131
2023-10-05T01:40:53.355518493Z I1005 01:40:53.355497 1 ovnkube.go:376] No longer leader; exiting
==> inspect.local.4926021517889636450/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-t8phs/sbdb/sbdb/logs/previous.log <==
2023-10-05T01:40:29.512154920Z 2023-10-05T01:40:29.512Z|00016|memory|INFO|atoms:15 cells:20 monitors:0 n-weak-refs:0
2023-10-05T01:40:52.501826012Z [1]+ Done exec /usr/share/ovn/scripts/ovn-ctl ${OVN_ARGS} --db-sb-cluster-remote-port=9644 --db-sb-cluster-remote-addr=${init_ip} --db-sb-cluster-remote-proto=ssl --ovn-sb-log="-vconsole:${OVN_LOG_LEVEL} -vfile:off -vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m" run_sb_ovsdbCould certainly be other issues going on beyond those as well. |
|
I'll close this for now, but it's easy to re-open or float similar changes in the future if folks want to take another run at it :) |
… capabilities
The cluster-version operator should be the only actor writing to
ClusterVersion status. Trust it to pick appropriate values, instead
of restricting it with an enum.
When updating into new capabilities, the current timeline is:
1. Outgoing CVO accepts the new target, and figures out verified and
implicit capabilities.
2. Outoing CVO starts trying to write to status, but because of the
new capability not being a part of the outgoing CRD's capability
enum, this fails.
3. Outgoing CVO pushes the incoming ClusterVersion CRD in runlevel 0
index 1 [1].
4. status sync attempts, if any, start working again.
5. Outgoing CVO pushes the incoming CVO Deployment in runlevel 0 index
3 [2].
6. Deployment controller TERMs the outgoing CVO process.
7. Outgoing CVO wraps up the manifest sycning.
8. Outgoing CVO attempts a final status sync [3].
9. Outgoing CVO releases the leader lock [4].
10. Outgoing CVO exists.
With the status enum removal in this commit, ClusterVersion status
syncing will remain working the whole time, reducing the risk of the
outgoing CVO shutting down before it had recorded important
information in status.
The enum relaxation will also help with rollbacks that drop
capabilities. Those currently fail on the same status-sync issue [5]:
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail 50 | grep 'ClusterVersion.config.openshift.io "version" is invalid:' | tail -n1
I1005 01:41:28.165931 1 cvo.go:600] Dropping "openshift-cluster-version/version" out of the queue &{0xc00039eae0 0xc0004142e8}: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.knownCapabilities[0]: Unsupported value: "Build": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[3]: Unsupported value: "DeploymentConfig": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[4]: Unsupported value: "ImageRegistry": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[6]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[0]: Unsupported value: "Build": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[3]: Unsupported value: "DeploymentConfig": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[4]: Unsupported value: "ImageRegistry": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[6]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning"]
but unlike the roll-forward case, where that is a temporary issue, on
rollbacks the additional capabilities never became acceptable again,
and the cluster-version operator was permanently blocked from writing
to ClusterVersion status. With the status enum removal in this
commit, ClusterVersion status syncing will remain working, even though
it will list as an enabled capability the capability which had been
added during the earlier roll-forward.
[1]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/Dockerfile.rhel#L11
[2]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/install/0000_00_cluster-version-operator_03_deployment.yaml
[3]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/pkg/cvo/cvo.go#L421-L423
[4]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/pkg/start/start.go#L258-L278
[5]: openshift/release#43984 (comment)
… capabilities
The cluster-version operator should be the only actor writing to
ClusterVersion status. Trust it to pick appropriate values, instead
of restricting it with an enum.
When updating into new capabilities, the current timeline is:
1. Outgoing CVO accepts the new target, and figures out verified and
implicit capabilities.
2. Outoing CVO starts trying to write to status, but because of the
new capability not being a part of the outgoing CRD's capability
enum, this fails.
3. Outgoing CVO pushes the incoming ClusterVersion CRD in runlevel 0
index 1 [1].
4. status sync attempts, if any, start working again.
5. Outgoing CVO pushes the incoming CVO Deployment in runlevel 0 index
3 [2].
6. Deployment controller TERMs the outgoing CVO process.
7. Outgoing CVO wraps up the manifest sycning.
8. Outgoing CVO attempts a final status sync [3].
9. Outgoing CVO releases the leader lock [4].
10. Outgoing CVO exists.
With the status enum removal in this commit, ClusterVersion status
syncing will remain working the whole time, reducing the risk of the
outgoing CVO shutting down before it had recorded important
information in status.
The enum relaxation will also help with rollbacks that drop
capabilities. Those currently fail on the same status-sync issue [5]:
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail 50 | grep 'ClusterVersion.config.openshift.io "version" is invalid:' | tail -n1
I1005 01:41:28.165931 1 cvo.go:600] Dropping "openshift-cluster-version/version" out of the queue &{0xc00039eae0 0xc0004142e8}: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.knownCapabilities[0]: Unsupported value: "Build": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[3]: Unsupported value: "DeploymentConfig": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[4]: Unsupported value: "ImageRegistry": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.knownCapabilities[6]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[0]: Unsupported value: "Build": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[3]: Unsupported value: "DeploymentConfig": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[4]: Unsupported value: "ImageRegistry": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", status.capabilities.enabledCapabilities[6]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning"]
but unlike the roll-forward case, where that is a temporary issue, on
rollbacks the additional capabilities never became acceptable again,
and the cluster-version operator was permanently blocked from writing
to ClusterVersion status. With the status enum removal in this
commit, ClusterVersion status syncing will remain working, even though
it will list as an enabled capability the capability which had been
added during the earlier roll-forward.
[1]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/Dockerfile.rhel#L11
[2]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/install/0000_00_cluster-version-operator_03_deployment.yaml
[3]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/pkg/cvo/cvo.go#L421-L423
[4]: https://github.com/openshift/cluster-version-operator/blob/baf7ba7b45852ad2b95e9e498585fbee35111eef/pkg/start/start.go#L258-L278
[5]: openshift/release#43984 (comment)
We'd dropped the last of these in 856aab2 (#33005) and 5e746a7 (#39897). There's now renewed interest in how these sorts of rollbacks look, so I'm reviving them for recent releases. I expect the issues with these rollbacks will at least include issues with the cluster-version operator losing the ability to write to ClusterVersion as the older CRD's enum rejects the capabilities added in the new release:
generating:
So a cluster updating from 4.13 to 4.14 will enable (possibly implicitly)
MachineAPIand other newly-labeled-in-4.14 capabilities. And then when the 4.13 ClusterVersion CRD is pushed during the rollback, those values become illegal, and the Kubernetes API server will reject the cluster-version operators attempts to write ClusterVersion status with errors complaining about the unrecognised MachineAPI and other capability string: