-
Notifications
You must be signed in to change notification settings - Fork 2.1k
ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10: Drop failing rollback jobs #33005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10: Drop failing rollback jobs #33005
Conversation
…1-upgrade-from-stable-4.10: Drop failing rollback jobs Like 2d73374 (origin/pr/26629) ci-operator/config/openshift/release: Drop failing minor rollback tests, 2022-02-28, openshift#26629), but for the 4.10-to-4.11-to-4.10 rollbacks. This time both the OVN and SDN rollback jobs are perma-failing [1,2], and in both cases the issue is sticking on [3,4]: INFO: cluster upgrade is Progressing: Working towards 4.10.35: 614 of 773 done (79% complete), waiting on openshift-controller-manager with that operator crash-looping on [5,6]: F1010 09:51:56.918590 1 cmd.go:138] open /var/run/configmaps/config/config.yaml: permission denied I haven't dug in more deeply to try to understand that failure, but as 2d73374 points out: > Since we don't support minor rollbacks, or really rollbacks of any > sort [12], I'm dropping these jobs instead of root-causing the hang. > ... > [12]: https://github.com/openshift/openshift-docs/blame/d4762f0f626a4dddb9d7330e63a3bb6cb73f5bb5/modules/update-upgrading-cli.adoc#L160-L162 Since then, those docs have moved to [7], but the lack of rollback support still stands. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade-rollback [2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade-rollback [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade-rollback/1579338022623645696 [4]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade-rollback/1578454440359235584 [5]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade-rollback/1579338022623645696/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/pods/openshift-controller-manager-operator_openshift-controller-manager-operator-7fbc8cc67d-zbrv4_openshift-controller-manager-operator_previous.log [6]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade-rollback/1578454440359235584/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-controller-manager-operator_openshift-controller-manager-operator-7fbc8cc67d-s5pwz_openshift-controller-manager-operator_previous.log [7]: https://github.com/openshift/openshift-docs/blob/7f87267bc69d65abd96e6b783100195c6b78549f/updating/updating-troubleshooting.adoc
27c5498 to
953fa11
Compare
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sdodson, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: Updated the following 2 configmaps:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The job flavor was originally added in 0837634 (Add ovn-upgrade-rollback job for 4.7->4.8, 2021-02-24, openshift#16260). The jobs have subsequently been cloned forward to new minors as part of the branching process. And as older jobs started failing, I'd been dropping them gradually like 856aab2 (ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10: Drop failing rollback jobs, 2022-10-11, openshift#33005). But rounding with Jamo, the jobs no longer serve a useful role, and as 856aab2 points out, rollbacks between minor releases are not supported. Drop the likely-to-fail and not-useful-even-when-it-passes jobs in their entirety, so they stop getting cloned forward during branching. I'm also adjusting the release controller changes from 421c921 (Introducing Rollback informing jobs, 2023-05-19, openshift#39488). I'm dropping 4.12 and earlier rollback informers, so we can focus on 4.13 while we feel out the new process. And I'm pivoting 4.13 away from the cross-minor job that this pull request drops, and towards the rollback-oldest-supported job that will help back [1]. [1]: https://issues.redhat.com/browse/OTA-455
…39897) * ci-operator/config/openshift/release: Drop cross-minor rollback jobs The job flavor was originally added in 0837634 (Add ovn-upgrade-rollback job for 4.7->4.8, 2021-02-24, #16260). The jobs have subsequently been cloned forward to new minors as part of the branching process. And as older jobs started failing, I'd been dropping them gradually like 856aab2 (ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10: Drop failing rollback jobs, 2022-10-11, #33005). But rounding with Jamo, the jobs no longer serve a useful role, and as 856aab2 points out, rollbacks between minor releases are not supported. Drop the likely-to-fail and not-useful-even-when-it-passes jobs in their entirety, so they stop getting cloned forward during branching. I'm also adjusting the release controller changes from 421c921 (Introducing Rollback informing jobs, 2023-05-19, #39488). I'm dropping 4.12 and earlier rollback informers, so we can focus on 4.13 while we feel out the new process. And I'm pivoting 4.13 away from the cross-minor job that this pull request drops, and towards the rollback-oldest-supported job that will help back [1]. [1]: https://issues.redhat.com/browse/OTA-455 * hack/validate-release-controller-config: Supplemental Git diff Because [1]: ERROR: The following differences were found: 3a4 > 03c544e5d55a55ae9f19d0de7d786341 .//core-services/release-controller/_releases/priv/release-ocp-4.12.json 35d35 < 1826a1b520574b66f152f814811c19f6 .//core-services/release-controller/_releases/priv/release-ocp-4.13.json 42a43 ... tells me what files need changing, but not what changes to make to them. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39897/pull-ci-openshift-release-master-release-controller-config/1664331471080394752 --------- Co-authored-by: wking <wking@penguin>
…penshift#39897) * ci-operator/config/openshift/release: Drop cross-minor rollback jobs The job flavor was originally added in 0837634 (Add ovn-upgrade-rollback job for 4.7->4.8, 2021-02-24, openshift#16260). The jobs have subsequently been cloned forward to new minors as part of the branching process. And as older jobs started failing, I'd been dropping them gradually like 856aab2 (ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10: Drop failing rollback jobs, 2022-10-11, openshift#33005). But rounding with Jamo, the jobs no longer serve a useful role, and as 856aab2 points out, rollbacks between minor releases are not supported. Drop the likely-to-fail and not-useful-even-when-it-passes jobs in their entirety, so they stop getting cloned forward during branching. I'm also adjusting the release controller changes from 421c921 (Introducing Rollback informing jobs, 2023-05-19, openshift#39488). I'm dropping 4.12 and earlier rollback informers, so we can focus on 4.13 while we feel out the new process. And I'm pivoting 4.13 away from the cross-minor job that this pull request drops, and towards the rollback-oldest-supported job that will help back [1]. [1]: https://issues.redhat.com/browse/OTA-455 * hack/validate-release-controller-config: Supplemental Git diff Because [1]: ERROR: The following differences were found: 3a4 > 03c544e5d55a55ae9f19d0de7d786341 .//core-services/release-controller/_releases/priv/release-ocp-4.12.json 35d35 < 1826a1b520574b66f152f814811c19f6 .//core-services/release-controller/_releases/priv/release-ocp-4.13.json 42a43 ... tells me what files need changing, but not what changes to make to them. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39897/pull-ci-openshift-release-master-release-controller-config/1664331471080394752 --------- Co-authored-by: wking <wking@penguin>
…y-4.14-upgrade-from-stable-4.13: Restore cross-minor rollbacks We'd dropped the last of these in 856aab2 (ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10: Drop failing rollback jobs, 2022-10-11, openshift#33005) and 5e746a7 (ci-operator/config/openshift/release: Drop cross-minor rollback jobs, 2023-06-07, openshift#39897). There's now renewed interest in how these sorts of rollbacks look, so I'm reviving them for recent releases. I expect the issues with these rollbacks will at least include issues with the cluster-version operator losing the ability to write to ClusterVersion as the older CRD's enum rejects the capabilities added in the new release: openshift/api $ git diff origin/release-4.13..origin/release-4.14 -- config/v1/types_cluster_version.go | grep kubebuilder:validation:Enum -// +kubebuilder:validation:Enum=openshift-samples;baremetal;marketplace;Console;Insights;Storage;CSISnapshot;NodeTuning +// +kubebuilder:validation:Enum=openshift-samples;baremetal;marketplace;Console;Insights;Storage;CSISnapshot;NodeTuning;MachineAPI;Build;DeploymentConfig;ImageRegistry -// +kubebuilder:validation:Enum=None;v4.11;v4.12;v4.13;vCurrent +// +kubebuilder:validation:Enum=None;v4.11;v4.12;v4.13;v4.14;vCurrent So a cluster updating from 4.13 to 4.14 will enable (possibly implicitly) MachineAPI and other newly-labeled-in-4.14 capabilities. And then when the 4.13 ClusterVersion CRD is pushed during the rollback, those values become illegal, and the Kubernetes API server will reject the cluster-version operators attempts to write ClusterVersion status with errors complaining about the unrecognised MachineAPI and other capability string [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/941/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change/1671502401497993216/artifacts/e2e-agnostic-ovn-upgrade-out-of-change/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7fd84b7b99-8b2qk_cluster-version-operator.log | grep 'ClusterVersion.config.openshift.io "version" is invalid' | tail -n1 I0621 16:45:41.154360 1 cvo.go:601] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: status.capabilities.enabledCapabilities[3]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning" [1]: openshift/cluster-version-operator#941 (review)
Like 2d73374 (#26629), but for the 4.10-to-4.11-to-4.10 rollbacks. This time both the OVN and SDN rollback jobs are perma-failing, and in both cases the issue is sticking on:
with that operator crash-looping on:
I haven't dug in more deeply to try to understand that failure, but as 2d73374 points out:
Since then, those docs have moved here, but the lack of rollback support still stands.