-
Notifications
You must be signed in to change notification settings - Fork 2.1k
ci-operator/config/openshift/release: Drop 4.8 -> 4.9 -> 4.8 rollback jobs #22287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci-operator/config/openshift/release: Drop 4.8 -> 4.9 -> 4.8 rollback jobs #22287
Conversation
… jobs 4.9 includes the backwards-incompatible etcd disk-schema change from etcd v3.5.0 [1]. That causes rollback jobs to fail like [2,3]: Working towards 4.8.12: 69 of 678 done (10% complete) where the cluster-version operator is waiting for the etcd operator (but hasn't been waiting quite long enough to complain about it by name): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7cfbc65959-xctxv_cluster-version-operator.log | grep 'Running sync.*in state\|Result of work' | tail -n3 I0925 23:25:54.121866 1 sync_worker.go:541] Running sync registry.build01.ci.openshift.org/ci-op-sqh94dxj/release@sha256:c3af995af7ee85e88c43c943e0a64c7066d90e77fafdabc7b22a095e4ea3c25a (force=true) on generation 3 in state Updating at attempt 11 I0925 23:31:36.036075 1 task_graph.go:555] Result of work: [Cluster operator etcd is degraded] I0925 23:34:38.987554 1 sync_worker.go:541] Running sync registry.build01.ci.openshift.org/ci-op-sqh94dxj/release@sha256:c3af995af7ee85e88c43c943e0a64c7066d90e77fafdabc7b22a095e4ea3c25a (force=true) on generation 3 in state Updating at attempt 12 Seeing what the etcd operator has to say: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "etcd").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-09-25T19:56:27Z RecentBackup=Unknown ControllerStarted: - 2021-09-25T22:01:04Z Degraded=True EtcdMembers_UnhealthyMembers::StaticPods_Error: EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-246-48.us-east-2.compute.internal is unhealthy StaticPodsDegraded: pod/etcd-ip-10-0-246-48.us-east-2.compute.internal container "etcd" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=etcd pod=etcd-ip-10-0-246-48.us-east-2.compute.internal_openshift-etcd(967f9e83-e6a2-437e-85e6-c33563286f7f) 2021-09-25T21:55:07Z Progressing=True NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 4; 0 nodes have achieved new revision 5 2021-09-25T19:57:59Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4; 0 nodes have achieved new revision 5 EtcdMembersAvailable: 2 of 3 members are available, ip-10-0-246-48.us-east-2.compute.internal is unhealthy 2021-09-25T19:56:43Z Upgradeable=True AsExpected: All is well And from the logs of that container [4]: {"level":"fatal","ts":"2021-09-25T23:34:47.679Z","caller":"membership/cluster.go:790","msg":"invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.4.14","determined-cluster-version":"3.5","stacktrace":"go.etcd.io/etcd/etcdserver/api/membership.mustDetectDowngrade\n\t/go/src/go.etcd.io/etcd/etcdserver/api/membership/cluster.go:790... We know these rollback jobs will always stick on that etcd disk-schema rollback, so no sense in spending CI money running them and watching them fail. This commit drops the jobs, and we won't worry about other minor-rollback issues around 4.8-to-4.9. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1999777#c0 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464 [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/clusterversion.json [4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/pods/openshift-etcd_etcd-ip-10-0-246-48.us-east-2.compute.internal_etcd.log
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stbenjam, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: Updated the following 2 configmaps:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In 4.11, [1], [2], and [3] all have recent passes, so I'm leaving them in. In 4.10, [4] and [5] have recent passes, so I'm leaving them in. Checking [6], both [7] and [8] update from 4.9 to 4.10 and start heading back towards 4.9, but they hang a control-plane node on drain. Same for the OVN flavor [9,10,11]. Since we don't support minor rollbacks, or really rollbacks of any sort [12], I'm dropping these jobs instead of root-causing the hang. In 4.9, [13] and [14] have recent passes, so I'm leaving them in. We already dropped the other 4.8 -> 4.9 -> 4.8 rollback jobs back in b3d04e5 (ci-operator/config/openshift/release: Drop 4.8 -> 4.9 -> 4.8 rollback jobs, 2021-09-27, openshift#22287). In 4.8, [15] and [16] have recent passes, so I'm leaving them in. 4.7 -> 4.8 -> 4.7 rollback tests timeout [17,18,19,20,21,22], without the pretty e2e-interval chart to make identifying the stuck thing easier. But again, not supported, so dropping instead of sinking time into root-causing. On 4.7, [23] and [24] have recent passes, so I'm leaving them in. 4.6 -> 4.7 -> 4.6 rollback tests timeout [25,26], so dropping them. On 4.6, [27] has recent passes, so I'm leaving it in. 4.5 -> 4.6 -> 4.7 rollback tests timeout [28,29], failed to build, but I've been dropping all the 4.y minor rollback jobs since 4.10, so keeping these around to see if subsequent runs will build and pass seems unlikely to be worth the effort. Dropping them too. 4.5 is end-of-life [30], so I'm dropping 4.4 -> 4.5 -> 4.4 rollback jobs without even looking to see if they're passing. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-e2e-aws-upgrade-rollback [2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade-rollback [3]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade-rollback [4]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-e2e-aws-upgrade-rollback [5]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade-rollback-oldest-supported [6]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback [7]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback/1497288333930270720 [8]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback/1498013325101895680 [9]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback [10]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback/1497671569042837504 [11]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback/1498033961199210496 [12]: https://github.com/openshift/openshift-docs/blame/d4762f0f626a4dddb9d7330e63a3bb6cb73f5bb5/modules/update-upgrading-cli.adoc#L160-L162 [13]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-rollback [14]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade-rollback-oldest-supported [15]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-rollback [16]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported [17]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback [18]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1496996649593999360 [19]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1497721649917595648 [20]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade-rollback [21]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade-rollback/1497620733604401152 [22]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade-rollback/1497983125802717184 [23]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ci-4.7-e2e-aws-upgrade-rollback [24]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-upgrade-rollback-oldest-supported [25]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback [26]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1497650430434349056 [27]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ci-4.6-e2e-aws-upgrade-rollback [28]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade-rollback [29]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade-rollback/1494388672508727296 [30]: https://access.redhat.com/support/policy/updates/openshift#dates
4.9 includes the backwards-incompatible etcd disk-schema change from etcd v3.5.0. That causes rollback jobs to fail like:
where the cluster-version operator is waiting for the etcd operator (but hasn't been waiting quite long enough to complain about it by name):
Seeing what the etcd operator has to say:
And from the logs of that container:
We know these rollback jobs will always stick on that etcd disk-schema rollback, so no sense in spending CI money running them and watching them fail. This commit drops the jobs, and we won't worry about other minor-rollback issues around 4.8-to-4.9.