Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Sep 27, 2021

4.9 includes the backwards-incompatible etcd disk-schema change from etcd v3.5.0. That causes rollback jobs to fail like:

Working towards 4.8.12: 69 of 678 done (10% complete)

where the cluster-version operator is waiting for the etcd operator (but hasn't been waiting quite long enough to complain about it by name):

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7cfbc65959-xctxv_cluster-version-operator.log | grep 'Running sync.*in state\|Result of work' | tail -n3
I0925 23:25:54.121866       1 sync_worker.go:541] Running sync registry.build01.ci.openshift.org/ci-op-sqh94dxj/release@sha256:c3af995af7ee85e88c43c943e0a64c7066d90e77fafdabc7b22a095e4ea3c25a (force=true) on generation 3 in state Updating at attempt 11
I0925 23:31:36.036075       1 task_graph.go:555] Result of work: [Cluster operator etcd is degraded]
I0925 23:34:38.987554       1 sync_worker.go:541] Running sync registry.build01.ci.openshift.org/ci-op-sqh94dxj/release@sha256:c3af995af7ee85e88c43c943e0a64c7066d90e77fafdabc7b22a095e4ea3c25a (force=true) on generation 3 in state Updating at attempt 12

Seeing what the etcd operator has to say:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "etcd").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2021-09-25T19:56:27Z RecentBackup=Unknown ControllerStarted: -
2021-09-25T22:01:04Z Degraded=True EtcdMembers_UnhealthyMembers::StaticPods_Error: EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-246-48.us-east-2.compute.internal is unhealthy
StaticPodsDegraded: pod/etcd-ip-10-0-246-48.us-east-2.compute.internal container "etcd" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=etcd pod=etcd-ip-10-0-246-48.us-east-2.compute.internal_openshift-etcd(967f9e83-e6a2-437e-85e6-c33563286f7f)
2021-09-25T21:55:07Z Progressing=True NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 4; 0 nodes have achieved new revision 5
2021-09-25T19:57:59Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4; 0 nodes have achieved new revision 5
EtcdMembersAvailable: 2 of 3 members are available, ip-10-0-246-48.us-east-2.compute.internal is unhealthy
2021-09-25T19:56:43Z Upgradeable=True AsExpected: All is well

And from the logs of that container:

{"level":"fatal","ts":"2021-09-25T23:34:47.679Z","caller":"membership/cluster.go:790","msg":"invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.4.14","determined-cluster-version":"3.5","stacktrace":"go.etcd.io/etcd/etcdserver/api/membership.mustDetectDowngrade\n\t/go/src/go.etcd.io/etcd/etcdserver/api/membership/cluster.go:790...

We know these rollback jobs will always stick on that etcd disk-schema rollback, so no sense in spending CI money running them and watching them fail. This commit drops the jobs, and we won't worry about other minor-rollback issues around 4.8-to-4.9.

… jobs

4.9 includes the backwards-incompatible etcd disk-schema change from
etcd v3.5.0 [1].  That causes rollback jobs to fail like [2,3]:

  Working towards 4.8.12: 69 of 678 done (10% complete)

where the cluster-version operator is waiting for the etcd operator
(but hasn't been waiting quite long enough to complain about it by
name):

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7cfbc65959-xctxv_cluster-version-operator.log | grep 'Running sync.*in state\|Result of work' | tail -n3
  I0925 23:25:54.121866       1 sync_worker.go:541] Running sync registry.build01.ci.openshift.org/ci-op-sqh94dxj/release@sha256:c3af995af7ee85e88c43c943e0a64c7066d90e77fafdabc7b22a095e4ea3c25a (force=true) on generation 3 in state Updating at attempt 11
  I0925 23:31:36.036075       1 task_graph.go:555] Result of work: [Cluster operator etcd is degraded]
  I0925 23:34:38.987554       1 sync_worker.go:541] Running sync registry.build01.ci.openshift.org/ci-op-sqh94dxj/release@sha256:c3af995af7ee85e88c43c943e0a64c7066d90e77fafdabc7b22a095e4ea3c25a (force=true) on generation 3 in state Updating at attempt 12

Seeing what the etcd operator has to say:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "etcd").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
  2021-09-25T19:56:27Z RecentBackup=Unknown ControllerStarted: -
  2021-09-25T22:01:04Z Degraded=True EtcdMembers_UnhealthyMembers::StaticPods_Error: EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-246-48.us-east-2.compute.internal is unhealthy
  StaticPodsDegraded: pod/etcd-ip-10-0-246-48.us-east-2.compute.internal container "etcd" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=etcd pod=etcd-ip-10-0-246-48.us-east-2.compute.internal_openshift-etcd(967f9e83-e6a2-437e-85e6-c33563286f7f)
  2021-09-25T21:55:07Z Progressing=True NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 4; 0 nodes have achieved new revision 5
  2021-09-25T19:57:59Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4; 0 nodes have achieved new revision 5
  EtcdMembersAvailable: 2 of 3 members are available, ip-10-0-246-48.us-east-2.compute.internal is unhealthy
  2021-09-25T19:56:43Z Upgradeable=True AsExpected: All is well

And from the logs of that container [4]:

  {"level":"fatal","ts":"2021-09-25T23:34:47.679Z","caller":"membership/cluster.go:790","msg":"invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.4.14","determined-cluster-version":"3.5","stacktrace":"go.etcd.io/etcd/etcdserver/api/membership.mustDetectDowngrade\n\t/go/src/go.etcd.io/etcd/etcdserver/api/membership/cluster.go:790...

We know these rollback jobs will always stick on that etcd disk-schema
rollback, so no sense in spending CI money running them and watching
them fail.  This commit drops the jobs, and we won't worry about other
minor-rollback issues around 4.8-to-4.9.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1999777#c0
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/clusterversion.json
[4]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/pods/openshift-etcd_etcd-ip-10-0-246-48.us-east-2.compute.internal_etcd.log
@openshift-ci openshift-ci bot requested review from deads2k and stbenjam September 27, 2021 23:03
@stbenjam
Copy link
Member

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 27, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stbenjam, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 27, 2021
@openshift-merge-robot openshift-merge-robot merged commit 893d679 into openshift:master Sep 27, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 27, 2021

@wking: Updated the following 2 configmaps:

  • ci-operator-master-configs configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-release-master__ci-4.9-upgrade-from-stable-4.8.yaml using file ci-operator/config/openshift/release/openshift-release-master__ci-4.9-upgrade-from-stable-4.8.yaml
  • job-config-master configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-release-master-periodics.yaml using file ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml
Details

In response to this:

4.9 includes the backwards-incompatible etcd disk-schema change from etcd v3.5.0. That causes rollback jobs to fail like:

Working towards 4.8.12: 69 of 678 done (10% complete)

where the cluster-version operator is waiting for the etcd operator (but hasn't been waiting quite long enough to complain about it by name):

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7cfbc65959-xctxv_cluster-version-operator.log | grep 'Running sync.*in state\|Result of work' | tail -n3
I0925 23:25:54.121866       1 sync_worker.go:541] Running sync registry.build01.ci.openshift.org/ci-op-sqh94dxj/release@sha256:c3af995af7ee85e88c43c943e0a64c7066d90e77fafdabc7b22a095e4ea3c25a (force=true) on generation 3 in state Updating at attempt 11
I0925 23:31:36.036075       1 task_graph.go:555] Result of work: [Cluster operator etcd is degraded]
I0925 23:34:38.987554       1 sync_worker.go:541] Running sync registry.build01.ci.openshift.org/ci-op-sqh94dxj/release@sha256:c3af995af7ee85e88c43c943e0a64c7066d90e77fafdabc7b22a095e4ea3c25a (force=true) on generation 3 in state Updating at attempt 12

Seeing what the etcd operator has to say:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback/1441850131237310464/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "etcd").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2021-09-25T19:56:27Z RecentBackup=Unknown ControllerStarted: -
2021-09-25T22:01:04Z Degraded=True EtcdMembers_UnhealthyMembers::StaticPods_Error: EtcdMembersDegraded: 2 of 3 members are available, ip-10-0-246-48.us-east-2.compute.internal is unhealthy
StaticPodsDegraded: pod/etcd-ip-10-0-246-48.us-east-2.compute.internal container "etcd" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=etcd pod=etcd-ip-10-0-246-48.us-east-2.compute.internal_openshift-etcd(967f9e83-e6a2-437e-85e6-c33563286f7f)
2021-09-25T21:55:07Z Progressing=True NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 4; 0 nodes have achieved new revision 5
2021-09-25T19:57:59Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 4; 0 nodes have achieved new revision 5
EtcdMembersAvailable: 2 of 3 members are available, ip-10-0-246-48.us-east-2.compute.internal is unhealthy
2021-09-25T19:56:43Z Upgradeable=True AsExpected: All is well

And from the logs of that container:

{"level":"fatal","ts":"2021-09-25T23:34:47.679Z","caller":"membership/cluster.go:790","msg":"invalid downgrade; server version is lower than determined cluster version","current-server-version":"3.4.14","determined-cluster-version":"3.5","stacktrace":"go.etcd.io/etcd/etcdserver/api/membership.mustDetectDowngrade\n\t/go/src/go.etcd.io/etcd/etcdserver/api/membership/cluster.go:790...

We know these rollback jobs will always stick on that etcd disk-schema rollback, so no sense in spending CI money running them and watching them fail. This commit drops the jobs, and we won't worry about other minor-rollback issues around 4.8-to-4.9.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the drop-4.8-to-4.9-rollbacks branch September 28, 2021 00:18
wking added a commit to wking/openshift-release that referenced this pull request Feb 28, 2022
In 4.11, [1], [2], and [3] all have recent passes, so I'm leaving them in.

In 4.10, [4] and [5] have recent passes, so I'm leaving them in.

Checking [6], both [7] and [8] update from 4.9 to 4.10 and start
heading back towards 4.9, but they hang a control-plane node on drain.
Same for the OVN flavor [9,10,11].  Since we don't support minor
rollbacks, or really rollbacks of any sort [12], I'm dropping these
jobs instead of root-causing the hang.

In 4.9, [13] and [14] have recent passes, so I'm leaving them in.  We
already dropped the other 4.8 -> 4.9 -> 4.8 rollback jobs back in
b3d04e5 (ci-operator/config/openshift/release: Drop 4.8 -> 4.9 ->
4.8 rollback jobs, 2021-09-27, openshift#22287).

In 4.8, [15] and [16] have recent passes, so I'm leaving them in.

4.7 -> 4.8 -> 4.7 rollback tests timeout [17,18,19,20,21,22], without
the pretty e2e-interval chart to make identifying the stuck thing
easier.  But again, not supported, so dropping instead of sinking time
into root-causing.

On 4.7, [23] and [24] have recent passes, so I'm leaving them in.

4.6 -> 4.7 -> 4.6 rollback tests timeout [25,26], so dropping them.

On 4.6, [27] has recent passes, so I'm leaving it in.

4.5 -> 4.6 -> 4.7 rollback tests timeout [28,29], failed to build, but
I've been dropping all the 4.y minor rollback jobs since 4.10, so
keeping these around to see if subsequent runs will build and pass
seems unlikely to be worth the effort.  Dropping them too.

4.5 is end-of-life [30], so I'm dropping 4.4 -> 4.5 -> 4.4 rollback
jobs without even looking to see if they're passing.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-e2e-aws-upgrade-rollback
[2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade-rollback
[3]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade-rollback
[4]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-e2e-aws-upgrade-rollback
[5]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade-rollback-oldest-supported
[6]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback
[7]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback/1497288333930270720
[8]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback/1498013325101895680
[9]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback
[10]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback/1497671569042837504
[11]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback/1498033961199210496
[12]: https://github.com/openshift/openshift-docs/blame/d4762f0f626a4dddb9d7330e63a3bb6cb73f5bb5/modules/update-upgrading-cli.adoc#L160-L162
[13]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-rollback
[14]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade-rollback-oldest-supported
[15]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-rollback
[16]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported
[17]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback
[18]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1496996649593999360
[19]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade-rollback/1497721649917595648
[20]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade-rollback
[21]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade-rollback/1497620733604401152
[22]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade-rollback/1497983125802717184
[23]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ci-4.7-e2e-aws-upgrade-rollback
[24]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-upgrade-rollback-oldest-supported
[25]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback
[26]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade-rollback/1497650430434349056
[27]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ci-4.6-e2e-aws-upgrade-rollback
[28]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade-rollback
[29]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade-rollback/1494388672508727296
[30]: https://access.redhat.com/support/policy/updates/openshift#dates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants