Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Oct 11, 2022

Like 2d73374 (#26629), but for the 4.10-to-4.11-to-4.10 rollbacks. This time both the OVN and SDN rollback jobs are perma-failing, and in both cases the issue is sticking on:

INFO: cluster upgrade is Progressing: Working towards 4.10.35: 614 of 773 done (79% complete), waiting on openshift-controller-manager

with that operator crash-looping on:

F1010 09:51:56.918590       1 cmd.go:138] open /var/run/configmaps/config/config.yaml: permission denied

I haven't dug in more deeply to try to understand that failure, but as 2d73374 points out:

Since we don't support minor rollbacks, or really rollbacks of any sort [12], I'm dropping these jobs instead of root-causing the hang.
...
[12]: https://github.com/openshift/openshift-docs/blame/d4762f0f626a4dddb9d7330e63a3bb6cb73f5bb5/modules/update-upgrading-cli.adoc#L160-L162

Since then, those docs have moved here, but the lack of rollback support still stands.

@openshift-ci openshift-ci bot requested review from vrutkovs and xueqzhan October 11, 2022 03:41
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 11, 2022
…1-upgrade-from-stable-4.10: Drop failing rollback jobs

Like 2d73374 (origin/pr/26629)
ci-operator/config/openshift/release: Drop failing minor rollback
tests, 2022-02-28, openshift#26629), but for the 4.10-to-4.11-to-4.10
rollbacks.  This time both the OVN and SDN rollback jobs are
perma-failing [1,2], and in both cases the issue is sticking on [3,4]:

  INFO: cluster upgrade is Progressing: Working towards 4.10.35: 614 of 773 done (79% complete), waiting on openshift-controller-manager

with that operator crash-looping on [5,6]:

  F1010 09:51:56.918590       1 cmd.go:138] open /var/run/configmaps/config/config.yaml: permission denied

I haven't dug in more deeply to try to understand that failure, but as
2d73374 points out:

> Since we don't support minor rollbacks, or really rollbacks of any
> sort [12], I'm dropping these jobs instead of root-causing the hang.
> ...
> [12]: https://github.com/openshift/openshift-docs/blame/d4762f0f626a4dddb9d7330e63a3bb6cb73f5bb5/modules/update-upgrading-cli.adoc#L160-L162

Since then, those docs have moved to [7], but the lack of rollback
support still stands.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade-rollback
[2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade-rollback
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade-rollback/1579338022623645696
[4]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade-rollback/1578454440359235584
[5]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade-rollback/1579338022623645696/artifacts/e2e-aws-ovn-upgrade-rollback/gather-extra/artifacts/pods/openshift-controller-manager-operator_openshift-controller-manager-operator-7fbc8cc67d-zbrv4_openshift-controller-manager-operator_previous.log
[6]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade-rollback/1578454440359235584/artifacts/e2e-aws-upgrade-rollback/gather-extra/artifacts/pods/openshift-controller-manager-operator_openshift-controller-manager-operator-7fbc8cc67d-s5pwz_openshift-controller-manager-operator_previous.log
[7]: https://github.com/openshift/openshift-docs/blob/7f87267bc69d65abd96e6b783100195c6b78549f/updating/updating-troubleshooting.adoc
@wking wking force-pushed the drop-4.10-to-4.11-rollback branch from 27c5498 to 953fa11 Compare October 11, 2022 03:45
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 11, 2022

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sdodson
Copy link
Member

sdodson commented Oct 11, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 11, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 856aab2 into openshift:master Oct 11, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 11, 2022

@wking: Updated the following 2 configmaps:

  • ci-operator-master-configs configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-release-master__ci-4.11-upgrade-from-stable-4.10.yaml using file ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10.yaml
  • job-config-master-periodics configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-release-master-periodics.yaml using file ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml
Details

In response to this:

Like 2d73374 (#26629), but for the 4.10-to-4.11-to-4.10 rollbacks. This time both the OVN and SDN rollback jobs are perma-failing, and in both cases the issue is sticking on:

INFO: cluster upgrade is Progressing: Working towards 4.10.35: 614 of 773 done (79% complete), waiting on openshift-controller-manager

with that operator crash-looping on:

F1010 09:51:56.918590 1 cmd.go:138] open /var/run/configmaps/config/config.yaml: permission denied

I haven't dug in more deeply to try to understand that failure, but as 2d73374 points out:

Since we don't support minor rollbacks, or really rollbacks of any sort [12], I'm dropping these jobs instead of root-causing the hang.
...
[12]: https://github.com/openshift/openshift-docs/blame/d4762f0f626a4dddb9d7330e63a3bb6cb73f5bb5/modules/update-upgrading-cli.adoc#L160-L162

Since then, those docs have moved here, but the lack of rollback support still stands.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the drop-4.10-to-4.11-rollback branch October 11, 2022 19:01
wking added a commit to wking/openshift-release that referenced this pull request Jun 1, 2023
The job flavor was originally added in 0837634 (Add
ovn-upgrade-rollback job for 4.7->4.8, 2021-02-24, openshift#16260).  The jobs
have subsequently been cloned forward to new minors as part of the
branching process.  And as older jobs started failing, I'd been
dropping them gradually like 856aab2
(ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10:
Drop failing rollback jobs, 2022-10-11, openshift#33005).  But rounding with
Jamo, the jobs no longer serve a useful role, and as 856aab2 points
out, rollbacks between minor releases are not supported.  Drop the
likely-to-fail and not-useful-even-when-it-passes jobs in their
entirety, so they stop getting cloned forward during branching.

I'm also adjusting the release controller changes from 421c921
(Introducing Rollback informing jobs, 2023-05-19, openshift#39488).  I'm
dropping 4.12 and earlier rollback informers, so we can focus on 4.13
while we feel out the new process.  And I'm pivoting 4.13 away from
the cross-minor job that this pull request drops, and towards the
rollback-oldest-supported job that will help back [1].

[1]: https://issues.redhat.com/browse/OTA-455
openshift-merge-robot pushed a commit that referenced this pull request Jun 7, 2023
…39897)

* ci-operator/config/openshift/release: Drop cross-minor rollback jobs

The job flavor was originally added in 0837634 (Add
ovn-upgrade-rollback job for 4.7->4.8, 2021-02-24, #16260).  The jobs
have subsequently been cloned forward to new minors as part of the
branching process.  And as older jobs started failing, I'd been
dropping them gradually like 856aab2
(ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10:
Drop failing rollback jobs, 2022-10-11, #33005).  But rounding with
Jamo, the jobs no longer serve a useful role, and as 856aab2 points
out, rollbacks between minor releases are not supported.  Drop the
likely-to-fail and not-useful-even-when-it-passes jobs in their
entirety, so they stop getting cloned forward during branching.

I'm also adjusting the release controller changes from 421c921
(Introducing Rollback informing jobs, 2023-05-19, #39488).  I'm
dropping 4.12 and earlier rollback informers, so we can focus on 4.13
while we feel out the new process.  And I'm pivoting 4.13 away from
the cross-minor job that this pull request drops, and towards the
rollback-oldest-supported job that will help back [1].

[1]: https://issues.redhat.com/browse/OTA-455

* hack/validate-release-controller-config: Supplemental Git diff

Because [1]:

  ERROR: The following differences were found:
  3a4
  > 03c544e5d55a55ae9f19d0de7d786341  .//core-services/release-controller/_releases/priv/release-ocp-4.12.json
  35d35
  < 1826a1b520574b66f152f814811c19f6  .//core-services/release-controller/_releases/priv/release-ocp-4.13.json
  42a43
  ...

tells me what files need changing, but not what changes to make to them.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39897/pull-ci-openshift-release-master-release-controller-config/1664331471080394752

---------

Co-authored-by: wking <wking@penguin>
jtaleric pushed a commit to jtaleric/release that referenced this pull request Jun 9, 2023
…penshift#39897)

* ci-operator/config/openshift/release: Drop cross-minor rollback jobs

The job flavor was originally added in 0837634 (Add
ovn-upgrade-rollback job for 4.7->4.8, 2021-02-24, openshift#16260).  The jobs
have subsequently been cloned forward to new minors as part of the
branching process.  And as older jobs started failing, I'd been
dropping them gradually like 856aab2
(ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10:
Drop failing rollback jobs, 2022-10-11, openshift#33005).  But rounding with
Jamo, the jobs no longer serve a useful role, and as 856aab2 points
out, rollbacks between minor releases are not supported.  Drop the
likely-to-fail and not-useful-even-when-it-passes jobs in their
entirety, so they stop getting cloned forward during branching.

I'm also adjusting the release controller changes from 421c921
(Introducing Rollback informing jobs, 2023-05-19, openshift#39488).  I'm
dropping 4.12 and earlier rollback informers, so we can focus on 4.13
while we feel out the new process.  And I'm pivoting 4.13 away from
the cross-minor job that this pull request drops, and towards the
rollback-oldest-supported job that will help back [1].

[1]: https://issues.redhat.com/browse/OTA-455

* hack/validate-release-controller-config: Supplemental Git diff

Because [1]:

  ERROR: The following differences were found:
  3a4
  > 03c544e5d55a55ae9f19d0de7d786341  .//core-services/release-controller/_releases/priv/release-ocp-4.12.json
  35d35
  < 1826a1b520574b66f152f814811c19f6  .//core-services/release-controller/_releases/priv/release-ocp-4.13.json
  42a43
  ...

tells me what files need changing, but not what changes to make to them.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39897/pull-ci-openshift-release-master-release-controller-config/1664331471080394752

---------

Co-authored-by: wking <wking@penguin>
wking added a commit to wking/openshift-release that referenced this pull request Oct 4, 2023
…y-4.14-upgrade-from-stable-4.13: Restore cross-minor rollbacks

We'd dropped the last of these in 856aab2
(ci-operator/config/openshift/release/openshift-release-master__ci-4.11-upgrade-from-stable-4.10:
Drop failing rollback jobs, 2022-10-11, openshift#33005) and 5e746a7
(ci-operator/config/openshift/release: Drop cross-minor rollback jobs,
2023-06-07, openshift#39897).  There's now renewed interest in how these sorts
of rollbacks look, so I'm reviving them for recent releases.  I expect
the issues with these rollbacks will at least include issues with the
cluster-version operator losing the ability to write to ClusterVersion
as the older CRD's enum rejects the capabilities added in the new
release:

  openshift/api $ git diff origin/release-4.13..origin/release-4.14 -- config/v1/types_cluster_version.go | grep kubebuilder:validation:Enum
  -// +kubebuilder:validation:Enum=openshift-samples;baremetal;marketplace;Console;Insights;Storage;CSISnapshot;NodeTuning
  +// +kubebuilder:validation:Enum=openshift-samples;baremetal;marketplace;Console;Insights;Storage;CSISnapshot;NodeTuning;MachineAPI;Build;DeploymentConfig;ImageRegistry
  -// +kubebuilder:validation:Enum=None;v4.11;v4.12;v4.13;vCurrent
  +// +kubebuilder:validation:Enum=None;v4.11;v4.12;v4.13;v4.14;vCurrent

So a cluster updating from 4.13 to 4.14 will enable (possibly
implicitly) MachineAPI and other newly-labeled-in-4.14 capabilities.
And then when the 4.13 ClusterVersion CRD is pushed during the
rollback, those values become illegal, and the Kubernetes API server
will reject the cluster-version operators attempts to write
ClusterVersion status with errors complaining about the unrecognised
MachineAPI and other capability string [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/941/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change/1671502401497993216/artifacts/e2e-agnostic-ovn-upgrade-out-of-change/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7fd84b7b99-8b2qk_cluster-version-operator.log | grep 'ClusterVersion.config.openshift.io "version" is invalid' | tail -n1
  I0621 16:45:41.154360       1 cvo.go:601] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: status.capabilities.enabledCapabilities[3]: Unsupported value: "MachineAPI": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning"

[1]: openshift/cluster-version-operator#941 (review)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants