Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master" #20875

wking · 2021-08-04T04:58:28Z

Reverts #19396.

/hold

We don't want to land this until the alert is adjusted to avoid false-positives on these chained updates.

…arily drop abort-at from master"

openshift-ci · 2021-08-04T04:58:31Z

@wking: This pull request references Bugzilla bug 1972948, which is invalid:

expected the bug to target the "4.9.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1972948: Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2021-12-26T02:59:28Z

Issues in openshift/release go stale after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 15d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2022-01-10T03:04:04Z

Stale issue in openshift/release rot after 15d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 15d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2022-01-25T03:14:47Z

Rotten issues in openshift/release close after 15d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2022-01-25T03:15:20Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues in openshift/release close after 15d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-01-25T03:15:25Z

@wking: This pull request references Bugzilla bug 1972948. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1972948: Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2022-04-30T03:06:17Z

/reopen

openshift-ci · 2022-04-30T03:06:42Z

@wking: Reopened this PR.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-04-30T03:06:59Z

@wking: This pull request references Bugzilla bug 1972948, which is invalid:

expected the bug to target the "4.11.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1972948: Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

LalatenduMohanty

/lgtm

openshift-ci · 2022-05-03T18:12:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/config/openshift/cluster-version-operator/OWNERS~~ [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

LalatenduMohanty · 2022-05-03T18:13:20Z

Looks like thing might have moved ahead and we might not see the same issue which caused us to do #19396 . If we see same issues again then we will drop the test again.

wking · 2022-05-03T18:13:37Z

/hold cancel

Alert issue is still open, but doesn't seem to be coming up in rehearsals. So we're going to start trying A->B->A presubmits again, and if we see alerts we'll complain in the bug ;)

openshift-ci · 2022-05-03T18:14:24Z

@wking: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Details

In response to this:

Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2022-05-03T20:31:42Z

/test build03-dry

openshift-ci · 2022-05-03T20:55:31Z

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2022-05-03T20:57:14Z

@wking: Updated the ci-operator-master-configs configmap in namespace ci at cluster app.ci using the following files:

key openshift-cluster-version-operator-master.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-master.yaml

Details

In response to this:

Reverts #19396.

/hold

We don't want to land this until the alert is adjusted to avoid false-positives on these chained updates.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…and B->A jobs We've been using abort-at since way back in 05da8a4 (ci-operator: CVO should perform a rollback test, 2020-01-20, openshift#6780), because it's important to cover both: * Can the proposed CVO successfully roll out the cluster on update? (where we want the new CVO in the target release). * Can the proposed CVO successfully validate an update request and launch the requested replacement CVO? (where we want the new CVO in the source release). However, being different has costs, because not all cluster components are tuned to gracefully handle back-to-back updates, and we have had to make temporary changes like b0b6533 (ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master, 2021-06-17, openshift#19396), before putting our toes back in the water with ad76b3e (Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master", 2021-08-03, openshift#20875). While discovering chained-update UX issues is useful, CVO presubmits are probably not the best place to do it (it would be better to have a periodic, so you have coverage even when the CVO is not seeing pull request activity). And a more fundamental issue is that recent origin suites use job classification to look up historical disruption levels to limit regression risk [1]. Those lookups are based on inspections of the cluster-under-test [2], and not on the job configuration. And the classification structure only has Release and FromRelease strings [3]. So today, when the suite checks for disruption at the end of an A->B->A abort-at=100 run, it sets {FromRelease: B, Release: A}, and complains when the disruption from the full A->B->A transition overshoots the allowed disruption for a single B->A leg. We could hypothetically address this by plumbing knowledge of the job configuration through the origin suite, so we could compare with expected disruption from other A->B->A runs. But David's not interested in making that sort of change in origin until we can demonstrate it mattering in a periodic that some team has committed to monitor and maintain, and we don't have that today. Another issue with A->B->A tests is that sometimes we make changes to the CVO that are one way, like extending the enumerated list of ClusterVersion capabilities [4]. On the B->A leg, the CVO restores the original ClusterVersion CRD with a restricted capability enum, and subsequent attempts to update the ClusterVersion resource fail like: I0818 17:41:40.580147 1 cvo.go:544] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.enabledCapabilities[0]: Unsupported value: "Console": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[1]: Unsupported value: "Insights": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[2]: Unsupported value: "Storage": supported values: "openshift-samples", "baremetal", "marketplace"] By separating into two presubmits, we should have consistently passing A->B updates, with the rest of the organization helping to keep that job style healthy. And we'll also have B->A updates which look the same to the job classifier, and should be similarly healthy as long as we don't make breaking CVO changes. When we do make breaking CVO changes, we can inspect the results, and: /override e2e-agnostic-upgrade-out-of-change when we feel that the proposed CVO cleanly launched the target, and ignore everything that happened once the target CVO started trying to reconcile the cluster. [1]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/test/extended/util/disruption/backend_sampler_tester.go#L91-L98 [2]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L43-L117 [3]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L16-L23 [4]: openshift/cluster-version-operator#801

…and B->A jobs (#31518) We've been using abort-at since way back in 05da8a4 (ci-operator: CVO should perform a rollback test, 2020-01-20, #6780), because it's important to cover both: * Can the proposed CVO successfully roll out the cluster on update? (where we want the new CVO in the target release). * Can the proposed CVO successfully validate an update request and launch the requested replacement CVO? (where we want the new CVO in the source release). However, being different has costs, because not all cluster components are tuned to gracefully handle back-to-back updates, and we have had to make temporary changes like b0b6533 (ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master, 2021-06-17, #19396), before putting our toes back in the water with ad76b3e (Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master", 2021-08-03, #20875). While discovering chained-update UX issues is useful, CVO presubmits are probably not the best place to do it (it would be better to have a periodic, so you have coverage even when the CVO is not seeing pull request activity). And a more fundamental issue is that recent origin suites use job classification to look up historical disruption levels to limit regression risk [1]. Those lookups are based on inspections of the cluster-under-test [2], and not on the job configuration. And the classification structure only has Release and FromRelease strings [3]. So today, when the suite checks for disruption at the end of an A->B->A abort-at=100 run, it sets {FromRelease: B, Release: A}, and complains when the disruption from the full A->B->A transition overshoots the allowed disruption for a single B->A leg. We could hypothetically address this by plumbing knowledge of the job configuration through the origin suite, so we could compare with expected disruption from other A->B->A runs. But David's not interested in making that sort of change in origin until we can demonstrate it mattering in a periodic that some team has committed to monitor and maintain, and we don't have that today. Another issue with A->B->A tests is that sometimes we make changes to the CVO that are one way, like extending the enumerated list of ClusterVersion capabilities [4]. On the B->A leg, the CVO restores the original ClusterVersion CRD with a restricted capability enum, and subsequent attempts to update the ClusterVersion resource fail like: I0818 17:41:40.580147 1 cvo.go:544] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.enabledCapabilities[0]: Unsupported value: "Console": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[1]: Unsupported value: "Insights": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[2]: Unsupported value: "Storage": supported values: "openshift-samples", "baremetal", "marketplace"] By separating into two presubmits, we should have consistently passing A->B updates, with the rest of the organization helping to keep that job style healthy. And we'll also have B->A updates which look the same to the job classifier, and should be similarly healthy as long as we don't make breaking CVO changes. When we do make breaking CVO changes, we can inspect the results, and: /override e2e-agnostic-upgrade-out-of-change when we feel that the proposed CVO cleanly launched the target, and ignore everything that happened once the target CVO started trying to reconcile the cluster. [1]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/test/extended/util/disruption/backend_sampler_tester.go#L91-L98 [2]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L43-L117 [3]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L16-L23 [4]: openshift/cluster-version-operator#801

Revert "ci-operator/config/openshift/cluster-version-operator: Tempor…

ad76b3e

…arily drop abort-at from master"

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. labels Aug 4, 2021

openshift-ci bot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 4, 2021

openshift-ci bot requested review from LalatenduMohanty and vrutkovs August 4, 2021 04:59

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 4, 2021

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 10, 2022

openshift-ci bot closed this Jan 25, 2022

openshift-ci bot reopened this Apr 30, 2022

openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. and removed bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. labels Apr 30, 2022

LalatenduMohanty approved these changes May 3, 2022

View reviewed changes

openshift-ci bot assigned LalatenduMohanty May 3, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 3, 2022

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 3, 2022

wking changed the title ~~Bug 1972948: Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master"~~ Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master" May 3, 2022

openshift-ci bot removed bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 3, 2022

openshift-merge-robot merged commit f568c6b into openshift:master May 3, 2022

wking deleted the revert-19396-temporarily-drop-cvo-abort-at branch May 3, 2022 20:59

wking mentioned this pull request Aug 19, 2022

ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs #31518

Merged

Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master" #20875

Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master" #20875

Uh oh!

Conversation

wking commented Aug 4, 2021

Uh oh!

openshift-ci bot commented Aug 4, 2021

Uh oh!

openshift-bot commented Dec 26, 2021

Uh oh!

openshift-bot commented Jan 10, 2022

Uh oh!

openshift-bot commented Jan 25, 2022

Uh oh!

openshift-ci bot commented Jan 25, 2022

Uh oh!

openshift-ci bot commented Jan 25, 2022

Uh oh!

wking commented Apr 30, 2022

Uh oh!

openshift-ci bot commented Apr 30, 2022

Uh oh!

openshift-ci bot commented Apr 30, 2022

Uh oh!

LalatenduMohanty left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented May 3, 2022

Uh oh!

LalatenduMohanty commented May 3, 2022

Uh oh!

wking commented May 3, 2022

Uh oh!

openshift-ci bot commented May 3, 2022

Uh oh!

wking commented May 3, 2022

Uh oh!

openshift-ci bot commented May 3, 2022

Uh oh!

openshift-ci bot commented May 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants