ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs #31518

wking · 2022-08-19T18:56:32Z

We've been using abort-at since way back in 05da8a4 (#6780), because it's important to cover both:

Can the proposed CVO successfully roll out the cluster on update? (where we want the new CVO in the target release).
Can the proposed CVO successfully validate an update request and launch the requested replacement CVO? (where we want the new CVO in the source release).

However, being different has costs, because not all cluster components are tuned to gracefully handle back-to-back updates, and we have had to make temporary changes like b0b6533 (#19396), before putting our toes back in the water with ad76b3e (#20875). While discovering chained-update UX issues is useful, CVO presubmits are probably not the best place to do it (it would be better to have a periodic, so you have coverage even when the CVO is not seeing pull request activity).

And a more fundamental issue is that recent origin suites use job classification to look up historical disruption levels to limit regression risk. Those lookups are based on inspections of the cluster-under-test, and not on the job configuration. And the classification structure only has Release and FromRelease strings. So today, when the suite checks for disruption at the end of an A->B->A abort-at=100 run, it sets {FromRelease: B, Release: A}, and complains when the disruption from the full A->B->A transition overshoots the allowed disruption for a single B->A leg.

We could hypothetically address this by plumbing knowledge of the job configuration through the origin suite, so we could compare with expected disruption from other A->B->A runs. But @deads2k's not interested in making that sort of change in origin until we can demonstrate it mattering in a periodic that some team has committed to monitor and maintain, and we don't have that today.

Another issue with A->B->A tests is that sometimes we make changes to the CVO that are one way, like extending the enumerated list of ClusterVersion capabilities. On the B->A leg, the CVO restores the original ClusterVersion CRD with a restricted capability enum, and subsequent attempts to update the ClusterVersion resource fail like:

I0818 17:41:40.580147       1 cvo.go:544] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.enabledCapabilities[0]: Unsupported value: "Console": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[1]: Unsupported value: "Insights": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[2]: Unsupported value: "Storage": supported values: "openshift-samples", "baremetal", "marketplace"]

By separating into two presubmits, we should have consistently passing A->B updates, with the rest of the organization helping to keep that job style healthy. And we'll also have B->A updates which look the same to the job classifier, and should be similarly healthy as long as we don't make breaking CVO changes. When we do make breaking CVO changes, we can inspect the results, and:

/override e2e-agnostic-upgrade-out-of-change

when we feel that the proposed CVO cleanly launched the target, and ignore everything that happened once the target CVO started trying to reconcile the cluster.

…and B->A jobs We've been using abort-at since way back in 05da8a4 (ci-operator: CVO should perform a rollback test, 2020-01-20, openshift#6780), because it's important to cover both: * Can the proposed CVO successfully roll out the cluster on update? (where we want the new CVO in the target release). * Can the proposed CVO successfully validate an update request and launch the requested replacement CVO? (where we want the new CVO in the source release). However, being different has costs, because not all cluster components are tuned to gracefully handle back-to-back updates, and we have had to make temporary changes like b0b6533 (ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master, 2021-06-17, openshift#19396), before putting our toes back in the water with ad76b3e (Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master", 2021-08-03, openshift#20875). While discovering chained-update UX issues is useful, CVO presubmits are probably not the best place to do it (it would be better to have a periodic, so you have coverage even when the CVO is not seeing pull request activity). And a more fundamental issue is that recent origin suites use job classification to look up historical disruption levels to limit regression risk [1]. Those lookups are based on inspections of the cluster-under-test [2], and not on the job configuration. And the classification structure only has Release and FromRelease strings [3]. So today, when the suite checks for disruption at the end of an A->B->A abort-at=100 run, it sets {FromRelease: B, Release: A}, and complains when the disruption from the full A->B->A transition overshoots the allowed disruption for a single B->A leg. We could hypothetically address this by plumbing knowledge of the job configuration through the origin suite, so we could compare with expected disruption from other A->B->A runs. But David's not interested in making that sort of change in origin until we can demonstrate it mattering in a periodic that some team has committed to monitor and maintain, and we don't have that today. Another issue with A->B->A tests is that sometimes we make changes to the CVO that are one way, like extending the enumerated list of ClusterVersion capabilities [4]. On the B->A leg, the CVO restores the original ClusterVersion CRD with a restricted capability enum, and subsequent attempts to update the ClusterVersion resource fail like: I0818 17:41:40.580147 1 cvo.go:544] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.enabledCapabilities[0]: Unsupported value: "Console": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[1]: Unsupported value: "Insights": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[2]: Unsupported value: "Storage": supported values: "openshift-samples", "baremetal", "marketplace"] By separating into two presubmits, we should have consistently passing A->B updates, with the rest of the organization helping to keep that job style healthy. And we'll also have B->A updates which look the same to the job classifier, and should be similarly healthy as long as we don't make breaking CVO changes. When we do make breaking CVO changes, we can inspect the results, and: /override e2e-agnostic-upgrade-out-of-change when we feel that the proposed CVO cleanly launched the target, and ignore everything that happened once the target CVO started trying to reconcile the cluster. [1]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/test/extended/util/disruption/backend_sampler_tester.go#L91-L98 [2]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L43-L117 [3]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L16-L23 [4]: openshift/cluster-version-operator#801

openshift-ci · 2022-08-19T23:22:01Z

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2022-08-20T01:47:56Z

into-change rehearsal:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/31518/rehearse-31518-pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade-into-change/1560727729605709824/artifacts/e2e-agnostic-upgrade-into-change/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
2022-08-19T21:47:49Z 2022-08-19T22:58:37Z Completed 4.12.0-0.ci.test-2022-08-19-204946-ci-op-i2fvq1vm-latest
2022-08-19T21:08:08Z 2022-08-19T21:42:58Z Completed 4.12.0-0.ci.test-2022-08-19-203953-ci-op-i2fvq1vm-initial

out-of-change rehearsal:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/31518/rehearse-31518-pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade-out-of-change/1560727730058694656/artifacts/e2e-agnostic-upgrade-out-of-change/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
2022-08-19T21:32:00Z 2022-08-19T22:34:28Z Completed 4.12.0-0.ci.test-2022-08-19-204925-ci-op-yz7jisfr-latest
2022-08-19T20:59:47Z 2022-08-19T21:25:47Z Completed 4.12.0-0.ci.test-2022-08-19-203954-ci-op-yz7jisfr-initial

Kindof hard to tell without an actual change in play. But both opening with initial and moving towards latest sounds like we probably have A->B covered. I don't understand why e2e-agnostic-upgrade-out-of-change definition isn't going from latest to initial, but it might actually end up being the B->A testing I want it to be once it happens on actual CVO pulls.

Also, reverts are cheap, if we land this, see it in action, and decide we got it wrong ;).

LalatenduMohanty

/lgtm

openshift-ci · 2022-08-22T20:59:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/config/openshift/cluster-version-operator/OWNERS~~ [LalatenduMohanty,wking]
~~ci-operator/jobs/openshift/cluster-version-operator/OWNERS~~ [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-08-22T21:38:49Z

@wking: Updated the following 2 configmaps:

ci-operator-master-configs configmap in namespace ci at cluster app.ci using the following files:
- key openshift-cluster-version-operator-master.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-master.yaml
job-config-master-presubmits configmap in namespace ci at cluster app.ci using the following files:
- key openshift-cluster-version-operator-master-presubmits.yaml using file ci-operator/jobs/openshift/cluster-version-operator/openshift-cluster-version-operator-master-presubmits.yaml

Details

In response to this:

We've been using abort-at since way back in 05da8a4 (#6780), because it's important to cover both:

Can the proposed CVO successfully roll out the cluster on update? (where we want the new CVO in the target release).

Can the proposed CVO successfully validate an update request and launch the requested replacement CVO? (where we want the new CVO in the source release).

However, being different has costs, because not all cluster components are tuned to gracefully handle back-to-back updates, and we have had to make temporary changes like b0b6533 (#19396), before putting our toes back in the water with ad76b3e (#20875). While discovering chained-update UX issues is useful, CVO presubmits are probably not the best place to do it (it would be better to have a periodic, so you have coverage even when the CVO is not seeing pull request activity).

And a more fundamental issue is that recent origin suites use job classification to look up historical disruption levels to limit regression risk. Those lookups are based on inspections of the cluster-under-test, and not on the job configuration. And the classification structure only has Release and FromRelease strings. So today, when the suite checks for disruption at the end of an A->B->A abort-at=100 run, it sets {FromRelease: B, Release: A}, and complains when the disruption from the full A->B->A transition overshoots the allowed disruption for a single B->A leg.

We could hypothetically address this by plumbing knowledge of the job configuration through the origin suite, so we could compare with expected disruption from other A->B->A runs. But @deads2k's not interested in making that sort of change in origin until we can demonstrate it mattering in a periodic that some team has committed to monitor and maintain, and we don't have that today.

Another issue with A->B->A tests is that sometimes we make changes to the CVO that are one way, like extending the enumerated list of ClusterVersion capabilities. On the B->A leg, the CVO restores the original ClusterVersion CRD with a restricted capability enum, and subsequent attempts to update the ClusterVersion resource fail like:

I0818 17:41:40.580147 1 cvo.go:544] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.enabledCapabilities[0]: Unsupported value: "Console": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities1: Unsupported value: "Insights": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities2: Unsupported value: "Storage": supported values: "openshift-samples", "baremetal", "marketplace"]

By separating into two presubmits, we should have consistently passing A->B updates, with the rest of the organization helping to keep that job style healthy. And we'll also have B->A updates which look the same to the job classifier, and should be similarly healthy as long as we don't make breaking CVO changes. When we do make breaking CVO changes, we can inspect the results, and:

/override e2e-agnostic-upgrade-out-of-change

when we feel that the proposed CVO cleanly launched the target, and ignore everything that happened once the target CVO started trying to reconcile the cluster.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ack config My initial approach from acd81b7 (ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs (openshift#31518), 2022-08-22) doesn't seem to have worked. Follow up by pattern-matching from ebbee6d (Add AWS and Azure upgrade jobs and AWS selfupgrade for o/k, 2020-10-09, openshift#12618) from openshift/kubernetes, which was renamed to e2e-{platform}-downgrade in 6841f1c (All 4.7+ yearly should be given an interval, 2021-03-24, openshift#17131), but is otherwise unchanged.

…ack config (#31580) My initial approach from acd81b7 (ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs (#31518), 2022-08-22) doesn't seem to have worked. Follow up by pattern-matching from ebbee6d (Add AWS and Azure upgrade jobs and AWS selfupgrade for o/k, 2020-10-09, #12618) from openshift/kubernetes, which was renamed to e2e-{platform}-downgrade in 6841f1c (All 4.7+ yearly should be given an interval, 2021-03-24, #17131), but is otherwise unchanged.

…hange The CVO has had e2e-agnostic-upgrade-into-change and e2e-agnostic-upgrade-out-of-change since acd81b7 (ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs, 2022-08-22, openshift#31518), as part of catching changes that would break rollbacks or roll-forwards pre-merge. Sometimes we accept that a change will break rollbacks, and in that case we '/override ...' to ignore the failure. But that way we are making an explicit decision, and not getting surprised by accidentally landing something that breaks rollbacks. Nightlies grew similar gates in e4b3a30 (trt-923: test out of change upgrades, 2023-10-16, openshift#43734), so even without this commit, we'd hear about things that break rollback post-merge. But reverting post-merge can be tedious, and with so much weight going through the API repo (feature gate promotion, CustomResourceDefinition changes, etc.), having a pre-merge guard seems like it will help. It will run in parallel with the existing update-into-change job, so it shouldn't increase overall latency. It will cost some to run, but that seems worth the cost if it catches issues once every quarter or year or so, if it saves the nightly-monitoring folks some manual debugging. I am not renaming the e2e-upgrade job to e2e-upgrade-into-change, because Prow gets confused by renaming required jobs if they're required for merge, and the final run on an open pull request failed before the job was renamed. There's no way to retest the old-name job, and Prow blocks on it until a root approver /override's the old name. Renaming to match the pattern used in the CVO might be worthwhile for consistency/uniformity at some point, but that can happen separately in follow-up work if the API approvers want. I'm not including the CVO's "agnostic", because while I like explicitly saying "the repo maintainers don't care what cloud this runs on", the existing API e2e-upgrade job doesn't, and matching that local-repo pattern seems more useful for avoiding confusion than matching the CVO pattern.

…hange (#62110) The CVO has had e2e-agnostic-upgrade-into-change and e2e-agnostic-upgrade-out-of-change since acd81b7 (ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs, 2022-08-22, #31518), as part of catching changes that would break rollbacks or roll-forwards pre-merge. Sometimes we accept that a change will break rollbacks, and in that case we '/override ...' to ignore the failure. But that way we are making an explicit decision, and not getting surprised by accidentally landing something that breaks rollbacks. Nightlies grew similar gates in e4b3a30 (trt-923: test out of change upgrades, 2023-10-16, #43734), so even without this commit, we'd hear about things that break rollback post-merge. But reverting post-merge can be tedious, and with so much weight going through the API repo (feature gate promotion, CustomResourceDefinition changes, etc.), having a pre-merge guard seems like it will help. It will run in parallel with the existing update-into-change job, so it shouldn't increase overall latency. It will cost some to run, but that seems worth the cost if it catches issues once every quarter or year or so, if it saves the nightly-monitoring folks some manual debugging. I am not renaming the e2e-upgrade job to e2e-upgrade-into-change, because Prow gets confused by renaming required jobs if they're required for merge, and the final run on an open pull request failed before the job was renamed. There's no way to retest the old-name job, and Prow blocks on it until a root approver /override's the old name. Renaming to match the pattern used in the CVO might be worthwhile for consistency/uniformity at some point, but that can happen separately in follow-up work if the API approvers want. I'm not including the CVO's "agnostic", because while I like explicitly saying "the repo maintainers don't care what cloud this runs on", the existing API e2e-upgrade job doesn't, and matching that local-repo pattern seems more useful for avoiding confusion than matching the CVO pattern.

openshift-ci bot requested review from LalatenduMohanty and vrutkovs August 19, 2022 18:57

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 19, 2022

wking force-pushed the decouple-cluster-version-rollback-presubmits branch from 8c48291 to de846e6 Compare August 19, 2022 19:32

wking mentioned this pull request Aug 19, 2022

lib/capability: Sort all slices after building them openshift/cluster-version-operator#827

Merged

wking force-pushed the decouple-cluster-version-rollback-presubmits branch from de846e6 to cacdf48 Compare August 19, 2022 20:32

qJkee mentioned this pull request Aug 22, 2022

[CNF-5645] bump api version to add new capability openshift/cluster-version-operator#801

Merged

LalatenduMohanty approved these changes Aug 22, 2022

View reviewed changes

openshift-ci bot assigned LalatenduMohanty Aug 22, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 22, 2022

openshift-merge-robot merged commit acd81b7 into openshift:master Aug 22, 2022

wking deleted the decouple-cluster-version-rollback-presubmits branch August 22, 2022 21:41

wking mentioned this pull request Aug 22, 2022

ci-operator/config/openshift/cluster-version-operator: Fix B->A rollback config #31580

Merged

wking mentioned this pull request Feb 26, 2025

NO-ISSUE: ci-operator/config/openshift/api: Add dev-branch e2e-upgrade-out-of-change #62110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs #31518

ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs #31518

Uh oh!

wking commented Aug 19, 2022

Uh oh!

openshift-ci bot commented Aug 19, 2022

Uh oh!

wking commented Aug 20, 2022 •

edited

Loading

Uh oh!

LalatenduMohanty left a comment

Uh oh!

openshift-ci bot commented Aug 22, 2022

Uh oh!

openshift-ci bot commented Aug 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs #31518

ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs #31518

Uh oh!

Conversation

wking commented Aug 19, 2022

Uh oh!

openshift-ci bot commented Aug 19, 2022

Uh oh!

wking commented Aug 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LalatenduMohanty left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Aug 22, 2022

Uh oh!

openshift-ci bot commented Aug 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wking commented Aug 20, 2022 •

edited

Loading