ci-operator: CVO should perform a rollback test #6780

smarterclayton · 2020-01-20T22:16:25Z

The CVO is a critical component and upgrading from the new code
is as important as upgrading to the new code. Change the job
definition to always require a rollforward and rollback.

The CVO is a critical component and upgrading *from* the new code is as important as upgrading *to* the new code. Change the job definition to always require a rollforward and rollback.

wking · 2020-01-20T23:01:09Z

/lgtm
/hold

Would be nice to spot check one of the rehearsals to make sure this is working as intended; then we can pull the hold.

openshift-ci-robot · 2020-01-20T23:02:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/config/openshift/cluster-version-operator/OWNERS~~ [smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2020-01-21T03:25:07Z

/test pj-rehearse

openshift-ci-robot · 2020-01-21T06:18:06Z

@smarterclayton: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/rehearse/openshift/cluster-version-operator/release-4.4/e2e-aws-upgrade	`05da8a4`	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-version-operator/master/e2e-aws-upgrade	`05da8a4`	link	`/test pj-rehearse`
ci/rehearse/openshift/cluster-version-operator/release-4.5/e2e-aws-upgrade	`05da8a4`	link	`/test pj-rehearse`
ci/prow/pj-rehearse	`05da8a4`	link	`/test pj-rehearse`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

vrutkovs · 2020-01-21T09:54:43Z

Another crash: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/6780/rehearse-6780-pull-ci-openshift-cluster-version-operator-release-4.4-e2e-aws-upgrade/2/artifacts/e2e-aws-upgrade/pods/openshift-cluster-version_cluster-version-operator-c65dc767f-zc429_cluster-version-operator.log

Lets land this, revert verify change and rework openshift/cluster-version-operator#279?

smarterclayton · 2020-01-21T15:07:17Z

openshift/cluster-version-operator#306 fixes that panic

openshift-ci-robot · 2020-01-21T15:15:16Z

@smarterclayton: Updated the following 13 configmaps:

ci-operator-4.1-configs configmap in namespace ci at cluster default using the following files:
- key openshift-cluster-version-operator-release-4.1.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.1.yaml
ci-operator-4.3-configs configmap in namespace ci at cluster default using the following files:
- key openshift-cluster-version-operator-release-4.3.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.3.yaml
ci-operator-4.3-configs configmap in namespace ci-stg at cluster default using the following files:
- key openshift-cluster-version-operator-release-4.3.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.3.yaml
ci-operator-4.4-configs configmap in namespace ci-stg at cluster default using the following files:
- key openshift-cluster-version-operator-release-4.4.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.4.yaml
ci-operator-4.5-configs configmap in namespace ci-stg at cluster default using the following files:
- key openshift-cluster-version-operator-release-4.5.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.5.yaml
ci-operator-master-configs configmap in namespace ci at cluster ci/api-build01-ci-devcluster-openshift-com:6443 using the following files:
- key openshift-cluster-version-operator-master.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-master.yaml
ci-operator-master-configs configmap in namespace ci at cluster default using the following files:
- key openshift-cluster-version-operator-master.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-master.yaml
ci-operator-4.2-configs configmap in namespace ci at cluster default using the following files:
- key openshift-cluster-version-operator-release-4.2.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.2.yaml
ci-operator-4.2-configs configmap in namespace ci-stg at cluster default using the following files:
- key openshift-cluster-version-operator-release-4.2.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.2.yaml
ci-operator-4.4-configs configmap in namespace ci at cluster default using the following files:
- key openshift-cluster-version-operator-release-4.4.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.4.yaml
ci-operator-4.5-configs configmap in namespace ci at cluster default using the following files:
- key openshift-cluster-version-operator-release-4.5.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.5.yaml
ci-operator-master-configs configmap in namespace ci-stg at cluster default using the following files:
- key openshift-cluster-version-operator-master.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-master.yaml
ci-operator-4.1-configs configmap in namespace ci-stg at cluster default using the following files:
- key openshift-cluster-version-operator-release-4.1.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-release-4.1.yaml

Details

In response to this:

The CVO is a critical component and upgrading from the new code
is as important as upgrading to the new code. Change the job
definition to always require a rollforward and rollback.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…and B->A jobs We've been using abort-at since way back in 05da8a4 (ci-operator: CVO should perform a rollback test, 2020-01-20, openshift#6780), because it's important to cover both: * Can the proposed CVO successfully roll out the cluster on update? (where we want the new CVO in the target release). * Can the proposed CVO successfully validate an update request and launch the requested replacement CVO? (where we want the new CVO in the source release). However, being different has costs, because not all cluster components are tuned to gracefully handle back-to-back updates, and we have had to make temporary changes like b0b6533 (ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master, 2021-06-17, openshift#19396), before putting our toes back in the water with ad76b3e (Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master", 2021-08-03, openshift#20875). While discovering chained-update UX issues is useful, CVO presubmits are probably not the best place to do it (it would be better to have a periodic, so you have coverage even when the CVO is not seeing pull request activity). And a more fundamental issue is that recent origin suites use job classification to look up historical disruption levels to limit regression risk [1]. Those lookups are based on inspections of the cluster-under-test [2], and not on the job configuration. And the classification structure only has Release and FromRelease strings [3]. So today, when the suite checks for disruption at the end of an A->B->A abort-at=100 run, it sets {FromRelease: B, Release: A}, and complains when the disruption from the full A->B->A transition overshoots the allowed disruption for a single B->A leg. We could hypothetically address this by plumbing knowledge of the job configuration through the origin suite, so we could compare with expected disruption from other A->B->A runs. But David's not interested in making that sort of change in origin until we can demonstrate it mattering in a periodic that some team has committed to monitor and maintain, and we don't have that today. Another issue with A->B->A tests is that sometimes we make changes to the CVO that are one way, like extending the enumerated list of ClusterVersion capabilities [4]. On the B->A leg, the CVO restores the original ClusterVersion CRD with a restricted capability enum, and subsequent attempts to update the ClusterVersion resource fail like: I0818 17:41:40.580147 1 cvo.go:544] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.enabledCapabilities[0]: Unsupported value: "Console": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[1]: Unsupported value: "Insights": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[2]: Unsupported value: "Storage": supported values: "openshift-samples", "baremetal", "marketplace"] By separating into two presubmits, we should have consistently passing A->B updates, with the rest of the organization helping to keep that job style healthy. And we'll also have B->A updates which look the same to the job classifier, and should be similarly healthy as long as we don't make breaking CVO changes. When we do make breaking CVO changes, we can inspect the results, and: /override e2e-agnostic-upgrade-out-of-change when we feel that the proposed CVO cleanly launched the target, and ignore everything that happened once the target CVO started trying to reconcile the cluster. [1]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/test/extended/util/disruption/backend_sampler_tester.go#L91-L98 [2]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L43-L117 [3]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L16-L23 [4]: openshift/cluster-version-operator#801

…and B->A jobs (#31518) We've been using abort-at since way back in 05da8a4 (ci-operator: CVO should perform a rollback test, 2020-01-20, #6780), because it's important to cover both: * Can the proposed CVO successfully roll out the cluster on update? (where we want the new CVO in the target release). * Can the proposed CVO successfully validate an update request and launch the requested replacement CVO? (where we want the new CVO in the source release). However, being different has costs, because not all cluster components are tuned to gracefully handle back-to-back updates, and we have had to make temporary changes like b0b6533 (ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master, 2021-06-17, #19396), before putting our toes back in the water with ad76b3e (Revert "ci-operator/config/openshift/cluster-version-operator: Temporarily drop abort-at from master", 2021-08-03, #20875). While discovering chained-update UX issues is useful, CVO presubmits are probably not the best place to do it (it would be better to have a periodic, so you have coverage even when the CVO is not seeing pull request activity). And a more fundamental issue is that recent origin suites use job classification to look up historical disruption levels to limit regression risk [1]. Those lookups are based on inspections of the cluster-under-test [2], and not on the job configuration. And the classification structure only has Release and FromRelease strings [3]. So today, when the suite checks for disruption at the end of an A->B->A abort-at=100 run, it sets {FromRelease: B, Release: A}, and complains when the disruption from the full A->B->A transition overshoots the allowed disruption for a single B->A leg. We could hypothetically address this by plumbing knowledge of the job configuration through the origin suite, so we could compare with expected disruption from other A->B->A runs. But David's not interested in making that sort of change in origin until we can demonstrate it mattering in a periodic that some team has committed to monitor and maintain, and we don't have that today. Another issue with A->B->A tests is that sometimes we make changes to the CVO that are one way, like extending the enumerated list of ClusterVersion capabilities [4]. On the B->A leg, the CVO restores the original ClusterVersion CRD with a restricted capability enum, and subsequent attempts to update the ClusterVersion resource fail like: I0818 17:41:40.580147 1 cvo.go:544] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.enabledCapabilities[0]: Unsupported value: "Console": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[1]: Unsupported value: "Insights": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[2]: Unsupported value: "Storage": supported values: "openshift-samples", "baremetal", "marketplace"] By separating into two presubmits, we should have consistently passing A->B updates, with the rest of the organization helping to keep that job style healthy. And we'll also have B->A updates which look the same to the job classifier, and should be similarly healthy as long as we don't make breaking CVO changes. When we do make breaking CVO changes, we can inspect the results, and: /override e2e-agnostic-upgrade-out-of-change when we feel that the proposed CVO cleanly launched the target, and ignore everything that happened once the target CVO started trying to reconcile the cluster. [1]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/test/extended/util/disruption/backend_sampler_tester.go#L91-L98 [2]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L43-L117 [3]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L16-L23 [4]: openshift/cluster-version-operator#801

ci-operator: CVO should perform a rollback test

05da8a4

The CVO is a critical component and upgrading *from* the new code is as important as upgrading *to* the new code. Change the job definition to always require a rollforward and rollback.

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 20, 2020

openshift-ci-robot requested review from abhinavdahiya and crawford January 20, 2020 22:17

openshift-ci-robot assigned wking Jan 20, 2020

openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Jan 20, 2020

smarterclayton removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 21, 2020

openshift-merge-robot merged commit 6ab59c7 into openshift:master Jan 21, 2020

wking mentioned this pull request Aug 19, 2022

ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs #31518

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci-operator: CVO should perform a rollback test #6780

ci-operator: CVO should perform a rollback test #6780

Uh oh!

smarterclayton commented Jan 20, 2020

Uh oh!

wking commented Jan 20, 2020

Uh oh!

openshift-ci-robot commented Jan 20, 2020

Uh oh!

smarterclayton commented Jan 21, 2020

Uh oh!

openshift-ci-robot commented Jan 21, 2020

Uh oh!

vrutkovs commented Jan 21, 2020

Uh oh!

smarterclayton commented Jan 21, 2020

Uh oh!

openshift-ci-robot commented Jan 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ci-operator: CVO should perform a rollback test #6780

ci-operator: CVO should perform a rollback test #6780

Uh oh!

Conversation

smarterclayton commented Jan 20, 2020

Uh oh!

wking commented Jan 20, 2020

Uh oh!

openshift-ci-robot commented Jan 20, 2020

Uh oh!

smarterclayton commented Jan 21, 2020

Uh oh!

openshift-ci-robot commented Jan 21, 2020

Uh oh!

vrutkovs commented Jan 21, 2020

Uh oh!

smarterclayton commented Jan 21, 2020

Uh oh!

openshift-ci-robot commented Jan 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants