Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jun 17, 2021

Avoid tightly-chained updates until etcdHighNumberOfLeaderChanges relaxes a bit (rhbz#1972948).

…op abort-at from master

Avoid tightly-chained updates until etcdHighNumberOfLeaderChanges
relaxes a bit [1].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1972948
@openshift-ci openshift-ci bot requested review from abhinavdahiya and jottofar June 17, 2021 17:59
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2021
cluster_profile: azure4
env:
TEST_TYPE: upgrade
TEST_UPGRADE_OPTIONS: abort-at=100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does abort-at=100 means?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/openshift/origin/blob/4fb407c2e22faa27276287f411829fb84a22545a/cmd/openshift-tests/openshift-tests.go#L330

* abort-at=NUMBER - Set to a number between 0 and 100 to control the percent of operators
		at which to stop the current upgrade and roll back to the current version.

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 25, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 25, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 242c93b into openshift:master Jun 25, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 25, 2021

@wking: Updated the ci-operator-master-configs configmap in namespace ci at cluster app.ci using the following files:

  • key openshift-cluster-version-operator-master.yaml using file ci-operator/config/openshift/cluster-version-operator/openshift-cluster-version-operator-master.yaml
Details

In response to this:

Avoid tightly-chained updates until etcdHighNumberOfLeaderChanges relaxes a bit (rhbz#1972948).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@LalatenduMohanty
Copy link
Member

We will revert this after the corresponding bug is fixed https://bugzilla.redhat.com/show_bug.cgi?id=1972948#c2

@wking wking deleted the temporarily-drop-cvo-abort-at branch August 4, 2021 04:57
wking added a commit to wking/openshift-release that referenced this pull request Aug 19, 2022
…and B->A jobs

We've been using abort-at since way back in 05da8a4 (ci-operator:
CVO should perform a rollback test, 2020-01-20, openshift#6780), because it's
important to cover both:

* Can the proposed CVO successfully roll out the cluster on update?
  (where we want the new CVO in the target release).
* Can the proposed CVO successfully validate an update request and
  launch the requested replacement CVO?  (where we want the new CVO in
  the source release).

However, being different has costs, because not all cluster components
are tuned to gracefully handle back-to-back updates, and we have had
to make temporary changes like b0b6533
(ci-operator/config/openshift/cluster-version-operator: Temporarily
drop abort-at from master, 2021-06-17, openshift#19396), before putting our
toes back in the water with ad76b3e (Revert
"ci-operator/config/openshift/cluster-version-operator: Temporarily
drop abort-at from master", 2021-08-03, openshift#20875).  While discovering
chained-update UX issues is useful, CVO presubmits are probably not
the best place to do it (it would be better to have a periodic, so you
have coverage even when the CVO is not seeing pull request activity).

And a more fundamental issue is that recent origin suites use job
classification to look up historical disruption levels to limit
regression risk [1].  Those lookups are based on inspections of the
cluster-under-test [2], and not on the job configuration.  And the
classification structure only has Release and FromRelease strings [3].
So today, when the suite checks for disruption at the end of an
A->B->A abort-at=100 run, it sets {FromRelease: B, Release: A}, and
complains when the disruption from the full A->B->A transition
overshoots the allowed disruption for a single B->A leg.

We could hypothetically address this by plumbing knowledge of the job
configuration through the origin suite, so we could compare with
expected disruption from other A->B->A runs.  But David's not
interested in making that sort of change in origin until we can
demonstrate it mattering in a periodic that some team has committed to
monitor and maintain, and we don't have that today.

Another issue with A->B->A tests is that sometimes we make changes to
the CVO that are one way, like extending the enumerated list of
ClusterVersion capabilities [4].  On the B->A leg, the CVO restores
the original ClusterVersion CRD with a restricted capability enum, and
subsequent attempts to update the ClusterVersion resource fail like:

  I0818 17:41:40.580147       1 cvo.go:544] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.enabledCapabilities[0]: Unsupported value: "Console": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[1]: Unsupported value: "Insights": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[2]: Unsupported value: "Storage": supported values: "openshift-samples", "baremetal", "marketplace"]

By separating into two presubmits, we should have consistently passing
A->B updates, with the rest of the organization helping to keep that
job style healthy.  And we'll also have B->A updates which look the
same to the job classifier, and should be similarly healthy as long as
we don't make breaking CVO changes.  When we do make breaking CVO
changes, we can inspect the results, and:

  /override e2e-agnostic-upgrade-out-of-change

when we feel that the proposed CVO cleanly launched the target, and
ignore everything that happened once the target CVO started trying to
reconcile the cluster.

[1]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/test/extended/util/disruption/backend_sampler_tester.go#L91-L98
[2]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L43-L117
[3]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L16-L23
[4]: openshift/cluster-version-operator#801
openshift-merge-robot pushed a commit that referenced this pull request Aug 22, 2022
…and B->A jobs (#31518)

We've been using abort-at since way back in 05da8a4 (ci-operator:
CVO should perform a rollback test, 2020-01-20, #6780), because it's
important to cover both:

* Can the proposed CVO successfully roll out the cluster on update?
  (where we want the new CVO in the target release).
* Can the proposed CVO successfully validate an update request and
  launch the requested replacement CVO?  (where we want the new CVO in
  the source release).

However, being different has costs, because not all cluster components
are tuned to gracefully handle back-to-back updates, and we have had
to make temporary changes like b0b6533
(ci-operator/config/openshift/cluster-version-operator: Temporarily
drop abort-at from master, 2021-06-17, #19396), before putting our
toes back in the water with ad76b3e (Revert
"ci-operator/config/openshift/cluster-version-operator: Temporarily
drop abort-at from master", 2021-08-03, #20875).  While discovering
chained-update UX issues is useful, CVO presubmits are probably not
the best place to do it (it would be better to have a periodic, so you
have coverage even when the CVO is not seeing pull request activity).

And a more fundamental issue is that recent origin suites use job
classification to look up historical disruption levels to limit
regression risk [1].  Those lookups are based on inspections of the
cluster-under-test [2], and not on the job configuration.  And the
classification structure only has Release and FromRelease strings [3].
So today, when the suite checks for disruption at the end of an
A->B->A abort-at=100 run, it sets {FromRelease: B, Release: A}, and
complains when the disruption from the full A->B->A transition
overshoots the allowed disruption for a single B->A leg.

We could hypothetically address this by plumbing knowledge of the job
configuration through the origin suite, so we could compare with
expected disruption from other A->B->A runs.  But David's not
interested in making that sort of change in origin until we can
demonstrate it mattering in a periodic that some team has committed to
monitor and maintain, and we don't have that today.

Another issue with A->B->A tests is that sometimes we make changes to
the CVO that are one way, like extending the enumerated list of
ClusterVersion capabilities [4].  On the B->A leg, the CVO restores
the original ClusterVersion CRD with a restricted capability enum, and
subsequent attempts to update the ClusterVersion resource fail like:

  I0818 17:41:40.580147       1 cvo.go:544] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.enabledCapabilities[0]: Unsupported value: "Console": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[1]: Unsupported value: "Insights": supported values: "openshift-samples", "baremetal", "marketplace", status.capabilities.enabledCapabilities[2]: Unsupported value: "Storage": supported values: "openshift-samples", "baremetal", "marketplace"]

By separating into two presubmits, we should have consistently passing
A->B updates, with the rest of the organization helping to keep that
job style healthy.  And we'll also have B->A updates which look the
same to the job classifier, and should be similarly healthy as long as
we don't make breaking CVO changes.  When we do make breaking CVO
changes, we can inspect the results, and:

  /override e2e-agnostic-upgrade-out-of-change

when we feel that the proposed CVO cleanly launched the target, and
ignore everything that happened once the target CVO started trying to
reconcile the cluster.

[1]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/test/extended/util/disruption/backend_sampler_tester.go#L91-L98
[2]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L43-L117
[3]: https://github.com/openshift/origin/blob/c33bf438a00bbd66227186f01c7e6a5c36741492/pkg/synthetictests/platformidentification/types.go#L16-L23
[4]: openshift/cluster-version-operator#801
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants