NO-ISSUE: ci-operator/config/openshift/api: Add dev-branch e2e-upgrade-out-of-change #62110

wking · 2025-02-26T19:51:29Z

The CVO has had e2e-agnostic-upgrade-into-change and e2e-agnostic-upgrade-out-of-change since acd81b7 (#31518), as part of catching changes that would break rollbacks or roll-forwards pre-merge. Sometimes we accept that a change will break rollbacks, and in that case we /override ... to ignore the failure. But that way we are making an explicit decision, and not getting surprised by accidentally landing something that breaks rollbacks.

Nightlies grew similar gates in e4b3a30 (#43734), so even without this commit, we'd hear about things that break rollback post-merge. But reverting post-merge can be tedious, and with so much weight going through the API repo (feature gate promotion, CustomResourceDefinition changes, etc.), having a pre-merge guard seems like it will help. It will run in parallel with the existing update-into-change job, so it shouldn't increase overall latency. It will cost some to run, but that seems worth the cost if it catches issues once every quarter or year or so, if it saves the nightly-monitoring folks some manual debugging.

I am not renaming the e2e-upgrade job to e2e-upgrade-into-change, because Prow gets confused by renaming required jobs if they're required for merge, and the final run on an open pull request failed before the job was renamed. There's no way to retest the old-name job, and Prow blocks on it until a root approver /override's the old name. Renaming to match the pattern used in the CVO might be worthwhile for consistency/uniformity at some point, but that can happen separately in follow-up work if the API approvers want.

I'm not including the CVO's agnostic, because while I like explicitly saying "the repo maintainers don't care what cloud this runs on", the existing API e2e-upgrade job doesn't, and matching that local-repo pattern seems more useful for avoiding confusion than matching the CVO pattern.

openshift-ci-robot · 2025-02-26T19:51:33Z

@wking: This pull request explicitly references no jira issue.

Details

In response to this:

The CVO has had e2e-agnostic-upgrade-into-change and e2e-agnostic-upgrade-out-of-change since acd81b7 (#31518), as part of catching changes that would break rollbacks or roll-forwards pre-merge. Sometimes we accept that a change will break rollbacks, and in that case we /override ... to ignore the failure. But that way we are making an explicit decision, and not getting surprised by accidentally landing something that breaks rollbacks.

Nightlies grew similar gates in e4b3a30 (#43734), so even without this commit, we'd hear about things that break rollback post-merge. But reverting post-merge can be tedious, and with so much weight going through the API repo (feature gate promotion, CustomResourceDefinition changes, etc.), having a pre-merge guard seems like it will help. It will run in parallel with the existing update-into-change job, so it shouldn't increase overall latency. It will cost some to run, but that seems worth the cost if it catches issues once every quarter or year or so, if it saves the nightly-monitoring folks some manual debugging.

I am not renaming the e2e-upgrade job to e2e-upgrade-into-change, because Prow gets confused by renaming required jobs if they're required for merge, and the final run on an open pull request failed before the job was renamed. There's no way to retest the old-name job, and Prow blocks on it until a root approver /override's the old name. Renaming to match the pattern used in the CVO might be worthwhile for consistency/uniformity at some point, but that can happen separately in follow-up work if the API approvers want.

I'm not including the CVO's agnostic, because while I like explicitly saying "the repo maintainers don't care what cloud this runs on", the existing API e2e-upgrade job doesn't, and matching that local-repo pattern seems more useful for avoiding confusion than matching the CVO pattern.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

wking · 2025-02-27T00:13:20Z

/pj-rehearse

openshift-ci-robot · 2025-02-27T00:13:22Z

@wking: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

JoelSpeed

We have two types of upgrade that we run for API PRs already, we do both z-stream and y-stream upgrades.

This is currently adding a z-stream upgrade (or downgrade?) test, do you think there is value in adding/swapping this for a y-stream version?

I assume the nightlies and CVO are only doing this on z-stream presently?

JoelSpeed · 2025-02-27T10:27:14Z

ci-operator/config/openshift/api/openshift-api-master.yaml

+    dependencies:
+      OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE: release:latest
+      OPENSHIFT_UPGRADE_RELEASE_IMAGE_OVERRIDE: release:initial
+    workflow: openshift-upgrade-azure


The e2e-upgrade above uses AWS, can we be consistent with that please?

it seemed like it might be worth spreading exposure around, just in case there was a change that broke one cloud. But the benefit of the platform-agnostic job name is that its really easy to move from cloud to cloud for any reason, so sure, pushed a pivot to AWS with bf1ffcb -> be62d94.

JoelSpeed · 2025-02-27T10:30:52Z

Looking at a recent nightly, I can see jobs that are upgrade-out-of-change and a bunch of plain upgrade jobs, but I don't see any upgrade-into-change, so I think this PR as is would put us inline with what's going on there, so for now I'm ok with leaving the existing upgrade job named as it is

wking · 2025-02-27T22:52:22Z

This is currently adding a z-stream upgrade (or downgrade?) test, do you think there is value in adding/swapping this for a y-stream version?

OCPSTRAT-975 covers what we do for rollbacks, and we are only trying within a z-stream, because so many things change during minor-version updates (new controllers getting added, etc.), and we don't want to spend the time/money on teaching 4.y how to clean up all the new-in-4.(y+1) stuff. And we have logic in the CVO to reject attempts to roll back from 4.(y+1) to 4.y, so I don't think we need to test those.

…hange The CVO has had e2e-agnostic-upgrade-into-change and e2e-agnostic-upgrade-out-of-change since acd81b7 (ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs, 2022-08-22, openshift#31518), as part of catching changes that would break rollbacks or roll-forwards pre-merge. Sometimes we accept that a change will break rollbacks, and in that case we '/override ...' to ignore the failure. But that way we are making an explicit decision, and not getting surprised by accidentally landing something that breaks rollbacks. Nightlies grew similar gates in e4b3a30 (trt-923: test out of change upgrades, 2023-10-16, openshift#43734), so even without this commit, we'd hear about things that break rollback post-merge. But reverting post-merge can be tedious, and with so much weight going through the API repo (feature gate promotion, CustomResourceDefinition changes, etc.), having a pre-merge guard seems like it will help. It will run in parallel with the existing update-into-change job, so it shouldn't increase overall latency. It will cost some to run, but that seems worth the cost if it catches issues once every quarter or year or so, if it saves the nightly-monitoring folks some manual debugging. I am not renaming the e2e-upgrade job to e2e-upgrade-into-change, because Prow gets confused by renaming required jobs if they're required for merge, and the final run on an open pull request failed before the job was renamed. There's no way to retest the old-name job, and Prow blocks on it until a root approver /override's the old name. Renaming to match the pattern used in the CVO might be worthwhile for consistency/uniformity at some point, but that can happen separately in follow-up work if the API approvers want. I'm not including the CVO's "agnostic", because while I like explicitly saying "the repo maintainers don't care what cloud this runs on", the existing API e2e-upgrade job doesn't, and matching that local-repo pattern seems more useful for avoiding confusion than matching the CVO pattern.

openshift-ci-robot · 2025-02-27T23:07:03Z

[REHEARSALNOTIFIER]
@wking: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
pull-ci-openshift-api-master-e2e-upgrade-out-of-change	openshift/api	presubmit	Presubmit changed

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

JoelSpeed · 2025-03-03T10:12:04Z

/lgtm
/approve
/pj-rehearse

openshift-ci-robot · 2025-03-03T10:12:06Z

@JoelSpeed: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

JoelSpeed · 2025-03-03T10:12:44Z

It would probably be useful to also have a techpreview version of this, to match out techpreview upgrade job, happy to follow up with a separate PR for that if we can get this one merged sooner though

openshift-ci · 2025-03-03T10:13:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/config/openshift/api/OWNERS~~ [JoelSpeed]
~~ci-operator/jobs/openshift/api/OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2025-03-03T20:30:19Z

rehearsal:

: [sig-arch] events should not repeat pathologically for ns/openshift-monitoring expand_less	0s
{  1 events happened too frequently

event happened 21 times, something is wrong: namespace/openshift-monitoring node/ip-10-0-64-133.us-west-2.compute.internal pod/prometheus-k8s-1 hmsg/357171899f - reason/Unhealthy Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (12:46:03Z) result=reject }

But it was a B->A update, as we're aiming for:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/62110/rehearse-62110-pull-ci-openshift-api-master-e2e-upgrade-out-of-change/1896504896492933120/artifacts/e2e-upgrade-out-of-change/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
2025-03-03T11:57:47Z 2025-03-03T13:01:01Z Completed 4.19.0-0.ci.test-2025-03-03-102330-ci-op-f0yf5dxv-initial
2025-03-03T11:17:22Z 2025-03-03T11:48:27Z Completed 4.19.0-0.ci.test-2025-03-03-110154-ci-op-f0yf5dxv-latest

And that failure mode turns up in other jobs, so even more evidence that it's not getting introduced by this change. And the rate is low, so we're unlikely to block API merges by landing a perma-failing required job.

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=cannot+register+an+exec+PID' | grep 'failures match'
periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-aws-ovn-upgrade (all) - 63 runs, 16% failed, 10% of failures match = 2% impact
rehearse-62110-pull-ci-openshift-api-master-e2e-upgrade-out-of-change (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
openshift-kubernetes-2225-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade (all) - 10 runs, 10% failed, 100% of failures match = 10% impact

/pj-rehearse ack

It would probably be useful to also have a techpreview version of this, to match out techpreview upgrade job, happy to follow up with a separate PR for that if we can get this one merged sooner though

My GitHub feedback loop is slow, so yeah, let's go with a follow-up pull for a TechPreview B->A job, if folks think it's worth covering that pre-merge too.

openshift-ci-robot · 2025-03-03T20:30:21Z

@wking: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

JoelSpeed · 2025-03-04T10:29:26Z

/test config

openshift-ci · 2025-03-04T10:38:36Z

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/rehearse/openshift/api/master/e2e-upgrade-out-of-change	`be62d94`	link	unknown	`/pj-rehearse pull-ci-openshift-api-master-e2e-upgrade-out-of-change`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-03-04T11:35:07Z

/retest-required

Remaining retests: 0 against base HEAD 40cd648 and 2 for PR HEAD be62d94 in total

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 26, 2025

openshift-ci bot requested review from knobunc and sjenning February 26, 2025 19:54

openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Feb 26, 2025

wking force-pushed the api-update-out-of-change branch from ae09bc5 to e490d5c Compare February 26, 2025 20:18

openshift-ci-robot removed the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Feb 26, 2025

wking force-pushed the api-update-out-of-change branch from e490d5c to bf1ffcb Compare February 26, 2025 20:38

JoelSpeed reviewed Feb 27, 2025

View reviewed changes

wking force-pushed the api-update-out-of-change branch from bf1ffcb to 780babb Compare February 27, 2025 22:56

wking force-pushed the api-update-out-of-change branch from 780babb to be62d94 Compare February 27, 2025 23:04

openshift-ci bot assigned JoelSpeed Mar 3, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 3, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2025

openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Mar 3, 2025

openshift-merge-bot bot merged commit 2c7badd into openshift:master Mar 4, 2025
15 of 16 checks passed

NO-ISSUE: ci-operator/config/openshift/api: Add dev-branch e2e-upgrade-out-of-change #62110

NO-ISSUE: ci-operator/config/openshift/api: Add dev-branch e2e-upgrade-out-of-change #62110

Uh oh!

Conversation

wking commented Feb 26, 2025

Uh oh!

openshift-ci-robot commented Feb 26, 2025

Uh oh!

wking commented Feb 27, 2025

Uh oh!

openshift-ci-robot commented Feb 27, 2025

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

wking Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoelSpeed commented Feb 27, 2025

Uh oh!

wking commented Feb 27, 2025

Uh oh!

openshift-ci-robot commented Feb 27, 2025

Uh oh!

JoelSpeed commented Mar 3, 2025

Uh oh!

openshift-ci-robot commented Mar 3, 2025

Uh oh!

JoelSpeed commented Mar 3, 2025

Uh oh!

openshift-ci bot commented Mar 3, 2025

Uh oh!

wking commented Mar 3, 2025

Uh oh!

openshift-ci-robot commented Mar 3, 2025

Uh oh!

JoelSpeed commented Mar 4, 2025

Uh oh!

openshift-ci bot commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wking Feb 27, 2025 •

edited

Loading

openshift-ci bot commented Mar 4, 2025 •

edited

Loading