-
Notifications
You must be signed in to change notification settings - Fork 2.1k
NO-ISSUE: ci-operator/config/openshift/api: Add dev-branch e2e-upgrade-out-of-change #62110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NO-ISSUE: ci-operator/config/openshift/api: Add dev-branch e2e-upgrade-out-of-change #62110
Conversation
|
@wking: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
ae09bc5 to
e490d5c
Compare
e490d5c to
bf1ffcb
Compare
|
/pj-rehearse |
|
@wking: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
JoelSpeed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have two types of upgrade that we run for API PRs already, we do both z-stream and y-stream upgrades.
This is currently adding a z-stream upgrade (or downgrade?) test, do you think there is value in adding/swapping this for a y-stream version?
I assume the nightlies and CVO are only doing this on z-stream presently?
| dependencies: | ||
| OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE: release:latest | ||
| OPENSHIFT_UPGRADE_RELEASE_IMAGE_OVERRIDE: release:initial | ||
| workflow: openshift-upgrade-azure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The e2e-upgrade above uses AWS, can we be consistent with that please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Looking at a recent nightly, I can see jobs that are |
OCPSTRAT-975 covers what we do for rollbacks, and we are only trying within a z-stream, because so many things change during minor-version updates (new controllers getting added, etc.), and we don't want to spend the time/money on teaching 4.y how to clean up all the new-in-4.(y+1) stuff. And we have logic in the CVO to reject attempts to roll back from 4.(y+1) to 4.y, so I don't think we need to test those. |
bf1ffcb to
780babb
Compare
…hange The CVO has had e2e-agnostic-upgrade-into-change and e2e-agnostic-upgrade-out-of-change since acd81b7 (ci-operator/config/openshift/cluster-version-operator: Separate A->B and B->A jobs, 2022-08-22, openshift#31518), as part of catching changes that would break rollbacks or roll-forwards pre-merge. Sometimes we accept that a change will break rollbacks, and in that case we '/override ...' to ignore the failure. But that way we are making an explicit decision, and not getting surprised by accidentally landing something that breaks rollbacks. Nightlies grew similar gates in e4b3a30 (trt-923: test out of change upgrades, 2023-10-16, openshift#43734), so even without this commit, we'd hear about things that break rollback post-merge. But reverting post-merge can be tedious, and with so much weight going through the API repo (feature gate promotion, CustomResourceDefinition changes, etc.), having a pre-merge guard seems like it will help. It will run in parallel with the existing update-into-change job, so it shouldn't increase overall latency. It will cost some to run, but that seems worth the cost if it catches issues once every quarter or year or so, if it saves the nightly-monitoring folks some manual debugging. I am not renaming the e2e-upgrade job to e2e-upgrade-into-change, because Prow gets confused by renaming required jobs if they're required for merge, and the final run on an open pull request failed before the job was renamed. There's no way to retest the old-name job, and Prow blocks on it until a root approver /override's the old name. Renaming to match the pattern used in the CVO might be worthwhile for consistency/uniformity at some point, but that can happen separately in follow-up work if the API approvers want. I'm not including the CVO's "agnostic", because while I like explicitly saying "the repo maintainers don't care what cloud this runs on", the existing API e2e-upgrade job doesn't, and matching that local-repo pattern seems more useful for avoiding confusion than matching the CVO pattern.
780babb to
be62d94
Compare
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/lgtm |
|
@JoelSpeed: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
It would probably be useful to also have a techpreview version of this, to match out techpreview upgrade job, happy to follow up with a separate PR for that if we can get this one merged sooner though |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
But it was a B->A update, as we're aiming for: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/62110/rehearse-62110-pull-ci-openshift-api-master-e2e-upgrade-out-of-change/1896504896492933120/artifacts/e2e-upgrade-out-of-change/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
2025-03-03T11:57:47Z 2025-03-03T13:01:01Z Completed 4.19.0-0.ci.test-2025-03-03-102330-ci-op-f0yf5dxv-initial
2025-03-03T11:17:22Z 2025-03-03T11:48:27Z Completed 4.19.0-0.ci.test-2025-03-03-110154-ci-op-f0yf5dxv-latestAnd that failure mode turns up in other jobs, so even more evidence that it's not getting introduced by this change. And the rate is low, so we're unlikely to block API merges by landing a perma-failing required job. $ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=cannot+register+an+exec+PID' | grep 'failures match'
periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-aws-ovn-upgrade (all) - 63 runs, 16% failed, 10% of failures match = 2% impact
rehearse-62110-pull-ci-openshift-api-master-e2e-upgrade-out-of-change (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
openshift-kubernetes-2225-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade (all) - 10 runs, 10% failed, 100% of failures match = 10% impact/pj-rehearse ack
My GitHub feedback loop is slow, so yeah, let's go with a follow-up pull for a TechPreview B->A job, if folks think it's worth covering that pre-merge too. |
|
@wking: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/test config |
|
@wking: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
2c7badd
into
openshift:master
The CVO has had
e2e-agnostic-upgrade-into-changeande2e-agnostic-upgrade-out-of-changesince acd81b7 (#31518), as part of catching changes that would break rollbacks or roll-forwards pre-merge. Sometimes we accept that a change will break rollbacks, and in that case we/override ...to ignore the failure. But that way we are making an explicit decision, and not getting surprised by accidentally landing something that breaks rollbacks.Nightlies grew similar gates in e4b3a30 (#43734), so even without this commit, we'd hear about things that break rollback post-merge. But reverting post-merge can be tedious, and with so much weight going through the API repo (feature gate promotion, CustomResourceDefinition changes, etc.), having a pre-merge guard seems like it will help. It will run in parallel with the existing update-into-change job, so it shouldn't increase overall latency. It will cost some to run, but that seems worth the cost if it catches issues once every quarter or year or so, if it saves the nightly-monitoring folks some manual debugging.
I am not renaming the
e2e-upgradejob toe2e-upgrade-into-change, because Prow gets confused by renaming required jobs if they're required for merge, and the final run on an open pull request failed before the job was renamed. There's no way to retest the old-name job, and Prow blocks on it until a root approver/override's the old name. Renaming to match the pattern used in the CVO might be worthwhile for consistency/uniformity at some point, but that can happen separately in follow-up work if the API approvers want.I'm not including the CVO's
agnostic, because while I like explicitly saying "the repo maintainers don't care what cloud this runs on", the existing APIe2e-upgradejob doesn't, and matching that local-repo pattern seems more useful for avoiding confusion than matching the CVO pattern.