Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Feb 26, 2025

The CVO has had e2e-agnostic-upgrade-into-change and e2e-agnostic-upgrade-out-of-change since acd81b7 (#31518), as part of catching changes that would break rollbacks or roll-forwards pre-merge. Sometimes we accept that a change will break rollbacks, and in that case we /override ... to ignore the failure. But that way we are making an explicit decision, and not getting surprised by accidentally landing something that breaks rollbacks.

Nightlies grew similar gates in e4b3a30 (#43734), so even without this commit, we'd hear about things that break rollback post-merge. But reverting post-merge can be tedious, and with so much weight going through the API repo (feature gate promotion, CustomResourceDefinition changes, etc.), having a pre-merge guard seems like it will help. It will run in parallel with the existing update-into-change job, so it shouldn't increase overall latency. It will cost some to run, but that seems worth the cost if it catches issues once every quarter or year or so, if it saves the nightly-monitoring folks some manual debugging.

I am not renaming the e2e-upgrade job to e2e-upgrade-into-change, because Prow gets confused by renaming required jobs if they're required for merge, and the final run on an open pull request failed before the job was renamed. There's no way to retest the old-name job, and Prow blocks on it until a root approver /override's the old name. Renaming to match the pattern used in the CVO might be worthwhile for consistency/uniformity at some point, but that can happen separately in follow-up work if the API approvers want.

I'm not including the CVO's agnostic, because while I like explicitly saying "the repo maintainers don't care what cloud this runs on", the existing API e2e-upgrade job doesn't, and matching that local-repo pattern seems more useful for avoiding confusion than matching the CVO pattern.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 26, 2025
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request explicitly references no jira issue.

Details

In response to this:

The CVO has had e2e-agnostic-upgrade-into-change and e2e-agnostic-upgrade-out-of-change since acd81b7 (#31518), as part of catching changes that would break rollbacks or roll-forwards pre-merge. Sometimes we accept that a change will break rollbacks, and in that case we /override ... to ignore the failure. But that way we are making an explicit decision, and not getting surprised by accidentally landing something that breaks rollbacks.

Nightlies grew similar gates in e4b3a30 (#43734), so even without this commit, we'd hear about things that break rollback post-merge. But reverting post-merge can be tedious, and with so much weight going through the API repo (feature gate promotion, CustomResourceDefinition changes, etc.), having a pre-merge guard seems like it will help. It will run in parallel with the existing update-into-change job, so it shouldn't increase overall latency. It will cost some to run, but that seems worth the cost if it catches issues once every quarter or year or so, if it saves the nightly-monitoring folks some manual debugging.

I am not renaming the e2e-upgrade job to e2e-upgrade-into-change, because Prow gets confused by renaming required jobs if they're required for merge, and the final run on an open pull request failed before the job was renamed. There's no way to retest the old-name job, and Prow blocks on it until a root approver /override's the old name. Renaming to match the pattern used in the CVO might be worthwhile for consistency/uniformity at some point, but that can happen separately in follow-up work if the API approvers want.

I'm not including the CVO's agnostic, because while I like explicitly saying "the repo maintainers don't care what cloud this runs on", the existing API e2e-upgrade job doesn't, and matching that local-repo pattern seems more useful for avoiding confusion than matching the CVO pattern.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from knobunc and sjenning February 26, 2025 19:54
@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Feb 26, 2025
@wking wking force-pushed the api-update-out-of-change branch from ae09bc5 to e490d5c Compare February 26, 2025 20:18
@openshift-ci-robot openshift-ci-robot removed the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Feb 26, 2025
@wking wking force-pushed the api-update-out-of-change branch from e490d5c to bf1ffcb Compare February 26, 2025 20:38
@wking
Copy link
Member Author

wking commented Feb 27, 2025

/pj-rehearse

@openshift-ci-robot
Copy link
Contributor

@wking: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two types of upgrade that we run for API PRs already, we do both z-stream and y-stream upgrades.

This is currently adding a z-stream upgrade (or downgrade?) test, do you think there is value in adding/swapping this for a y-stream version?

I assume the nightlies and CVO are only doing this on z-stream presently?

dependencies:
OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE: release:latest
OPENSHIFT_UPGRADE_RELEASE_IMAGE_OVERRIDE: release:initial
workflow: openshift-upgrade-azure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The e2e-upgrade above uses AWS, can we be consistent with that please?

Copy link
Member Author

@wking wking Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seemed like it might be worth spreading exposure around, just in case there was a change that broke one cloud. But the benefit of the platform-agnostic job name is that its really easy to move from cloud to cloud for any reason, so sure, pushed a pivot to AWS with bf1ffcb -> be62d94.

@JoelSpeed
Copy link
Contributor

Looking at a recent nightly, I can see jobs that are upgrade-out-of-change and a bunch of plain upgrade jobs, but I don't see any upgrade-into-change, so I think this PR as is would put us inline with what's going on there, so for now I'm ok with leaving the existing upgrade job named as it is

@wking
Copy link
Member Author

wking commented Feb 27, 2025

This is currently adding a z-stream upgrade (or downgrade?) test, do you think there is value in adding/swapping this for a y-stream version?

OCPSTRAT-975 covers what we do for rollbacks, and we are only trying within a z-stream, because so many things change during minor-version updates (new controllers getting added, etc.), and we don't want to spend the time/money on teaching 4.y how to clean up all the new-in-4.(y+1) stuff. And we have logic in the CVO to reject attempts to roll back from 4.(y+1) to 4.y, so I don't think we need to test those.

@wking wking force-pushed the api-update-out-of-change branch from bf1ffcb to 780babb Compare February 27, 2025 22:56
…hange

The CVO has had e2e-agnostic-upgrade-into-change and
e2e-agnostic-upgrade-out-of-change since acd81b7
(ci-operator/config/openshift/cluster-version-operator: Separate A->B
and B->A jobs, 2022-08-22, openshift#31518), as part of catching changes that
would break rollbacks or roll-forwards pre-merge.  Sometimes we accept
that a change will break rollbacks, and in that case we '/override
...' to ignore the failure.  But that way we are making an explicit
decision, and not getting surprised by accidentally landing something
that breaks rollbacks.

Nightlies grew similar gates in e4b3a30 (trt-923: test out of
change upgrades, 2023-10-16, openshift#43734), so even without this commit,
we'd hear about things that break rollback post-merge.  But reverting
post-merge can be tedious, and with so much weight going through the
API repo (feature gate promotion, CustomResourceDefinition changes,
etc.), having a pre-merge guard seems like it will help.  It will run
in parallel with the existing update-into-change job, so it shouldn't
increase overall latency.  It will cost some to run, but that seems
worth the cost if it catches issues once every quarter or year or so,
if it saves the nightly-monitoring folks some manual debugging.

I am not renaming the e2e-upgrade job to e2e-upgrade-into-change,
because Prow gets confused by renaming required jobs if they're
required for merge, and the final run on an open pull request failed
before the job was renamed.  There's no way to retest the old-name
job, and Prow blocks on it until a root approver /override's the old
name.  Renaming to match the pattern used in the CVO might be
worthwhile for consistency/uniformity at some point, but that can
happen separately in follow-up work if the API approvers want.

I'm not including the CVO's "agnostic", because while I like
explicitly saying "the repo maintainers don't care what cloud this
runs on", the existing API e2e-upgrade job doesn't, and matching that
local-repo pattern seems more useful for avoiding confusion than
matching the CVO pattern.
@wking wking force-pushed the api-update-out-of-change branch from 780babb to be62d94 Compare February 27, 2025 23:04
@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@wking: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-api-master-e2e-upgrade-out-of-change openshift/api presubmit Presubmit changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@JoelSpeed
Copy link
Contributor

/lgtm
/approve
/pj-rehearse

@openshift-ci-robot
Copy link
Contributor

@JoelSpeed: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@JoelSpeed
Copy link
Contributor

It would probably be useful to also have a techpreview version of this, to match out techpreview upgrade job, happy to follow up with a separate PR for that if we can get this one merged sooner though

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 3, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 3, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2025
@wking
Copy link
Member Author

wking commented Mar 3, 2025

rehearsal:

: [sig-arch] events should not repeat pathologically for ns/openshift-monitoring expand_less	0s
{  1 events happened too frequently

event happened 21 times, something is wrong: namespace/openshift-monitoring node/ip-10-0-64-133.us-west-2.compute.internal pod/prometheus-k8s-1 hmsg/357171899f - reason/Unhealthy Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (12:46:03Z) result=reject }

But it was a B->A update, as we're aiming for:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/62110/rehearse-62110-pull-ci-openshift-api-master-e2e-upgrade-out-of-change/1896504896492933120/artifacts/e2e-upgrade-out-of-change/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
2025-03-03T11:57:47Z 2025-03-03T13:01:01Z Completed 4.19.0-0.ci.test-2025-03-03-102330-ci-op-f0yf5dxv-initial
2025-03-03T11:17:22Z 2025-03-03T11:48:27Z Completed 4.19.0-0.ci.test-2025-03-03-110154-ci-op-f0yf5dxv-latest

And that failure mode turns up in other jobs, so even more evidence that it's not getting introduced by this change. And the rate is low, so we're unlikely to block API merges by landing a perma-failing required job.

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=cannot+register+an+exec+PID' | grep 'failures match'
periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-aws-ovn-upgrade (all) - 63 runs, 16% failed, 10% of failures match = 2% impact
rehearse-62110-pull-ci-openshift-api-master-e2e-upgrade-out-of-change (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
openshift-kubernetes-2225-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade (all) - 10 runs, 10% failed, 100% of failures match = 10% impact

/pj-rehearse ack

It would probably be useful to also have a techpreview version of this, to match out techpreview upgrade job, happy to follow up with a separate PR for that if we can get this one merged sooner though

My GitHub feedback loop is slow, so yeah, let's go with a follow-up pull for a TechPreview B->A job, if folks think it's worth covering that pre-merge too.

@openshift-ci-robot
Copy link
Contributor

@wking: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Mar 3, 2025
@JoelSpeed
Copy link
Contributor

/test config

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 4, 2025

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/openshift/api/master/e2e-upgrade-out-of-change be62d94 link unknown /pj-rehearse pull-ci-openshift-api-master-e2e-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 40cd648 and 2 for PR HEAD be62d94 in total

@openshift-merge-bot openshift-merge-bot bot merged commit 2c7badd into openshift:master Mar 4, 2025
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants