Skip to content

OCPNODE-3877: add normal grace period allow non-drain updates to complete#30480

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
QiWang19:grace-wait-pool
Nov 15, 2025
Merged

OCPNODE-3877: add normal grace period allow non-drain updates to complete#30480
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
QiWang19:grace-wait-pool

Conversation

@QiWang19
Copy link
Member

@QiWang19 QiWang19 commented Nov 11, 2025

The logic now waits for 2min (normal rollout time for non-drain updates) before reporting an error if the pool requires an update but nodes are not ready.
This ensures that non-drain updates can complete successfully, for example, shipping a default ClusterImagePolicy during an upgrade (openshift/cluster-update-keys#85).

@openshift-ci openshift-ci bot requested review from deads2k and p0lyn0mial November 11, 2025 22:00
@QiWang19
Copy link
Member Author

/testwith openshift/cluster-update-keys/main/e2e-aws-upgrade openshift/cluster-update-keys#85 #30480

@QiWang19 QiWang19 changed the title Upgrade test add normal grace period allow non-drain updates to complete OCPNODE-3877: Upgrade test add normal grace period allow non-drain updates to complete Nov 12, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 12, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 12, 2025

@QiWang19: This pull request references OCPNODE-3877 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

The logic now waits for 2min (normal rollout time for non-drain updates) before reporting an error if the pool requires an update but nodes are not ready.
This ensures that non-drain updates can complete successfully, for example, shipping a default ClusterImagePolicy during an upgrade (openshift/cluster-update-keys#85).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@QiWang19
Copy link
Member Author

/verified by @QiWang19

/testwith openshift/cluster-update-keys/main/e2e-aws-upgrade openshift/cluster-update-keys#85 #30480 passed

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/multi-pr-openshift-origin-30480-openshift-cluster-update-keys-85-openshift-origin-30480-e2e-aws-upgrade/1988367407705493504

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 12, 2025
@openshift-ci-robot
Copy link

@QiWang19: This PR has been marked as verified by @QiWang19.

Details

In response to this:

/verified by @QiWang19

/testwith openshift/cluster-update-keys/main/e2e-aws-upgrade openshift/cluster-update-keys#85 #30480 passed

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/multi-pr-openshift-origin-30480-openshift-cluster-update-keys-85-openshift-origin-30480-e2e-aws-upgrade/1988367407705493504

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@QiWang19 QiWang19 changed the title OCPNODE-3877: Upgrade test add normal grace period allow non-drain updates to complete OCPNODE-3877: add normal grace period allow non-drain updates to complete Nov 12, 2025
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Nov 12, 2025
@QiWang19
Copy link
Member Author

/verified by @QiWang19

/testwith openshift/cluster-update-keys/main/e2e-aws-upgrade openshift/cluster-update-keys#85 #30480 passed

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/multi-pr-openshift-origin-30480-openshift-cluster-update-keys-85-openshift-origin-30480-e2e-aws-upgrade/1988367407705493504

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 12, 2025
@openshift-ci-robot
Copy link

@QiWang19: This PR has been marked as verified by @QiWang19.

Details

In response to this:

/verified by @QiWang19

/testwith openshift/cluster-update-keys/main/e2e-aws-upgrade openshift/cluster-update-keys#85 #30480 passed

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/multi-pr-openshift-origin-30480-openshift-cluster-update-keys-85-openshift-origin-30480-e2e-aws-upgrade/1988367407705493504

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

The logic now waits for 2min (normal rollout time for non-drain updates) before reporting error if the pool requires an update but nodes are not ready.

Signed-off-by: Qi Wang <qiwan@redhat.com>
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Nov 12, 2025
@QiWang19
Copy link
Member Author

/testwith openshift/cluster-update-keys/main/e2e-aws-upgrade openshift/cluster-update-keys#85 #30480

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 13, 2025
@openshift-ci-robot
Copy link

@QiWang19: This PR has been marked as verified by @QiWang19.

Details

In response to this:

/verified by @QiWang19

/testwith openshift/cluster-update-keys/main/e2e-aws-upgrade openshift/cluster-update-keys#85 #30480 passed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@neisw
Copy link
Contributor

neisw commented Nov 13, 2025

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2025
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 13, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: neisw, QiWang19, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 782ff8b and 2 for PR HEAD 2fd0d8e in total

@QiWang19
Copy link
Member Author

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b7d2a64 and 1 for PR HEAD 2fd0d8e in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 15, 2025

@QiWang19: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-upgrade-rollback 2fd0d8e link false /test e2e-aws-ovn-upgrade-rollback

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 8aaf884 and 0 for PR HEAD 2fd0d8e in total

@openshift-merge-bot openshift-merge-bot bot merged commit 111e203 into openshift:main Nov 15, 2025
21 of 22 checks passed
wking added a commit to wking/cluster-update-keys that referenced this pull request Nov 19, 2025
…-openshift-cip""

This reverts commit 7a5dcee.

This one has taken us some time:

* 2025-08-27, 94f7582, openshift#82 was our first attempt at enabling the
  ClusterImagePolicy.
* ...but it tripped up the origin test suite, so it was reverted in
  2025-08-28, c40e7b9, openshift#83.
* Qi then hardened the test suite with openshift/origin@d3af51e4acb
  (not fail upgrade checks if all nodes are ready, 2025-09-29,
  openshift/origin#30318) and openshift/origin@2fd0d8e242 (Upgrade
  test add 2min grace period allow non-drain updates to complete,
  2025-11-12, openshift/origin#30480).
* With the tougher CI in place, we tried a second time with
  2025-11-17, 1f89a67, openshift#85.
* ...but still tripped up origin, with runs like [1] taking 2.25m
  (more than the 2m grace period):

    I1119 17:26:21.890667 1511 upgrade.go:629] Waiting on pools to be upgraded
    I1119 17:26:21.939178 1511 upgrade.go:792] Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
    I1119 17:26:21.939259 1511 upgrade.go:666] Invariant violation detected: master pool requires update but nodes not ready. Waiting up to 2m0s for non-draining updates to complete
    I1119 17:26:31.984116 1511 upgrade.go:792] Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
    ...
    I1119 17:28:21.981438 1511 upgrade.go:792] Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
    I1119 17:28:21.981514 1511 upgrade.go:673] Invariant violation detected: the "master" pool should be updated before the CVO reports available at the new version

  and:

    $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade/1991158541779472384/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/inspect/cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigpools/master.yaml | yaml2json | jq -r '.status.conditions[] | select(.type == "Updating") | .lastTransitionTime + " " + .status'
    2025-11-19T17:28:36Z False

  28:36 - 26:21 = 135s = 2.25m, which overshot the 2m grace period.
  The second attempt was reverted in 7a5dcee, openshift#87.

* Qi then hardened the test suite further with
  openshift/origin@c17e560263 (Update grace period for cluster upgrade
  to 10 minutes, 2025-11-19, #openshift/origin#30506).
* This commit is taking a third attempt at enabling the
  ClusterImagePolicy.

[1]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade/1991158541779472384
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants