-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPBUGS-6850: [release-4.12] upgrade/adminack: guarantee one admin ack check post-upgrade #27684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-6850: [release-4.12] upgrade/adminack: guarantee one admin ack check post-upgrade #27684
Conversation
While looking into OCPBUGS-5505 I discovered that some 4.10->4.11 upgrade job runs perform an Admin Ack check, while some do not. 4.11 has a `ack-4.11-kube-1.25-api-removals-in-4.12` gate, so these upgrade jobs sometimes test that `Upgradeable` goes `false` after the ugprade, and sometimes they do not. This is only determined by the polling race condition: the check is executed once per 10 minutes, and we cancel the polling after upgrade is completed. This means that in some cases we are lucky and manage to run one check before the cancel, and sometimes we are not and only check while still on the base version. Add a guaranteed single check execution after the upgrade, so that admin ack is always checked at least once with the upgrade target version. Doing checks after `done` is signalled has prior art in the alert test.
The `done` signal is either a timeout or "upgrade finished, stop testing". We do not need to perform the last check in the former case. Track versions that we check and when we get the signal, check whether the current version was checked at least once, and if not, check it before terminating.
…ade check openshift#27645 intended to add a guaranteed post-upgrade check but I have overlooked how exactly the polling is implemented and terminated, leading to the post-upgrade check never actually execute. Previously the test used `PollImmediateWithContext` for the each-10-minutes check. The `ConditionFunc` never actually returned `true` or non-nil `err`, so the `PollImmediateWithContext` never terminated by the means of `ConditionFunc`: it was always terminated by the `ctx.Done()` that the framework does on finished upgrade (or a test timeout). This means that `PollImmediateWithContext` always terminated with `err=wait.ErrWaitTimeout` and the `Test` method immediately returned, so the "guaranteed" check code is never reached. Given our `ConditionFunc` never terminates the polling, we can simplify and use the `wait.UntilWithContext` instead, which is a simpler version that precisely implements the desired loop (poll until context is done).
During testing of OCPBUGS-5505, it was discovered that even with shortening the CVO cache TTL, CVO may still only update `Upgradeable` in its sync interval, which may be as high as 4 minutes. Hence the tests needs to wait for that time (I added 5 second buffer on top of that).
|
@petr-muller: This pull request references Jira Issue OCPBUGS-6503, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@petr-muller: No Bugzilla bug is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@petr-muller: No Jira issue is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@petr-muller: No Bugzilla bug is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@petr-muller: No Bugzilla bug is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@petr-muller: No Bugzilla bug is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
|
@petr-muller: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/jira cherrypick OCPBUGS-6503 |
|
@petr-muller: Jira Issue OCPBUGS-6503 has been cloned as Jira Issue OCPBUGS-6850. Retitling PR to link against new bug. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LalatenduMohanty, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/label backport-risk-assessed |
|
/uncc @jiajliu |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/unassign @prietyc123 |
|
GitHub is acting funny: the commits from this PR are merged including the merge commit but for some reason the PR stays open. |
|
@petr-muller: This pull request references Jira Issue OCPBUGS-6850. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
35f8c05 |
…-to-release-4.12 OCPBUGS-6850: [release-4.12] upgrade/adminack: guarantee one admin ack check post-upgrade
This is a cherry-pick of #27645 and #27678
/hold