-
Notifications
You must be signed in to change notification settings - Fork 214
[release-4.12] OCPBUGS-10565: RetrievePayload: Improve timeouts and cover behavior with tests #914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When we separated payload load from payload apply (openshift#683) the context used for the retrieval changed as well. It went from one that was constrained by syncTimeout (2 -4 minutes) [1] to being the unconstrained shutdownContext [2]. However if "force" is specified we explicitly set a 2 minute timeout in RetrievePayload. This commit creates a new context with a reasonable timeout for RetrievePayload regardless of "force". [1] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/sync_worker.go#L605 [2] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/cvo.go#L413
The `RetrievePayload` performs two operations: verification and download. Both can take a non-trivial amount of time to terminate, up to "hanging" where CVO needs to abort the operation. The verification result can be ignored when upgrade is forced. The CVO calls `RetrievePayload` with a context that does not set a deadline, so `RetrievePayload` previously set its own internal deadline, common for both operations. This led to a suboptimal behavior on forced upgrades, where "hanging" verification could eat the whole timeout budget, got cancelled but its result was ignored (because of force). The code tried to proceed with download but that immediately aborts because of the expired context. Improve timeouts in `RetrievePayload` for both input context states: with and without deadline. If the input context sets a deadline, it is respected. If it does not, the default, separate deadlines are applied for both operations. In both cases, the code makes sure the hanging verification never spends the whole budget. When verification terminates fast, the rest of its alloted time is provided to the download operation.
|
@openshift-cherrypick-robot: Bugzilla Bug 2090680 has been cloned as Jira Issue OCPBUGS-10565. Retitling PR to link against new bug. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@openshift-cherrypick-robot: No Bugzilla bug is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-10565, which is valid. The bug has been moved to the POST state. 6 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test unit /override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change ^^^ is the only failure and it is unrelated to CVO |
|
@petr-muller: petr-muller unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Haha OWNERS files in release branches 🤣 |
|
/test e2e-agnostic-operator |
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
/label backport-risk-assessed |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LalatenduMohanty, openshift-cherrypick-robot The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
pre-merge verified in https://issues.redhat.com/browse/OCPBUGS-10565#comment-22011696 |
|
@jianlinliu @jiajliu can we get a |
|
/label cherry-pick-approved |
|
/retest-required |
|
@openshift-cherrypick-robot: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@openshift-cherrypick-robot: Jira Issue OCPBUGS-10565: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-10565 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-version-operator-container-v4.12.0-202305022015.p0.ga5c4031.assembly.stream for distgit cluster-version-operator. |
This is an automated cherry-pick of #896
/assign petr-muller