-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests #896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests #896
Conversation
|
/test unit |
|
@petr-muller: This pull request references OCPBUGSM-44759 which is a valid jira issue. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Skipping CI for Draft Pull Request. |
|
@petr-muller: This pull request references OCPBUGSM-44759 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When we separated payload load from payload apply (openshift#683) the context used for the retrieval changed as well. It went from one that was constrained by syncTimeout (2 -4 minutes) [1] to being the unconstrained shutdownContext [2]. However if "force" is specified we explicitly set a 2 minute timeout in RetrievePayload. This commit creates a new context with a reasonable timeout for RetrievePayload regardless of "force". [1] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/sync_worker.go#L605 [2] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/cvo.go#L413
|
/test unit |
|
@petr-muller: This pull request references OCPBUGSM-44759 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
b7826c6 to
23109b4
Compare
The `RetrievePayload` performs two operations: verification and download. Both can take a non-trivial amount of time to terminate, up to "hanging" where CVO needs to abort the operation. The verification result can be ignored when upgrade is forced. The CVO calls `RetrievePayload` with a context that does not set a deadline, so `RetrievePayload` previously set its own internal deadline, common for both operations. This led to a suboptimal behavior on forced upgrades, where "hanging" verification could eat the whole timeout budget, got cancelled but its result was ignored (because of force). The code tried to proceed with download but that immediately aborts because of the expired context. Improve timeouts in `RetrievePayload` for both input context states: with and without deadline. If the input context sets a deadline, it is respected. If it does not, the default, separate deadlines are applied for both operations. In both cases, the code makes sure the hanging verification never spends the whole budget. When verification terminates fast, the rest of its alloted time is provided to the download operation.
23109b4 to
d421ded
Compare
Only disruption tests, unlikely to be caused by CVO |
|
@petr-muller: No Jira issue is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@petr-muller: This pull request references Bugzilla bug 1822752, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@petr-muller: This pull request references Bugzilla bug 2090680, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
|
@petr-muller: This pull request references Bugzilla bug 2090680, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Nothing pointing at CVO, one more try /test e2e-agnostic-upgrade-into-change |
|
/hold We want to investigate findings from https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c22 |
|
oh nice, all-pass on the first try ❤️ |
wking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
I have investigated the concerns from https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c22 and filed https://issues.redhat.com/browse/OCPBUGS-7714 for the problem. That behavior is just "uncovered" by this bugfix, not caused by it. The fix actually improves the state for the clusters that hit it - I believe that previously the manifest reconciliation was completely broken while the unbounded retrieval was timing out for hours, we just did not notice. |
|
/hold cancel |
|
/override ci/prow/e2e-agnostic-upgrade-into-change DNS operator issue, unrelated to this PR |
|
@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-upgrade-into-change DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@petr-muller: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@petr-muller: All pull requests linked via external trackers have merged:
Bugzilla bug 2090680 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fix a bunch of smells I encountered while working on openshift#896. Mostly simplify method signatures, but also remove two unused methods.
Fix a bunch of smells I encountered while working on openshift#896. Mostly simplify method signatures, but also remove two unused methods.
|
/cherry-pick release-4.12 |
|
@petr-muller: new pull request created: #914 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The
RetrievePayloadperforms two operations: verification and download. Both can take a non-trivial amount of time to terminate, up to "hanging" where CVO needs to abort the operation. The verification result can be ignored when upgrade is forced. The CVO callsRetrievePayloadwith a context that does not set a deadline, soRetrievePayloadpreviously set its own internal deadline, common for both operations. This led to a suboptimal behavior on forced upgrades, where "hanging" verification could eat the whole timeout budget, got cancelled but its result was ignored (because of force). The code tried to proceed with download but that immediately aborts because of the expired context.Improve timeouts in
RetrievePayloadfor both input context states: with and without deadline. If the input context sets a deadline, it is respected. If it does not, the default, separate deadlines are applied for both operations. In both cases, the code makes sure the hanging verification never spends the whole budget. When verification terminates fast, the rest of its allotted time is given to the download operation.To keep context about the change, I have kept separate commits where the intermediate ones show the newly-added tests failing for both the original bug and the regression that caused us to revert in #881.
The tests are a little clunky, being time-based and all. The happy case takes ~4s to execute (the parallelism helps), failures should be capped to ~5s. In theory the tests could be flakey (there's no strong guarantee that a
selectwill always see a after-4s channel operation before the after-5s one) but I'm not seeing any problems with several tries of high-count--count 30runs :Original, pre-#846 CVO fails the newly added test for RHBZ#2090680:
Reinstated #846 code fails two new tests (simulating the scenario for which we needed to revert it in #881):
Last commit implements the separate timeouts and makes all tests pass.