Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests #896

petr-muller · 2023-02-03T01:10:25Z

The RetrievePayload performs two operations: verification and download. Both can take a non-trivial amount of time to terminate, up to "hanging" where CVO needs to abort the operation. The verification result can be ignored when upgrade is forced. The CVO calls RetrievePayload with a context that does not set a deadline, so RetrievePayload previously set its own internal deadline, common for both operations. This led to a suboptimal behavior on forced upgrades, where "hanging" verification could eat the whole timeout budget, got cancelled but its result was ignored (because of force). The code tried to proceed with download but that immediately aborts because of the expired context.

Improve timeouts in RetrievePayload for both input context states: with and without deadline. If the input context sets a deadline, it is respected. If it does not, the default, separate deadlines are applied for both operations. In both cases, the code makes sure the hanging verification never spends the whole budget. When verification terminates fast, the rest of its allotted time is given to the download operation.

To keep context about the change, I have kept separate commits where the intermediate ones show the newly-added tests failing for both the original bug and the regression that caused us to revert in #881.

The tests are a little clunky, being time-based and all. The happy case takes ~4s to execute (the parallelism helps), failures should be capped to ~5s. In theory the tests could be flakey (there's no strong guarantee that a select will always see a after-4s channel operation before the after-5s one) but I'm not seeing any problems with several tries of high-count --count 30 runs :

$ go test --count 30 ./pkg/cvo/... -run TestPayloadRetrieverRetrievePayload
ok  	github.com/openshift/cluster-version-operator/pkg/cvo	120.096s

Original, pre-#846 CVO fails the newly added test for RHBZ#2090680:

 --- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
    --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_passes_and_download_hangs_then_it_is_terminated_and_returns_error_(RHBZ#2090680) (5.00s)
        updatepayload_test.go:105: downloader: test backstop hit (expected cancel via ctx)
        updatepayload_test.go:339: Returned error differs from expected:
              interface{}(
            - 	e"Unable to download and prepare the update: download was canceled",
            + 	e"Unable to download and prepare the update: downloader: test backstop hit (expected cancel via ctx)",
              )

Reinstated #846 code fails two new tests (simulating the scenario for which we needed to revert it in #881):

 --- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
    --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_is_timing_out_verification_with_unlimited_context_and_update_is_forced_then_verification_times_out_promptly_and_retrieval_proceeds_but_download_fails_then_return_error (2.00s)
        updatepayload_test.go:118: downloader: unexpected ctx cancel (expected download to finish)
        updatepayload_test.go:339: Returned error differs from expected:
              interface{}(
            - 	e"Unable to download and prepare the update: fails to download",
            + 	e"Unable to download and prepare the update: downloader: unexpected ctx cancel (expected download to finish)",
              )
    --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_fails_to_verify_until_timeout_but_is_forced_then_it_allows_enough_time_for_download_and_it_returns_successfully (2.00s)
        updatepayload_test.go:118: downloader: unexpected ctx cancel (expected download to finish)
        updatepayload_test.go:339: Returned error differs from expected:
              interface{}(
            + 	e"Unable to download and prepare the update: downloader: unexpected ctx cancel (expected download to finish)",
              )

Last commit implements the separate timeouts and makes all tests pass.

petr-muller · 2023-02-03T01:10:41Z

/test unit

openshift-ci-robot · 2023-02-03T01:11:36Z

@petr-muller: This pull request references OCPBUGSM-44759 which is a valid jira issue.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-02-03T01:11:38Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2023-02-03T01:19:33Z

@petr-muller: This pull request references OCPBUGSM-44759 which is a valid jira issue.

Details

In response to this:

Original, pre-#846 CVO fails the newly added test for RHBZ#2090680:

--- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_passes_and_download_hangs_then_it_is_terminated_and_returns_error_(RHBZ#2090680) (5.00s)
       updatepayload_test.go:105: downloader: test backstop hit (expected cancel via ctx)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           - 	e"Unable to download and prepare the update: download was canceled",
           + 	e"Unable to download and prepare the update: downloader: test backstop hit (expected cancel via ctx)",
             )

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

When we separated payload load from payload apply (openshift#683) the context used for the retrieval changed as well. It went from one that was constrained by syncTimeout (2 -4 minutes) [1] to being the unconstrained shutdownContext [2]. However if "force" is specified we explicitly set a 2 minute timeout in RetrievePayload. This commit creates a new context with a reasonable timeout for RetrievePayload regardless of "force". [1] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/sync_worker.go#L605 [2] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/cvo.go#L413

petr-muller · 2023-02-03T01:20:09Z

/test unit

openshift-ci-robot · 2023-02-03T01:26:01Z

@petr-muller: This pull request references OCPBUGSM-44759 which is a valid jira issue.

Details

In response to this:

Original, pre-#846 CVO fails the newly added test for RHBZ#2090680:

--- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_passes_and_download_hangs_then_it_is_terminated_and_returns_error_(RHBZ#2090680) (5.00s)
       updatepayload_test.go:105: downloader: test backstop hit (expected cancel via ctx)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           - 	e"Unable to download and prepare the update: download was canceled",
           + 	e"Unable to download and prepare the update: downloader: test backstop hit (expected cancel via ctx)",
             )

Reinstated #846 code fails two new tests (simulating the scenario for which we needed to revert it):

--- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_is_timing_out_verification_with_unlimited_context_and_update_is_forced_then_verification_times_out_promptly_and_retrieval_proceeds_but_download_fails_then_return_error (2.00s)
       updatepayload_test.go:118: downloader: unexpected ctx cancel (expected download to finish)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           - 	e"Unable to download and prepare the update: fails to download",
           + 	e"Unable to download and prepare the update: downloader: unexpected ctx cancel (expected download to finish)",
             )
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_fails_to_verify_until_timeout_but_is_forced_then_it_allows_enough_time_for_download_and_it_returns_successfully (2.00s)
       updatepayload_test.go:118: downloader: unexpected ctx cancel (expected download to finish)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           + 	e"Unable to download and prepare the update: downloader: unexpected ctx cancel (expected download to finish)",
             )

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

The `RetrievePayload` performs two operations: verification and download. Both can take a non-trivial amount of time to terminate, up to "hanging" where CVO needs to abort the operation. The verification result can be ignored when upgrade is forced. The CVO calls `RetrievePayload` with a context that does not set a deadline, so `RetrievePayload` previously set its own internal deadline, common for both operations. This led to a suboptimal behavior on forced upgrades, where "hanging" verification could eat the whole timeout budget, got cancelled but its result was ignored (because of force). The code tried to proceed with download but that immediately aborts because of the expired context. Improve timeouts in `RetrievePayload` for both input context states: with and without deadline. If the input context sets a deadline, it is respected. If it does not, the default, separate deadlines are applied for both operations. In both cases, the code makes sure the hanging verification never spends the whole budget. When verification terminates fast, the rest of its alloted time is provided to the download operation.

petr-muller · 2023-02-03T13:30:36Z

[sig-api-machinery] disruption/kube-api connection/new should be available throughout the test 
[sig-api-machinery] disruption/kube-api connection/reused should be available throughout the test
[sig-api-machinery] disruption/openshift-api connection/new should be available throughout the test
[sig-api-machinery] disruption/openshift-api connection/reused should be available throughout the test

Only disruption tests, unlikely to be caused by CVO
/test e2e-agnostic-upgrade-into-change

openshift-ci-robot · 2023-02-03T13:31:38Z

@petr-muller: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

Details

In response to this:

Original, pre-#846 CVO fails the newly added test for RHBZ#2090680:

--- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_passes_and_download_hangs_then_it_is_terminated_and_returns_error_(RHBZ#2090680) (5.00s)
       updatepayload_test.go:105: downloader: test backstop hit (expected cancel via ctx)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           - 	e"Unable to download and prepare the update: download was canceled",
           + 	e"Unable to download and prepare the update: downloader: test backstop hit (expected cancel via ctx)",
             )

Reinstated #846 code fails two new tests (simulating the scenario for which we needed to revert it):

--- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_is_timing_out_verification_with_unlimited_context_and_update_is_forced_then_verification_times_out_promptly_and_retrieval_proceeds_but_download_fails_then_return_error (2.00s)
       updatepayload_test.go:118: downloader: unexpected ctx cancel (expected download to finish)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           - 	e"Unable to download and prepare the update: fails to download",
           + 	e"Unable to download and prepare the update: downloader: unexpected ctx cancel (expected download to finish)",
             )
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_fails_to_verify_until_timeout_but_is_forced_then_it_allows_enough_time_for_download_and_it_returns_successfully (2.00s)
       updatepayload_test.go:118: downloader: unexpected ctx cancel (expected download to finish)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           + 	e"Unable to download and prepare the update: downloader: unexpected ctx cancel (expected download to finish)",
             )

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-02-03T13:32:04Z

@petr-muller: This pull request references Bugzilla bug 1822752, which is invalid:

expected the bug to be open, but it isn't
expected the bug to target the "4.13.0" release, but it targets "4.11.0" instead
expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is CLOSED (ERRATA) instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1822752: RetrievePayload: Improve timeouts and cover behavior with tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-02-03T13:43:50Z

@petr-muller: This pull request references Bugzilla bug 2090680, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.13.0) matches configured target release for branch (4.13.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @evakhoni

Details

In response to this:

Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-02-03T14:12:10Z

@petr-muller: This pull request references Bugzilla bug 2090680, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.13.0) matches configured target release for branch (4.13.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @evakhoni

Details

In response to this:

Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller · 2023-02-05T08:44:10Z

: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
: [sig-network] pods should successfully create sandboxes by other
: [sig-api-machinery] disruption/kube-api connection/reused should be available throughout the test
: [sig-api-machinery] disruption/oauth-api connection/reused should be available throughout the test
: [sig-api-machinery] disruption/openshift-api connection/new should be available throughout the test
: [sig-api-machinery] disruption/openshift-api connection/reused should be available throughout the test
upgrade: [sig-cluster-lifecycle] ClusterOperators are available and not degraded after upgrade

Nothing pointing at CVO, one more try

/test e2e-agnostic-upgrade-into-change

petr-muller · 2023-02-07T10:35:50Z

/hold

We want to investigate findings from https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c22

pkg/cvo/updatepayload.go

pkg/cvo/updatepayload_test.go

petr-muller · 2023-02-14T22:51:30Z

oh nice, all-pass on the first try ❤️

wking

/lgtm

openshift-ci · 2023-02-15T23:31:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [petr-muller,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

petr-muller · 2023-02-17T16:40:06Z

I have investigated the concerns from https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c22 and filed https://issues.redhat.com/browse/OCPBUGS-7714 for the problem. That behavior is just "uncovered" by this bugfix, not caused by it. The fix actually improves the state for the clusters that hit it - I believe that previously the manifest reconciliation was completely broken while the unbounded retrieval was timing out for hours, we just did not notice.

petr-muller · 2023-02-17T16:41:06Z

/hold cancel

openshift-ci-robot · 2023-02-17T20:11:31Z

/retest-required

Remaining retests: 0 against base HEAD 9bfc8a5 and 2 for PR HEAD ac127d3 in total

petr-muller · 2023-02-18T13:52:51Z

/override ci/prow/e2e-agnostic-upgrade-into-change

DNS operator issue, unrelated to this PR

openshift-ci · 2023-02-18T13:53:07Z

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-upgrade-into-change

Details

In response to this:

/override ci/prow/e2e-agnostic-upgrade-into-change

DNS operator issue, unrelated to this PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-02-18T13:53:08Z

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2023-02-18T13:56:10Z

@petr-muller: All pull requests linked via external trackers have merged:

Bugzilla bug 2090680 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Fix a bunch of smells I encountered while working on openshift#896. Mostly simplify method signatures, but also remove two unused methods.

petr-muller · 2023-03-20T13:20:08Z

/cherry-pick release-4.12

openshift-cherrypick-robot · 2023-03-20T13:20:54Z

@petr-muller: new pull request created: #914

Details

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

RetrievePayload: improve testability and add tests

0a121cc

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 3, 2023

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2023

petr-muller changed the title ~~OCPBUGSM-44759: Improve timeouts on payload retrieval~~ OCPBUGSM-44759: RetrievePayload: Improve timeouts and cover behavior with tests Feb 3, 2023

petr-muller marked this pull request as ready for review February 3, 2023 01:45

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2023

openshift-ci bot requested review from DavidHurta and LalatenduMohanty February 3, 2023 01:46

petr-muller force-pushed the ocpbugsm-44759-unrevert-and-fix-regression branch 3 times, most recently from b7826c6 to 23109b4 Compare February 3, 2023 02:07

petr-muller force-pushed the ocpbugsm-44759-unrevert-and-fix-regression branch from 23109b4 to d421ded Compare February 3, 2023 02:13

petr-muller changed the title ~~OCPBUGSM-44759: RetrievePayload: Improve timeouts and cover behavior with tests~~ Bug 1822752: RetrievePayload: Improve timeouts and cover behavior with tests Feb 3, 2023

openshift-ci-robot removed the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 3, 2023

openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Feb 3, 2023

petr-muller changed the title ~~Bug 1822752: RetrievePayload: Improve timeouts and cover behavior with tests~~ Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests Feb 3, 2023

openshift-ci bot removed the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label Feb 3, 2023

openshift-ci bot requested a review from evakhoni February 3, 2023 13:33

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2023

wking reviewed Feb 13, 2023

View reviewed changes

pkg/cvo/updatepayload.go Show resolved Hide resolved

wking reviewed Feb 13, 2023

View reviewed changes

pkg/cvo/updatepayload.go Outdated Show resolved Hide resolved

wking reviewed Feb 13, 2023

View reviewed changes

pkg/cvo/updatepayload_test.go Show resolved Hide resolved

RetrievePayload: improve godocs and refactor (address review)

ac127d3

wking approved these changes Feb 15, 2023

View reviewed changes

openshift-ci bot assigned wking Feb 15, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 15, 2023

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2023

openshift-merge-robot merged commit 0e0d368 into openshift:master Feb 18, 2023

petr-muller added a commit to petr-muller/cluster-version-operator that referenced this pull request Feb 18, 2023

pkg/cvo: code cleanups

d581a97

Fix a bunch of smells I encountered while working on openshift#896. Mostly simplify method signatures, but also remove two unused methods.

petr-muller mentioned this pull request Feb 18, 2023

pkg/cvo: code cleanups #902

Merged

petr-muller added a commit to petr-muller/cluster-version-operator that referenced this pull request Feb 25, 2023

pkg/cvo: code cleanups

854e24e

Fix a bunch of smells I encountered while working on openshift#896. Mostly simplify method signatures, but also remove two unused methods.

openshift-cherrypick-robot mentioned this pull request Mar 20, 2023

[release-4.12] OCPBUGS-10565: RetrievePayload: Improve timeouts and cover behavior with tests #914

Merged

Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests #896

Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests #896

Uh oh!

Conversation

petr-muller commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petr-muller commented Feb 3, 2023

Uh oh!

openshift-ci-robot commented Feb 3, 2023

Uh oh!

openshift-ci bot commented Feb 3, 2023

Uh oh!

openshift-ci-robot commented Feb 3, 2023

Uh oh!

petr-muller commented Feb 3, 2023

Uh oh!

openshift-ci-robot commented Feb 3, 2023

Uh oh!

petr-muller commented Feb 3, 2023

Uh oh!

openshift-ci-robot commented Feb 3, 2023

Uh oh!

openshift-ci bot commented Feb 3, 2023

Uh oh!

openshift-ci bot commented Feb 3, 2023

Uh oh!

openshift-ci bot commented Feb 3, 2023

Uh oh!

petr-muller commented Feb 5, 2023

Uh oh!

petr-muller commented Feb 7, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petr-muller commented Feb 14, 2023

Uh oh!

wking left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Feb 15, 2023

Uh oh!

petr-muller commented Feb 17, 2023

Uh oh!

petr-muller commented Feb 17, 2023

Uh oh!

openshift-ci-robot commented Feb 17, 2023

Uh oh!

petr-muller commented Feb 18, 2023

Uh oh!

openshift-ci bot commented Feb 18, 2023

Uh oh!

openshift-ci bot commented Feb 18, 2023

Uh oh!

openshift-ci bot commented Feb 18, 2023

Uh oh!

petr-muller commented Mar 20, 2023

Uh oh!

openshift-cherrypick-robot commented Mar 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

petr-muller commented Feb 3, 2023 •

edited

Loading