Skip to content

Conversation

@petr-muller
Copy link
Member

@petr-muller petr-muller commented Feb 3, 2023

The RetrievePayload performs two operations: verification and download. Both can take a non-trivial amount of time to terminate, up to "hanging" where CVO needs to abort the operation. The verification result can be ignored when upgrade is forced. The CVO calls RetrievePayload with a context that does not set a deadline, so RetrievePayload previously set its own internal deadline, common for both operations. This led to a suboptimal behavior on forced upgrades, where "hanging" verification could eat the whole timeout budget, got cancelled but its result was ignored (because of force). The code tried to proceed with download but that immediately aborts because of the expired context.

Improve timeouts in RetrievePayload for both input context states: with and without deadline. If the input context sets a deadline, it is respected. If it does not, the default, separate deadlines are applied for both operations. In both cases, the code makes sure the hanging verification never spends the whole budget. When verification terminates fast, the rest of its allotted time is given to the download operation.

To keep context about the change, I have kept separate commits where the intermediate ones show the newly-added tests failing for both the original bug and the regression that caused us to revert in #881.

The tests are a little clunky, being time-based and all. The happy case takes ~4s to execute (the parallelism helps), failures should be capped to ~5s. In theory the tests could be flakey (there's no strong guarantee that a select will always see a after-4s channel operation before the after-5s one) but I'm not seeing any problems with several tries of high-count --count 30 runs :

$ go test --count 30 ./pkg/cvo/... -run TestPayloadRetrieverRetrievePayload
ok  	github.com/openshift/cluster-version-operator/pkg/cvo	120.096s

Original, pre-#846 CVO fails the newly added test for RHBZ#2090680:

 --- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
    --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_passes_and_download_hangs_then_it_is_terminated_and_returns_error_(RHBZ#2090680) (5.00s)
        updatepayload_test.go:105: downloader: test backstop hit (expected cancel via ctx)
        updatepayload_test.go:339: Returned error differs from expected:
              interface{}(
            - 	e"Unable to download and prepare the update: download was canceled",
            + 	e"Unable to download and prepare the update: downloader: test backstop hit (expected cancel via ctx)",
              ) 

Reinstated #846 code fails two new tests (simulating the scenario for which we needed to revert it in #881):

 --- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
    --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_is_timing_out_verification_with_unlimited_context_and_update_is_forced_then_verification_times_out_promptly_and_retrieval_proceeds_but_download_fails_then_return_error (2.00s)
        updatepayload_test.go:118: downloader: unexpected ctx cancel (expected download to finish)
        updatepayload_test.go:339: Returned error differs from expected:
              interface{}(
            - 	e"Unable to download and prepare the update: fails to download",
            + 	e"Unable to download and prepare the update: downloader: unexpected ctx cancel (expected download to finish)",
              )
    --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_fails_to_verify_until_timeout_but_is_forced_then_it_allows_enough_time_for_download_and_it_returns_successfully (2.00s)
        updatepayload_test.go:118: downloader: unexpected ctx cancel (expected download to finish)
        updatepayload_test.go:339: Returned error differs from expected:
              interface{}(
            + 	e"Unable to download and prepare the update: downloader: unexpected ctx cancel (expected download to finish)",
              ) 

Last commit implements the separate timeouts and makes all tests pass.

@petr-muller
Copy link
Member Author

/test unit

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 3, 2023
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references OCPBUGSM-44759 which is a valid jira issue.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 3, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references OCPBUGSM-44759 which is a valid jira issue.

Details

In response to this:

Original, pre-#846 CVO fails the newly added test for RHBZ#2090680:

--- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_passes_and_download_hangs_then_it_is_terminated_and_returns_error_(RHBZ#2090680) (5.00s)
       updatepayload_test.go:105: downloader: test backstop hit (expected cancel via ctx)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           - 	e"Unable to download and prepare the update: download was canceled",
           + 	e"Unable to download and prepare the update: downloader: test backstop hit (expected cancel via ctx)",
             ) 

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

When we separated payload load from payload apply (openshift#683) the context
used for the retrieval changed as well. It went from one that was
constrained by syncTimeout (2 -4 minutes) [1] to being the unconstrained
shutdownContext [2]. However if "force" is specified we explicitly set a
2 minute timeout in RetrievePayload. This commit creates a new context
with a reasonable timeout for RetrievePayload regardless of "force".

[1]
https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/sync_worker.go#L605

[2]
https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/cvo.go#L413
@petr-muller
Copy link
Member Author

/test unit

@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references OCPBUGSM-44759 which is a valid jira issue.

Details

In response to this:

Original, pre-#846 CVO fails the newly added test for RHBZ#2090680:

--- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_passes_and_download_hangs_then_it_is_terminated_and_returns_error_(RHBZ#2090680) (5.00s)
       updatepayload_test.go:105: downloader: test backstop hit (expected cancel via ctx)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           - 	e"Unable to download and prepare the update: download was canceled",
           + 	e"Unable to download and prepare the update: downloader: test backstop hit (expected cancel via ctx)",
             ) 

Reinstated #846 code fails two new tests (simulating the scenario for which we needed to revert it):

--- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_is_timing_out_verification_with_unlimited_context_and_update_is_forced_then_verification_times_out_promptly_and_retrieval_proceeds_but_download_fails_then_return_error (2.00s)
       updatepayload_test.go:118: downloader: unexpected ctx cancel (expected download to finish)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           - 	e"Unable to download and prepare the update: fails to download",
           + 	e"Unable to download and prepare the update: downloader: unexpected ctx cancel (expected download to finish)",
             )
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_fails_to_verify_until_timeout_but_is_forced_then_it_allows_enough_time_for_download_and_it_returns_successfully (2.00s)
       updatepayload_test.go:118: downloader: unexpected ctx cancel (expected download to finish)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           + 	e"Unable to download and prepare the update: downloader: unexpected ctx cancel (expected download to finish)",
             ) 

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller petr-muller changed the title OCPBUGSM-44759: Improve timeouts on payload retrieval OCPBUGSM-44759: RetrievePayload: Improve timeouts and cover behavior with tests Feb 3, 2023
@petr-muller petr-muller marked this pull request as ready for review February 3, 2023 01:45
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2023
@petr-muller petr-muller force-pushed the ocpbugsm-44759-unrevert-and-fix-regression branch 3 times, most recently from b7826c6 to 23109b4 Compare February 3, 2023 02:07
The `RetrievePayload` performs two operations: verification and
download. Both can take a non-trivial amount of time to terminate, up to
"hanging" where CVO needs to abort the operation. The verification
result can be ignored when upgrade is forced. The CVO calls
`RetrievePayload` with a context that does not set a deadline, so
`RetrievePayload` previously set its own internal deadline, common for
both operations. This led to a suboptimal behavior on forced upgrades,
where "hanging" verification could eat the whole timeout budget, got
cancelled but its result was ignored (because of force). The code
tried to proceed with download but that immediately aborts because of
the expired context.

Improve timeouts in `RetrievePayload` for both input context states:
with and without deadline. If the input context sets a deadline, it is
respected. If it does not, the default, separate deadlines are applied
for both operations. In both cases, the code makes sure the hanging
verification never spends the whole budget. When verification terminates
fast, the rest of its alloted time is provided to the download
operation.
@petr-muller petr-muller force-pushed the ocpbugsm-44759-unrevert-and-fix-regression branch from 23109b4 to d421ded Compare February 3, 2023 02:13
@petr-muller
Copy link
Member Author

[sig-api-machinery] disruption/kube-api connection/new should be available throughout the test 
[sig-api-machinery] disruption/kube-api connection/reused should be available throughout the test
[sig-api-machinery] disruption/openshift-api connection/new should be available throughout the test
[sig-api-machinery] disruption/openshift-api connection/reused should be available throughout the test

Only disruption tests, unlikely to be caused by CVO
/test e2e-agnostic-upgrade-into-change

@petr-muller petr-muller changed the title OCPBUGSM-44759: RetrievePayload: Improve timeouts and cover behavior with tests Bug 1822752: RetrievePayload: Improve timeouts and cover behavior with tests Feb 3, 2023
@openshift-ci-robot openshift-ci-robot removed the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 3, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

Details

In response to this:

Original, pre-#846 CVO fails the newly added test for RHBZ#2090680:

--- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_passes_and_download_hangs_then_it_is_terminated_and_returns_error_(RHBZ#2090680) (5.00s)
       updatepayload_test.go:105: downloader: test backstop hit (expected cancel via ctx)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           - 	e"Unable to download and prepare the update: download was canceled",
           + 	e"Unable to download and prepare the update: downloader: test backstop hit (expected cancel via ctx)",
             ) 

Reinstated #846 code fails two new tests (simulating the scenario for which we needed to revert it):

--- FAIL: TestPayloadRetrieverRetrievePayload (0.00s)
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_is_timing_out_verification_with_unlimited_context_and_update_is_forced_then_verification_times_out_promptly_and_retrieval_proceeds_but_download_fails_then_return_error (2.00s)
       updatepayload_test.go:118: downloader: unexpected ctx cancel (expected download to finish)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           - 	e"Unable to download and prepare the update: fails to download",
           + 	e"Unable to download and prepare the update: downloader: unexpected ctx cancel (expected download to finish)",
             )
   --- FAIL: TestPayloadRetrieverRetrievePayload/when_sha_digest_pullspec_image_fails_to_verify_until_timeout_but_is_forced_then_it_allows_enough_time_for_download_and_it_returns_successfully (2.00s)
       updatepayload_test.go:118: downloader: unexpected ctx cancel (expected download to finish)
       updatepayload_test.go:339: Returned error differs from expected:
             interface{}(
           + 	e"Unable to download and prepare the update: downloader: unexpected ctx cancel (expected download to finish)",
             ) 

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Feb 3, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 3, 2023

@petr-muller: This pull request references Bugzilla bug 1822752, which is invalid:

  • expected the bug to be open, but it isn't
  • expected the bug to target the "4.13.0" release, but it targets "4.11.0" instead
  • expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is CLOSED (ERRATA) instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1822752: RetrievePayload: Improve timeouts and cover behavior with tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller petr-muller changed the title Bug 1822752: RetrievePayload: Improve timeouts and cover behavior with tests Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests Feb 3, 2023
@openshift-ci openshift-ci bot removed the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label Feb 3, 2023
@openshift-ci openshift-ci bot requested a review from evakhoni February 3, 2023 13:33
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 3, 2023

@petr-muller: This pull request references Bugzilla bug 2090680, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.13.0) matches configured target release for branch (4.13.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @evakhoni

Details

In response to this:

Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 3, 2023

@petr-muller: This pull request references Bugzilla bug 2090680, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.13.0) matches configured target release for branch (4.13.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @evakhoni

Details

In response to this:

Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member Author

: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
: [sig-network] pods should successfully create sandboxes by other
: [sig-api-machinery] disruption/kube-api connection/reused should be available throughout the test
: [sig-api-machinery] disruption/oauth-api connection/reused should be available throughout the test
: [sig-api-machinery] disruption/openshift-api connection/new should be available throughout the test
: [sig-api-machinery] disruption/openshift-api connection/reused should be available throughout the test
upgrade: [sig-cluster-lifecycle] ClusterOperators are available and not degraded after upgrade

Nothing pointing at CVO, one more try

/test e2e-agnostic-upgrade-into-change

@petr-muller
Copy link
Member Author

/hold

We want to investigate findings from https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c22

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2023
@petr-muller
Copy link
Member Author

oh nice, all-pass on the first try ❤️

Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 15, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 15, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@petr-muller
Copy link
Member Author

I have investigated the concerns from https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c22 and filed https://issues.redhat.com/browse/OCPBUGS-7714 for the problem. That behavior is just "uncovered" by this bugfix, not caused by it. The fix actually improves the state for the clusters that hit it - I believe that previously the manifest reconciliation was completely broken while the unbounded retrieval was timing out for hours, we just did not notice.

@petr-muller
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 9bfc8a5 and 2 for PR HEAD ac127d3 in total

@petr-muller
Copy link
Member Author

/override ci/prow/e2e-agnostic-upgrade-into-change

DNS operator issue, unrelated to this PR

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 18, 2023

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-upgrade-into-change

Details

In response to this:

/override ci/prow/e2e-agnostic-upgrade-into-change

DNS operator issue, unrelated to this PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 18, 2023

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 0e0d368 into openshift:master Feb 18, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 18, 2023

@petr-muller: All pull requests linked via external trackers have merged:

Bugzilla bug 2090680 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller added a commit to petr-muller/cluster-version-operator that referenced this pull request Feb 18, 2023
Fix a bunch of smells I encountered while working on openshift#896. Mostly simplify method signatures, but also remove two unused methods.
petr-muller added a commit to petr-muller/cluster-version-operator that referenced this pull request Feb 25, 2023
Fix a bunch of smells I encountered while working on openshift#896. Mostly simplify method signatures, but also remove two unused methods.
@petr-muller
Copy link
Member Author

/cherry-pick release-4.12

@openshift-cherrypick-robot

@petr-muller: new pull request created: #914

Details

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants