Skip to content

Conversation

@jottofar
Copy link
Contributor

@jottofar jottofar commented Jul 17, 2020

Change CVO wait.Until instances to wait.UntilWithContext making top-level context available to the lower levels of the code. Use the added context to add a timeout to the remote cincinnati HTTP call which has the potential to hang the caller at the mercy of the target server.

The timeout has been set to an hour which was determined to be long enough to allow all expected well-behaved requests to complete but short enough to not be catastrophic if the calling thread were to hang.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 17, 2020
@jottofar jottofar force-pushed the ota-244-add-context branch 3 times, most recently from 3970ccf to 17b5248 Compare July 17, 2020 22:11
@wking
Copy link
Member

wking commented Jul 18, 2020

unit includes:

# github.com/openshift/cluster-version-operator/pkg/cvo [github.com/openshift/cluster-version-operator/pkg/cvo.test]
pkg/cvo/cvo_scenarios_test.go:114:15: not enough arguments in call to o.sync
	have (string)
	want (context.Context, string)

@jottofar jottofar force-pushed the ota-244-add-context branch 5 times, most recently from 4ffb363 to 9ebfffc Compare July 20, 2020 23:59
@jottofar
Copy link
Contributor Author

/retest

@jottofar jottofar force-pushed the ota-244-add-context branch from 9ebfffc to 4923ce5 Compare July 21, 2020 12:49
@jottofar
Copy link
Contributor Author

/test unit

@jottofar
Copy link
Contributor Author

/retest

3 similar comments
@jottofar
Copy link
Contributor Author

/retest

@jottofar
Copy link
Contributor Author

/retest

@jottofar
Copy link
Contributor Author

/retest

@jottofar jottofar force-pushed the ota-244-add-context branch from 4923ce5 to f5654a9 Compare July 24, 2020 21:30
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 24, 2020
@jottofar
Copy link
Contributor Author

/retitle Use context to add timeout to cincinnati HTTP request

@openshift-ci-robot openshift-ci-robot changed the title Add a context to remote cincinnati calls from CVO Use context to add timeout to cincinnati HTTP request Jul 24, 2020
@jottofar jottofar force-pushed the ota-244-add-context branch from f5654a9 to 292f41e Compare July 24, 2020 21:40
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 24, 2020
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2020
@jottofar jottofar force-pushed the ota-244-add-context branch from d5b7ee7 to d1af4f9 Compare July 28, 2020 18:24
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2020
@jottofar jottofar force-pushed the ota-244-add-context branch 3 times, most recently from c67db6e to 8ee0d52 Compare July 30, 2020 13:20
@wking
Copy link
Member

wking commented Jul 30, 2020

pkg/cincinnati/cincinnati.go:89:36: undefined: transport

@jottofar jottofar force-pushed the ota-244-add-context branch from 8ee0d52 to 6cb435f Compare July 30, 2020 13:45
Change CVO wait.Until instances to wait.UntilWithContext making top-level
context available to the lower levels of the code. Use the added context
to add a timeout to the remote cincinnati HTTP call which has the
potential to hang the caller at the mercy of the target server.

The timeout has been set to an hour which was determined to be long
enough to allow all expected well-behaved requests to complete but short
enough to not be catastrophic if the calling thread were to hang.
@jottofar jottofar force-pushed the ota-244-add-context branch from 6cb435f to 1d1de3b Compare July 30, 2020 14:10
@jottofar
Copy link
Contributor Author

/test integration

@jottofar
Copy link
Contributor Author

/test e2e

@jottofar
Copy link
Contributor Author

/test e2e-upgrade

@wking
Copy link
Member

wking commented Jul 31, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 31, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member

wking commented Jul 31, 2020

e2e passed, except for an end user can use OLM can subscribe to the operator, which is rhbz#1862322.

/override ci/prow/e2e

@openshift-ci-robot
Copy link
Contributor

@wking: Overrode contexts on behalf of wking: ci/prow/e2e

Details

In response to this:

e2e passed, except for an end user can use OLM can subscribe to the operator, which is rhbz#1862322.

/override ci/prow/e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 4c3024d into openshift:master Jul 31, 2020
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 20, 2021
Address a bug introduced by cc1921d (pkg/start: Release leader
lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the
Operator.Run context would leave the operator with no time to attempt
the final sync [1]:

  E0119 22:24:15.924216       1 cvo.go:344] unable to perform final sync: context canceled

With this commit, I'm piping through shutdownContext, which gets a
two-minute grace period beyond runContext, to give the operator time
to push out that final status (which may include important information
like the fact that the incoming release image has completed
verification).

---

This commit picks c4ddf03 (pkg/cvo: Use shutdownContext for final
status synchronization, 2021-01-19, openshift#517) back to 4.5.  It's not a
clean pick, because we're missing changes like:

* b72e843 (Bug 1822844: Block z level upgrades if
  ClusterVersionOverridesSet set, 2020-04-30, openshift#364).
* 1d1de3b (Use context to add timeout to cincinnati HTTP request,
  2019-01-15, openshift#410).

which also touched these lines.  But we've gotten this far without
backporting rhbz#1822844, and openshift#410 was never associated with a bug in
the first place, so instead of pulling back more of 4.6 to get a clean
pick, I've just manually reconciled the pick conflicts.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 23, 2021
Address a bug introduced by cc1921d (pkg/start: Release leader
lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the
Operator.Run context would leave the operator with no time to attempt
the final sync [1]:

  E0119 22:24:15.924216       1 cvo.go:344] unable to perform final sync: context canceled

With this commit, I'm piping through shutdownContext, which gets a
two-minute grace period beyond runContext, to give the operator time
to push out that final status (which may include important information
like the fact that the incoming release image has completed
verification).

---

This commit picks c4ddf03 (pkg/cvo: Use shutdownContext for final
status synchronization, 2021-01-19, openshift#517) back to 4.5.  It's not a
clean pick, because we're missing changes like:

* b72e843 (Bug 1822844: Block z level upgrades if
  ClusterVersionOverridesSet set, 2020-04-30, openshift#364).
* 1d1de3b (Use context to add timeout to cincinnati HTTP request,
  2019-01-15, openshift#410).

which also touched these lines.  But we've gotten this far without
backporting rhbz#1822844, and openshift#410 was never associated with a bug in
the first place, so instead of pulling back more of 4.6 to get a clean
pick, I've just manually reconciled the pick conflicts.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 23, 2021
Address a bug introduced by cc1921d (pkg/start: Release leader
lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the
Operator.Run context would leave the operator with no time to attempt
the final sync [1]:

  E0119 22:24:15.924216       1 cvo.go:344] unable to perform final sync: context canceled

With this commit, I'm piping through shutdownContext, which gets a
two-minute grace period beyond runContext, to give the operator time
to push out that final status (which may include important information
like the fact that the incoming release image has completed
verification).

---

This commit picks c4ddf03 (pkg/cvo: Use shutdownContext for final
status synchronization, 2021-01-19, openshift#517) back to 4.5.  It's not a
clean pick, because we're missing changes like:

* b72e843 (Bug 1822844: Block z level upgrades if
  ClusterVersionOverridesSet set, 2020-04-30, openshift#364).
* 1d1de3b (Use context to add timeout to cincinnati HTTP request,
  2019-01-15, openshift#410).

which also touched these lines.  But we've gotten this far without
backporting rhbz#1822844, and openshift#410 was never associated with a bug in
the first place, so instead of pulling back more of 4.6 to get a clean
pick, I've just manually reconciled the pick conflicts.

Removing Start from pkg/start (again) fixes a buggy re-introduction in
the manually-backported 20421b6 (*: Add lots of Context and options
arguments, 2020-07-24, openshift#470).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants