Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success #372

wking · 2020-05-28T07:19:47Z

With this commit, I drop contextIsCancelled in favor of Context.Err(). From the docs:

If Done is not yet closed, Err returns nil. If Done is closed, Err returns a non-nil error explaining why: Canceled if the context was canceled or DeadlineExceeded if the context's deadline passed. After Err returns a non-nil error, successive calls to Err return the same error.

I dunno why we'd been checking Done() instead, but contextIsCancelled dates back to 961873d (#82).

I've also generalized a number of *Cancel* helpers to be *Context* to remind folks that Context.Err() can be DeadlineExceeded as well as Canceled, and the CVO uses both WithCancel and WithTimeout. The new error messages will be either:

update context deadline exceeded at 1 of 2

or:

update context canceled at 1 of 2

Instead of always claiming:

update was cancelled at 1 of 2

With this commit, I drop contextIsCancelled in favor of Context.Err(). From the docs [1]: If Done is not yet closed, Err returns nil. If Done is closed, Err returns a non-nil error explaining why: Canceled if the context was canceled or DeadlineExceeded if the context's deadline passed. After Err returns a non-nil error, successive calls to Err return the same error. I dunno why we'd been checking Done() instead, but contextIsCancelled dates back to 961873d (sync: Do config syncing in the background, 2019-01-11, openshift#82). I've also generalized a number of *Cancel* helpers to be *Context* to remind folks that Context.Err() can be DeadlineExceeded as well as Canceled, and the CVO uses both WithCancel and WithTimeout. The new error messages will be either: update context deadline exceeded at 1 of 2 or: update context canceled at 1 of 2 Instead of always claiming: update was cancelled at 1 of 2 [1]: https://golang.org/pkg/context/#Context

… as success For [1]. Before this commit, you could have a flow like: 1. SyncWorker.Start() 2. External code calls Update(), e.g. after noticing a ClusterVersion spec.desiredUpdate change. 3. Update sets w.work to point at the desired payload. 4. Start's Until loop is triggered via w.notify. 5. Start calls calculateNext, which notices the change and sets the state to UpdatingPayload. 6. Start calculates a syncTimeout and calls syncOnce. 7. syncOnce notices the new payload and loads it. 8. For whatever reason, payload retrieval takes a while. Blackholed signature-fetch attempts in a restricted network [2] are one example of something that could be slow here. Eventually the syncTimeout kicks in and signature fetching or other verification is canceled (which counts as failed verification). 9. Force is true, so syncOnce logs "Forcing past precondition failures..." but carries on to call apply. 10. apply computes the manifest graph, runs the ClusterOperator precreation (whose handlers return ContextError() right after spawning, because the context is already expired), and runs the main RunGraph (whose handlers do the same). 11. The main RunGraph returns at a slice with a handful of context errors and nothing else. apply passes this on to consistentReporter.Errors. 12. consistentReporter.Errors calls summarizeTaskGraphErrors, which logs "All errors were context errors..." and returns nil to avoid alarming consistentReporter.Errors (we don't want to put this in our ClusterVersion status and alarm users). 13. apply returns the summarized nil to syncOnce. 14. syncOnce returns the summarized nil to Start. 15. Start logs "Sync succeeded..." and flops into ReconcilingPayload for the next round. 16. Start comes into the next round in reconciling mode despite never having attempted to apply any manifests in its Updating mode. The manifest graph gets flattened and shuffled and all kinds of terrible things could happen like the machine-config trying to roll out the newer machine-os-content and its 4.4 hyperkube binary before rolling out prerequisites like the 4.4 kube-apiserver operator. With this commit, the process is the same through 12, but ends with: 13. apply returns the first context error to syncOnce. 14. syncOnce returns that error to Start. 15. Start backs off and comes in again with a second attempt at UpdatingPayload. 16. Manifests get pushed in the intended order, and nothing explodes. The race fixed in this commit could also have come up without timing out the payload pull/verification, e.g. by having a perfectly slow ClusterOperator preconditions. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1838497#c23 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1840343

openshift-ci-robot · 2020-05-28T12:31:07Z

@wking: This pull request references Bugzilla bug 1838497, which is invalid:

expected the bug to target the "4.5.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2020-05-28T12:32:05Z

Added on the fix for rhbz#1838497, since it is also in the "are these context errors?" space.

wking · 2020-05-28T12:32:46Z

/bugzilla refresh

openshift-ci-robot · 2020-05-28T12:32:54Z

@wking: This pull request references Bugzilla bug 1838497, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

LalatenduMohanty · 2020-05-28T14:04:04Z

/retest

pkg/cvo/sync_worker.go

wking · 2020-05-28T17:31:07Z

e2e:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/372/pull-ci-openshift-cluster-version-operator-master-e2e-aws/1395/build-log.txt | grep -A6 'Failing tests:' | head -n7
Failing tests:

[sig-builds][Feature:Builds] custom build with buildah  being created from new-build should complete build with custom builder image [Suite:openshift/conformance/parallel]
[sig-imageregistry][Feature:ImageInfo] Image info should display information about images [Suite:openshift/conformance/parallel]
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
[sig-operator] an end user use OLM can subscribe to the cockroachdb operator [Suite:openshift/conformance/parallel]

One of the alerts is CertifiedOperatorConnectionErrors, so possibly related to the Quay outage.

/retest

sdodson · 2020-05-28T19:34:47Z

/test e2e-aws

jottofar · 2020-05-28T21:21:20Z

/test e2e-aws-upgrade

wking · 2020-05-29T04:14:23Z

openshift/ci-tools#860 should help update CI.

/retest

abhinavdahiya · 2020-05-29T08:17:59Z

pkg/cvo/sync_worker.go

 	copied.Step = "ApplyResources"
 	copied.Fraction = float32(r.done) / float32(r.total)
-	if !isCancelledError(err) {
+	if !isContextError(err) {


Replace this with errors.Is ?

I dunno if we have support for Unwrap, which is what Is uses. This commit is mostly about adjusting the naming to reflect the current implementation more precisely. Can we punt implementation improvements to future work?

Which errors.Is we are talking about here. Is it from https://github.com/kubernetes/apimachinery/blob/master/pkg/api/errors/errors.go ?

Which errors.Is we are talking about here.

https://blog.golang.org/go1.13-errors

LalatenduMohanty

/hold for @abhinavdahiya to make sure he is fine with the answer to his suggestion

/lgtm

openshift-ci-robot · 2020-06-01T18:41:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abhinavdahiya · 2020-06-01T18:55:30Z

/hold cancel

wking · 2020-06-01T19:03:01Z

/bugzilla refresh

openshift-ci-robot · 2020-06-01T19:03:04Z

@wking: This pull request references Bugzilla bug 1838497, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2020-06-01T21:24:41Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-01T21:37:39Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-01T23:34:44Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-02T01:05:39Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-02T01:44:39Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-02T03:15:39Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-02T04:59:41Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-02T05:12:39Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-02T08:27:42Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-02T08:40:40Z

/retest

Please review the full test history for this PR and help us cut down flakes.

LalatenduMohanty · 2020-06-02T09:42:53Z

/test e2e-aws

LalatenduMohanty · 2020-06-02T11:38:35Z

/test e2e-aws

openshift-bot · 2020-06-02T17:46:25Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-02T17:59:25Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-02T19:43:21Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2020-06-03T02:23:20Z

upgrade job seems to have gotten lost. Kick it.

/test e2e-aws-upgrade

wking · 2020-06-03T04:55:03Z

CI registry/cluster flake.

/retest

openshift-ci-robot · 2020-06-03T06:16:30Z

@wking: All pull requests linked via external trackers have merged: openshift/cluster-version-operator#372. Bugzilla bug 1838497 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2020-06-03T13:15:06Z

/cherrypick release-4.5

openshift-cherrypick-robot · 2020-06-03T13:15:13Z

@wking: new pull request created: #378

Details

In response to this:

/cherrypick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from crawford and jottofar May 28, 2020 07:20

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 28, 2020

wking force-pushed the precise-context-error branch from 918fdbd to 1033fa4 Compare May 28, 2020 12:30

wking changed the title ~~pkg/cvo/sync_worker: Generalize CancelError to ContextError~~ Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success May 28, 2020

openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 28, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label May 28, 2020

openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label May 28, 2020

jottofar reviewed May 28, 2020

View reviewed changes

pkg/cvo/sync_worker.go Show resolved Hide resolved

LalatenduMohanty reviewed May 28, 2020

View reviewed changes

pkg/cvo/sync_worker.go Show resolved Hide resolved

abhinavdahiya reviewed May 29, 2020

View reviewed changes

LalatenduMohanty approved these changes Jun 1, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 1, 2020

openshift-ci-robot assigned LalatenduMohanty Jun 1, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 1, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 1, 2020

openshift-merge-robot merged commit 26de2a1 into openshift:master Jun 3, 2020

wking deleted the precise-context-error branch June 3, 2020 13:08

openshift-cherrypick-robot mentioned this pull request Jun 3, 2020

Bug 1843526: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success #378

Merged

Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success #372

Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success #372

Uh oh!

Conversation

wking commented May 28, 2020

Uh oh!

openshift-ci-robot commented May 28, 2020

Uh oh!

wking commented May 28, 2020

Uh oh!

wking commented May 28, 2020

Uh oh!

openshift-ci-robot commented May 28, 2020

Uh oh!

LalatenduMohanty commented May 28, 2020

Uh oh!

Uh oh!

Uh oh!

wking commented May 28, 2020

Uh oh!

sdodson commented May 28, 2020

Uh oh!

jottofar commented May 28, 2020

Uh oh!

wking commented May 29, 2020

Uh oh!

abhinavdahiya May 29, 2020

Choose a reason for hiding this comment

Uh oh!

wking May 29, 2020

Choose a reason for hiding this comment

Uh oh!

LalatenduMohanty Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya Jun 1, 2020

Choose a reason for hiding this comment

Uh oh!

LalatenduMohanty left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Jun 1, 2020

Uh oh!

abhinavdahiya commented Jun 1, 2020

Uh oh!

wking commented Jun 1, 2020

Uh oh!

openshift-ci-robot commented Jun 1, 2020

Uh oh!

openshift-bot commented Jun 1, 2020

Uh oh!

openshift-bot commented Jun 1, 2020

Uh oh!

openshift-bot commented Jun 1, 2020

Uh oh!

openshift-bot commented Jun 2, 2020

Uh oh!

openshift-bot commented Jun 2, 2020

Uh oh!

openshift-bot commented Jun 2, 2020

Uh oh!

openshift-bot commented Jun 2, 2020

Uh oh!

openshift-bot commented Jun 2, 2020

Uh oh!

openshift-bot commented Jun 2, 2020

Uh oh!

openshift-bot commented Jun 2, 2020

Uh oh!

LalatenduMohanty commented Jun 2, 2020

Uh oh!

LalatenduMohanty commented Jun 2, 2020

Uh oh!

openshift-bot commented Jun 2, 2020

Uh oh!

openshift-bot commented Jun 2, 2020

Uh oh!

openshift-bot commented Jun 2, 2020

Uh oh!

wking commented Jun 3, 2020