-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 1843732: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success #379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1843732: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success #379
Conversation
With this commit, I drop contextIsCancelled in favor of Context.Err(). From the docs [1]: If Done is not yet closed, Err returns nil. If Done is closed, Err returns a non-nil error explaining why: Canceled if the context was canceled or DeadlineExceeded if the context's deadline passed. After Err returns a non-nil error, successive calls to Err return the same error. I dunno why we'd been checking Done() instead, but contextIsCancelled dates back to 961873d (sync: Do config syncing in the background, 2019-01-11, openshift#82). I've also generalized a number of *Cancel* helpers to be *Context* to remind folks that Context.Err() can be DeadlineExceeded as well as Canceled, and the CVO uses both WithCancel and WithTimeout. The new error messages will be either: update context deadline exceeded at 1 of 2 or: update context canceled at 1 of 2 Instead of always claiming: update was cancelled at 1 of 2 Cherry-picked from eea2092 (pkg/cvo/sync_worker: Generalize CancelError to ContextError, 2020-05-28, openshift#378) and edited to resolve context conflicts because release-4.4 lacks 2a469e3 (cvo: When installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318). [1]: https://golang.org/pkg/context/#Context
… as success
For [1]. Before this commit, you could have a flow like:
1. SyncWorker.Start()
2. External code calls Update(), e.g. after noticing a ClusterVersion
spec.desiredUpdate change.
3. Update sets w.work to point at the desired payload.
4. Start's Until loop is triggered via w.notify.
5. Start calls calculateNext, which notices the change and sets the
state to UpdatingPayload.
6. Start calculates a syncTimeout and calls syncOnce.
7. syncOnce notices the new payload and loads it.
8. For whatever reason, payload retrieval takes a while. Blackholed
signature-fetch attempts in a restricted network [2] are one
example of something that could be slow here. Eventually the
syncTimeout kicks in and signature fetching or other verification
is canceled (which counts as failed verification).
9. Force is true, so syncOnce logs "Forcing past precondition
failures..." but carries on to call apply.
10. apply computes the manifest graph, runs the ClusterOperator
precreation (whose handlers return ContextError() right after
spawning, because the context is already expired), and runs the
main RunGraph (whose handlers do the same).
11. The main RunGraph returns at a slice with a handful of context
errors and nothing else. apply passes this on to
consistentReporter.Errors.
12. consistentReporter.Errors calls summarizeTaskGraphErrors, which
logs "All errors were context errors..." and returns nil to avoid
alarming consistentReporter.Errors (we don't want to put this in
our ClusterVersion status and alarm users).
13. apply returns the summarized nil to syncOnce.
14. syncOnce returns the summarized nil to Start.
15. Start logs "Sync succeeded..." and flops into ReconcilingPayload
for the next round.
16. Start comes into the next round in reconciling mode despite never
having attempted to apply any manifests in its Updating mode. The
manifest graph gets flattened and shuffled and all kinds of
terrible things could happen like the machine-config trying to
roll out the newer machine-os-content and its 4.4 hyperkube binary
before rolling out prerequisites like the 4.4 kube-apiserver
operator.
With this commit, the process is the same through 12, but ends with:
13. apply returns the first context error to syncOnce.
14. syncOnce returns that error to Start.
15. Start backs off and comes in again with a second attempt at
UpdatingPayload.
16. Manifests get pushed in the intended order, and nothing explodes.
The race fixed in this commit could also have come up without timing
out the payload pull/verification, e.g. by having a perfectly slow
ClusterOperator preconditions.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1838497#c23
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1840343
|
@wking: No Bugzilla bug is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: This pull request references Bugzilla bug 1843732, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
|
@wking: This pull request references Bugzilla bug 1843732, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick release-4.3 |
|
@wking: once the present PR merges, I will cherry-pick it on top of release-4.3 in a new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
|
@wking: This pull request references Bugzilla bug 1843732, which is valid. 6 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LalatenduMohanty, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: All pull requests linked via external trackers have merged: openshift/cluster-version-operator#379. Bugzilla bug 1843732 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: new pull request created: #380 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Cherry-picked from #378 and edited to resolve context conflicts because release-4.4 lacks #318.