-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success #372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1838497: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success #372
Conversation
With this commit, I drop contextIsCancelled in favor of Context.Err(). From the docs [1]: If Done is not yet closed, Err returns nil. If Done is closed, Err returns a non-nil error explaining why: Canceled if the context was canceled or DeadlineExceeded if the context's deadline passed. After Err returns a non-nil error, successive calls to Err return the same error. I dunno why we'd been checking Done() instead, but contextIsCancelled dates back to 961873d (sync: Do config syncing in the background, 2019-01-11, openshift#82). I've also generalized a number of *Cancel* helpers to be *Context* to remind folks that Context.Err() can be DeadlineExceeded as well as Canceled, and the CVO uses both WithCancel and WithTimeout. The new error messages will be either: update context deadline exceeded at 1 of 2 or: update context canceled at 1 of 2 Instead of always claiming: update was cancelled at 1 of 2 [1]: https://golang.org/pkg/context/#Context
… as success
For [1]. Before this commit, you could have a flow like:
1. SyncWorker.Start()
2. External code calls Update(), e.g. after noticing a ClusterVersion
spec.desiredUpdate change.
3. Update sets w.work to point at the desired payload.
4. Start's Until loop is triggered via w.notify.
5. Start calls calculateNext, which notices the change and sets the
state to UpdatingPayload.
6. Start calculates a syncTimeout and calls syncOnce.
7. syncOnce notices the new payload and loads it.
8. For whatever reason, payload retrieval takes a while. Blackholed
signature-fetch attempts in a restricted network [2] are one
example of something that could be slow here. Eventually the
syncTimeout kicks in and signature fetching or other verification
is canceled (which counts as failed verification).
9. Force is true, so syncOnce logs "Forcing past precondition
failures..." but carries on to call apply.
10. apply computes the manifest graph, runs the ClusterOperator
precreation (whose handlers return ContextError() right after
spawning, because the context is already expired), and runs the
main RunGraph (whose handlers do the same).
11. The main RunGraph returns at a slice with a handful of context
errors and nothing else. apply passes this on to
consistentReporter.Errors.
12. consistentReporter.Errors calls summarizeTaskGraphErrors, which
logs "All errors were context errors..." and returns nil to avoid
alarming consistentReporter.Errors (we don't want to put this in
our ClusterVersion status and alarm users).
13. apply returns the summarized nil to syncOnce.
14. syncOnce returns the summarized nil to Start.
15. Start logs "Sync succeeded..." and flops into ReconcilingPayload
for the next round.
16. Start comes into the next round in reconciling mode despite never
having attempted to apply any manifests in its Updating mode. The
manifest graph gets flattened and shuffled and all kinds of
terrible things could happen like the machine-config trying to
roll out the newer machine-os-content and its 4.4 hyperkube binary
before rolling out prerequisites like the 4.4 kube-apiserver
operator.
With this commit, the process is the same through 12, but ends with:
13. apply returns the first context error to syncOnce.
14. syncOnce returns that error to Start.
15. Start backs off and comes in again with a second attempt at
UpdatingPayload.
16. Manifests get pushed in the intended order, and nothing explodes.
The race fixed in this commit could also have come up without timing
out the payload pull/verification, e.g. by having a perfectly slow
ClusterOperator preconditions.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1838497#c23
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1840343
918fdbd to
1033fa4
Compare
|
@wking: This pull request references Bugzilla bug 1838497, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Added on the fix for rhbz#1838497, since it is also in the "are these context errors?" space. |
|
/bugzilla refresh |
|
@wking: This pull request references Bugzilla bug 1838497, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
|
e2e: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/372/pull-ci-openshift-cluster-version-operator-master-e2e-aws/1395/build-log.txt | grep -A6 'Failing tests:' | head -n7
Failing tests:
[sig-builds][Feature:Builds] custom build with buildah being created from new-build should complete build with custom builder image [Suite:openshift/conformance/parallel]
[sig-imageregistry][Feature:ImageInfo] Image info should display information about images [Suite:openshift/conformance/parallel]
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
[sig-operator] an end user use OLM can subscribe to the cockroachdb operator [Suite:openshift/conformance/parallel]
One of the alerts is /retest |
|
/test e2e-aws |
|
/test e2e-aws-upgrade |
|
openshift/ci-tools#860 should help update CI. /retest |
| copied.Step = "ApplyResources" | ||
| copied.Fraction = float32(r.done) / float32(r.total) | ||
| if !isCancelledError(err) { | ||
| if !isContextError(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace this with errors.Is ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno if we have support for Unwrap, which is what Is uses. This commit is mostly about adjusting the naming to reflect the current implementation more precisely. Can we punt implementation improvements to future work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which errors.Is we are talking about here. Is it from https://github.com/kubernetes/apimachinery/blob/master/pkg/api/errors/errors.go ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which errors.Is we are talking about here.
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold for @abhinavdahiya to make sure he is fine with the answer to his suggestion
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LalatenduMohanty, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
|
/bugzilla refresh |
|
@wking: This pull request references Bugzilla bug 1838497, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
9 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/test e2e-aws |
1 similar comment
|
/test e2e-aws |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
upgrade job seems to have gotten lost. Kick it. /test e2e-aws-upgrade |
|
CI registry/cluster flake. /retest |
|
@wking: All pull requests linked via external trackers have merged: openshift/cluster-version-operator#372. Bugzilla bug 1838497 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick release-4.5 |
|
@wking: new pull request created: #378 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
With this commit, I drop
contextIsCancelledin favor ofContext.Err(). From the docs:I dunno why we'd been checking
Done()instead, butcontextIsCancelleddates back to 961873d (#82).I've also generalized a number of
*Cancel*helpers to be*Context*to remind folks thatContext.Err()can beDeadlineExceededas well asCanceled, and the CVO uses bothWithCancelandWithTimeout. The new error messages will be either:or:
Instead of always claiming: