-
Notifications
You must be signed in to change notification settings - Fork 213
pkg/featurechangestopper: Refactor to allow declaring work completion #704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/featurechangestopper: Refactor to allow declaring work completion #704
Conversation
Exercise more than just the syncHandler helper, so we know that cancels and the vendored library functions work as expected.
Pivoting to the ConditionWithContextFunc pattern [1], so the various
handlers can say "don't bother launching me again, I'm satisfied" or
"yikes, this error is terrible, I'm giving up". As an unlikely
example, perhaps something out of band shuts down the queue and
c.queue.Get() returns true for quit/shutdown, in which case the
controller will now respect [2]:
If shutdown = true, the caller should end their goroutine.
A more likely example for this controller is that syncHandler sees the
tech-preview-ness change, and calls the shutdown function. Before
this commit, the expected production pathway was:
1. syncHandler calls shutdownFn. If shutdownFn happened to be nil,
then the controller would continue logging about the change, but
never actually stopping or doing anything more than spewing logs.
2. shutdownFn (set in Options.run way over in pkg/start) happens to be
the cancel function for Run's Context.
3. The Context Done() block in Run clears, and we log the controller
shutdown.
4. As Run exits, the deferred queue Shutdown() cleans up the queue.
5. The queue closing unblocks the queue Get() in processNextWorkItem,
allowing runWorker to complete.
6. UntilWithContext has its callback complete, and notices that its
Context is done and returns.
7. The worker goroutine completes.
That's pretty involved, and relies on the brittle shutdownFn == cancel
semantics. In the test suite, where shutdownFn is unrelated to the
cancel, it resulted in a two second delay while the controller
continued to watch for further changes until the test function
canceled the context:
old-code$ go test -v -count=1 -run TestTechPreviewChangeStopper/default-with-change-to-tech-preview ./pkg/featurechangestopper
=== RUN TestTechPreviewChangeStopper
=== RUN TestTechPreviewChangeStopper/default-with-change-to-tech-preview
I1204 00:12:48.169195 13301 techpreviewchangestopper.go:77] Starting stop-on-techpreview-change controller
I1204 00:12:48.270323 13301 techpreviewchangestopper.go:63] TechPreviewNoUpgrade status changed from false to true, shutting down.
I1204 00:12:50.170559 13301 techpreviewchangestopper.go:92] Shutting down stop-on-techpreview-change controller
--- PASS: TestTechPreviewChangeStopper (2.00s)
--- PASS: TestTechPreviewChangeStopper/default-with-change-to-tech-preview (2.00s)
PASS
ok github.com/openshift/cluster-version-operator/pkg/featurechangestopper 2.017s
With the new code, the ConditionWithContextFunc pattern allows for:
1. syncHandler calls shutdownFn and returns 'true, nil' to signal
no-more-work.
2. processNextWorkItem passes that done value up the call stack.
3. runWorker passes that done value up the call stack.
4. PollImmediateUntilWithContext returns without error.
5. Run logs its shutdown and returns.
6. As Run exits, the deferred cancel closes childCtx.
7. childCtx closing unblocks the queue-shutdown goroutine (more on
this goroutine below), which completes.
This avoids blocking Run on someone canceling the context:
new-code$ go test -v -count=1 -run TestTechPreviewChangeStopper/default-with-change-to-tech-preview ./pkg/featurechangestopper
=== RUN TestTechPreviewChangeStopper
=== RUN TestTechPreviewChangeStopper/default-with-change-to-tech-preview
I1204 00:12:57.400950 13405 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller
I1204 00:12:57.501913 13405 techpreviewchangestopper.go:74] TechPreviewNoUpgrade was false, but the current feature set is "TechPreviewNoUpgrade"; requesting shutdown.
I1204 00:12:57.502625 13405 techpreviewchangestopper.go:105] Shutting down stop-on-techpreview-change controller
--- PASS: TestTechPreviewChangeStopper (0.10s)
--- PASS: TestTechPreviewChangeStopper/default-with-change-to-tech-preview (0.10s)
PASS
ok github.com/openshift/cluster-version-operator/pkg/featurechangestopper 0.123s
In the "no tech-preview change, context canceled for other reasons"
case, the old and new flows are somewhat different. With the old
code, the flow started at step 4 of the old production pathway. With
the new code, we're driving this up from the handler, which is
probably blocked on queue Get(). Because Get() doesn't (yet) take a
Context argument, the only way to break out is to cancel the queue.
So there's a new sentinel goroutine and a childCtx context, so:
1. External actor cancels Run's context.
2. Cancel is inherited by childCtx.
3. childCtx closing unblocks the queue-shutdown goroutine, which shuts
down the queue and completes.
4. Queue shutdown unblocks Get(), which returns done/shutdown.
5. processNextWorkItem returns 'true, nil' to signal no-more-work.
6. Continues at runWorker and step 6 in the tech-preview-changed flow
above.
[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#ConditionWithContextFunc
[2]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#Type.Get
2956fa7 to
ffb9808
Compare
vrutkovs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, holding for @jottofar review
/lgtm
/hold
/retest
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vrutkovs, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
LGTM |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Pivoting to the
ConditionWithContextFuncpattern, so the various handlers can say "don't bother launching me again, I'm satisfied" or "yikes, this error is terrible, I'm giving up". As an unlikely example, perhaps something out of band shuts down the queue andc.queue.Get()returnstruefor quit/shutdown, in which case the controller will now respect:A more likely scenario is that
syncHandler, sees the tech-preview-ness change, and calls the shutdown function. Before this pull-request, the expected production pathway assumed thatshutdownFunctionwas the context cancel function, which it is inOptions.runway over inpkg/start. But that's a somewhat brittle assumption to make in this package, and you can see it break down in the test suite (now that I've pivoted to exercising the wholeRunmethod) with the old code (my test pivot, but not my implementation pivot):That has a 2s delay after noticing the change where the controller is still running because the test suite hasn't canceled its context. Comparing with the new code:
showing the controller exiting very quickly after noticing the change.
The new way has some fiddly bits to work around
queue.Getnot taking aContext, and those are described in more detail in the commit messages.Coming back to the
Run-covering tests. @deads2k had recommended against, based on slowness, raciness, and plumbing. But on the plumbing front, the new code is +23 and -28, so I'm dropping a few lines, and nothing I'm doing in the new test code feels particularly fiddly to me. On the slowness front, a few seconds added on to CI jobs seems unlikely to be noticeable, and the CI bindings mean that even if folks are too busy to run those few seconds locally, they'll be run before we merge any code. And on the raciness front, my 2s in the test'sWithTimeoutis already well under the controller's 30s poller. And based on my local test results, you'd need to crank that down to a tenth of a second or so before you started having race issues, so I have over an order of magnitude of protection.