Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Dec 4, 2021

Pivoting to the ConditionWithContextFunc pattern, so the various handlers can say "don't bother launching me again, I'm satisfied" or "yikes, this error is terrible, I'm giving up". As an unlikely example, perhaps something out of band shuts down the queue and c.queue.Get() returns true for quit/shutdown, in which case the controller will now respect:

If shutdown = true, the caller should end their goroutine.

A more likely scenario is that syncHandler, sees the tech-preview-ness change, and calls the shutdown function. Before this pull-request, the expected production pathway assumed that shutdownFunction was the context cancel function, which it is in Options.run way over in pkg/start. But that's a somewhat brittle assumption to make in this package, and you can see it break down in the test suite (now that I've pivoted to exercising the whole Run method) with the old code (my test pivot, but not my implementation pivot):

$ go test -v -count=1 -run TestTechPreviewChangeStopper/default-with-change-to-tech-preview ./pkg/featurechangestopper
=== RUN   TestTechPreviewChangeStopper
=== RUN   TestTechPreviewChangeStopper/default-with-change-to-tech-preview
I1204 00:12:48.169195   13301 techpreviewchangestopper.go:77] Starting stop-on-techpreview-change controller
I1204 00:12:48.270323   13301 techpreviewchangestopper.go:63] TechPreviewNoUpgrade status changed from false to true, shutting down.
I1204 00:12:50.170559   13301 techpreviewchangestopper.go:92] Shutting down stop-on-techpreview-change controller
--- PASS: TestTechPreviewChangeStopper (2.00s)
    --- PASS: TestTechPreviewChangeStopper/default-with-change-to-tech-preview (2.00s)
PASS
ok      github.com/openshift/cluster-version-operator/pkg/featurechangestopper  2.017s

That has a 2s delay after noticing the change where the controller is still running because the test suite hasn't canceled its context. Comparing with the new code:

$ go test -v -count=1 -run TestTechPreviewChangeStopper/default-with-change-to-tech-preview ./pkg/featurechangestopper
=== RUN   TestTechPreviewChangeStopper
=== RUN   TestTechPreviewChangeStopper/default-with-change-to-tech-preview
I1204 00:12:57.400950   13405 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller
I1204 00:12:57.501913   13405 techpreviewchangestopper.go:74] TechPreviewNoUpgrade was false, but the current feature set is "TechPreviewNoUpgrade"; requesting shutdown.
I1204 00:12:57.502625   13405 techpreviewchangestopper.go:105] Shutting down stop-on-techpreview-change controller
--- PASS: TestTechPreviewChangeStopper (0.10s)
    --- PASS: TestTechPreviewChangeStopper/default-with-change-to-tech-preview (0.10s)
PASS
ok      github.com/openshift/cluster-version-operator/pkg/featurechangestopper  0.123s

showing the controller exiting very quickly after noticing the change.

The new way has some fiddly bits to work around queue.Get not taking a Context, and those are described in more detail in the commit messages.

Coming back to the Run-covering tests. @deads2k had recommended against, based on slowness, raciness, and plumbing. But on the plumbing front, the new code is +23 and -28, so I'm dropping a few lines, and nothing I'm doing in the new test code feels particularly fiddly to me. On the slowness front, a few seconds added on to CI jobs seems unlikely to be noticeable, and the CI bindings mean that even if folks are too busy to run those few seconds locally, they'll be run before we merge any code. And on the raciness front, my 2s in the test's WithTimeout is already well under the controller's 30s poller. And based on my local test results, you'd need to crank that down to a tenth of a second or so before you started having race issues, so I have over an order of magnitude of protection.

Exercise more than just the syncHandler helper, so we know that
cancels and the vendored library functions work as expected.
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 4, 2021
Pivoting to the ConditionWithContextFunc pattern [1], so the various
handlers can say "don't bother launching me again, I'm satisfied" or
"yikes, this error is terrible, I'm giving up".  As an unlikely
example, perhaps something out of band shuts down the queue and
c.queue.Get() returns true for quit/shutdown, in which case the
controller will now respect [2]:

  If shutdown = true, the caller should end their goroutine.

A more likely example for this controller is that syncHandler sees the
tech-preview-ness change, and calls the shutdown function.  Before
this commit, the expected production pathway was:

1. syncHandler calls shutdownFn.  If shutdownFn happened to be nil,
   then the controller would continue logging about the change, but
   never actually stopping or doing anything more than spewing logs.
2. shutdownFn (set in Options.run way over in pkg/start) happens to be
   the cancel function for Run's Context.
3. The Context Done() block in Run clears, and we log the controller
   shutdown.
4. As Run exits, the deferred queue Shutdown() cleans up the queue.
5. The queue closing unblocks the queue Get() in processNextWorkItem,
   allowing runWorker to complete.
6. UntilWithContext has its callback complete, and notices that its
   Context is done and returns.
7. The worker goroutine completes.

That's pretty involved, and relies on the brittle shutdownFn == cancel
semantics.  In the test suite, where shutdownFn is unrelated to the
cancel, it resulted in a two second delay while the controller
continued to watch for further changes until the test function
canceled the context:

  old-code$ go test -v -count=1 -run TestTechPreviewChangeStopper/default-with-change-to-tech-preview ./pkg/featurechangestopper
  === RUN   TestTechPreviewChangeStopper
  === RUN   TestTechPreviewChangeStopper/default-with-change-to-tech-preview
  I1204 00:12:48.169195   13301 techpreviewchangestopper.go:77] Starting stop-on-techpreview-change controller
  I1204 00:12:48.270323   13301 techpreviewchangestopper.go:63] TechPreviewNoUpgrade status changed from false to true, shutting down.
  I1204 00:12:50.170559   13301 techpreviewchangestopper.go:92] Shutting down stop-on-techpreview-change controller
  --- PASS: TestTechPreviewChangeStopper (2.00s)
      --- PASS: TestTechPreviewChangeStopper/default-with-change-to-tech-preview (2.00s)
  PASS
  ok      github.com/openshift/cluster-version-operator/pkg/featurechangestopper  2.017s

With the new code, the ConditionWithContextFunc pattern allows for:

1. syncHandler calls shutdownFn and returns 'true, nil' to signal
   no-more-work.
2. processNextWorkItem passes that done value up the call stack.
3. runWorker passes that done value up the call stack.
4. PollImmediateUntilWithContext returns without error.
5. Run logs its shutdown and returns.
6. As Run exits, the deferred cancel closes childCtx.
7. childCtx closing unblocks the queue-shutdown goroutine (more on
   this goroutine below), which completes.

This avoids blocking Run on someone canceling the context:

  new-code$ go test -v -count=1 -run TestTechPreviewChangeStopper/default-with-change-to-tech-preview ./pkg/featurechangestopper
  === RUN   TestTechPreviewChangeStopper
  === RUN   TestTechPreviewChangeStopper/default-with-change-to-tech-preview
  I1204 00:12:57.400950   13405 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller
  I1204 00:12:57.501913   13405 techpreviewchangestopper.go:74] TechPreviewNoUpgrade was false, but the current feature set is "TechPreviewNoUpgrade"; requesting shutdown.
  I1204 00:12:57.502625   13405 techpreviewchangestopper.go:105] Shutting down stop-on-techpreview-change controller
  --- PASS: TestTechPreviewChangeStopper (0.10s)
      --- PASS: TestTechPreviewChangeStopper/default-with-change-to-tech-preview (0.10s)
  PASS
  ok      github.com/openshift/cluster-version-operator/pkg/featurechangestopper  0.123s

In the "no tech-preview change, context canceled for other reasons"
case, the old and new flows are somewhat different.  With the old
code, the flow started at step 4 of the old production pathway.  With
the new code, we're driving this up from the handler, which is
probably blocked on queue Get().  Because Get() doesn't (yet) take a
Context argument, the only way to break out is to cancel the queue.
So there's a new sentinel goroutine and a childCtx context, so:

1. External actor cancels Run's context.
2. Cancel is inherited by childCtx.
3. childCtx closing unblocks the queue-shutdown goroutine, which shuts
   down the queue and completes.
4. Queue shutdown unblocks Get(), which returns done/shutdown.
5. processNextWorkItem returns 'true, nil' to signal no-more-work.
6. Continues at runWorker and step 6 in the tech-preview-changed flow
   above.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#ConditionWithContextFunc
[2]: https://pkg.go.dev/k8s.io/client-go/util/workqueue#Type.Get
@wking wking force-pushed the tech-preview-tweaks branch from 2956fa7 to ffb9808 Compare December 5, 2021 05:13
Copy link

@vrutkovs vrutkovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, holding for @jottofar review
/lgtm
/hold
/retest

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 6, 2021
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 6, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 6, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrutkovs, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jottofar
Copy link
Contributor

jottofar commented Dec 6, 2021

LGTM
/unhold
/retest

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 6, 2021
@wking
Copy link
Member Author

wking commented Dec 6, 2021

last three runs all failed in different ways, all of which are unrelated to this PR.

/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 6, 2021

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

last three runs all failed in different ways, all of which are unrelated to this PR.

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit a7dedc7 into openshift:master Dec 6, 2021
@wking wking deleted the tech-preview-tweaks branch December 6, 2021 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants