feat: replace StatusPoller w/ StatusWatcher #572

karlkfi · 2022-04-09T14:08:16Z

feat: replace StatusPoller w/ StatusWatcher

Add DefaultStatusWatcher that wraps DynamicClient and manages
informers for a set of resource objects.
- Supports two modes: root-scoped & namespace-scoped.
- Root-scoped mode uses root-scoped informers to efficiency and
  performance.
- Namespace-scoped mode uses namespace-scoped informers to
  minimize the permissions needed to run and the size of the
  in-memory object cache.
- Automatic mode selects which mode to use based on whether the
  objects being watched are in one or multiple namespaces.
  This is the default mode, optimizing for performance.
- If CRDs are being watched, the creation/deletion of CRDs can
  cause informers for those custom resources to be created/deleted.
- In namespace-scope mode, if namespaces are being watched, the
  creation/deletion of namespaces can also trigger informers to
  be created/deleted.
- All creates/updates/deletes to CRDs also cause RESTMapper reset.
- Allow pods to be unschedulable for 15s before reporting the
  status as Failed. Any update resets the timer.
Add BlindStatusWatcher for testing and disabling for dry-run.
Add DynamicClusterReader that wraps DynamicClient.
This is now used to look up generated resources
(ex: Deployment > ReplicaSets > Pods).
Add DefaultStatusReader which uses a DelegatingStatusReader to
wrap a list of conventional and specific StatusReaders.
This should make it easier to extend the list of StatusReaders.
Move some pending WaitEvents to be optional in tests, now that
StatusWatcher can resolve their status before the WaitTask starts.
Add a new Thousand Deployments stress test (10x kind nodes)
Add some new logs for easier debugging
Add internal SyncEvent so that apply/delete tasks don't start
until the StatusWatcher has finished initial synchronization.
This helps avoid missing events from actions that happen while
synchronization is incomplete.
Filter optional pending WaitEvents when testing.

BREAKING CHANGE: Replace StatusPoller w/ StatusWatcher
BREAKING CHANGE: Remove PollInterval (obsolete with watcher)

karlkfi · 2022-04-09T14:11:11Z

I still have some tests to write and refactoring to do, but I got all the tests passing locally.

karlkfi · 2022-04-13T02:39:41Z

/test cli-utils-presubmit-master-stress

karlkfi · 2022-04-15T22:57:26Z

Extracted #574

karlkfi · 2022-04-15T23:12:20Z

Extracted #575

karlkfi · 2022-05-06T19:20:15Z

Fixed the flaky test with a new FilterOptionalEvents added throughout the e2e tests.

karlkfi · 2022-05-06T19:23:45Z

Comments addressed. Tests passed. Please take another look.

mortent

/lgtm

Liujingfang1 · 2022-05-06T22:32:37Z

/lgtm

karlkfi · 2022-05-07T00:04:40Z

/hold

Block until we get a release with k8s v1.24 out.

ash2k

Solid work! I've left a few comments. This PR adds quite a few TODOs. Are you planning to address those later?

pkg/kstatus/watcher/default_status_watcher.go

ash2k · 2022-05-09T00:20:56Z

pkg/kstatus/watcher/object_status_reporter.go

+// TODO: Retry with backoff if in namespace-scoped mode, to allow CRDs & namespaces to be created asynchronously
+type ObjectStatusReporter struct {
+	// InformerFactory is used to build informers
+	InformerFactory *DynamicInformerFactory


I suggest we make this an interface for both (future) testing and composability reasons.

Seems like a reasonable enhancement, but we can probably wait to change it to use an interface until we add a 2nd impl. I try not to prematurely add interfaces, if I can avoid it. Sometime having the struct itself is more flexible, and in this case the coupling is pretty tight to the SharedIndexInformer impl details. For example, we know that passing an unstructured example means the callbacks will almost always be unstructured and can be safely cast.

pkg/kstatus/watcher/event_funnel.go

ash2k · 2022-05-09T04:17:09Z

pkg/kstatus/watcher/object_status_reporter.go

+
+		if tombstone, ok := iobj.(cache.DeletedFinalStateUnknown); ok {
+			// Last state unknown. Possibly stale.
+			// TODO: Should we propegate this uncertainty to the caller?


We probably should?

How tho? The callback interface informer uses is pretty terrible, because it uses interface{} objects as arguments, making it impossible to know what type you're going to get as a caller without looking at the implimentation. This particular pattern of wrapping the object with a struct with a getter is especially bad because it's not a standard object type. I didn't even know it was a possible edge case until an e2e test failed randomly.

The only idea I've had is to maybe add an annotation, but that would be equally hard to discover and would require removal by the receiver.

ash2k · 2022-05-09T04:17:46Z

pkg/kstatus/watcher/object_status_reporter.go

+		id := object.UnstructuredToObjMetadata(obj)
+		klog.Warningf("Invalid CRD added: missing group and/or kind: %v", id)
+		// Don't return an error, because this should not inturrupt the task queue.
+		// TODO: Allow non-fatal errors to be reported using a specific error type.


pkg/kstatus/watcher/object_status_reporter.go

ash2k · 2022-05-09T04:23:56Z

pkg/kstatus/watcher/watcher.go

+}
+
+//go:generate stringer -type=RESTScopeStrategy -linecomment
+type RESTScopeStrategy int


nit: This can probably be byte =)

I've never seen a go enum use byte. Int is more extensible. But perhaps you're suggesting that this particular enum will never be extended?

karlkfi · 2022-05-09T20:19:39Z

This PR adds quite a few TODOs. Are you planning to address those later?

Honestly, if the watcher works well, I don't think I'll get back around to the TODOs for a while. The main goal here is performance/responsiveness and reduced memory use, and I don't want perfect to be the enemy of good. This codebase already has a pile of tech debt and the TODO issues I've made keep getting auto-closed, so I figured TODOs in the code were at least a little more persistent and might be discovered again, the next time someone is in this code.

I will address some of your comments tho. Thanks for reviewing!

pkg/kstatus/watcher/event_funnel.go

- Add DefaultStatusWatcher that wraps DynamicClient and manages informers for a set of resource objects. - Supports two modes: root-scoped & namespace-scoped. - Root-scoped mode uses root-scoped informers to efficiency and performance. - Namespace-scoped mode uses namespace-scoped informers to minimize the permissions needed to run and the size of the in-memory object cache. - Automatic mode selects which mode to use based on whether the objects being watched are in one or multiple namespaces. This is the default mode, optimizing for performance. - If CRDs are being watched, the creation/deletion of CRDs can cause informers for those custom resources to be created/deleted. - In namespace-scope mode, if namespaces are being watched, the creation/deletion of namespaces can also trigger informers to be created/deleted. - All creates/updates/deletes to CRDs also cause RESTMapper reset. - Allow pods to be unschedulable for 15s before reporting the status as Failed. Any update resets the timer. - Add BlindStatusWatcher for testing and disabling for dry-run. - Add DynamicClusterReader that wraps DynamicClient. This is now used to look up generated resources (ex: Deployment > ReplicaSets > Pods). - Add DefaultStatusReader which uses a DelegatingStatusReader to wrap a list of conventional and specific StatusReaders. This should make it easier to extend the list of StatusReaders. - Move some pending WaitEvents to be optional in tests, now that StatusWatcher can resolve their status before the WaitTask starts. - Add a new Thousand Deployments stress test (10x kind nodes) - Add some new logs for easier debugging - Add internal SyncEvent so that apply/delete tasks don't start until the StatusWatcher has finished initial synchronization. This helps avoid missing events from actions that happen while synchronization is incomplete. - Filter optional pending WaitEvents when testing. BREAKING CHANGE: Replace StatusPoller w/ StatusWatcher BREAKING CHANGE: Remove PollInterval (obsolete with watcher)

ash2k

/lgtm

Thanks!

mortent

/approve

k8s-ci-robot · 2022-05-11T20:05:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ash2k, karlkfi, mortent

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mortent]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

karlkfi · 2022-05-12T22:45:43Z

/unhold

v0.30.0 passed all tests in Config Sync, so this this unblocked for merge.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 9, 2022

k8s-ci-robot requested review from ash2k and seans3 April 9, 2022 14:08

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 9, 2022

karlkfi requested a review from mortent April 9, 2022 14:08

karlkfi force-pushed the karl-status-watcher branch 4 times, most recently from eba7e3d to d760ea7 Compare April 13, 2022 02:24

karlkfi force-pushed the karl-status-watcher branch 7 times, most recently from 192d01d to 0d50a15 Compare April 14, 2022 18:28

karlkfi changed the title ~~[WIP] feat: replace StatusPoller w/ StatusWatcher~~ feat: replace StatusPoller w/ StatusWatcher Apr 14, 2022

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 14, 2022

karlkfi force-pushed the karl-status-watcher branch 4 times, most recently from e408395 to 5b850da Compare April 15, 2022 19:49

karlkfi changed the title ~~feat: replace StatusPoller w/ StatusWatcher~~ [WIP] feat: replace StatusPoller w/ StatusWatcher Apr 15, 2022

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2022

karlkfi force-pushed the karl-status-watcher branch from 5b850da to 50ec259 Compare April 15, 2022 23:49

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 6, 2022

karlkfi requested a review from mortent May 6, 2022 19:24

mortent reviewed May 6, 2022

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 6, 2022

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2022

ash2k reviewed May 9, 2022

View reviewed changes

karlkfi force-pushed the karl-status-watcher branch from c1e5a9f to 7af34af Compare May 9, 2022 21:52

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 9, 2022

ash2k reviewed May 10, 2022

View reviewed changes

pkg/kstatus/watcher/event_funnel.go Outdated Show resolved Hide resolved

karlkfi force-pushed the karl-status-watcher branch 2 times, most recently from 49705fa to 2e4c358 Compare May 10, 2022 03:54

karlkfi force-pushed the karl-status-watcher branch from 2e4c358 to c469493 Compare May 10, 2022 17:40

ash2k approved these changes May 11, 2022

View reviewed changes

k8s-ci-robot assigned ash2k May 11, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2022

mortent approved these changes May 11, 2022

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2022

karlkfi mentioned this pull request May 12, 2022

chore: update k8s libs to 1.24 #577

Merged

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 12, 2022

karlkfi merged commit 4831b66 into kubernetes-sigs:master May 12, 2022

karlkfi deleted the karl-status-watcher branch May 12, 2022 22:47

ash2k mentioned this pull request Mar 16, 2023

Allow to pass watcher strategy #617

Merged

feat: replace StatusPoller w/ StatusWatcher #572

feat: replace StatusPoller w/ StatusWatcher #572

Uh oh!

Conversation

karlkfi commented Apr 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karlkfi commented Apr 9, 2022

Uh oh!

karlkfi commented Apr 13, 2022

Uh oh!

karlkfi commented Apr 15, 2022

Uh oh!

karlkfi commented Apr 15, 2022

Uh oh!

karlkfi commented May 6, 2022

Uh oh!

karlkfi commented May 6, 2022

Uh oh!

mortent left a comment

Choose a reason for hiding this comment

Uh oh!

Liujingfang1 commented May 6, 2022

Uh oh!

karlkfi commented May 7, 2022

Uh oh!

ash2k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ash2k May 9, 2022

Choose a reason for hiding this comment

Uh oh!

karlkfi May 9, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ash2k May 9, 2022

Choose a reason for hiding this comment

Uh oh!

karlkfi May 9, 2022

Choose a reason for hiding this comment

Uh oh!

ash2k May 9, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ash2k May 9, 2022

Choose a reason for hiding this comment

Uh oh!

karlkfi May 9, 2022

Choose a reason for hiding this comment

Uh oh!

karlkfi commented May 9, 2022

Uh oh!

Uh oh!

ash2k left a comment

Choose a reason for hiding this comment

Uh oh!

mortent left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented May 11, 2022

Uh oh!

karlkfi commented May 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

karlkfi commented Apr 9, 2022 •

edited

Loading