Skip to content

Conversation

@karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Apr 9, 2022

feat: replace StatusPoller w/ StatusWatcher

  • Add DefaultStatusWatcher that wraps DynamicClient and manages
    informers for a set of resource objects.
    • Supports two modes: root-scoped & namespace-scoped.
    • Root-scoped mode uses root-scoped informers to efficiency and
      performance.
    • Namespace-scoped mode uses namespace-scoped informers to
      minimize the permissions needed to run and the size of the
      in-memory object cache.
    • Automatic mode selects which mode to use based on whether the
      objects being watched are in one or multiple namespaces.
      This is the default mode, optimizing for performance.
    • If CRDs are being watched, the creation/deletion of CRDs can
      cause informers for those custom resources to be created/deleted.
    • In namespace-scope mode, if namespaces are being watched, the
      creation/deletion of namespaces can also trigger informers to
      be created/deleted.
    • All creates/updates/deletes to CRDs also cause RESTMapper reset.
    • Allow pods to be unschedulable for 15s before reporting the
      status as Failed. Any update resets the timer.
  • Add BlindStatusWatcher for testing and disabling for dry-run.
  • Add DynamicClusterReader that wraps DynamicClient.
    This is now used to look up generated resources
    (ex: Deployment > ReplicaSets > Pods).
  • Add DefaultStatusReader which uses a DelegatingStatusReader to
    wrap a list of conventional and specific StatusReaders.
    This should make it easier to extend the list of StatusReaders.
  • Move some pending WaitEvents to be optional in tests, now that
    StatusWatcher can resolve their status before the WaitTask starts.
  • Add a new Thousand Deployments stress test (10x kind nodes)
  • Add some new logs for easier debugging
  • Add internal SyncEvent so that apply/delete tasks don't start
    until the StatusWatcher has finished initial synchronization.
    This helps avoid missing events from actions that happen while
    synchronization is incomplete.
  • Filter optional pending WaitEvents when testing.

BREAKING CHANGE: Replace StatusPoller w/ StatusWatcher
BREAKING CHANGE: Remove PollInterval (obsolete with watcher)

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 9, 2022
@k8s-ci-robot k8s-ci-robot requested review from ash2k and seans3 April 9, 2022 14:08
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 9, 2022
@karlkfi karlkfi requested a review from mortent April 9, 2022 14:08
@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 9, 2022

I still have some tests to write and refactoring to do, but I got all the tests passing locally.

@karlkfi karlkfi force-pushed the karl-status-watcher branch 4 times, most recently from eba7e3d to d760ea7 Compare April 13, 2022 02:24
@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 13, 2022

/test cli-utils-presubmit-master-stress

@karlkfi karlkfi force-pushed the karl-status-watcher branch 7 times, most recently from 192d01d to 0d50a15 Compare April 14, 2022 18:28
@karlkfi karlkfi changed the title [WIP] feat: replace StatusPoller w/ StatusWatcher feat: replace StatusPoller w/ StatusWatcher Apr 14, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 14, 2022
@karlkfi karlkfi force-pushed the karl-status-watcher branch 4 times, most recently from e408395 to 5b850da Compare April 15, 2022 19:49
@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 15, 2022

Extracted #574

@karlkfi
Copy link
Contributor Author

karlkfi commented Apr 15, 2022

Extracted #575

@karlkfi karlkfi changed the title feat: replace StatusPoller w/ StatusWatcher [WIP] feat: replace StatusPoller w/ StatusWatcher Apr 15, 2022
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2022
@karlkfi karlkfi force-pushed the karl-status-watcher branch from 5b850da to 50ec259 Compare April 15, 2022 23:49
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 6, 2022
@karlkfi
Copy link
Contributor Author

karlkfi commented May 6, 2022

Fixed the flaky test with a new FilterOptionalEvents added throughout the e2e tests.

@karlkfi
Copy link
Contributor Author

karlkfi commented May 6, 2022

Comments addressed. Tests passed. Please take another look.

@karlkfi karlkfi requested a review from mortent May 6, 2022 19:24
Copy link
Member

@mortent mortent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 6, 2022
@Liujingfang1
Copy link
Contributor

/lgtm

@karlkfi
Copy link
Contributor Author

karlkfi commented May 7, 2022

/hold

Block until we get a release with k8s v1.24 out.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2022
Copy link
Member

@ash2k ash2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work! I've left a few comments. This PR adds quite a few TODOs. Are you planning to address those later?

// TODO: Retry with backoff if in namespace-scoped mode, to allow CRDs & namespaces to be created asynchronously
type ObjectStatusReporter struct {
// InformerFactory is used to build informers
InformerFactory *DynamicInformerFactory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we make this an interface for both (future) testing and composability reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable enhancement, but we can probably wait to change it to use an interface until we add a 2nd impl. I try not to prematurely add interfaces, if I can avoid it. Sometime having the struct itself is more flexible, and in this case the coupling is pretty tight to the SharedIndexInformer impl details. For example, we know that passing an unstructured example means the callbacks will almost always be unstructured and can be safely cast.


if tombstone, ok := iobj.(cache.DeletedFinalStateUnknown); ok {
// Last state unknown. Possibly stale.
// TODO: Should we propegate this uncertainty to the caller?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How tho? The callback interface informer uses is pretty terrible, because it uses interface{} objects as arguments, making it impossible to know what type you're going to get as a caller without looking at the implimentation. This particular pattern of wrapping the object with a struct with a getter is especially bad because it's not a standard object type. I didn't even know it was a possible edge case until an e2e test failed randomly.

The only idea I've had is to maybe add an annotation, but that would be equally hard to discover and would require removal by the receiver.

id := object.UnstructuredToObjMetadata(obj)
klog.Warningf("Invalid CRD added: missing group and/or kind: %v", id)
// Don't return an error, because this should not inturrupt the task queue.
// TODO: Allow non-fatal errors to be reported using a specific error type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

//go:generate stringer -type=RESTScopeStrategy -linecomment
type RESTScopeStrategy int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This can probably be byte =)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never seen a go enum use byte. Int is more extensible. But perhaps you're suggesting that this particular enum will never be extended?

@karlkfi
Copy link
Contributor Author

karlkfi commented May 9, 2022

This PR adds quite a few TODOs. Are you planning to address those later?

Honestly, if the watcher works well, I don't think I'll get back around to the TODOs for a while. The main goal here is performance/responsiveness and reduced memory use, and I don't want perfect to be the enemy of good. This codebase already has a pile of tech debt and the TODO issues I've made keep getting auto-closed, so I figured TODOs in the code were at least a little more persistent and might be discovered again, the next time someone is in this code.

I will address some of your comments tho. Thanks for reviewing!

@karlkfi karlkfi force-pushed the karl-status-watcher branch from c1e5a9f to 7af34af Compare May 9, 2022 21:52
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 9, 2022
@karlkfi karlkfi force-pushed the karl-status-watcher branch 2 times, most recently from 49705fa to 2e4c358 Compare May 10, 2022 03:54
- Add DefaultStatusWatcher that wraps DynamicClient and manages
  informers for a set of resource objects.
  - Supports two modes: root-scoped & namespace-scoped.
  - Root-scoped mode uses root-scoped informers to efficiency and
    performance.
  - Namespace-scoped mode uses namespace-scoped informers to
    minimize the permissions needed to run and the size of the
    in-memory object cache.
  - Automatic mode selects which mode to use based on whether the
    objects being watched are in one or multiple namespaces.
    This is the default mode, optimizing for performance.
  - If CRDs are being watched, the creation/deletion of CRDs can
    cause informers for those custom resources to be created/deleted.
  - In namespace-scope mode, if namespaces are being watched, the
    creation/deletion of namespaces can also trigger informers to
    be created/deleted.
  - All creates/updates/deletes to CRDs also cause RESTMapper reset.
  - Allow pods to be unschedulable for 15s before reporting the
    status as Failed. Any update resets the timer.
- Add BlindStatusWatcher for testing and disabling for dry-run.
- Add DynamicClusterReader that wraps DynamicClient.
  This is now used to look up generated resources
  (ex: Deployment > ReplicaSets > Pods).
- Add DefaultStatusReader which uses a DelegatingStatusReader to
  wrap a list of conventional and specific StatusReaders.
  This should make it easier to extend the list of StatusReaders.
- Move some pending WaitEvents to be optional in tests, now that
  StatusWatcher can resolve their status before the WaitTask starts.
- Add a new Thousand Deployments stress test (10x kind nodes)
- Add some new logs for easier debugging
- Add internal SyncEvent so that apply/delete tasks don't start
  until the StatusWatcher has finished initial synchronization.
  This helps avoid missing events from actions that happen while
  synchronization is incomplete.
- Filter optional pending WaitEvents when testing.

BREAKING CHANGE: Replace StatusPoller w/ StatusWatcher
BREAKING CHANGE: Remove PollInterval (obsolete with watcher)
@karlkfi karlkfi force-pushed the karl-status-watcher branch from 2e4c358 to c469493 Compare May 10, 2022 17:40
Copy link
Member

@ash2k ash2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2022
Copy link
Member

@mortent mortent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ash2k, karlkfi, mortent

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2022
@karlkfi
Copy link
Contributor Author

karlkfi commented May 12, 2022

/unhold

v0.30.0 passed all tests in Config Sync, so this this unblocked for merge.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 12, 2022
@karlkfi karlkfi merged commit 4831b66 into kubernetes-sigs:master May 12, 2022
@karlkfi karlkfi deleted the karl-status-watcher branch May 12, 2022 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants