daemon: use informers to check for updates #130

yuqi-zhang · 2018-10-15T21:14:50Z

Change existing watch workflow to an informer instead, as watches
are prone to breaking and restarting the MCD. The Run() function
within MCD is changed from an infinite loop to instead block,
and updates are handled by callbacks via the informer.

yuqi-zhang · 2018-10-15T21:15:41Z

Marking WIP as the newest version isn't fully tested pending other issues.

cc/ @jlebon @sdemos @ashcrow

ashcrow · 2018-10-16T14:42:46Z

Makes sense so far. I recommend rebasing though as I think there has been a commit or two in the areas you're updating.

abhinavdahiya · 2018-10-16T17:16:51Z

you are using an informer that reacts to changes from all the nodes. Please only watch changes to the node the daemon was scheduled.

sdemos · 2018-10-16T20:58:32Z

@abhinavdahiya do you have a good example of an informer that only watches one instance of a resource? I wasn't sure if you could do that, right now the update callback checks if it's the node it cares about and if it's not it returns immediately, which I thought would be enough.

yuqi-zhang · 2018-10-18T21:36:20Z

Rebased.
@abhinavdahiya that's a good point, but I'm not sure what the proper method to do so would be. Am I able to just set an informer on a singular node instead of KubeInformerFactory.Core().V1().Nodes()? Right now the handler immediately checks for the node, so there shouldn't be a significant performance penalty.

ashcrow · 2018-10-19T13:20:59Z

/retest

yuqi-zhang · 2018-10-22T20:36:01Z

Added some fixes and removing WIP:

instead of panic, instead reboot the node after setting degraded. This makes it consistent with other failures (the node will sleep post reboot)
add a check to continue sleeping upon updates while we're degraded to preserve previous run logs (with actual failures)
stop checking for node updates and deletions for the informer (causes race conditions when update has not fully applied, and we handle those cases elsewhere regardless)

It is possible to filter for nodes to watch, albeit roundabout and adds complexity. I feel that the check in handleUpdate is working, and doesn't cause performance issues to warrant a custom informer. Note that there is the option to use UntilWithSync: https://github.com/kubernetes/client-go/blob/master/tools/watch/until.go#L98 sometime in the future (basically a custom informer that mimics our old workflow without causing issues), but that is with v9.0.0 (we use 8.0.0: https://github.com/openshift/machine-config-operator/blob/master/Gopkg.toml#L60), so I personally think the fix here should work for now.

I've tested with updates to configs, as well as leaving it running for 6+ hours, haven't broken so far or hot looped. I have the latest image built at docker.io/jerzhang/origin-machine-config-daemon:v3.11.0-178-g936c032f if you want to try for yourself.

yuqi-zhang · 2018-10-22T20:36:30Z

@ashcrow @sdemos PTAL

ashcrow · 2018-10-23T13:40:43Z

pkg/daemon/daemon.go

No update needed, but I'm curious: Was there a specific reason to do the informer setup after the Daemon creation? Or is it more so it reads better to you this way rather than:

return &Daemon{ .... nodeLister: nodeInformer.Lister(), nodeListerSynced: nodeInformer.Informer().HasSynced, }, nil

I noticed that elsewhere we split off the Lister, etc. (e.g. https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller.go#L105). I can change it to your suggested method if you'd like.

No this is fine. I was just curious 😄

ashcrow · 2018-10-23T13:43:32Z

cmd/machine-config-daemon/start.go

I think a comment here would be helpful. Seeing a closing of the channel without going through the informer docs seems odd.

ashcrow

One question, and one nit. Neither are blockers.

ashcrow · 2018-10-23T13:44:33Z

Will let @sdemos and/or @jlebon and/or @kikisdeliveryservice review and pull the merge lever if they also 👍.

jlebon · 2018-10-23T13:59:00Z

pkg/daemon/daemon.go

Minor: indentation looks off here.

jlebon · 2018-10-23T13:59:48Z

pkg/daemon/daemon.go

Why change the run() and process() split here? For one thing it makes the error handling a bit less painful in all the calls (i.e. we're just marking degraded on error in one spot instead of on all the out paths).

My initial thought was that process() was more of a loop processing commands, whereas now it just blocks at the <-stop, and callbacks happen instead. This is also somewhat consistent with how we have informers elsewhere in this repo. I have no strong feelings either way so I can change that if you feel that makes more sense,

IMHO I think it's ok that Run and process are not split in this PR wince the work in #126 was going to have to change process and Run significantly anyway.

jlebon · 2018-10-23T14:21:05Z

pkg/daemon/daemon.go

Do I understand correctly that there's nothing preventing this to run at the same time as the initial triggerUpdate()? Or do callbacks not actually occur until after cache.WaitForCacheSync()

Good point, I haven't seen this cause an issue/race condition (even when there are repeated failed updates so it should have triggered), but let me double check.

jlebon · 2018-10-23T14:22:20Z

pkg/daemon/daemon.go

I think I'd rather we stop watching and exit on error conditions in this callback function and instead only have the "sleep on Degraded" logic at the beginning of the daemon.

Sure, should I maybe instead not start the informer if we are degraded? So we never even get to this place?

Definitely! (I thought that was already the case -- when are callbacks actually active? Is it at nodeInformer.Informer().AddEventHandler() time? In that case, yeah we should move that to after checking if the node is even degraded.)

But also in this function if the node is degraded, let's just stop watching for updates and exit?

yuqi-zhang · 2018-10-24T17:48:00Z

Added fixup to move the state checking before the informer starts, which degrades early (and doesn't start informer anymore) as well as prevent race conditions between multiple updates. Will squash if it makes sense. Tested.

@jlebon PTAL

sdemos · 2018-10-24T21:31:49Z

changes all make sense to me, and the logic looks good. leaving it to @jlebon to confirm the changes he requested.

jlebon · 2018-10-25T14:20:57Z

pkg/daemon/daemon.go

Minor: at this point, we might as well move these to runStartCmd before we call CheckStateOnBoot?

Good point, will fix

jlebon · 2018-10-25T14:35:15Z

cmd/machine-config-daemon/start.go

IIUC, there's a race condition here if the node annotation is modified after CheckStateOnBoot(), but before cache is synced and we can start receiving callbacks, right?

Here's an idea: we spawn another goroutine that just listens for notifications on a channel before doing triggerUpdate(). Then in handleUpdate we send a notification. But crucially, we also send a notification after WaitForCacheSync() is done if annotations don't match. That way updates are still always serialized, and we close the race condition.

Yes, I thought about this as well, but while testing this it seems that the notifications should still be there for when the informer actually starts listening (Maybe I'm mis-understanding, I used to have a addNode callback as well which triggers despite having happened before the informer starts listening)

Also, during testing, I found that the callback happens quite frequently with the node (once every couple seconds), so updates are generally caught eventually anyways (since we only care about differences in CC vs DC anyways).

yuqi-zhang · 2018-10-25T18:09:11Z

@jlebon pushed a minor fixup

As discussed, it is probably better to handle via. a new goroutine to ensure no updates are lost, but for the time being this is unlikely to cause any race conditions. Instead I've PR'ed a fix to the other race condition at: #145

We will likely improve this behaviour as follow-up work. If you think this makes sense, I will squash the commits.

Thanks for the reviews!

Edit: will rebase 145 as well when squashing.

jlebon · 2018-10-25T18:25:06Z

Yup sure, SGTM. To be fair, a race existed even before this PR (i.e. in waitUntilUpdate between comparing dc to cc and actually starting to watch for annotation updates).

ashcrow · 2018-10-25T18:28:24Z

/test images

ashcrow · 2018-10-26T13:13:54Z

/test e2e-aws

ashcrow · 2018-10-26T13:14:45Z

@yuqi-zhang needs a rebase.

Change existing watch workflow to an informer instead, as watches are prone to breaking and restarting the MCD. The Run() function within MCD is changed from an infinite loop to instead block, and updates are handled by callbacks via the informer. Signed-off-by: Yu Qi Zhang <jerzhang@redhat.com>

yuqi-zhang · 2018-10-26T14:39:20Z

Rebased, squashed, and re-tested locally

jlebon · 2018-10-26T15:06:32Z

/lgtm

openshift-ci-robot · 2018-10-26T15:06:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlebon, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~cmd/machine-config-daemon/OWNERS~~ [jlebon]
~~pkg/daemon/OWNERS~~ [jlebon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AddFunc docs in [1,2]. We've been setting an UpdateFunc on this informer since we pivoted to node informers in d67e633 (daemon: use informers to check for updates, 2018-10-26, openshift#130). But we need to listen for add events to catch situations where the node's desiredConfig is altered before the informer's initial List call. This avoids situtations like [3,4]: I0514 03:27:54.773485 4242 update.go:1346] Starting to manage node: ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal ... I0514 03:42:37.392240 4242 request.go:1068] Response Body: {"kind":"Node","apiVersion":"v1","metadata":{"name":"ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal"..."machineconfiguration.openshift.io/desiredConfig":"rendered-node-log-level-14137156-e501-4d3b-9a93-7fff577b45e5-a048da8ad1c08dce6dc6aa3c1101fb44"... ... I0514 03:43:18.111048 1897 update.go:1346] Starting to manage node: ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal ... I0514 03:43:21.844234 1897 request.go:1068] Response Body: {"kind":"NodeList","apiVersion":"v1","metadata":{"selfLink":"/api/v1/nodes","resourceVersion":"28107"},"items":[{"metadata":{"name":"ci-op-9gpgc-m-0.c.openshift-gce-devel-ci.internal"..."machineconfiguration.openshift.io/currentConfig":"rendered-master-9f55fba0ad22a8069d04b6ddf87b8ed9","machineconfiguration.openshift.io/desiredConfig":"rendered-master-9f55fba0ad22a8069d04b6ddf87b8ed9"... ... I0514 03:48:15.568762 1897 daemon.go:1119] Updating Node ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal ... I0514 03:48:16.569041 1897 daemon.go:368] Started syncing node "ci-op-9gpgc-w-d-hdjg2.c.openshift-gce-devel-ci.internal" (2020-05-14 03:48:16.569022688 +0> I0514 03:48:16.590279 1897 daemon.go:767] Current config: rendered-worker-a048da8ad1c08dce6dc6aa3c1101fb44 I0514 03:48:16.590302 1897 daemon.go:768] Desired config: rendered-node-log-level-14137156-e501-4d3b-9a93-7fff577b45e5-a048da8ad1c08dce6dc6aa3c1101fb44 That was: 1. Somebody sets desiredConfig to rendered-node-log-level-... while the old MCD (PID 4242) is still running. 2. New MCD (PID 1897) comes up, and fires up its Node informer. 3. Informer does a List while starting up, and sees the node with desiredConfig set to rendered-node-log-level-... Goes to call AddFunc, but until this commit we weren't setting it, so MCD does nothing. 4. Informer sets up its Watch. 5. Time passes... 6. Something else changes the managed Node (kubelet heartbeat? Who cares?), Watch trips, UpdateFunc called. 7. MCD notices the desiredConfig and begins applying rendered-node-log-level-... With this commit, we'll react at step 3 instead of 7, hopefully fixing [5]. [1]: https://pkg.go.dev/k8s.io/client-go/tools/cache?tab=doc#ResourceEventHandlerFuncs [2]: https://pkg.go.dev/k8s.io/client-go/tools/cache?tab=doc#ResourceEventHandlerFuncs.OnAdd [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1730/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2250 [4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1730/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2250/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-pq5wv_machine-config-daemon.log [5]: https://bugzilla.redhat.com/show_bug.cgi?id=1834895

BUG 1697644: operator: use openshift-config/pull-secret to fetch the pull secret

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2018

openshift-ci-robot requested review from ashcrow and sdemos October 15, 2018 21:14

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 15, 2018

sdemos mentioned this pull request Oct 18, 2018

daemon: Run once mode #126

Merged

4 tasks

yuqi-zhang force-pushed the fix-mcd-node-watch branch from e61d9a4 to 2b46491 Compare October 18, 2018 21:33

yuqi-zhang force-pushed the fix-mcd-node-watch branch from 2b46491 to 936c032 Compare October 22, 2018 20:35

yuqi-zhang changed the title ~~WIP: daemon: use informers to check for updates~~ daemon: use informers to check for updates Oct 22, 2018

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2018

ashcrow reviewed Oct 23, 2018

View reviewed changes

ashcrow approved these changes Oct 23, 2018

View reviewed changes

jlebon reviewed Oct 23, 2018

View reviewed changes

jlebon reviewed Oct 25, 2018

View reviewed changes

yuqi-zhang force-pushed the fix-mcd-node-watch branch from e70ce3c to 672dab4 Compare October 25, 2018 18:17

yuqi-zhang force-pushed the fix-mcd-node-watch branch from 672dab4 to d67e633 Compare October 26, 2018 14:38

openshift-ci-robot assigned jlebon Oct 26, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 26, 2018

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2018

openshift-merge-robot merged commit 2d42cab into openshift:master Oct 26, 2018

yuqi-zhang mentioned this pull request Oct 29, 2018

MCD "regularly" restarts #68

Closed

yuqi-zhang deleted the fix-mcd-node-watch branch August 26, 2019 15:08

wking mentioned this pull request May 14, 2020

Bug 1834895: pkg/daemon: Set AddFunc on the nodeInformer as well #1731

Merged

osherdp pushed a commit to osherdp/machine-config-operator that referenced this pull request Apr 13, 2021

Merge pull request openshift#130 from abhinavdahiya/move-pull-secret

1a820f4

BUG 1697644: operator: use openshift-config/pull-secret to fetch the pull secret

daemon: use informers to check for updates #130

daemon: use informers to check for updates #130

Uh oh!

Conversation

yuqi-zhang commented Oct 15, 2018

Uh oh!

yuqi-zhang commented Oct 15, 2018

Uh oh!

ashcrow commented Oct 16, 2018

Uh oh!

abhinavdahiya commented Oct 16, 2018

Uh oh!

sdemos commented Oct 16, 2018

Uh oh!

yuqi-zhang commented Oct 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashcrow commented Oct 19, 2018

Uh oh!

yuqi-zhang commented Oct 22, 2018

Uh oh!

yuqi-zhang commented Oct 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashcrow left a comment

Choose a reason for hiding this comment

Uh oh!

ashcrow commented Oct 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang Oct 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlebon Oct 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang commented Oct 24, 2018

Uh oh!

sdemos commented Oct 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang commented Oct 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jlebon commented Oct 25, 2018

Uh oh!

ashcrow commented Oct 25, 2018

Uh oh!

yuqi-zhang commented Oct 18, 2018 •

edited

Loading

ashcrow commented Oct 23, 2018 •

edited

Loading

yuqi-zhang Oct 23, 2018 •

edited

Loading

jlebon Oct 23, 2018 •

edited

Loading

yuqi-zhang commented Oct 25, 2018 •

edited

Loading