controller: Wait for template and kubelet before starting renderer #385

cgwalters · 2019-02-05T18:38:03Z

Avoid e.g. rendering a MC without the kubelet config.

Closes: #338

cmd/machine-config-controller/start.go

jlebon · 2019-02-05T19:14:20Z

This looks plausible to me. Will let others have a look and test it out.
/approve

pkg/controller/template/template_controller.go

pkg/controller/kubelet-config/kubelet_config_controller.go

runcom · 2019-02-05T21:24:37Z

Approving code wise but I won't be able to test it on a cluster till tomorrow nor tests seem to be running anyway :(

/approve

openshift-merge-robot · 2019-02-06T06:27:55Z

/retest

openshift-merge-robot · 2019-02-06T12:17:28Z

/retest

cgwalters · 2019-02-06T14:43:43Z

/test e2e-aws-op

runcom · 2019-02-07T08:51:41Z

/retest

runcom · 2019-02-07T09:49:22Z

/retest

runcom · 2019-02-07T10:38:53Z

/lgtm

runcom · 2019-02-07T10:49:10Z

/retest

runcom · 2019-02-07T11:40:37Z

tests are failing because of:

time="2019-02-07T11:12:12Z" level=debug msg="Still waiting for the cluster to initialize: Could not update clusteroperator \"monitoring\" (195 of 264)"
time="2019-02-07T11:18:27Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating deployment object failed: timed out waiting for the condition"
time="2019-02-07T11:22:57Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition"
time="2019-02-07T11:36:04Z" level=fatal msg="failed to initialize the cluster: timed out waiting for the condition"

runcom · 2019-02-07T13:03:34Z

/retest

runcom · 2019-02-07T14:17:26Z

monitoring flake again

/retest

jlebon · 2019-02-07T14:48:34Z

/lgtm

openshift-ci-robot · 2019-02-07T14:48:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jlebon, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,jlebon,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

runcom · 2019-02-07T15:10:53Z

wtf, still the very same issue with monitoring operator, reported to them:

Sergiusz Urbaniak [3 hours ago]
this smells like a communication issue with the API server :thinking_face:

openshift-ci-robot · 2019-02-07T20:12:16Z

New changes are detected. LGTM label has been removed.

ashcrow · 2019-02-08T18:24:00Z

/retest

runcom · 2019-02-08T20:49:14Z

/retest

runcom · 2019-02-08T21:58:05Z

not sure why installer keeps failing on the prometheus operator here (from logs)

/retest

cgwalters · 2019-02-08T23:03:49Z

Looking at the logs here, we're still stalling out in the MCC. So clearly this code isn't working as I thought.

I need to try again to build a local release payload and test locally to more easily debug.

cgwalters · 2019-02-08T23:03:59Z

/hold

runcom · 2019-02-08T23:23:01Z

ok, looking at this PR from my editor again now, I see there are many code paths in syncKubeletConfig and upward in the call stack which may return early and just leave the WaitGroup undone, thus deadlocking...

runcom · 2019-02-08T23:27:08Z

So, just from my understanding of the flow, we should close the WaitGroup anyway even if we're going through another code path right? (even an error one).
I believe this would do the trick:

diff --git a/pkg/controller/kubelet-config/kubelet_config_controller.go b/pkg/controller/kubelet-config/kubelet_config_controller.go
index 9acc10c..9d368f6 100644
--- a/pkg/controller/kubelet-config/kubelet_config_controller.go
+++ b/pkg/controller/kubelet-config/kubelet_config_controller.go
@@ -304,6 +304,9 @@ func (ctrl *Controller) syncKubeletConfig(key string) error {
 	glog.V(4).Infof("Started syncing kubeletconfig %q (%v)", key, startTime)
 	defer func() {
 		glog.V(4).Infof("Finished syncing kubeletconfig %q (%v)", key, time.Since(startTime))
+		ctrl.firstSync.Do(func() {
+			ctrl.readyFlag.Done()
+		})
 	}()
 
 	_, name, err := cache.SplitMetaNamespaceKey(key)
@@ -420,10 +423,6 @@ func (ctrl *Controller) syncKubeletConfig(key string) error {
 		glog.Infof("Applied KubeletConfig %v on MachineConfigPool %v", key, pool.Name)
 	}
 
-	ctrl.firstSync.Do(func() {
-		ctrl.readyFlag.Done()
-	})
-
 	return ctrl.syncStatusOnly(cfg, nil)
 }
 
diff --git a/pkg/controller/template/template_controller.go b/pkg/controller/template/template_controller.go
index ec5681b..7bafbdf 100644
--- a/pkg/controller/template/template_controller.go
+++ b/pkg/controller/template/template_controller.go
@@ -324,6 +324,9 @@ func (ctrl *Controller) syncControllerConfig(key string) error {
 	glog.V(4).Infof("Started syncing controllerconfig %q (%v)", key, startTime)
 	defer func() {
 		glog.V(4).Infof("Finished syncing controllerconfig %q (%v)", key, time.Since(startTime))
+		ctrl.firstSync.Do(func() {
+			ctrl.readyFlag.Done()
+		})
 	}()
 
 	_, name, err := cache.SplitMetaNamespaceKey(key)
@@ -376,10 +379,6 @@ func (ctrl *Controller) syncControllerConfig(key string) error {
 		}
 	}
 
-	ctrl.firstSync.Do(func() {
-		ctrl.readyFlag.Done()
-	})
-
 	return ctrl.syncCompletedStatus(cfg)
 }

runcom · 2019-02-08T23:29:20Z

we can enhance that patch above to also return an error through a channel and completely fail the initialization of the other dependant controllers if we really want to (not sure that's desiderable though)

cgwalters · 2019-02-11T17:55:18Z

Looking at this again the problem seems obvious, we can't block at that point or we won't start the kube informers, etc. Changed to start a new goroutine to do an async wait.

runcom · 2019-02-11T18:01:33Z

pkg/controller/kubelet-config/kubelet_config_controller.go

shouldn't this still be moved to the defer above to avoid not being called on errors (leaking the other goroutine at this point) or other successful paths?

I don't think so, because we want to only say we're ready on success right?

how do we spot that we're blocked on the wait() though? logs? is that a condition we must wait forever or do we need to error?

if that's the case, cool

cgwalters · 2019-02-12T16:46:08Z

OK finally did a bit more of a dive into the codebase here to understand things. First, I was totally confused; the 01-{master,worker}-kubelet MCs are generated by the template controller, not the kubelet sub-controller.

The reason this PR is hanging is because by default

$ oc get --all-namespaces kubeletconfig
No resources found.

And we were waiting for one.

For now...all we need to do is block on the template controller having done a full sync. Going to rewrok this PR.

cgwalters · 2019-02-12T20:18:01Z

Looks like this worked, there's a clear delay now between when we finish the templates and then start the renderer:

I0212 18:17:02.595914       1 leaderelection.go:196] successfully acquired lease openshift-machine-config-operator/machine-config-controller
I0212 18:17:02.796845       1 template_controller.go:122] Starting MachineConfigController-TemplateController
I0212 18:17:02.996081       1 kubelet_config_controller.go:138] Starting MachineConfigController-KubeletConfigController
I0212 18:17:07.218526       1 template_controller.go:380] Initial template sync done
I0212 18:17:07.395747       1 render_controller.go:112] Starting MachineConfigController-RenderController
I0212 18:17:07.396296       1 node_controller.go:121] Starting MachineConfigController-NodeController
I0212 18:17:12.996365       1 render_controller.go:456] Generated machineconfig master-761cbffcf67ae93c8fde4cb77438e3da from 3 configs: [{MachineConfig  00-master  machineconfiguration.openshift.io/v1  } {MachineConfig  00-master-ssh  machineconfiguration.openshift.io/v1  } {MachineConfig  01-master-kubelet  machineconfiguration.openshift.io/v1  }]
I0212 18:17:13.296207       1 render_controller.go:456] Generated machineconfig worker-fb4c0a0f9bd8c642a941ea3d55db1e7b from 3 configs: [{MachineConfig  00-worker  machineconfiguration.openshift.io/v1  } {MachineConfig  00-worker-ssh  machineconfiguration.openshift.io/v1  } {MachineConfig  01-worker-kubelet  machineconfiguration.openshift.io/v1  }]

cgwalters · 2019-02-12T20:30:25Z

Something went wrong operator side though, we never exited init mode in that run:

E0212 18:16:38.428010       1 reflector.go:134] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfigPool: the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io)
I0212 18:16:40.640806       1 sync.go:48] [init mode] synced pools in 21.804511ms
I0212 18:17:07.726846       1 sync.go:48] [init mode] synced mcc in 27.077026444s
I0212 18:17:13.840892       1 sync.go:48] [init mode] synced mcs in 6.103865369s
I0212 18:17:19.936551       1 sync.go:48] [init mode] synced mcd in 6.081135133s
I0212 18:27:19.953801       1 sync.go:48] [init mode] synced required-pools in 10m0.000482382s
I0212 18:27:20.053636       1 sync.go:48] [init mode] synced pools in 9.189432ms
I0212 18:27:22.105326       1 sync.go:48] [init mode] synced mcc in 2.042414571s
I0212 18:27:23.174254       1 sync.go:48] [init mode] synced mcs in 1.058831684s
I0212 18:27:24.535234       1 sync.go:48] [init mode] synced mcd in 1.351046342s
I0212 18:37:24.545491       1 sync.go:48] [init mode] synced required-pools in 10m0.000319228s
I0212 18:37:24.646409       1 sync.go:48] [init mode] synced pools in 8.973161ms
I0212 18:37:26.693497       1 sync.go:48] [init mode] synced mcc in 2.038344699s
I0212 18:37:27.759663       1 sync.go:48] [init mode] synced mcs in 1.055972849s
I0212 18:37:29.125584       1 sync.go:48] [init mode] synced mcd in 1.354875482s
I0212 18:47:29.136629       1 sync.go:48] [init mode] synced required-pools in 10m0.000314036s
I0212 18:47:29.236215       1 sync.go:48] [init mode] synced pools in 9.119727ms
I0212 18:47:31.283226       1 sync.go:48] [init mode] synced mcc in 2.037753306s
I0212 18:47:32.354976       1 sync.go:48] [init mode] synced mcs in 1.061297401s
I0212 18:47:33.715687       1 sync.go:48] [init mode] synced mcd in 1.349676074s

Both the master/worker pools are:

                "machineCount": 0,
                "readyMachineCount": 0,
                "unavailableMachineCount": 0,
                "updatedMachineCount": 0

Hmm. Something interesting about the MCC logs are that I don't see any output from the node controller.

cgwalters · 2019-02-12T22:38:39Z

Not sure what happened in that run. Added 2 commits which tweak logging that should be helpful in general.

cgwalters · 2019-02-14T19:16:50Z

OK I totally don't understand why the node sub-controller apparently isn't reacting with this PR. We're starting it...

Testing this out on a local libvirt run...

Avoid e.g. rendering a MC without the `01-master-kubelet` config template. Closes: openshift#338 Co-authored-by: Antonio Murdaca <[email protected]>

openshift-ci-robot · 2019-02-28T20:50:52Z

@cgwalters: The following tests failed for commit 9c4a790, say /retest to rerun them:

Test name	Details	Rerun command
ci/prow/e2e-aws	link	`/test e2e-aws`
ci/prow/e2e-aws-op	link	`/test e2e-aws-op`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

runcom · 2019-03-17T15:54:08Z

another approach to fix this is here 9e3ad65 in #482 - basically, do what the operator does to ensure a sync between template and render (look at controller config)

jlebon reviewed Feb 5, 2019

View reviewed changes

cmd/machine-config-controller/start.go Outdated Show resolved Hide resolved

cgwalters force-pushed the controller-phased-startup branch from b4b7de1 to 37fac58 Compare February 5, 2019 19:02

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2019

runcom reviewed Feb 5, 2019

View reviewed changes

pkg/controller/template/template_controller.go Outdated Show resolved Hide resolved

cgwalters force-pushed the controller-phased-startup branch from 37fac58 to 4c6732e Compare February 5, 2019 19:22

openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 5, 2019

runcom reviewed Feb 5, 2019

View reviewed changes

pkg/controller/kubelet-config/kubelet_config_controller.go Outdated Show resolved Hide resolved

cgwalters force-pushed the controller-phased-startup branch from 4c6732e to 0be8861 Compare February 5, 2019 21:10

cgwalters force-pushed the controller-phased-startup branch from 0be8861 to 646fa63 Compare February 6, 2019 23:09

openshift-ci-robot assigned runcom Feb 7, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2019

runcom mentioned this pull request Feb 7, 2019

pkg/operator: lay down ClusterOperator for MCO #386

Merged

openshift-ci-robot assigned jlebon Feb 7, 2019

cgwalters force-pushed the controller-phased-startup branch from 646fa63 to 26085f3 Compare February 7, 2019 20:12

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2019

cgwalters force-pushed the controller-phased-startup branch from d20b04a to 494c5c2 Compare February 11, 2019 17:54

runcom reviewed Feb 11, 2019

View reviewed changes

cgwalters force-pushed the controller-phased-startup branch from 494c5c2 to d517532 Compare February 12, 2019 16:53

cgwalters force-pushed the controller-phased-startup branch from d517532 to a2d9d63 Compare February 12, 2019 22:38

This was referenced Feb 13, 2019

Controller logging #419

Merged

fix initial race between template and render sub-controllers #338

Closed

cgwalters force-pushed the controller-phased-startup branch from a2d9d63 to 4e4fa04 Compare February 14, 2019 17:40

runcom mentioned this pull request Feb 24, 2019

pkg/operator: don't check version for sub MC in pools #488

Closed

cgwalters force-pushed the controller-phased-startup branch from 4e4fa04 to 6caf6d0 Compare February 27, 2019 18:32

controller: Wait for template sub-controller before starting renderer

9c4a790

Avoid e.g. rendering a MC without the `01-master-kubelet` config template. Closes: openshift#338 Co-authored-by: Antonio Murdaca <[email protected]>

cgwalters force-pushed the controller-phased-startup branch from 6caf6d0 to 9c4a790 Compare February 28, 2019 19:41

runcom mentioned this pull request Mar 17, 2019

controllers: refactor code and start informers first, fix template/render race #482

Merged

openshift-merge-robot closed this in #482 Mar 19, 2019

controller: Wait for template and kubelet before starting renderer #385

controller: Wait for template and kubelet before starting renderer #385

Uh oh!

Conversation

cgwalters commented Feb 5, 2019

Uh oh!

Uh oh!

jlebon commented Feb 5, 2019

Uh oh!

Uh oh!

Uh oh!

runcom commented Feb 5, 2019

Uh oh!

openshift-merge-robot commented Feb 6, 2019

Uh oh!

openshift-merge-robot commented Feb 6, 2019

Uh oh!

cgwalters commented Feb 6, 2019

Uh oh!

runcom commented Feb 7, 2019

Uh oh!

runcom commented Feb 7, 2019

Uh oh!

runcom commented Feb 7, 2019

Uh oh!

runcom commented Feb 7, 2019

Uh oh!

runcom commented Feb 7, 2019

Uh oh!

runcom commented Feb 7, 2019

Uh oh!

runcom commented Feb 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jlebon commented Feb 7, 2019

Uh oh!

openshift-ci-robot commented Feb 7, 2019

Uh oh!

runcom commented Feb 7, 2019

Uh oh!

openshift-ci-robot commented Feb 7, 2019

Uh oh!

ashcrow commented Feb 8, 2019

Uh oh!

runcom commented Feb 8, 2019

Uh oh!

runcom commented Feb 8, 2019

Uh oh!

cgwalters commented Feb 8, 2019

Uh oh!

cgwalters commented Feb 8, 2019

Uh oh!

runcom commented Feb 8, 2019

Uh oh!

runcom commented Feb 8, 2019

Uh oh!

runcom commented Feb 8, 2019

Uh oh!

cgwalters commented Feb 11, 2019

Uh oh!

runcom Feb 11, 2019

Choose a reason for hiding this comment

Uh oh!

cgwalters Feb 11, 2019

Choose a reason for hiding this comment

Uh oh!

runcom Feb 11, 2019

Choose a reason for hiding this comment

Uh oh!

runcom Feb 11, 2019

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Feb 12, 2019

Uh oh!

cgwalters commented Feb 12, 2019

Uh oh!

cgwalters commented Feb 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgwalters commented Feb 12, 2019

runcom commented Feb 7, 2019 •

edited

Loading

cgwalters commented Feb 12, 2019 •

edited

Loading

openshift-ci-robot commented Feb 28, 2019 •

edited

Loading