Skip to content

Conversation

@cgwalters
Copy link
Member

Avoid e.g. rendering a MC without the kubelet config.

Closes: #338

@cgwalters cgwalters force-pushed the controller-phased-startup branch from b4b7de1 to 37fac58 Compare February 5, 2019 19:02
@jlebon
Copy link
Member

jlebon commented Feb 5, 2019

This looks plausible to me. Will let others have a look and test it out.
/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2019
@cgwalters cgwalters force-pushed the controller-phased-startup branch from 37fac58 to 4c6732e Compare February 5, 2019 19:22
@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 5, 2019
@cgwalters cgwalters force-pushed the controller-phased-startup branch from 4c6732e to 0be8861 Compare February 5, 2019 21:10
@runcom
Copy link
Member

runcom commented Feb 5, 2019

Approving code wise but I won't be able to test it on a cluster till tomorrow nor tests seem to be running anyway :(

/approve

@openshift-merge-robot
Copy link
Contributor

/retest

1 similar comment
@openshift-merge-robot
Copy link
Contributor

/retest

@cgwalters
Copy link
Member Author

/test e2e-aws-op

@cgwalters cgwalters force-pushed the controller-phased-startup branch from 0be8861 to 646fa63 Compare February 6, 2019 23:09
@runcom
Copy link
Member

runcom commented Feb 7, 2019

/retest

1 similar comment
@runcom
Copy link
Member

runcom commented Feb 7, 2019

/retest

@runcom
Copy link
Member

runcom commented Feb 7, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2019
@runcom
Copy link
Member

runcom commented Feb 7, 2019

/retest

@runcom
Copy link
Member

runcom commented Feb 7, 2019

tests are failing because of:

time="2019-02-07T11:12:12Z" level=debug msg="Still waiting for the cluster to initialize: Could not update clusteroperator \"monitoring\" (195 of 264)"
time="2019-02-07T11:18:27Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating deployment object failed: timed out waiting for the condition"
time="2019-02-07T11:22:57Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition"
time="2019-02-07T11:36:04Z" level=fatal msg="failed to initialize the cluster: timed out waiting for the condition"

@runcom
Copy link
Member

runcom commented Feb 7, 2019

/retest

@runcom
Copy link
Member

runcom commented Feb 7, 2019

monitoring flake again

/retest

@jlebon
Copy link
Member

jlebon commented Feb 7, 2019

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jlebon, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,jlebon,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom
Copy link
Member

runcom commented Feb 7, 2019

wtf, still the very same issue with monitoring operator, reported to them:

Sergiusz Urbaniak [3 hours ago]
this smells like a communication issue with the API server :thinking_face:

@cgwalters cgwalters force-pushed the controller-phased-startup branch from 646fa63 to 26085f3 Compare February 7, 2019 20:12
@openshift-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@ashcrow
Copy link
Member

ashcrow commented Feb 8, 2019

/retest

1 similar comment
@runcom
Copy link
Member

runcom commented Feb 8, 2019

/retest

@runcom
Copy link
Member

runcom commented Feb 8, 2019

not sure why installer keeps failing on the prometheus operator here (from logs)

/retest

@cgwalters
Copy link
Member Author

Looking at the logs here, we're still stalling out in the MCC. So clearly this code isn't working as I thought.

I need to try again to build a local release payload and test locally to more easily debug.

@cgwalters
Copy link
Member Author

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2019
@runcom
Copy link
Member

runcom commented Feb 8, 2019

ok, looking at this PR from my editor again now, I see there are many code paths in syncKubeletConfig and upward in the call stack which may return early and just leave the WaitGroup undone, thus deadlocking...

@runcom
Copy link
Member

runcom commented Feb 8, 2019

So, just from my understanding of the flow, we should close the WaitGroup anyway even if we're going through another code path right? (even an error one).
I believe this would do the trick:

diff --git a/pkg/controller/kubelet-config/kubelet_config_controller.go b/pkg/controller/kubelet-config/kubelet_config_controller.go
index 9acc10c..9d368f6 100644
--- a/pkg/controller/kubelet-config/kubelet_config_controller.go
+++ b/pkg/controller/kubelet-config/kubelet_config_controller.go
@@ -304,6 +304,9 @@ func (ctrl *Controller) syncKubeletConfig(key string) error {
 	glog.V(4).Infof("Started syncing kubeletconfig %q (%v)", key, startTime)
 	defer func() {
 		glog.V(4).Infof("Finished syncing kubeletconfig %q (%v)", key, time.Since(startTime))
+		ctrl.firstSync.Do(func() {
+			ctrl.readyFlag.Done()
+		})
 	}()
 
 	_, name, err := cache.SplitMetaNamespaceKey(key)
@@ -420,10 +423,6 @@ func (ctrl *Controller) syncKubeletConfig(key string) error {
 		glog.Infof("Applied KubeletConfig %v on MachineConfigPool %v", key, pool.Name)
 	}
 
-	ctrl.firstSync.Do(func() {
-		ctrl.readyFlag.Done()
-	})
-
 	return ctrl.syncStatusOnly(cfg, nil)
 }
 
diff --git a/pkg/controller/template/template_controller.go b/pkg/controller/template/template_controller.go
index ec5681b..7bafbdf 100644
--- a/pkg/controller/template/template_controller.go
+++ b/pkg/controller/template/template_controller.go
@@ -324,6 +324,9 @@ func (ctrl *Controller) syncControllerConfig(key string) error {
 	glog.V(4).Infof("Started syncing controllerconfig %q (%v)", key, startTime)
 	defer func() {
 		glog.V(4).Infof("Finished syncing controllerconfig %q (%v)", key, time.Since(startTime))
+		ctrl.firstSync.Do(func() {
+			ctrl.readyFlag.Done()
+		})
 	}()
 
 	_, name, err := cache.SplitMetaNamespaceKey(key)
@@ -376,10 +379,6 @@ func (ctrl *Controller) syncControllerConfig(key string) error {
 		}
 	}
 
-	ctrl.firstSync.Do(func() {
-		ctrl.readyFlag.Done()
-	})
-
 	return ctrl.syncCompletedStatus(cfg)
 }

@runcom
Copy link
Member

runcom commented Feb 8, 2019

we can enhance that patch above to also return an error through a channel and completely fail the initialization of the other dependant controllers if we really want to (not sure that's desiderable though)

@cgwalters cgwalters force-pushed the controller-phased-startup branch from d20b04a to 494c5c2 Compare February 11, 2019 17:54
@cgwalters
Copy link
Member Author

Looking at this again the problem seems obvious, we can't block at that point or we won't start the kube informers, etc. Changed to start a new goroutine to do an async wait.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this still be moved to the defer above to avoid not being called on errors (leaking the other goroutine at this point) or other successful paths?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because we want to only say we're ready on success right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we spot that we're blocked on the wait() though? logs? is that a condition we must wait forever or do we need to error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that's the case, cool

@cgwalters
Copy link
Member Author

OK finally did a bit more of a dive into the codebase here to understand things. First, I was totally confused; the 01-{master,worker}-kubelet MCs are generated by the template controller, not the kubelet sub-controller.

The reason this PR is hanging is because by default

$ oc get --all-namespaces kubeletconfig
No resources found.

And we were waiting for one.

For now...all we need to do is block on the template controller having done a full sync. Going to rewrok this PR.

@cgwalters cgwalters force-pushed the controller-phased-startup branch from 494c5c2 to d517532 Compare February 12, 2019 16:53
@cgwalters
Copy link
Member Author

Looks like this worked, there's a clear delay now between when we finish the templates and then start the renderer:

I0212 18:17:02.595914       1 leaderelection.go:196] successfully acquired lease openshift-machine-config-operator/machine-config-controller
I0212 18:17:02.796845       1 template_controller.go:122] Starting MachineConfigController-TemplateController
I0212 18:17:02.996081       1 kubelet_config_controller.go:138] Starting MachineConfigController-KubeletConfigController
I0212 18:17:07.218526       1 template_controller.go:380] Initial template sync done
I0212 18:17:07.395747       1 render_controller.go:112] Starting MachineConfigController-RenderController
I0212 18:17:07.396296       1 node_controller.go:121] Starting MachineConfigController-NodeController
I0212 18:17:12.996365       1 render_controller.go:456] Generated machineconfig master-761cbffcf67ae93c8fde4cb77438e3da from 3 configs: [{MachineConfig  00-master  machineconfiguration.openshift.io/v1  } {MachineConfig  00-master-ssh  machineconfiguration.openshift.io/v1  } {MachineConfig  01-master-kubelet  machineconfiguration.openshift.io/v1  }]
I0212 18:17:13.296207       1 render_controller.go:456] Generated machineconfig worker-fb4c0a0f9bd8c642a941ea3d55db1e7b from 3 configs: [{MachineConfig  00-worker  machineconfiguration.openshift.io/v1  } {MachineConfig  00-worker-ssh  machineconfiguration.openshift.io/v1  } {MachineConfig  01-worker-kubelet  machineconfiguration.openshift.io/v1  }]

@cgwalters
Copy link
Member Author

cgwalters commented Feb 12, 2019

Something went wrong operator side though, we never exited init mode in that run:

E0212 18:16:38.428010       1 reflector.go:134] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfigPool: the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io)
I0212 18:16:40.640806       1 sync.go:48] [init mode] synced pools in 21.804511ms
I0212 18:17:07.726846       1 sync.go:48] [init mode] synced mcc in 27.077026444s
I0212 18:17:13.840892       1 sync.go:48] [init mode] synced mcs in 6.103865369s
I0212 18:17:19.936551       1 sync.go:48] [init mode] synced mcd in 6.081135133s
I0212 18:27:19.953801       1 sync.go:48] [init mode] synced required-pools in 10m0.000482382s
I0212 18:27:20.053636       1 sync.go:48] [init mode] synced pools in 9.189432ms
I0212 18:27:22.105326       1 sync.go:48] [init mode] synced mcc in 2.042414571s
I0212 18:27:23.174254       1 sync.go:48] [init mode] synced mcs in 1.058831684s
I0212 18:27:24.535234       1 sync.go:48] [init mode] synced mcd in 1.351046342s
I0212 18:37:24.545491       1 sync.go:48] [init mode] synced required-pools in 10m0.000319228s
I0212 18:37:24.646409       1 sync.go:48] [init mode] synced pools in 8.973161ms
I0212 18:37:26.693497       1 sync.go:48] [init mode] synced mcc in 2.038344699s
I0212 18:37:27.759663       1 sync.go:48] [init mode] synced mcs in 1.055972849s
I0212 18:37:29.125584       1 sync.go:48] [init mode] synced mcd in 1.354875482s
I0212 18:47:29.136629       1 sync.go:48] [init mode] synced required-pools in 10m0.000314036s
I0212 18:47:29.236215       1 sync.go:48] [init mode] synced pools in 9.119727ms
I0212 18:47:31.283226       1 sync.go:48] [init mode] synced mcc in 2.037753306s
I0212 18:47:32.354976       1 sync.go:48] [init mode] synced mcs in 1.061297401s
I0212 18:47:33.715687       1 sync.go:48] [init mode] synced mcd in 1.349676074s

Both the master/worker pools are:

                "machineCount": 0,
                "readyMachineCount": 0,
                "unavailableMachineCount": 0,
                "updatedMachineCount": 0

Hmm. Something interesting about the MCC logs are that I don't see any output from the node controller.

@cgwalters cgwalters force-pushed the controller-phased-startup branch from d517532 to a2d9d63 Compare February 12, 2019 22:38
@cgwalters
Copy link
Member Author

Not sure what happened in that run. Added 2 commits which tweak logging that should be helpful in general.

@cgwalters
Copy link
Member Author

OK I totally don't understand why the node sub-controller apparently isn't reacting with this PR. We're starting it...

Testing this out on a local libvirt run...

Avoid e.g. rendering a MC without the `01-master-kubelet` config template.

Closes: openshift#338

Co-authored-by: Antonio Murdaca <[email protected]>
@cgwalters cgwalters force-pushed the controller-phased-startup branch from 6caf6d0 to 9c4a790 Compare February 28, 2019 19:41
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 28, 2019

@cgwalters: The following tests failed for commit 9c4a790, say /retest to rerun them:

Test name Details Rerun command
ci/prow/e2e-aws link /test e2e-aws
ci/prow/e2e-aws-op link /test e2e-aws-op

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@runcom
Copy link
Member

runcom commented Mar 17, 2019

another approach to fix this is here 9e3ad65 in #482 - basically, do what the operator does to ensure a sync between template and render (look at controller config)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants