-
Notifications
You must be signed in to change notification settings - Fork 462
controller: Wait for template and kubelet before starting renderer #385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b4b7de1 to
37fac58
Compare
|
This looks plausible to me. Will let others have a look and test it out. |
37fac58 to
4c6732e
Compare
4c6732e to
0be8861
Compare
|
Approving code wise but I won't be able to test it on a cluster till tomorrow nor tests seem to be running anyway :( /approve |
|
/retest |
1 similar comment
|
/retest |
|
/test e2e-aws-op |
0be8861 to
646fa63
Compare
|
/retest |
1 similar comment
|
/retest |
|
/lgtm |
|
/retest |
|
tests are failing because of: |
|
/retest |
|
monitoring flake again /retest |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, jlebon, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
wtf, still the very same issue with monitoring operator, reported to them:
|
646fa63 to
26085f3
Compare
|
New changes are detected. LGTM label has been removed. |
|
/retest |
1 similar comment
|
/retest |
|
not sure why installer keeps failing on the prometheus operator here (from logs) /retest |
|
Looking at the logs here, we're still stalling out in the MCC. So clearly this code isn't working as I thought. I need to try again to build a local release payload and test locally to more easily debug. |
|
/hold |
|
ok, looking at this PR from my editor again now, I see there are many code paths in |
|
So, just from my understanding of the flow, we should close the WaitGroup anyway even if we're going through another code path right? (even an error one). diff --git a/pkg/controller/kubelet-config/kubelet_config_controller.go b/pkg/controller/kubelet-config/kubelet_config_controller.go
index 9acc10c..9d368f6 100644
--- a/pkg/controller/kubelet-config/kubelet_config_controller.go
+++ b/pkg/controller/kubelet-config/kubelet_config_controller.go
@@ -304,6 +304,9 @@ func (ctrl *Controller) syncKubeletConfig(key string) error {
glog.V(4).Infof("Started syncing kubeletconfig %q (%v)", key, startTime)
defer func() {
glog.V(4).Infof("Finished syncing kubeletconfig %q (%v)", key, time.Since(startTime))
+ ctrl.firstSync.Do(func() {
+ ctrl.readyFlag.Done()
+ })
}()
_, name, err := cache.SplitMetaNamespaceKey(key)
@@ -420,10 +423,6 @@ func (ctrl *Controller) syncKubeletConfig(key string) error {
glog.Infof("Applied KubeletConfig %v on MachineConfigPool %v", key, pool.Name)
}
- ctrl.firstSync.Do(func() {
- ctrl.readyFlag.Done()
- })
-
return ctrl.syncStatusOnly(cfg, nil)
}
diff --git a/pkg/controller/template/template_controller.go b/pkg/controller/template/template_controller.go
index ec5681b..7bafbdf 100644
--- a/pkg/controller/template/template_controller.go
+++ b/pkg/controller/template/template_controller.go
@@ -324,6 +324,9 @@ func (ctrl *Controller) syncControllerConfig(key string) error {
glog.V(4).Infof("Started syncing controllerconfig %q (%v)", key, startTime)
defer func() {
glog.V(4).Infof("Finished syncing controllerconfig %q (%v)", key, time.Since(startTime))
+ ctrl.firstSync.Do(func() {
+ ctrl.readyFlag.Done()
+ })
}()
_, name, err := cache.SplitMetaNamespaceKey(key)
@@ -376,10 +379,6 @@ func (ctrl *Controller) syncControllerConfig(key string) error {
}
}
- ctrl.firstSync.Do(func() {
- ctrl.readyFlag.Done()
- })
-
return ctrl.syncCompletedStatus(cfg)
} |
|
we can enhance that patch above to also return an error through a channel and completely fail the initialization of the other dependant controllers if we really want to (not sure that's desiderable though) |
d20b04a to
494c5c2
Compare
|
Looking at this again the problem seems obvious, we can't block at that point or we won't start the kube informers, etc. Changed to start a new goroutine to do an async wait. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this still be moved to the defer above to avoid not being called on errors (leaking the other goroutine at this point) or other successful paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, because we want to only say we're ready on success right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we spot that we're blocked on the wait() though? logs? is that a condition we must wait forever or do we need to error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that's the case, cool
|
OK finally did a bit more of a dive into the codebase here to understand things. First, I was totally confused; the The reason this PR is hanging is because by default And we were waiting for one. For now...all we need to do is block on the template controller having done a full sync. Going to rewrok this PR. |
494c5c2 to
d517532
Compare
|
Looks like this worked, there's a clear delay now between when we finish the templates and then start the renderer: |
|
Something went wrong operator side though, we never exited init mode in that run: Both the master/worker pools are: Hmm. Something interesting about the MCC logs are that I don't see any output from the node controller. |
d517532 to
a2d9d63
Compare
|
Not sure what happened in that run. Added 2 commits which tweak logging that should be helpful in general. |
a2d9d63 to
4e4fa04
Compare
|
OK I totally don't understand why the node sub-controller apparently isn't reacting with this PR. We're starting it... Testing this out on a local libvirt run... |
4e4fa04 to
6caf6d0
Compare
Avoid e.g. rendering a MC without the `01-master-kubelet` config template. Closes: openshift#338 Co-authored-by: Antonio Murdaca <[email protected]>
6caf6d0 to
9c4a790
Compare
|
@cgwalters: The following tests failed for commit 9c4a790, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Avoid e.g. rendering a MC without the kubelet config.
Closes: #338