-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm: Detect kubelet readiness and error out if the kubelet is unhealthy #51369
Conversation
/retest |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really happy to have better diagnostic output here. It should be a big help to admins when things go wrong.
I'm slightly concerned that 155 seconds will not be enough. It seems hard to pick a safe timeout for this because it needs to cover a relatively large download over the user's internet connection which might be almost arbitrarily slow. Maybe we could keep trying forever, but start printing helpful warning/error messages after a certain amount of time? We might also just print out a message saying that kubeadm init
has given up, but the kubelet might still get things up and running if you are able to restart or otherwise fix it.
My other style concern is that the number of timeout/retry constants here is kind of complex because of the way TryRunCommand
and waitForAPIAndKubelet
are nested. Perhaps there's a way we could refactor so there's a single select
that blocks until the first failure condition is met? This is not a blocker.
The failing test looks like #51429.
Unfortunately, an error has occurred: | ||
{{ .Error }} | ||
|
||
This error is likely caused by that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe This error is likely caused by one of these problems:
?
|
||
go func(errC chan error, waiter apiclient.Waiter) { | ||
// This goroutine can only make kubeadm init fail. If this check succeeds, it won't do anything special | ||
if err := waiter.WaitForHealthyKubelet(40*time.Second, "http://localhost:10255/healthz"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to assume the port number here? Isn't it configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok to assume that; it's the standard for the kubelet readonly port. If you change that manually you will way larger problems :)
That timeout is currently 30 minutes; can increase if you think that'd be useful.
They aren't nested.
That is what we explicitely want to avoid in prod scenarios. Note that there are three different goroutines:
And there is a single wait block; the first goroutine to return something will be the return value used by the func. This means that if the |
/retest |
1 similar comment
/retest |
/lgtm We can always make the timeout configurable later if it is an issue in real environments. This change should be a strict improvement over the current behavior, so I don't want to block merging over that. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: luxas, mattmoyer Associated issue: 377 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Note that this 30min timeout already exists at HEAD; it isn't added or changed in this PR /retest |
/retest Review the full test history for this PR. |
2 similar comments
/retest Review the full test history for this PR. |
/retest Review the full test history for this PR. |
/retest |
/retest Review the full test history for this PR. |
1 similar comment
/retest Review the full test history for this PR. |
757e2e1
to
6853df7
Compare
6853df7
to
92c5997
Compare
Rebased, re-applying LGTM |
/retest |
/retest Review the full test history for this PR. |
@luxas: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue (batch tested with PRs 51682, 51546, 51369, 50924, 51827) |
fmt.Printf("[init] Waiting for the kubelet to boot up the control plane as Static Pods from directory %q\n", kubeadmconstants.GetStaticPodDirectory()) | ||
fmt.Println("[init] This often takes around a minute; or longer if the control plane images have to be pulled.") | ||
|
||
go func(errC chan error, waiter apiclient.Waiter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be based on --skip-preflight-checks
, if this flag is set, we shouldn't check the status of kubelet, if not it makes sense to check the status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a hard line to balance.. I'd prefer to have this as-is although there is a small change it will timeout before all the images have pulled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to discuss this, please open a new issue in kubernetes/kubeadm though
What this PR does / why we need it:
In order to improve the UX when the kubelet is unhealthy or stopped, or whatever, kubeadm now polls the kubelet's API after 40 and 60 seconds, and then performs an exponential backoff for a total of 155 seconds.
If the kubelet endpoint is not returning
ok
by then, kubeadm gives up and exits.This will miligate at least 60% of our "[apiclient] Created API client, waiting for control plane to come up" issues in the kubeadm issue tracker 🎉, as kubeadm now informs the user what's wrong and also doesn't deadlock like before.
Demo:
In this demo, I'm first starting kubeadm normally and everything works as usual.
In the second case, I'm explicitely stopping the kubelet so it doesn't run, and skipping preflight checks, so that kubeadm doesn't even try to exec
systemctl start kubelet
like it does usually.That obviously results in a non-working system, but now kubeadm tells the user what's the problem instead of waiting forever.
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #Fixes: kubernetes/kubeadm#377
Special notes for your reviewer:
Release note:
@kubernetes/sig-cluster-lifecycle-pr-reviews @pipejakob
cc @justinsb @kris-nova @lukemarsden as well as you wanted this feature :)