-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm 1.30.1 init fails in kubelet-check with timeout #3069
Comments
thanks for the report yes, there were some refactors in the area of kubelet health check in 1.30, however they are straightforward and only you have reported the problem. i cannot reproduce it, also we have an extensive e2e test suite and there are many projects built on top of kubeadm that have also not reported such an issue. my point is, we need to understand why it is failing in your setup. the location of the failure is here: that is a standard go HTTP client, but it may have some differences with how curl does transport WRT timeouts and proxy for example. you can install wireshark and see what the differences are between the two requests. |
if you are able to build kubeadm from source you can try the following:
|
Thanks for the direction on where to look in the source, I was able to determine what was causing the issue. Unfortunately I don't understand why it is working in the previous release. This is failing in my setup because of how http.Client is resolving kubelet is only listening on 127.0.0.1:10248. Curl works because, as I just discovered, it treats The KubeletConfiguration.HealthzBindAddress default value is literal string The only relevant difference I see in this logic between 1.29.5 and 1.30.1 is using http.Client.Do() instead of http.Client.Get(). But, I wrote a trivial test that used both and the two functions had the same behavior: They resolved localhost as Three different fixes worked in my setup:
Probably if kubeadm is not going to build the healthz URL using the KubeletConfiguration that it is deploying, it should use the same values that KubeletConfiguration uses for defaults. I updated my setup to resolve localhost to 127.0.0.1 and everything is working well like that. Thanks again for the pointers! |
thanks for testing and confirming the source of the problem. if that passes on CI the fix can be backported to 1.30.latest |
i will mark it as bug instead of regression, because we are not sure about this part. the old code in 1.29 also hardcoded localhost |
fix will be available in 1.30.2 |
hello I also encountered a similar issue, it's address is: kubernetes/kubernetes#125275 I checked my DNS and this is its result:
At the same time, I also added the corresponding resolution in the root@kmaster1:~# cat /etc/hosts
10.0.0.5 debian
# The following lines are desirable for IPv6 capable hosts
#::1 localhost ip6-localhost ip6-loopback
#ff02::1 ip6-allnodes
#ff02::2 ip6-allrouters
10.0.0.21 kmaster1
10.0.0.22 knode1
10.0.0.23 knode2
127.0.0.1 localhost But now executing [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
cmd/kubeadm/app/cmd/phases/workflow/runner.go:259
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
cmd/kubeadm/app/cmd/phases/workflow/runner.go:446
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
cmd/kubeadm/app/cmd/phases/workflow/runner.go:232
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
cmd/kubeadm/app/cmd/init.go:111
github.com/spf13/cobra.(*Command).execute
vendor/github.com/spf13/cobra/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
vendor/github.com/spf13/cobra/command.go:1068
github.com/spf13/cobra.(*Command).Execute
vendor/github.com/spf13/cobra/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
cmd/kubeadm/app/kubeadm.go:50
main.main
cmd/kubeadm/kubeadm.go:25
runtime.main
/usr/local/go/src/runtime/proc.go:250
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1598
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
cmd/kubeadm/app/cmd/phases/workflow/runner.go:260
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
cmd/kubeadm/app/cmd/phases/workflow/runner.go:446
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
cmd/kubeadm/app/cmd/phases/workflow/runner.go:232
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
cmd/kubeadm/app/cmd/init.go:111
github.com/spf13/cobra.(*Command).execute
vendor/github.com/spf13/cobra/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
vendor/github.com/spf13/cobra/command.go:1068
github.com/spf13/cobra.(*Command).Execute
vendor/github.com/spf13/cobra/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
cmd/kubeadm/app/kubeadm.go:50
main.main
cmd/kubeadm/kubeadm.go:25
runtime.main
/usr/local/go/src/runtime/proc.go:250
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1598 I also checked kubelet
Hope to get your advice, thank you! |
that seems like a separate problem. https://github.com/kubernetes/kubeadm?tab=readme-ov-file#support |
I've placed some relevant logs in the issues I raised, the general content is as follows. Jun 02 13:46:00 kmaster1 kubelet[658]: I0602 13:46:00.418850 658 server.go:469] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Jun 02 13:46:00 kmaster1 kubelet[658]: I0602 13:46:00.439107 658 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"RuntimeCgroupsName":"","SystemCgroupsName":"","KubeletCgroupsName":"","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"cgroupfs","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":false,"KubeReservedCgroupName":"","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":null,"SystemReserved":null,"HardEvictionThresholds":[],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":-1,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
Jun 02 13:46:00 kmaster1 kubelet[658]: E0602 13:46:00.453782 658 cri_stats_provider.go:448] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
Jun 02 13:46:00 kmaster1 kubelet[658]: E0602 13:46:00.453866 658 kubelet.go:1431] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
Jun 02 13:46:00 kmaster1 kubelet[658]: I0602 13:46:00.506290 658 manager.go:471] "Failed to read data from checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
Jun 02 13:46:00 kmaster1 kubelet[658]: E0602 13:46:00.513551 658 kubelet.go:2327] "Skipping pod synchronization" err="PLEG is not healthy: pleg has yet to be successful"
Jun 02 14:03:39 kmaster1 kubelet[1120]: I0602 14:03:39.312424 1120 server.go:469] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Jun 02 14:03:39 kmaster1 kubelet[1120]: I0602 14:03:39.316470 1120 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"RuntimeCgroupsName":"","SystemCgroupsName":"","KubeletCgroupsName":"","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"cgroupfs","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":false,"KubeReservedCgroupName":"","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":null,"SystemReserved":null,"HardEvictionThresholds":[],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":-1,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
Jun 02 14:03:39 kmaster1 kubelet[1120]: E0602 14:03:39.319191 1120 cri_stats_provider.go:448] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
Jun 02 14:03:39 kmaster1 kubelet[1120]: E0602 14:03:39.319211 1120 kubelet.go:1431] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
Jun 02 14:03:39 kmaster1 kubelet[1120]: E0602 14:03:39.335951 1120 kubelet.go:2327] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jun 02 14:03:39 kmaster1 kubelet[1120]: I0602 14:03:39.346537 1120 manager.go:471] "Failed to read data from checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
Jun 02 14:26:58 kmaster1 kubelet[3511]: I0602 14:26:58.865189 3511 server.go:469] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Jun 02 14:26:58 kmaster1 kubelet[3511]: I0602 14:26:58.868954 3511 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"RuntimeCgroupsName":"","SystemCgroupsName":"","KubeletCgroupsName":"","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"cgroupfs","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":false,"KubeReservedCgroupName":"","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":null,"SystemReserved":null,"HardEvictionThresholds":[],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":-1,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
Jun 02 14:26:58 kmaster1 kubelet[3511]: E0602 14:26:58.871007 3511 cri_stats_provider.go:448] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
Jun 02 14:26:58 kmaster1 kubelet[3511]: E0602 14:26:58.871131 3511 kubelet.go:1431] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
Jun 02 14:26:58 kmaster1 kubelet[3511]: I0602 14:26:58.883078 3511 manager.go:471] "Failed to read data from checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
Jun 02 14:26:58 kmaster1 kubelet[3511]: E0602 14:26:58.887446 3511 kubelet.go:2327] "Skipping pod synchronization" err="PLEG is not healthy: pleg has yet to be successful"
Jun 02 14:26:59 kmaster1 kubelet[3555]: I0602 14:26:59.084609 3555 server.go:469] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Jun 02 14:26:59 kmaster1 kubelet[3555]: I0602 14:26:59.090511 3555 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"RuntimeCgroupsName":"","SystemCgroupsName":"","KubeletCgroupsName":"","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"cgroupfs","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":false,"KubeReservedCgroupName":"","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":null,"SystemReserved":null,"HardEvictionThresholds":[],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":-1,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
Jun 02 14:26:59 kmaster1 kubelet[3555]: E0602 14:26:59.091925 3555 cri_stats_provider.go:448] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
Jun 02 14:26:59 kmaster1 kubelet[3555]: E0602 14:26:59.091981 3555 kubelet.go:1431] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
Jun 02 14:26:59 kmaster1 kubelet[3555]: E0602 14:26:59.102172 3555 kubelet.go:2327] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jun 02 14:26:59 kmaster1 kubelet[3555]: I0602 14:26:59.103552 3555 manager.go:471] "Failed to read data from checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
Jun 02 14:28:03 kmaster1 kubelet[7909]: I0602 14:28:03.967366 7909 server.go:469] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Jun 02 14:28:03 kmaster1 kubelet[7909]: I0602 14:28:03.973456 7909 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"RuntimeCgroupsName":"","SystemCgroupsName":"","KubeletCgroupsName":"","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"cgroupfs","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":false,"KubeReservedCgroupName":"","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":null,"SystemReserved":null,"HardEvictionThresholds":[],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":-1,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
Jun 02 14:28:03 kmaster1 kubelet[7909]: E0602 14:28:03.974820 7909 cri_stats_provider.go:448] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
Jun 02 14:28:03 kmaster1 kubelet[7909]: E0602 14:28:03.974835 7909 kubelet.go:1431] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
Jun 02 14:28:03 kmaster1 kubelet[7909]: E0602 14:28:03.979881 7909 kubelet.go:2327] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jun 02 14:28:03 kmaster1 kubelet[7909]: I0602 14:28:03.987221 7909 manager.go:471] "Failed to read data from checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found" Should I reopen my issues and continue asking questions under them? |
we don't provide support on github @gy528909 so please don't open more issues in the kubernetes github org. also this is not a kubeadm bug. post in slack or discuss.k8s.io ask in #sig-node or #containerd too. |
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use
kubeadm version
):Environment:
kubectl version
): 1.30.1uname -a
):Linux k8ctl1 6.1.0-20-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.85-1 (2024-04-11) x86_64 GNU/Linux
What happened?
I attempted to initialize the cluster, but the init command failed indicating a timeout connecting to
http://localhost:10248/healthz
. Notably, I can immediately run the provided curl command which results in an instant response of "ok".What you expected to happen?
Successful cluster initialization ending in a join command for the worker nodes.
How to reproduce it (as minimally and precisely as possible)?
The VM for the first control node of this cluster was created from a Debian 12 template that is mostly a minimal Debian install plus cloud-init and some common utilities. The VM was prepared for K8s using Ansible plays developed from the initial docs. (install containerd and configure for systemd cgroup driver, set kernel params, add overlay and br_netfliter modules, etc.)
Containerd is installed from docker apt repo (containerd.io package)
Kubernetes components are installed from pkgs.k8s.io apt repo
VM is IPv6 single-stack, IPv4 Internet access via NAT64.
Rebooted VM and validated modules were loaded and kernel params were set.
Using config file:
Ran command
kubeadm init --config /root/kubeadm-config.yaml
Anything else we need to know?
Possible Regression from 1.29.x
The issue is repeatable, and does not occur with kubeadm 1.29.5.
Other Details
Expected containers are running after the failure, and none are flapping:
Cluster seems to be working after failed init
I was able to push kubeadm through the remainder of the phases and ended up with what I think is a normally working control node.
Then setting testing kubectl
Get a token to join other nodes:
kubeadm token create --print-join-command
Attempted to join a worker node, where kubeadm also failed with the same error as it did on the control node. But, the worker appeared to be joined successfully despite the error.
System log from boot through attempted cluster init on control node:
kubeadm-init-fail4-system-journal.log
The text was updated successfully, but these errors were encountered: