Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver request timed out, when marking a node as master by adding a label and a taint #937

Closed
xlgao-zju opened this issue Jun 19, 2018 · 18 comments
Assignees
Labels
area/cloudprovider priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@xlgao-zju
Copy link

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): 1.9.7

Environment:

  • Kubernetes version (use kubectl version): 1.9.7
  • Cloud provider or hardware configuration: Alibaba Cloud
  • OS (e.g. from /etc/os-release): CentOS
  • Kernel (e.g. uname -a): 3.10.0-693.2.2.el7.x86_64
  • Others:

What happened?

Etcdserver request timed out, when marking a node as master by adding a label and a taint.

The logs are:

Jun 14 17:37:35 localhost cloud-init: [kubeconfig] Wrote KubeConfig file to disk: "kubelet.conf"
Jun 14 17:37:36 localhost cloud-init: [kubeconfig] Wrote KubeConfig file to disk: "controller-manager.conf"
Jun 14 17:37:36 localhost cloud-init: [kubeconfig] Wrote KubeConfig file to disk: "scheduler.conf"
Jun 14 17:37:36 localhost cloud-init: [controlplane] Adding extra host path mount "localtime" to "kube-apiserver"
Jun 14 17:37:36 localhost cloud-init: [controlplane] Adding extra host path mount "localtime" to "kube-controller-manager"
Jun 14 17:37:36 localhost cloud-init: [controlplane] Adding extra host path mount "localtime" to "kube-scheduler"
Jun 14 17:37:36 localhost cloud-init: [controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml"
Jun 14 17:37:36 localhost cloud-init: [controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
Jun 14 17:37:36 localhost cloud-init: [controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml"
Jun 14 17:37:36 localhost cloud-init: [init] Waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests".
Jun 14 17:37:36 localhost cloud-init: [init] This might take a minute or longer if the control plane images have to be pulled.
Jun 14 17:38:17 localhost cloud-init: [apiclient] All control plane components are healthy after 40.516281 seconds
Jun 14 17:38:17 localhost cloud-init: [uploadconfig] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
Jun 14 17:38:21 localhost cloud-init: [markmaster] Will mark node cn-beijing.i-2ze8wq45pybh93gpzuda as master by adding a label and a taint
Jun 14 17:38:29 localhost cloud-init: error marking master: etcdserver: request timed out

What you expected to happen?

kubeadm init succeeded

How to reproduce it (as minimally and precisely as possible)?

Issue kubeadm init, and it happens sometimes...

Anything else we need to know?

The etcd works well...

@xlgao-zju
Copy link
Author

I think we can add some retry, it may fix this issue...

@xlgao-zju
Copy link
Author

@neolit123 PTAL

@neolit123
Copy link
Member

@xlgao-zju thanks for the report.

Issue kubeadm init, and it happens sometimes...

if it only happens sometimes, i wonder if it's a networking issue.
do you have good connectivity to the etcdserver in question?

/cc @detiber @stealthybox

@neolit123 neolit123 added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. area/cloudprovider labels Jun 19, 2018
@xlgao-zju
Copy link
Author

@neolit123

if it only happens sometimes, i wonder if it's a networking issue.
do you have good connectivity to the etcdserver in question?

Yes, I think it may be the networking issue too. But I checked the etcd, and it worked well. So, I do not know how to avoid this. I think adding some retry is one of the solutions.

@detiber
Copy link
Member

detiber commented Jun 19, 2018

@xlgao-zju I'm a bit hesitant to add retries without having an explicit reason to do so, since it could mask other potential issues. In this case if the control plane node has some type of intermittent connectivity issues with etcd there will be other cascading issues with the cluster.

When invoking kubeadm init are you using using a config file or passing in any additional arguments? If using the default config or a config that doesn't specify an external etcd cluster, then I would have additional concerns, since connectivity issues to the etcd static pod would be even more concerning since it is deployed locally on the host.

@timothysc timothysc added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Jun 19, 2018
@xlgao-zju
Copy link
Author

@detiber I used the config file.

I created a three nodes etcd cluster, before I issued kubeadm init. And I specified the IPs in the config file.

Feel free to ping me, if you need more evidence.

@xlgao-zju
Copy link
Author

I came across this issue today again. Our team creates k8s cluster hundreds times everyday, so we will come across some rare issues. :)

This is the log of etcd during that period.

-- Logs begin at 二 2018-06-26 20:09:29 CST, end at 二 2018-06-26 16:23:45 CST. --
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: adff8e2c78fa38a1 [term: 2] received a MsgVote message with higher term from f7babc917426a5a8 [term: 3]
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: adff8e2c78fa38a1 became follower at term 3
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: adff8e2c78fa38a1 [logterm: 2, index: 1290, vote: 0] cast MsgVote for f7babc917426a5a8 [logterm: 2, index: 1290] at term 3
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: raft.node: adff8e2c78fa38a1 lost leader a680a06346fef120 at term 3
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: raft.node: adff8e2c78fa38a1 elected leader f7babc917426a5a8 at term 3
6月 26 12:14:25 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: read-only range request "key:\"/registry/clusterrolebindings\" range_end:\"/registry/clusterrolebindingt\" count_only:true " with result "range_response_count:0 size:7" took too long (803.326661ms) to execute
6月 26 12:14:25 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: read-only range request "key:\"/registry/namespaces/kube-system\" " with result "range_response_count:1 size:178" took too long (187.157103ms) to execute

@detiber
Copy link
Member

detiber commented Jun 26, 2018

@xlgao-zju what type of storage is backing these clusters? If etcd isn't backed by fast enough storage it could potentially account for the issue you are seeing.

@xlgao-zju
Copy link
Author

@detiber I use a cloud disk on alibaba cloud. Seems the cloud disk is kind of slow... Any suggestion to avoid this issue?

@detiber
Copy link
Member

detiber commented Jun 26, 2018

@xlgao-zju in the past for ci/testing envs I have used a memory-backed tmpfs to work around slow disk access. It doesn't solve the problem for longer-lived envs, but works in a pinch for short-lived envs.

@xlgao-zju
Copy link
Author

@detiber Yes, the tmpfs work fine with short-lived envs, is there any solution which can fix this issue(expect changing the type of cloud disk:P)?

@timothysc
Copy link
Member

/assign @detiber @chuckha

@timothysc timothysc removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 3, 2018
@timothysc timothysc added this to the v1.12 milestone Jul 3, 2018
@timothysc timothysc modified the milestones: v1.12, v1.13 Aug 21, 2018
@verwilst
Copy link

verwilst commented Aug 28, 2018

I'm also running into this, while running kubespray with kubeadm enabled. I have 3 separate etcd nodes.

/opt/bin/kubeadm init --config=/etc/kubernetes/kubeadm-config.v1alpha2.yaml --ignore-preflight-errors=all -v256

-- SNIP --

[markmaster] Marking the node k8s-acc-api01 as master by adding the label "node-role.kubernetes.io/master=''"
[markmaster] Marking the node k8s-acc-api01 as master by adding the taints [node-role.kubernetes.io/master:NoSchedule]
I0828 11:43:14.666180 26618 round_trippers.go:386] curl -k -v -XGET -H "Accept: application/json, /" -H "User-Agent: kubeadm/v1.11.2 (linux/amd64) kubernetes/bb9ffb1" 'https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01'
I0828 11:43:14.673901 26618 round_trippers.go:405] GET https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01 404 Not Found in 7 milliseconds
I0828 11:43:14.673961 26618 round_trippers.go:411] Response Headers:
I0828 11:43:14.673997 26618 round_trippers.go:414] Date: Tue, 28 Aug 2018 11:43:14 GMT
I0828 11:43:14.674025 26618 round_trippers.go:414] Audit-Id: 4f3d3a7b-f982-4607-8f7d-e50484d8f4ee
I0828 11:43:14.674053 26618 round_trippers.go:414] Content-Type: application/json
I0828 11:43:14.674075 26618 round_trippers.go:414] Content-Length: 196
I0828 11:43:14.674143 26618 request.go:897] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodes "k8s-acc-api01" not found","reason":"NotFound","details":{"name":"k8s-acc-api01","kind":"nodes"},"code":404}
I0828 11:43:15.166150 26618 round_trippers.go:386] curl -k -v -XGET -H "Accept: application/json, /" -H "User-Agent: kubeadm/v1.11.2 (linux/amd64) kubernetes/bb9ffb1" 'https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01'
I0828 11:43:15.175064 26618 round_trippers.go:405] GET https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01 404 Not Found in 8 milliseconds
I0828 11:43:15.175100 26618 round_trippers.go:411] Response Headers:
I0828 11:43:15.175120 26618 round_trippers.go:414] Content-Length: 196
I0828 11:43:15.175137 26618 round_trippers.go:414] Date: Tue, 28 Aug 2018 11:43:15 GMT
I0828 11:43:15.175149 26618 round_trippers.go:414] Audit-Id: 7bd55f71-5b37-42ac-b709-e16cfb9d777c
I0828 11:43:15.175162 26618 round_trippers.go:414] Content-Type: application/json
I0828 11:43:15.175216 26618 request.go:897] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodes "k8s-acc-api01" not found","reason":"NotFound","details":{"name":"k8s-acc-api01","kind":"nodes"},"code":404}

-- REPEATS OVER AND OVER --

I0828 11:45:14.174411 26618 round_trippers.go:386] curl -k -v -XGET -H "Accept: application/json, /" -H "User-Agent: kubeadm/v1.11.2 (linux/amd64) kubernetes/bb9ffb1" 'https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01'
I0828 11:45:14.181212 26618 round_trippers.go:405] GET https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01 404 Not Found in 6 milliseconds
I0828 11:45:14.181254 26618 round_trippers.go:411] Response Headers:
I0828 11:45:14.181270 26618 round_trippers.go:414] Audit-Id: 7e715a9f-d52b-4624-a1d4-f3a79991c837
I0828 11:45:14.181292 26618 round_trippers.go:414] Content-Type: application/json
I0828 11:45:14.181311 26618 round_trippers.go:414] Content-Length: 196
I0828 11:45:14.181325 26618 round_trippers.go:414] Date: Tue, 28 Aug 2018 11:45:14 GMT
I0828 11:45:14.181392 26618 request.go:897] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodes "k8s-acc-api01" not found","reason":"NotFound","details":{"name":"k8s-acc-api01","kind":"nodes"},"code":404}
error marking master: timed out waiting for the condition

@verwilst
Copy link

Using Kubernetes/kubeadm 1.11.2 on CoreOS.

@timothysc timothysc added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Oct 30, 2018
@ereslibre
Copy link
Contributor

I0828 11:45:14.181212 26618 round_trippers.go:405] GET https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01 404 Not Found in 6 milliseconds

I0828 11:45:14.181392 26618 request.go:897] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodes "k8s-acc-api01" not found","reason":"NotFound","details":{"name":"k8s-acc-api01","kind":"nodes"},"code":404}

I think this issue is not related to the original reported error: Jun 14 17:38:29 localhost cloud-init: error marking master: etcdserver: request timed out. In #937 (comment) I think the issue is that the Kubelet never registered so k8s-acc-api01 is not found. Since all errors are being ignored with --ignore-preflight-errors=all we are losing some context whether some preflight checks would have failed before reaching this point.

@chuckha
Copy link

chuckha commented Nov 8, 2018

Regarding retries: there are (and were in 1.9.7) retries built in to the code. However, there is what I believe to be a bug. If the connection to the API server returns an error, including a timeout error, there will be no retry. I think the client response needs to be smarter about the errors it checks for so that it can return false, nil on a retryable error.

I believe this line is the culprit of not getting retries where we would expect them.

@timothysc timothysc added this to the v1.14 milestone Nov 13, 2018
@timothysc timothysc removed this from the v1.14 milestone Jan 4, 2019
@timothysc
Copy link
Member

Closing as stale.

@anfernee
Copy link
Member

anfernee commented Sep 4, 2019

Is this issue fixed? why is it closed? I still have the issue in 1.13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloudprovider priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

8 participants