etcdserver request timed out, when marking a node as master by adding a label and a taint #937

xlgao-zju · 2018-06-19T04:49:13Z

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): 1.9.7

Environment:

Kubernetes version (use kubectl version): 1.9.7
Cloud provider or hardware configuration: Alibaba Cloud
OS (e.g. from /etc/os-release): CentOS
Kernel (e.g. uname -a): 3.10.0-693.2.2.el7.x86_64
Others:

What happened?

Etcdserver request timed out, when marking a node as master by adding a label and a taint.

The logs are:

Jun 14 17:37:35 localhost cloud-init: [kubeconfig] Wrote KubeConfig file to disk: "kubelet.conf"
Jun 14 17:37:36 localhost cloud-init: [kubeconfig] Wrote KubeConfig file to disk: "controller-manager.conf"
Jun 14 17:37:36 localhost cloud-init: [kubeconfig] Wrote KubeConfig file to disk: "scheduler.conf"
Jun 14 17:37:36 localhost cloud-init: [controlplane] Adding extra host path mount "localtime" to "kube-apiserver"
Jun 14 17:37:36 localhost cloud-init: [controlplane] Adding extra host path mount "localtime" to "kube-controller-manager"
Jun 14 17:37:36 localhost cloud-init: [controlplane] Adding extra host path mount "localtime" to "kube-scheduler"
Jun 14 17:37:36 localhost cloud-init: [controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml"
Jun 14 17:37:36 localhost cloud-init: [controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
Jun 14 17:37:36 localhost cloud-init: [controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml"
Jun 14 17:37:36 localhost cloud-init: [init] Waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests".
Jun 14 17:37:36 localhost cloud-init: [init] This might take a minute or longer if the control plane images have to be pulled.
Jun 14 17:38:17 localhost cloud-init: [apiclient] All control plane components are healthy after 40.516281 seconds
Jun 14 17:38:17 localhost cloud-init: [uploadconfig] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
Jun 14 17:38:21 localhost cloud-init: [markmaster] Will mark node cn-beijing.i-2ze8wq45pybh93gpzuda as master by adding a label and a taint
Jun 14 17:38:29 localhost cloud-init: error marking master: etcdserver: request timed out

What you expected to happen?

kubeadm init succeeded

How to reproduce it (as minimally and precisely as possible)?

Issue kubeadm init, and it happens sometimes...

Anything else we need to know?

The etcd works well...

The text was updated successfully, but these errors were encountered:

xlgao-zju · 2018-06-19T04:51:32Z

I think we can add some retry, it may fix this issue...

xlgao-zju · 2018-06-19T06:24:09Z

@neolit123 PTAL

neolit123 · 2018-06-19T11:17:35Z

@xlgao-zju thanks for the report.

Issue kubeadm init, and it happens sometimes...

if it only happens sometimes, i wonder if it's a networking issue.
do you have good connectivity to the etcdserver in question?

/cc @detiber @stealthybox

xlgao-zju · 2018-06-19T11:31:44Z

@neolit123

if it only happens sometimes, i wonder if it's a networking issue.
do you have good connectivity to the etcdserver in question?

Yes, I think it may be the networking issue too. But I checked the etcd, and it worked well. So, I do not know how to avoid this. I think adding some retry is one of the solutions.

detiber · 2018-06-19T12:47:48Z

@xlgao-zju I'm a bit hesitant to add retries without having an explicit reason to do so, since it could mask other potential issues. In this case if the control plane node has some type of intermittent connectivity issues with etcd there will be other cascading issues with the cluster.

When invoking kubeadm init are you using using a config file or passing in any additional arguments? If using the default config or a config that doesn't specify an external etcd cluster, then I would have additional concerns, since connectivity issues to the etcd static pod would be even more concerning since it is deployed locally on the host.

xlgao-zju · 2018-06-19T15:34:31Z

@detiber I used the config file.

I created a three nodes etcd cluster, before I issued kubeadm init. And I specified the IPs in the config file.

Feel free to ping me, if you need more evidence.

xlgao-zju · 2018-06-26T08:59:30Z

I came across this issue today again. Our team creates k8s cluster hundreds times everyday, so we will come across some rare issues. :)

This is the log of etcd during that period.

-- Logs begin at 二 2018-06-26 20:09:29 CST, end at 二 2018-06-26 16:23:45 CST. --
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: adff8e2c78fa38a1 [term: 2] received a MsgVote message with higher term from f7babc917426a5a8 [term: 3]
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: adff8e2c78fa38a1 became follower at term 3
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: adff8e2c78fa38a1 [logterm: 2, index: 1290, vote: 0] cast MsgVote for f7babc917426a5a8 [logterm: 2, index: 1290] at term 3
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: raft.node: adff8e2c78fa38a1 lost leader a680a06346fef120 at term 3
6月 26 12:14:07 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: raft.node: adff8e2c78fa38a1 elected leader f7babc917426a5a8 at term 3
6月 26 12:14:25 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: read-only range request "key:\"/registry/clusterrolebindings\" range_end:\"/registry/clusterrolebindingt\" count_only:true " with result "range_response_count:0 size:7" took too long (803.326661ms) to execute
6月 26 12:14:25 iZwz9b0r3wwhsq4dee7ut5Z etcd[9543]: read-only range request "key:\"/registry/namespaces/kube-system\" " with result "range_response_count:1 size:178" took too long (187.157103ms) to execute

detiber · 2018-06-26T14:08:59Z

@xlgao-zju what type of storage is backing these clusters? If etcd isn't backed by fast enough storage it could potentially account for the issue you are seeing.

xlgao-zju · 2018-06-26T14:16:19Z

@detiber I use a cloud disk on alibaba cloud. Seems the cloud disk is kind of slow... Any suggestion to avoid this issue?

detiber · 2018-06-26T14:25:06Z

@xlgao-zju in the past for ci/testing envs I have used a memory-backed tmpfs to work around slow disk access. It doesn't solve the problem for longer-lived envs, but works in a pinch for short-lived envs.

xlgao-zju · 2018-06-26T14:52:56Z

@detiber Yes, the tmpfs work fine with short-lived envs, is there any solution which can fix this issue(expect changing the type of cloud disk:P)?

timothysc · 2018-07-03T21:16:54Z

/assign @detiber @chuckha

verwilst · 2018-08-28T12:38:06Z

I'm also running into this, while running kubespray with kubeadm enabled. I have 3 separate etcd nodes.

/opt/bin/kubeadm init --config=/etc/kubernetes/kubeadm-config.v1alpha2.yaml --ignore-preflight-errors=all -v256

-- SNIP --

[markmaster] Marking the node k8s-acc-api01 as master by adding the label "node-role.kubernetes.io/master=''"
[markmaster] Marking the node k8s-acc-api01 as master by adding the taints [node-role.kubernetes.io/master:NoSchedule]
I0828 11:43:14.666180 26618 round_trippers.go:386] curl -k -v -XGET -H "Accept: application/json, /" -H "User-Agent: kubeadm/v1.11.2 (linux/amd64) kubernetes/bb9ffb1" 'https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01'
I0828 11:43:14.673901 26618 round_trippers.go:405] GET https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01 404 Not Found in 7 milliseconds
I0828 11:43:14.673961 26618 round_trippers.go:411] Response Headers:
I0828 11:43:14.673997 26618 round_trippers.go:414] Date: Tue, 28 Aug 2018 11:43:14 GMT
I0828 11:43:14.674025 26618 round_trippers.go:414] Audit-Id: 4f3d3a7b-f982-4607-8f7d-e50484d8f4ee
I0828 11:43:14.674053 26618 round_trippers.go:414] Content-Type: application/json
I0828 11:43:14.674075 26618 round_trippers.go:414] Content-Length: 196
I0828 11:43:14.674143 26618 request.go:897] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodes "k8s-acc-api01" not found","reason":"NotFound","details":{"name":"k8s-acc-api01","kind":"nodes"},"code":404}
I0828 11:43:15.166150 26618 round_trippers.go:386] curl -k -v -XGET -H "Accept: application/json, /" -H "User-Agent: kubeadm/v1.11.2 (linux/amd64) kubernetes/bb9ffb1" 'https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01'
I0828 11:43:15.175064 26618 round_trippers.go:405] GET https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01 404 Not Found in 8 milliseconds
I0828 11:43:15.175100 26618 round_trippers.go:411] Response Headers:
I0828 11:43:15.175120 26618 round_trippers.go:414] Content-Length: 196
I0828 11:43:15.175137 26618 round_trippers.go:414] Date: Tue, 28 Aug 2018 11:43:15 GMT
I0828 11:43:15.175149 26618 round_trippers.go:414] Audit-Id: 7bd55f71-5b37-42ac-b709-e16cfb9d777c
I0828 11:43:15.175162 26618 round_trippers.go:414] Content-Type: application/json
I0828 11:43:15.175216 26618 request.go:897] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodes "k8s-acc-api01" not found","reason":"NotFound","details":{"name":"k8s-acc-api01","kind":"nodes"},"code":404}

-- REPEATS OVER AND OVER --

I0828 11:45:14.174411 26618 round_trippers.go:386] curl -k -v -XGET -H "Accept: application/json, /" -H "User-Agent: kubeadm/v1.11.2 (linux/amd64) kubernetes/bb9ffb1" 'https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01'
I0828 11:45:14.181212 26618 round_trippers.go:405] GET https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01 404 Not Found in 6 milliseconds
I0828 11:45:14.181254 26618 round_trippers.go:411] Response Headers:
I0828 11:45:14.181270 26618 round_trippers.go:414] Audit-Id: 7e715a9f-d52b-4624-a1d4-f3a79991c837
I0828 11:45:14.181292 26618 round_trippers.go:414] Content-Type: application/json
I0828 11:45:14.181311 26618 round_trippers.go:414] Content-Length: 196
I0828 11:45:14.181325 26618 round_trippers.go:414] Date: Tue, 28 Aug 2018 11:45:14 GMT
I0828 11:45:14.181392 26618 request.go:897] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodes "k8s-acc-api01" not found","reason":"NotFound","details":{"name":"k8s-acc-api01","kind":"nodes"},"code":404}
error marking master: timed out waiting for the condition

verwilst · 2018-08-28T12:39:09Z

Using Kubernetes/kubeadm 1.11.2 on CoreOS.

ereslibre · 2018-11-03T18:08:36Z

I0828 11:45:14.181212 26618 round_trippers.go:405] GET https://k8s-api.acc.verwilst:6443/api/v1/nodes/k8s-acc-api01 404 Not Found in 6 milliseconds

I0828 11:45:14.181392 26618 request.go:897] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodes "k8s-acc-api01" not found","reason":"NotFound","details":{"name":"k8s-acc-api01","kind":"nodes"},"code":404}

I think this issue is not related to the original reported error: Jun 14 17:38:29 localhost cloud-init: error marking master: etcdserver: request timed out. In #937 (comment) I think the issue is that the Kubelet never registered so k8s-acc-api01 is not found. Since all errors are being ignored with --ignore-preflight-errors=all we are losing some context whether some preflight checks would have failed before reaching this point.

chuckha · 2018-11-08T16:35:42Z

Regarding retries: there are (and were in 1.9.7) retries built in to the code. However, there is what I believe to be a bug. If the connection to the API server returns an error, including a timeout error, there will be no retry. I think the client response needs to be smarter about the errors it checks for so that it can return false, nil on a retryable error.

I believe this line is the culprit of not getting retries where we would expect them.

timothysc · 2019-01-04T23:00:49Z

Closing as stale.

anfernee · 2019-09-04T21:35:02Z

Is this issue fixed? why is it closed? I still have the issue in 1.13.

neolit123 added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. area/cloudprovider labels Jun 19, 2018

timothysc added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Jun 19, 2018

k8s-ci-robot assigned chuckha and detiber Jul 3, 2018

timothysc removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 3, 2018

timothysc added this to the v1.12 milestone Jul 3, 2018

timothysc modified the milestones: v1.12, v1.13 Aug 21, 2018

timothysc added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Oct 30, 2018

chuckha mentioned this issue Nov 9, 2018

kubeadm: Adds tests to node patching kubernetes/kubernetes#70876

Merged

joshuacox mentioned this issue Nov 11, 2018

Marking the master by adding the taints 'error marking master: timed out waiting for the condition' #1227

Closed

timothysc removed this from the v1.13 milestone Nov 13, 2018

timothysc added this to the v1.14 milestone Nov 13, 2018

timothysc removed this from the v1.14 milestone Jan 4, 2019

timothysc closed this as completed Jan 4, 2019

ereslibre mentioned this issue Jan 27, 2019

REQUEST: New membership for ereslibre kubernetes/org#421

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdserver request timed out, when marking a node as master by adding a label and a taint #937

etcdserver request timed out, when marking a node as master by adding a label and a taint #937

xlgao-zju commented Jun 19, 2018

xlgao-zju commented Jun 19, 2018

xlgao-zju commented Jun 19, 2018

neolit123 commented Jun 19, 2018

xlgao-zju commented Jun 19, 2018

detiber commented Jun 19, 2018

xlgao-zju commented Jun 19, 2018

xlgao-zju commented Jun 26, 2018

detiber commented Jun 26, 2018

xlgao-zju commented Jun 26, 2018

detiber commented Jun 26, 2018

xlgao-zju commented Jun 26, 2018

timothysc commented Jul 3, 2018

verwilst commented Aug 28, 2018 •

edited

Loading

verwilst commented Aug 28, 2018

ereslibre commented Nov 3, 2018

chuckha commented Nov 8, 2018

timothysc commented Jan 4, 2019

anfernee commented Sep 4, 2019

etcdserver request timed out, when marking a node as master by adding a label and a taint #937

etcdserver request timed out, when marking a node as master by adding a label and a taint #937

Comments

xlgao-zju commented Jun 19, 2018

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

xlgao-zju commented Jun 19, 2018

xlgao-zju commented Jun 19, 2018

neolit123 commented Jun 19, 2018

xlgao-zju commented Jun 19, 2018

detiber commented Jun 19, 2018

xlgao-zju commented Jun 19, 2018

xlgao-zju commented Jun 26, 2018

detiber commented Jun 26, 2018

xlgao-zju commented Jun 26, 2018

detiber commented Jun 26, 2018

xlgao-zju commented Jun 26, 2018

timothysc commented Jul 3, 2018

verwilst commented Aug 28, 2018 • edited Loading

verwilst commented Aug 28, 2018

ereslibre commented Nov 3, 2018

chuckha commented Nov 8, 2018

timothysc commented Jan 4, 2019

anfernee commented Sep 4, 2019

verwilst commented Aug 28, 2018 •

edited

Loading