Fix CNI api timeout for a long time #87

BSWANG · 2020-02-04T07:40:08Z

What happened:

CNI timeout on pod initializing:

add cmd: error get ip info from grpc call, pod: default-busybox-594898f99b-6f2zg: rpc error: code = Unknown desc = error get pod info for: K8sPodName:"busybox-594898f99b-6f2zg" K8sPodNamespace:"default" K8sPodInfraContainerId:"7137c2e914b03cad9fe81567aca0d139d7096691156a5cb2e32d55bf96c45667" : Get https://172.21.0.1:443/api/v1/namespaces/default/pods/busybox-594898f99b-6f2zg?resourceVersion=0&timeout=1m10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)]

The timeout error will encountered continuously for 10-30mins even cni retry times.

After further investigation, the k8s client using http2 protocol to connect apiserver. http2 will reuse tcp connection duration http requests. When CNI timeout error, I found the terwayd's connection to apiserver become half-closed. The tcp state is still ESTABLISHED, but request packages on the connection cannot get response from remote. After tcp retries about 10-30 minutes, the connecting will be reconstructed, and then the CNI able to return normal.

tcp connection:
tcp        0   2395 10.0.8.228:38276        172.21.0.1:443          ESTABLISHED 5252/terwayd         on (43.29/10/0) # (countdown/tcp retry count/keepalive heartbeat count)

nf_conntrack:
ipv4     2 tcp      6 275 ESTABLISHED src=10.0.8.228 dst=172.21.0.1 sport=38276 dport=443 [UNREPLIED] src=10.0.9.130 dst=10.0.8.228 sport=6443 dport=25163 mark=0 zone=0 use=2

How to resolve

reconstruct connection to apiserver immediately when half-closed connection produced
some commuity discuss on kubernetes/client-go#374
and kubelet's have done this on kubernetes/kubernetes#78016

some commuity discuss on kubernetes/client-go#374 and kubelet's have done this on kubernetes/kubernetes#78016 Signed-off-by: bingshen.wbs <[email protected]>

reconstruct connection to apiserver when half-closed connection produced

2ad8d1b

some commuity discuss on kubernetes/client-go#374 and kubelet's have done this on kubernetes/kubernetes#78016 Signed-off-by: bingshen.wbs <[email protected]>

BSWANG merged commit 14a4f84 into AliyunContainerService:master Feb 4, 2020

This was referenced Oct 15, 2020

Flanneld doesn't reconnect to the apiserver flannel-io/flannel#1272

Closed

Flannel vxlan windows network creation and Failed to list from api-server flannel-io/flannel#1359

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CNI api timeout for a long time #87

Fix CNI api timeout for a long time #87

BSWANG commented Feb 4, 2020 •

edited

Loading

Fix CNI api timeout for a long time #87

Fix CNI api timeout for a long time #87

Conversation

BSWANG commented Feb 4, 2020 • edited Loading

What happened:

How to resolve

BSWANG commented Feb 4, 2020 •

edited

Loading