Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CNI api timeout for a long time #87

Merged
merged 1 commit into from
Feb 4, 2020

Conversation

BSWANG
Copy link
Member

@BSWANG BSWANG commented Feb 4, 2020

What happened:

CNI timeout on pod initializing:

add cmd: error get ip info from grpc call, pod: default-busybox-594898f99b-6f2zg: rpc error: code = Unknown desc = error get pod info for: K8sPodName:"busybox-594898f99b-6f2zg" K8sPodNamespace:"default" K8sPodInfraContainerId:"7137c2e914b03cad9fe81567aca0d139d7096691156a5cb2e32d55bf96c45667" : Get https://172.21.0.1:443/api/v1/namespaces/default/pods/busybox-594898f99b-6f2zg?resourceVersion=0&timeout=1m10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)]

The timeout error will encountered continuously for 10-30mins even cni retry times.

After further investigation, the k8s client using http2 protocol to connect apiserver. http2 will reuse tcp connection duration http requests. When CNI timeout error, I found the terwayd's connection to apiserver become half-closed. The tcp state is still ESTABLISHED, but request packages on the connection cannot get response from remote. After tcp retries about 10-30 minutes, the connecting will be reconstructed, and then the CNI able to return normal.

tcp connection:
tcp        0   2395 10.0.8.228:38276        172.21.0.1:443          ESTABLISHED 5252/terwayd         on (43.29/10/0) # (countdown/tcp retry count/keepalive heartbeat count)

nf_conntrack:
ipv4     2 tcp      6 275 ESTABLISHED src=10.0.8.228 dst=172.21.0.1 sport=38276 dport=443 [UNREPLIED] src=10.0.9.130 dst=10.0.8.228 sport=6443 dport=25163 mark=0 zone=0 use=2

How to resolve

reconstruct connection to apiserver immediately when half-closed connection produced
some commuity discuss on kubernetes/client-go#374
and kubelet's have done this on kubernetes/kubernetes#78016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant