Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API calls fails with timeout for logs / port-forward and others #232

Closed
ekarlso opened this issue Mar 8, 2018 · 53 comments
Closed

API calls fails with timeout for logs / port-forward and others #232

ekarlso opened this issue Mar 8, 2018 · 53 comments

Comments

@ekarlso
Copy link

ekarlso commented Mar 8, 2018

See kubectl debug output here:

I0308 10:32:21.539580   26486 round_trippers.go:417] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.8.1 (linux/amd64) kubernetes/f38e43b" -H "Authorization: Bearer x" https://x:443/api/v1/namespaces/default/pods/x/log?container=x
I0308 10:34:32.790295   26486 round_trippers.go:436] GET https://X:443/api/v1/namespaces/default/pods/x/log?container=x 500 Internal Server Error in 131250 milliseconds
I0308 10:34:32.790356   26486 round_trippers.go:442] Response Headers:
I0308 10:34:32.790376   26486 round_trippers.go:445]     Content-Type: application/json
I0308 10:34:32.790390   26486 round_trippers.go:445]     Content-Length: 275
I0308 10:34:32.790414   26486 round_trippers.go:445]     Date: Thu, 08 Mar 2018 09:34:32 GMT
I0308 10:34:32.790504   26486 request.go:836] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Get https://aks-nodepool1-53392281-1:10250/containerLogs/default/x: dial tcp 10.240.0.6:10250: getsockopt: connection timed out","code":500}
I0308 10:34:32.790999   26486 helpers.go:207] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "Get https://aks-nodepool1-53392281-1:10250/containerLogs/default/x/x: dial tcp 10.240.0.6:10250: getsockopt: connection timed out",
  "code": 500
}]
F0308 10:34:32.791043   26486 helpers.go:120] Error from server: Get https://aks-nodepool1-53392281-1:10250/containerLogs/default/x/x: dial tcp 10.240.0.6:10250: getsockopt: connection timed out

Also in Azure Cloud Shell it times out so it seems to be a internal error in AKS?

@mfaizanse
Copy link

I am also facing this issue #224. @slack @sauryadas can you say when it will be fixed? The AKS is in unusable state right now.

@JunSun17
Copy link

@ekarlso @mfaizanse So pulling log does not work? Does kubectl get po works?

@mfaizanse
Copy link

@JunSun17 Yes, kubectl get pods works. I deployed the cluster in west europe.

@ekarlso
Copy link
Author

ekarlso commented Mar 10, 2018

correct @jungchenhung

@JunSun17
Copy link

@ekarlso @mfaizanse Yes, this is a known issue and we are root causing it. Will update when we have more data on it. Thanks!

@mfaizanse
Copy link

@JunSun17 I am also facing this issue in Central US cluster as well. Do you have any estimated timeline, when will this issue be resolved?

@slack
Copy link
Contributor

slack commented Mar 15, 2018

Logs, exec, and attach all require the master <-> node tunnels to be established. Check that the tunnelfront and kube-svc-redirect pods are up and running:

$ kubectl -n kube-system get po -l component=tunnel
NAME                           READY     STATUS    RESTARTS   AGE
tunnelfront-7644cd56b7-l5jmc   1/1       Running   0          2d
$ kubectl -n kube-system get po -l component=kube-svc-redirect
NAME                      READY     STATUS    RESTARTS   AGE
kube-svc-redirect-pq6kf   1/1       Running   0          2d
kube-svc-redirect-x6sq5   1/1       Running   0          2d
kube-svc-redirect-zjl7x   1/1       Running   1          2d

If the pods are running, deleting tunnelfront pod, it will be recreated after a few minutes:

$ kubectl -n kube-system delete po -l component=tunnel
pod "tunnelfront-7644cd56b7-l5jmc" deleted

@mfaizanse
Copy link

$ kubectl logs tunnelfront-67c6c6f4bd-4hjjb -n kube-system
Error from server: Get https://aks-agentpool-xxxxxxx-1:10250/containerLogs/kube-system/tunnelfront-67c6c6f4bd-4hjjb/tunnel-front: dial tcp 10.240.0.5:10250: getsockopt: connection timed out
$ kubectl logs kube-svc-redirect-8k8lh -n kube-system
Error from server: Get https://aks-agentpool-xxxxxxx-2:10250/containerLogs/kube-system/kube-svc-redirect-8k8lh/redirector: dial tcp 10.240.0.6:10250: getsockopt: connection timed out

@mfaizanse
Copy link

Creating the cluster using Azure CLI resolved the issue.

@JunSun17
Copy link

@mfaizanse Sorry for the late reply. I have asked another customer facing the same issue for logs to further investigate this. If you still have the problematic cluster, can you provide the log so we can do further checking:

  1. Find out which node have tunnelfront running:
    kubectl -n kube-system get po -o wide
    and you should see something like,
    tunnelfront-687c6c5f68-t6bbl 1/1 Running 0 20m 10.244.5.42 aks-nodepool1-18710123-0

  2. Login to that agent node aks-nodepool1-xxxxx, and use docker to pull the tunnelfront log (docker container logs --details tunnelfront-xxxxx).

and please post the last 100 lines from the log here.

@pkelleratwork
Copy link

pkelleratwork commented Mar 20, 2018

I am having the same problem - trying to use helm, receive the following error....

Kubernets cluster on Azuer
kubectl is working.
helm fails to connect
verified tiller is running on

$ helm ls Error: forwarding ports: error upgrading connection: error dialing backend: dial tcp 10.240.0.5:10250: getsockopt: connection timed out

@JunSun17
Copy link

@pkelleratwork do you experience the log issue too? If so, can you kindly provide the logs based on my comments above.

@01100010011001010110010101110000

I'm experiencing this issue as well, my tunnelfront logs are in this gist

@Nirmalyasen
Copy link

Running into the same issue trying to run helm against a 1.9.2 AKS cluster in Azure

helm ls
Error: forwarding ports: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out

@jpatters
Copy link

jpatters commented Mar 26, 2018

I am having this same issue using helm as well as kubectl logs.
AKS cluster created via cli about a month ago.
kubectl works for other commands
tiller is running
exact same error as above.

@jpatters
Copy link

I've upgraded both kubernetes and helm and things are working again. It would be nice to know what the cause of this was though.

@vogelinho10
Copy link

We are also facing this issue. Cluster is created in West-Europe.
kubectl proxy, logs, helm are all not working since today.
Yesterday after the cluster was set up everything was working fine.
I tried deleting tunnelfront, svc-redirect and proxy but I still can´t see an improvement.

How can we fix that?

@lkt82
Copy link

lkt82 commented Apr 19, 2018

We are are also seeing this. please fix

@vogelinho10
Copy link

is there any update on that issue?
we redeployed the cluster 2 weeks ago and are now facing the same issue again!

@jpatters
Copy link

I abandoned AKS because of this issue.

@jluenne
Copy link

jluenne commented May 15, 2018

How did you guys create those AKS clusters initially? It may just be a lucky coincidence, but I haven't seen this issue on clusters I created via CLI yet, whereas we already had it two times on clusters created using ARM templates. Same region, same version.

@kautsig
Copy link

kautsig commented May 15, 2018

Let me briefly describe what we found out on the issue:

  • We provisioned a 2 node cluster with a custom subnet
  • There is a tunnelfront pod for master <-> node communication
    deployed in the cluster, which is needed for these commands
    (log/exec/attach). Mentioned above.
  • tunnelfront fails because it cannot resolve the API hostname to
    retrieve the secret it needs.
  • We found that DNS works for pods on the first node
    (aks-default-x-0), but is broken on the second (aks-default-x-1)

We could not find the root cause, which is probably related to
connectivity (e.g. routing), as we also can't ping pods on node 1 from
pods on node 2.

Scaling down to 1 node worked for now.

@jpatters
Copy link

@jluenne I've created clusters with both the web ui and the cli and had it on both.

@ResDiaryLewis
Copy link

We had the issue described by Adam here on a single node cluster.

  • We used az aks scale to scale our cluster up by a single node, then drained our previous node with kubectl drain, causing all of out pods to be scheduled to the new node.
  • Then we tried to run kubectl logs again and discovered that it had worked, thus we knew that the issue was with the previous node.
  • We then restarted that VM in the azure portal and uncordoned it: kubectl uncordon. We also drained the new node again.
  • Everything worked after that so we concluded that restarting the VM worked. This may not be feasible for those storing data in the empty dir of the node however.

TL;DR - Restart the VM in the Azure portal if you can

@kattelus
Copy link

We had this too on 3 node cluster on custom subnet. Scaling down to 1 node helped for now, but 1 node in long run is not enough.

@kautsig
Copy link

kautsig commented May 16, 2018

@ResDiaryLewis I try to combine our findings:

  • drain-ing the node causes tunnelfront to be re-scheduled (probably to a "good" node)
  • logs/exec/attach works then
  • uncordon the old node and drain-ing the new node moves tunnelfront pod back on node 1

So which kind of state does the node have that a restart fixes it? 🤔

@pushrbx
Copy link

pushrbx commented May 22, 2018

Just to twist things even more, I'd like to add that this problem has started happening for me after I've installed OMS integration following the docs: https://docs.microsoft.com/en-us/azure/monitoring/monitoring-container-health
My cluster version was 1.8.7.
I tried Lewis's solution but it didn't cure the problem. But after upgrading the cluster to 1.9.6 the problem has disappeared.

@wojciech12
Copy link

I solved the issue with tunnelfront. I have deleted the following pods from kube-system namespace:

kube-svc-redirect
kube-proxy
kube-dns
tunnelfront

When I got new pods, everything worked.

@wojciech12
Copy link

wojciech12 commented Jun 8, 2018

^ all died today, service forwarding stopped working.

@markmcnaughton
Copy link

markmcnaughton commented Jun 12, 2018

Also experiencing the issue in west europe.

What doesn't work (for me, kube 1.9.6):

  • Restarting the node
  • deleting kube-svc-redirect, kube-proxy, kube-dns, tunnelfront
  • running the script from sauryadas and the subsequent command from hleoxalide

What helped

  • az aks scale, kubectl drain, kubectl uncordon

edit only helps temporarily. This is really a nightmare, two days of constant AKS issues and our time is running out fast.

@wojciech12
Copy link

wojciech12 commented Jun 13, 2018

^ I can confirm that scaling down the AKS cluster to 1 node and scaling back to the original size of the cluster helps.

This bug is a nightmare.

@kautsig
Copy link

kautsig commented Jun 14, 2018

Not being able to do kubectl logs/attach is probably just an effect of DNS not working on any cluster node but 0 (the node where the dns service endpoint is, to be more precise).

This is also why downscaling fixes the issue, because all broken pods (tunnelfront for this specific issue) are rescheduled to node 0.

When I scale up again, logs/attach is fixed, but DNS is not for nodes other than 0. On a broken node, I can resolve IPs through an external DNS server not through system DNS within pods (10.0.0.10).

root@netdebug2-67864574f7-dqgx5:/# dig google.com
; <<>> DiG 9.9.5-9+deb8u15-Debian <<>> google.com
;; global options: +cmd
;; connection timed out; no servers could be reached
root@netdebug2-67864574f7-dqgx5:/# dig @8.8.8.8 google.com

; <<>> DiG 9.9.5-9+deb8u15-Debian <<>> @8.8.8.8 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:

root@netdebug2-67864574f7-dqgx5:/#

The reason for this can only be node to node communication, because the pods can't reach the DNS service. I investigated the route table created in the aks resource group which looks fine, but then i noticed it is not associated to the subnet, which seems to be an already reported issue in #400.

As in #400, I also use terraform for provisioning as probably a lot of people here do.

I can not rule out that there are other issues causing similar effects, but associating the route table to the aks subnet makes definitely a difference for me.

@aderderian
Copy link

Any updates on this? I am seeing the issue most recently after deploying a new service via Azure Portal and then scaling down from 3 to 1 nodes. I can't deploy via Helm anymore with the error originally described.

@YvonneArnoldus
Copy link

Hi,

I have the same issue (kube 1.10.3):

  • Restarting the node
  • deleting kube-svc-redirect, kube-proxy, kube-dns, tunnelfront

I have created a support issue but all that I got so far is that the pod I'm running should be on a bigger VM but even when I cap my resources or delete the pod from the cluster I have the same problem.

@kylel
Copy link

kylel commented Jul 9, 2018

I saw the same issue today after scaling from 1 to 3 nodes.
"kubectl -n kube-system delete po -l component=tunnel" Has fixed the issue but I'm not sure if this is permanent

@SurushS
Copy link

SurushS commented Aug 13, 2018

We had the same issue yesterday. All pods that died wouldn't come back up again with (CrashLoopBackOff). az aks browse and kubectl proxy didn't work. the kubectl get commands did work fine. Several services/apis (pods) still worked others did not (not reachable, and after restart don't come up). after following the steps from @wojciech12 and restarting the four system pods, it seems its running again this morning. I don't know for how long. I'm really worried about the stability of AKS tbh. When it was in preview we had to recreate our cluster 5 times in a few months. Since its been GA (June) we were hopeful this wouldn't happen when in GA. However this resulted in a big downtime over the weekend for us, which isn't good. Seeing that I'm not the only one having these symptoms, hopefully the AKS team can fix this issue quickly because for me it still feels like its in preview.

@agolomoodysaada
Copy link
Contributor

The fact that any worker node going unstable freezes the entire cluster and all tooling and deployment scripts fail is a big blocker. It defeats the premise of Kubernetes itself that your applications stay highly-available even when a worker node fails. Putting tunnelfront and svc-redirect pods inside the cluster is not designed with failure in mind. Are there any plans on making this design actually handle failure smoothly?

@mleneveut
Copy link

Same issue here in NorthEurope :

  • Created cluster of 3 nodes with azure CLI in a custom VNET with this command :
az aks create --resource-group $aksResourceGroup --name $aksClusterName --node-count $nodeCount --node-vm-size $nodeVMSize --max-pods 100 --generate-ssh-keys --network-plugin azure --vnet-subnet-id $vNetId --docker-bridge-address $dockerBridgeAddress --dns-service-ip $dnsServiceIp --service-cidr $serviceCidr --enable-rbac --aad-server-app-id $serverAppId --aad-server-app-secret $serverAppSecret --aad-client-app-id $clientAppId --aad-tenant-id $tenantId
  • Upgraded to version 1.11.2

Things were working properly at first, we :

  • added tags to start / stop the VMs with an Automation Powershell script
    • Start-AzureRmVM -Name $AzureResourceItem.Name -ResourceGroupName $AzureResourceItem.ResourceGroupName
    • Stop-AzureRmVM -Name $AzureResourceItem.Name -ResourceGroupName $AzureResourceItem.ResourceGroupName -Force
  • tried to replace the kube-dns with coreDNS, but failed, so rollbacked to kube-dns
  • installed Istio and a tiller service-account
  • installed coreos/Prometheus

Then errors showed up the day after when doing :

Tried :

  • stop / start the VMs : not better
  • killed kube-system pods one by one : not better
  • removed every pod installed (istio, prometheus) : not better
  • removed properly all istio custom resources, tiller service account and istio resources : not better
  • upgrade aks to same version : did not work, still doing nothing after 45 minutes => killed
  • scale to 1 node : not better
  • scale back to 3 nodes : not better

N.B : we have another AKS in northEurope with 3 nodes under a custom VNet, RBAC disabled, max pods default (30), prometheus, ELK, lots of pods deployed and no problems.

@jskulavik
Copy link

Please see #676

@Kuvaldis
Copy link

In our setup we had similar issue. We tried to create a custom routing for AKS subnetwork. It was created before the actual AKS instance. Turned out that AKS created it's own inner route table but could not attach it to the subnet because there has already been one attached.

Maybe not the actual case, but might be helpful.

@wagnst
Copy link

wagnst commented Jan 24, 2019

facing the same issue with brand new AKS in europe. The docker logs from the tunnelfront see below. To be honest the solution totally looks like not fully thought through.

The error is as the following:

$ kubectl -n kube-system exec addon-http-application-routing-nginx-ingress-controller-ffxszzl -it -- bash
Error from server: error dialing backend: dial tcp 10.xxx.xxx.xxx:10250: i/o timeout

Please take a look:

$ docker container logs --details 4fa68112c512
 [ 13:56:25 ] INF: Starting: Every:[30] kubeconfig:[/etc/kubernetes/kubeconfig/kubeconfig] SSH-Server:[t_testcafuaks3-xxxeff.hcp.westeurope.azmk8s.io]
 [ 13:56:26 ] INF: Creating tunnel interface
 [ 13:56:26 ] INF: Tunnel device created user:[apiserver] name:[tun0] ip:[192.0.2.200/32] peer:[192.0.2.100]
 [ 13:56:26 ] INF: ip Configuration:
 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
     inet 127.0.0.1/8 scope host lo
        valid_lft forever preferred_lft forever
 3: eth0@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
     link/ether 02:6c:2d:35:42:a3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
     inet 10.244.0.22/24 scope global eth0
        valid_lft forever preferred_lft forever
 4: tun0: <POINTOPOINT,MULTICAST,NOARP,UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 500
     link/none
     inet 192.0.2.200 peer 192.0.2.100/32 scope global tun0
        valid_lft forever preferred_lft forever
 [ 13:56:26 ] INF: ******************
 [ 13:56:26 ] INF: iptables are not configured, configuring..
 [ 13:56:26 ] INF: iptables: Masqueraded both tun0  and eth0
 [ 13:56:26 ] INF: iptables: accept on incoming traffic on tun0
 [ 13:56:26 ] INF: current ip tables
 Chain PREROUTING (policy ACCEPT)
 num  target     prot opt source               destination

 Chain INPUT (policy ACCEPT)
 num  target     prot opt source               destination

 Chain OUTPUT (policy ACCEPT)
 num  target     prot opt source               destination

 Chain POSTROUTING (policy ACCEPT)
 num  target     prot opt source               destination
 1    MASQUERADE  all  --  anywhere             anywhere
 2    MASQUERADE  all  --  anywhere             anywhere
 Chain INPUT (policy ACCEPT)
 num  target     prot opt source               destination

 Chain FORWARD (policy ACCEPT)
 num  target     prot opt source               destination
 1    ACCEPT     all  --  anywhere             anywhere

 Chain OUTPUT (policy ACCEPT)
 num  target     prot opt source               destination
 [ 13:56:26 ] INF: ****************
 [ 13:56:26 ] INF: Attempting to download tunnel-front secret
 [ 13:56:27 ] INF: Success:Get secret:[tunnelfront] from kubernetes
 [ 13:56:27 ] INF: Secret tunnelfront was downloaded successfully
 [ 13:56:28 ] INF: Keys extracted from secret:[tunnelfront] successfully
 [ 13:56:28 ] INF: Attempting to download tunnel-front secret
 [ 13:56:28 ] INF: Success:Get secret:[tunnelend] from kubernetes
 [ 13:56:28 ] INF: Secret tunnelend was downloaded successfully
 [ 13:56:28 ] INF: Keys extracted from secret:[tunnelend] successfully
 [ 13:56:28 ] INF: Attempting to download tunnelgateway [publicKey] from [testcafuaks3-xxx.tun.westeurope.azmk8s.io:443] to [/tmp/tunnel_gateway]
 --2019-01-24 13:56:28--  https://testcafuaks3-xxx.tun.westeurope.azmk8s.io/secrets
 Resolving testcafuaks3-xxx.tun.westeurope.azmk8s.io (testcafuaks3-xxx.tun.westeurope.azmk8s.io)... 51.145.xxx.xxx
 Connecting to testcafuaks3-xxx.tun.westeurope.azmk8s.io (testcafuaks3-xxx.tun.westeurope.azmk8s.io)|51.145.xxx.xxx|:443... connected.
 HTTP request sent, awaiting response... 200 OK
 Length: 380 [text/plain]
 Saving to: '/tmp/tunnel_gateway/publicKey'

      0K                                                       100%  604M=0s

 2019-01-24 13:56:28 (604 MB/s) - '/tmp/tunnel_gateway/publicKey' saved [380/380]

 [ 13:56:28 ] INF: Success:Get tunnelgateway SSH [publicKey] from TunnelGateway TLS
 [ 13:56:28 ] INF: Secret /tmp/tunnel_gateway publicKey was downloaded successfully
 [ 13:56:28 ] INF: This host identity configured with /tmp/tunnel_front/publicKey & /tmp/tunnel_front/privateKey
 [ 13:56:28 ] INF: sshd server start with PID:[171]
 Could not load host key: /etc/ssh/ssh_host_ecdsa_key
 Could not load host key: /etc/ssh/ssh_host_ed25519_key
 [ 13:56:28 ] INF: host[t_testcafuaks3-xxx.hcp.westeurope.azmk8s.io:9000] was added for root
 [ 13:56:28 ] INF: host[testcafuaks3-xxx.tun.westeurope.azmk8s.io:22] was added for root
 [ 13:56:29 ] INF: Tunnel front and end both server.version == local.version, no rotation needed
 error: list of process IDs must follow -p

 Usage:
  ps [options]

  Try 'ps --help <simple|list|output|threads|misc|all>'
   or 'ps --help <s|l|o|t|m|a>'
  for additional help text.

 For more details see ps(1).
 Agent pid 295
 [ 13:56:29 ] INF: ssh-agent has been started with pid: 295
 Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)
 [ 13:56:29 ] INF: Ssh to tunnelend is not connected.. attempting to connect

@avik-so
Copy link

avik-so commented Feb 12, 2019

Just had this issue with 1.12.4, is there a known workaround / fix?
None of the above solutions worked.

@Tebro
Copy link

Tebro commented Mar 7, 2019

We are experiencing this as well. Kubernetes version is 1.12.6. Tried scaling to 1 node, and removing the pods mentioned above but nothing helps. The service is something we would really like to use but this is a fundamental part of managed Kubernetes that needs to be working.

@avik-so
Copy link

avik-so commented Mar 11, 2019

We are experiencing this as well. Kubernetes version is 1.12.6. Tried scaling to 1 node, and removing the pods mentioned above but nothing helps. The service is something we would really like to use but this is a fundamental part of managed Kubernetes that needs to be working.

We ended up creating new clusters. In one case we needed a different region, in another case we needed different VM sizes.
It was easier to just create a new cluster and migrate than the debug the issue.
In the new clusters, everything is working smoothly.

@jnoller
Copy link
Contributor

jnoller commented Jun 12, 2019

I am locking this thread. I am sorry that this issue is not being resolved at the speed you’d like (or that the AKS team likes). To be clear, SLB support is entering final testing to go into preview. The current timeout issues, as communicated are directly tied to the Basic LB timeouts and behaviors.

Issues on this repo are not places to vent at the AKS team, state that we’re not taking the product seriously, or are not doing our jobs. I understand bugs impact all of us, and low level behavioral issues like this are very painful.

This repository is meant for feature requests and bug reports, and the repository is governed by the code of conduct linked in the readme.

This issue is still open, but this comment thread is locked. The issue will be updated when we roll out the required changes.

@Azure Azure locked as too heated and limited conversation to collaborators Jun 12, 2019
@jluk jluk closed this as completed Apr 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests