kube-svc-redirect CrashLoopBackOff #56

Guillaume-Mayer · 2017-11-22T15:26:28Z

Today I created a new cluster in eastus.
When I look at system pods: kubectl get pods -n kube-system, I get that:

heapster-58f795c4cf-bmsjn 2/2 Running 0 46m
kube-dns-v20-6c8f7f988b-75xlk 3/3 Running 0 46m
kube-dns-v20-6c8f7f988b-rvv2x 3/3 Running 0 46m
kube-proxy-2plc9 1/1 Running 0 46m
kube-proxy-pb6tr 1/1 Running 0 46m
kube-proxy-xhvm8 1/1 Running 0 46m
kube-svc-redirect-j8v6c 0/1 CrashLoopBackOff 13 46m
kube-svc-redirect-jz5bl 0/1 CrashLoopBackOff 13 46m
kube-svc-redirect-p5hbb 0/1 Error 14 46m
kubernetes-dashboard-6fc8cf9586-fhscp 0/1 CrashLoopBackOff 13 46m
tunnelfront-7446f49869-mj2n5 1/1 Running 0 46m

The text was updated successfully, but these errors were encountered:

smithc · 2017-11-27T22:56:07Z

I'm experiencing the same issue. Notably, the kubernetes-dashboard pod is also failing, meaning it is impossible for me to use the dashboard, either through 'kubectl proxy', or 'az aks browse'.

amanohar · 2017-11-29T06:42:06Z

@smithc @Guillaume-Mayer can you provide more details like name of your resource group and resource. This will enable us to look up logs and aid investigation.

smithc · 2017-11-30T03:14:25Z

Hi @amanohar, my resource group is 'cs-kube' and my aks container service name is 'cs-cluster'. Let me know if you need any further information; I'm happy to help.

Actually, I just checked my kube-system pods once again (having left the cluster up and running for a few days) - it appears that the kubernetes-dashboard, and kube-svc-redirect containers have finally started working. Here's the output of kubectl get pods -n kube-system:

heapster-75667786bb-djsh8                        2/2       Running   0          4d
kube-dns-v20-6c8f7f988b-78v29                 3/3       Running   0          6d
kube-dns-v20-6c8f7f988b-vkclw                  3/3       Running   0          6d
kube-proxy-8gsqp                                        1/1       Running   0          6d
kube-svc-redirect-xmrmk                             1/1       Running   535        6d
kubernetes-dashboard-6fc8cf9586-nvsbp   1/1       Running   495        6d
tunnelfront-644f654dbb-r55dx                    1/1       Running   0          6d

Notice that the number of restarts is quite high on both the kube-svc-redirect and kubernetes-dashboard pods.

I'd be interested to know if anything was done on Microsoft's side to help stabilize those services, or if we should be on the lookout for those failing again in the future.

In any case, I just want to say thanks for taking the time to help out with troubleshooting.

garystafford · 2017-12-12T17:48:10Z

Likewise, I am experiencing a similar issue. Is there any progress on fixing this? I have destroyed and created (3) new clusters in the last two days, with the same results. Prior to this, I was able to create clusters, although upgrading the Kubernetes version or adding nodes to cluster didn't work, at least I could create a new cluster. Now, that is not even working.

> kubectl get pods --all-namespaces
NAMESPACE     NAME                                    READY     STATUS             RESTARTS   AGE
kube-system   heapster-58f795c4cf-snlz7               2/2       Running            0          1h
kube-system   kube-dns-v20-6c8f7f988b-kqkqz           3/3       Running            0          1h
kube-system   kube-dns-v20-6c8f7f988b-vklvp           3/3       Running            0          1h
kube-system   kube-proxy-kvs9g                        1/1       Running            0          1h
kube-system   kube-svc-redirect-4g2bw                 0/1       CrashLoopBackOff   16         1h
kube-system   kubernetes-dashboard-6fc8cf9586-wrcg8   0/1       CrashLoopBackOff   15         1h
kube-system   tunnelfront-684dbb4bfd-bh5h8            1/1       Running            0          1h

debben · 2017-12-12T18:21:58Z

I'm having the same issue. I just tried AKS for the first time last night. I stood up a cluster in east last night and encountered the same thing. I tore down the cluster and stood up another one this morning only to get the same result.

When I ran the create command this morning, I used the flag --dns-name-prefix my-prefix . I ran into the same error and started to dig. kubectl logs --previous would time out so I could never understand why the pod was crashing.

Ultimately I pulled the image and tried playing with it locally to understand what it's doing:
$ docker pull dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3
$ docker run --rm -it dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3 sh

In the container I could see the script. I set the two environment variables APISERVER_FQDN and KUBERNETES_SVC_IP to match the values listed by running kubectl get ds kube-svc-redirect -o yaml -n kube-system. Once the variables were set I tried running the run-kube-svc-redirect.sh script in the local container. This resulted in:

[ 18:12:43 ] INF: Validating if we can get an ip for the supplied FQDN: t_my-prefix-285018fc.hcp.eastus.azmk8s.io
Host t_my-prefix-285018fc.hcp.eastus.azmk8s.io not found: 3(NXDOMAIN)

That's when I realized the api FQDN doesn't match what I see in azure portal as it has a prefix t_ . Changing the variable in the local container and running the script again got me further along in the script before failing.

I tried running kubectl edit ds kube-svc-redirect -n kube-system to remove what looks like an erroneous prefix to the FQDN variable. When I apply the change though, it only lasts for a few seconds and then the daemon set definition is immediately overwritten to the original configuration. I'm not sure what writes this daemon set or keeps updating it. This is as far as I got debugging.

debben · 2017-12-12T20:28:18Z

So, to further test my above theory I ran, kubectl get ds kube-svc-redirect -o yaml -n kube-system > kube-svc-redirect.yaml. I then edited the file, changing the APISERVER_FQDN variable to remove what I believed to be an erroneous prefix and also renaming the daemon set to 'kube-svc-redirect-fix'. I then ran kubectl apply -f kube-svc-redirect.yaml -n kube-system.

I could see my new daemon set and the pod it created. The pod ran without crashing. Unfortunately kubernetes-dashboard was still in a crash loop. I left to go for a meeting and when I came back, my daemon set 'kube-svc-redirect-fix' was gone. I'm guessing whatever controller was replacing my changes to the actual 'kube-svc-redirect' was also watching kube-system in general and deleting any additional resources created. The pods in question however were no longer stuck in crash loop:

C:\Users\debben> kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system heapster-75667786bb-ngqsv 2/2 Running 0 1h
kube-system kube-dns-v20-6c8f7f988b-dfpv9 3/3 Running 0 6h
kube-system kube-dns-v20-6c8f7f988b-x9ppt 3/3 Running 0 6h
kube-system kube-proxy-qgthv 1/1 Running 0 6h
kube-system kube-svc-redirect-xjktm 1/1 Running 29 3h
kube-system kubernetes-dashboard-6fc8cf9586-w6w7r 1/1 Running 6 1h
kube-system tunnelfront-8f8db54b7-8s5dp 1/1 Running 0 6h

I could now access the dashboard az aks browse. When I ran kubectl get ds kube-svc-redirect -o yaml -n kube-system I did see the APISERVER_FQDN still had the leading t_ prefix so maybe that is the correct configuration after all. I'd still like to know how the cluster was in this state, how this got resolved (if any of the commands I ran had anything to do with it), and more explanation as to what AKS puts in kube-system by default and how that namespace is being kept pristine.

slack · 2017-12-19T23:26:53Z

Want to shed a little light on the underlying issue. As part of your AKS cluster we provision a dedicated IP address that is used by the infrastructure that lets logs, exec, attach, and proxy work. That's the oddly-named t_* hostname you see as part of kube-svc-redirect.

When we provision an AKS cluster, that IP address allocation is async. We've had a few service bugs and regional rate limits that have extended the allocation beyond 15 minutes. This looks like logs not working, or kube-svc-redirect in CrashLoopBackOff for some period of time. Eventually recovering once the address allocation completes.

Once these pods do connect up successfully, they will remain connected and shouldn't go back into CrashLoopBackOff again.

There have been a few cases where that allocation permanently fails. Longer-term, we are working on making this part of the service a lot more robust.

I'm going to close out this issue, since we don't have any active incidents at the moment!

gonarys · 2018-08-23T17:02:18Z

I had this problem and it turned out that I had a resource that used the subnet dedicated to AKS. you have to check this and if so, remove the resource.

qiangli · 2018-09-04T18:33:34Z

it happened to me after adding more nodes to the cluster. Luckily, kubectl still works; the cluster returned to normal after deleting failing pods (repeatedly)

mrdfuse · 2018-09-11T12:39:23Z

We encountered this issue today as well, after adding more nodes on a 1.10.6 pretty vanilla cluster (no advanced networking).
@slack can you reopen the issue?

debben mentioned this issue Dec 13, 2017

connection timed out error Trying to execute kubectl log or kubectl exec -it on pods #74

Closed

slack closed this as completed Dec 19, 2017

slack added bug question labels Dec 19, 2017

slack mentioned this issue Dec 20, 2017

AKS capacity issues in West US 2 #2

Closed

ondrad1 mentioned this issue Jan 28, 2018

kube-svc-redirect CrashLoopBackOff in West Europe #153

Closed

mbrancato mentioned this issue Jul 16, 2018

Advanced networking pods getting wrong IPs #533

Closed

derekperkins mentioned this issue Aug 29, 2018

Cluster unavailable after upgrade from 1.10.5 to 1.11.2 #626

Closed

ghost locked as resolved and limited conversation to collaborators Aug 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-svc-redirect CrashLoopBackOff #56

kube-svc-redirect CrashLoopBackOff #56

Guillaume-Mayer commented Nov 22, 2017

smithc commented Nov 27, 2017

amanohar commented Nov 29, 2017

smithc commented Nov 30, 2017 •

edited

Loading

garystafford commented Dec 12, 2017 •

edited

Loading

debben commented Dec 12, 2017

debben commented Dec 12, 2017

slack commented Dec 19, 2017

gonarys commented Aug 23, 2018

qiangli commented Sep 4, 2018

mrdfuse commented Sep 11, 2018

kube-svc-redirect CrashLoopBackOff #56

kube-svc-redirect CrashLoopBackOff #56

Comments

Guillaume-Mayer commented Nov 22, 2017

smithc commented Nov 27, 2017

amanohar commented Nov 29, 2017

smithc commented Nov 30, 2017 • edited Loading

garystafford commented Dec 12, 2017 • edited Loading

debben commented Dec 12, 2017

debben commented Dec 12, 2017

slack commented Dec 19, 2017

gonarys commented Aug 23, 2018

qiangli commented Sep 4, 2018

mrdfuse commented Sep 11, 2018

smithc commented Nov 30, 2017 •

edited

Loading

garystafford commented Dec 12, 2017 •

edited

Loading