Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-svc-redirect CrashLoopBackOff #56

Closed
Guillaume-Mayer opened this issue Nov 22, 2017 · 10 comments
Closed

kube-svc-redirect CrashLoopBackOff #56

Guillaume-Mayer opened this issue Nov 22, 2017 · 10 comments

Comments

@Guillaume-Mayer
Copy link

Today I created a new cluster in eastus.
When I look at system pods: kubectl get pods -n kube-system, I get that:

heapster-58f795c4cf-bmsjn 2/2 Running 0 46m
kube-dns-v20-6c8f7f988b-75xlk 3/3 Running 0 46m
kube-dns-v20-6c8f7f988b-rvv2x 3/3 Running 0 46m
kube-proxy-2plc9 1/1 Running 0 46m
kube-proxy-pb6tr 1/1 Running 0 46m
kube-proxy-xhvm8 1/1 Running 0 46m
kube-svc-redirect-j8v6c 0/1 CrashLoopBackOff 13 46m
kube-svc-redirect-jz5bl 0/1 CrashLoopBackOff 13 46m
kube-svc-redirect-p5hbb 0/1 Error 14 46m
kubernetes-dashboard-6fc8cf9586-fhscp 0/1 CrashLoopBackOff 13 46m
tunnelfront-7446f49869-mj2n5 1/1 Running 0 46m

@smithc
Copy link

smithc commented Nov 27, 2017

I'm experiencing the same issue. Notably, the kubernetes-dashboard pod is also failing, meaning it is impossible for me to use the dashboard, either through 'kubectl proxy', or 'az aks browse'.

@amanohar
Copy link

@smithc @Guillaume-Mayer can you provide more details like name of your resource group and resource. This will enable us to look up logs and aid investigation.

@smithc
Copy link

smithc commented Nov 30, 2017

Hi @amanohar, my resource group is 'cs-kube' and my aks container service name is 'cs-cluster'. Let me know if you need any further information; I'm happy to help.

Actually, I just checked my kube-system pods once again (having left the cluster up and running for a few days) - it appears that the kubernetes-dashboard, and kube-svc-redirect containers have finally started working. Here's the output of kubectl get pods -n kube-system:

heapster-75667786bb-djsh8                        2/2       Running   0          4d
kube-dns-v20-6c8f7f988b-78v29                 3/3       Running   0          6d
kube-dns-v20-6c8f7f988b-vkclw                  3/3       Running   0          6d
kube-proxy-8gsqp                                        1/1       Running   0          6d
kube-svc-redirect-xmrmk                             1/1       Running   535        6d
kubernetes-dashboard-6fc8cf9586-nvsbp   1/1       Running   495        6d
tunnelfront-644f654dbb-r55dx                    1/1       Running   0          6d

Notice that the number of restarts is quite high on both the kube-svc-redirect and kubernetes-dashboard pods.

I'd be interested to know if anything was done on Microsoft's side to help stabilize those services, or if we should be on the lookout for those failing again in the future.

In any case, I just want to say thanks for taking the time to help out with troubleshooting.

@garystafford
Copy link

garystafford commented Dec 12, 2017

Likewise, I am experiencing a similar issue. Is there any progress on fixing this? I have destroyed and created (3) new clusters in the last two days, with the same results. Prior to this, I was able to create clusters, although upgrading the Kubernetes version or adding nodes to cluster didn't work, at least I could create a new cluster. Now, that is not even working.

> kubectl get pods --all-namespaces
NAMESPACE     NAME                                    READY     STATUS             RESTARTS   AGE
kube-system   heapster-58f795c4cf-snlz7               2/2       Running            0          1h
kube-system   kube-dns-v20-6c8f7f988b-kqkqz           3/3       Running            0          1h
kube-system   kube-dns-v20-6c8f7f988b-vklvp           3/3       Running            0          1h
kube-system   kube-proxy-kvs9g                        1/1       Running            0          1h
kube-system   kube-svc-redirect-4g2bw                 0/1       CrashLoopBackOff   16         1h
kube-system   kubernetes-dashboard-6fc8cf9586-wrcg8   0/1       CrashLoopBackOff   15         1h
kube-system   tunnelfront-684dbb4bfd-bh5h8            1/1       Running            0          1h

@debben
Copy link

debben commented Dec 12, 2017

I'm having the same issue. I just tried AKS for the first time last night. I stood up a cluster in east last night and encountered the same thing. I tore down the cluster and stood up another one this morning only to get the same result.

When I ran the create command this morning, I used the flag --dns-name-prefix my-prefix . I ran into the same error and started to dig. kubectl logs --previous would time out so I could never understand why the pod was crashing.

Ultimately I pulled the image and tried playing with it locally to understand what it's doing:
$ docker pull dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3
$ docker run --rm -it dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3 sh

In the container I could see the script. I set the two environment variables APISERVER_FQDN and KUBERNETES_SVC_IP to match the values listed by running kubectl get ds kube-svc-redirect -o yaml -n kube-system. Once the variables were set I tried running the run-kube-svc-redirect.sh script in the local container. This resulted in:

[ 18:12:43 ] INF: Validating if we can get an ip for the supplied FQDN: t_my-prefix-285018fc.hcp.eastus.azmk8s.io
Host t_my-prefix-285018fc.hcp.eastus.azmk8s.io not found: 3(NXDOMAIN)

That's when I realized the api FQDN doesn't match what I see in azure portal as it has a prefix t_ . Changing the variable in the local container and running the script again got me further along in the script before failing.

I tried running kubectl edit ds kube-svc-redirect -n kube-system to remove what looks like an erroneous prefix to the FQDN variable. When I apply the change though, it only lasts for a few seconds and then the daemon set definition is immediately overwritten to the original configuration. I'm not sure what writes this daemon set or keeps updating it. This is as far as I got debugging.

@debben
Copy link

debben commented Dec 12, 2017

So, to further test my above theory I ran, kubectl get ds kube-svc-redirect -o yaml -n kube-system > kube-svc-redirect.yaml. I then edited the file, changing the APISERVER_FQDN variable to remove what I believed to be an erroneous prefix and also renaming the daemon set to 'kube-svc-redirect-fix'. I then ran kubectl apply -f kube-svc-redirect.yaml -n kube-system.

I could see my new daemon set and the pod it created. The pod ran without crashing. Unfortunately kubernetes-dashboard was still in a crash loop. I left to go for a meeting and when I came back, my daemon set 'kube-svc-redirect-fix' was gone. I'm guessing whatever controller was replacing my changes to the actual 'kube-svc-redirect' was also watching kube-system in general and deleting any additional resources created. The pods in question however were no longer stuck in crash loop:

C:\Users\debben> kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system heapster-75667786bb-ngqsv 2/2 Running 0 1h
kube-system kube-dns-v20-6c8f7f988b-dfpv9 3/3 Running 0 6h
kube-system kube-dns-v20-6c8f7f988b-x9ppt 3/3 Running 0 6h
kube-system kube-proxy-qgthv 1/1 Running 0 6h
kube-system kube-svc-redirect-xjktm 1/1 Running 29 3h
kube-system kubernetes-dashboard-6fc8cf9586-w6w7r 1/1 Running 6 1h
kube-system tunnelfront-8f8db54b7-8s5dp 1/1 Running 0 6h

I could now access the dashboard az aks browse. When I ran kubectl get ds kube-svc-redirect -o yaml -n kube-system I did see the APISERVER_FQDN still had the leading t_ prefix so maybe that is the correct configuration after all. I'd still like to know how the cluster was in this state, how this got resolved (if any of the commands I ran had anything to do with it), and more explanation as to what AKS puts in kube-system by default and how that namespace is being kept pristine.

@slack
Copy link
Contributor

slack commented Dec 19, 2017

Want to shed a little light on the underlying issue. As part of your AKS cluster we provision a dedicated IP address that is used by the infrastructure that lets logs, exec, attach, and proxy work. That's the oddly-named t_* hostname you see as part of kube-svc-redirect.

When we provision an AKS cluster, that IP address allocation is async. We've had a few service bugs and regional rate limits that have extended the allocation beyond 15 minutes. This looks like logs not working, or kube-svc-redirect in CrashLoopBackOff for some period of time. Eventually recovering once the address allocation completes.

Once these pods do connect up successfully, they will remain connected and shouldn't go back into CrashLoopBackOff again.

There have been a few cases where that allocation permanently fails. Longer-term, we are working on making this part of the service a lot more robust.

I'm going to close out this issue, since we don't have any active incidents at the moment!

@gonarys
Copy link

gonarys commented Aug 23, 2018

I had this problem and it turned out that I had a resource that used the subnet dedicated to AKS. you have to check this and if so, remove the resource.

@qiangli
Copy link

qiangli commented Sep 4, 2018

it happened to me after adding more nodes to the cluster. Luckily, kubectl still works; the cluster returned to normal after deleting failing pods (repeatedly)

@mrdfuse
Copy link

mrdfuse commented Sep 11, 2018

We encountered this issue today as well, after adding more nodes on a 1.10.6 pretty vanilla cluster (no advanced networking).
@slack can you reopen the issue?

@ghost ghost locked as resolved and limited conversation to collaborators Aug 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants