-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API calls fails with timeout for logs / port-forward and others #232
Comments
I am also facing this issue #224. @slack @sauryadas can you say when it will be fixed? The AKS is in unusable state right now. |
@ekarlso @mfaizanse So pulling log does not work? Does kubectl get po works? |
@JunSun17 Yes, kubectl get pods works. I deployed the cluster in west europe. |
correct @jungchenhung |
@ekarlso @mfaizanse Yes, this is a known issue and we are root causing it. Will update when we have more data on it. Thanks! |
@JunSun17 I am also facing this issue in Central US cluster as well. Do you have any estimated timeline, when will this issue be resolved? |
Logs, exec, and attach all require the master <-> node tunnels to be established. Check that the
If the pods are running, deleting
|
|
Creating the cluster using Azure CLI resolved the issue. |
@mfaizanse Sorry for the late reply. I have asked another customer facing the same issue for logs to further investigate this. If you still have the problematic cluster, can you provide the log so we can do further checking:
and please post the last 100 lines from the log here. |
I am having the same problem - trying to use helm, receive the following error.... Kubernets cluster on Azuer
|
@pkelleratwork do you experience the log issue too? If so, can you kindly provide the logs based on my comments above. |
I'm experiencing this issue as well, my |
Running into the same issue trying to run helm against a 1.9.2 AKS cluster in Azure helm ls |
I am having this same issue using helm as well as kubectl logs. |
I've upgraded both kubernetes and helm and things are working again. It would be nice to know what the cause of this was though. |
We are also facing this issue. Cluster is created in West-Europe. How can we fix that? |
We are are also seeing this. please fix |
is there any update on that issue? |
I abandoned AKS because of this issue. |
How did you guys create those AKS clusters initially? It may just be a lucky coincidence, but I haven't seen this issue on clusters I created via CLI yet, whereas we already had it two times on clusters created using ARM templates. Same region, same version. |
Let me briefly describe what we found out on the issue:
We could not find the root cause, which is probably related to Scaling down to 1 node worked for now. |
@jluenne I've created clusters with both the web ui and the cli and had it on both. |
We had the issue described by Adam here on a single node cluster.
TL;DR - Restart the VM in the Azure portal if you can |
We had this too on 3 node cluster on custom subnet. Scaling down to 1 node helped for now, but 1 node in long run is not enough. |
@ResDiaryLewis I try to combine our findings:
So which kind of state does the node have that a restart fixes it? 🤔 |
Just to twist things even more, I'd like to add that this problem has started happening for me after I've installed OMS integration following the docs: https://docs.microsoft.com/en-us/azure/monitoring/monitoring-container-health |
I solved the issue with tunnelfront. I have deleted the following pods from kube-system namespace:
When I got new pods, everything worked. |
^ all died today, service forwarding stopped working. |
Also experiencing the issue in west europe. What doesn't work (for me, kube 1.9.6):
What helped
edit only helps temporarily. This is really a nightmare, two days of constant AKS issues and our time is running out fast. |
^ I can confirm that scaling down the AKS cluster to 1 node and scaling back to the original size of the cluster helps. This bug is a nightmare. |
Not being able to do kubectl logs/attach is probably just an effect of DNS not working on any cluster node but 0 (the node where the dns service endpoint is, to be more precise). This is also why downscaling fixes the issue, because all broken pods (tunnelfront for this specific issue) are rescheduled to node 0. When I scale up again, logs/attach is fixed, but DNS is not for nodes other than 0. On a broken node, I can resolve IPs through an external DNS server not through system DNS within pods (10.0.0.10).
The reason for this can only be node to node communication, because the pods can't reach the DNS service. I investigated the route table created in the aks resource group which looks fine, but then i noticed it is not associated to the subnet, which seems to be an already reported issue in #400. As in #400, I also use terraform for provisioning as probably a lot of people here do. I can not rule out that there are other issues causing similar effects, but associating the route table to the aks subnet makes definitely a difference for me. |
Any updates on this? I am seeing the issue most recently after deploying a new service via Azure Portal and then scaling down from 3 to 1 nodes. I can't deploy via Helm anymore with the error originally described. |
Hi, I have the same issue (kube 1.10.3):
I have created a support issue but all that I got so far is that the pod I'm running should be on a bigger VM but even when I cap my resources or delete the pod from the cluster I have the same problem. |
I saw the same issue today after scaling from 1 to 3 nodes. |
We had the same issue yesterday. All pods that died wouldn't come back up again with (CrashLoopBackOff). az aks browse and kubectl proxy didn't work. the kubectl get commands did work fine. Several services/apis (pods) still worked others did not (not reachable, and after restart don't come up). after following the steps from @wojciech12 and restarting the four system pods, it seems its running again this morning. I don't know for how long. I'm really worried about the stability of AKS tbh. When it was in preview we had to recreate our cluster 5 times in a few months. Since its been GA (June) we were hopeful this wouldn't happen when in GA. However this resulted in a big downtime over the weekend for us, which isn't good. Seeing that I'm not the only one having these symptoms, hopefully the AKS team can fix this issue quickly because for me it still feels like its in preview. |
The fact that any worker node going unstable freezes the entire cluster and all tooling and deployment scripts fail is a big blocker. It defeats the premise of Kubernetes itself that your applications stay highly-available even when a worker node fails. Putting tunnelfront and svc-redirect pods inside the cluster is not designed with failure in mind. Are there any plans on making this design actually handle failure smoothly? |
Same issue here in NorthEurope :
Things were working properly at first, we :
Then errors showed up the day after when doing :
Tried :
N.B : we have another AKS in northEurope with 3 nodes under a custom VNet, RBAC disabled, max pods default (30), prometheus, ELK, lots of pods deployed and no problems. |
Please see #676 |
In our setup we had similar issue. We tried to create a custom routing for AKS subnetwork. It was created before the actual AKS instance. Turned out that AKS created it's own inner route table but could not attach it to the subnet because there has already been one attached. Maybe not the actual case, but might be helpful. |
facing the same issue with brand new AKS in europe. The docker logs from the tunnelfront see below. To be honest the solution totally looks like not fully thought through. The error is as the following:
Please take a look:
|
Just had this issue with 1.12.4, is there a known workaround / fix? |
We are experiencing this as well. Kubernetes version is 1.12.6. Tried scaling to 1 node, and removing the pods mentioned above but nothing helps. The service is something we would really like to use but this is a fundamental part of managed Kubernetes that needs to be working. |
We ended up creating new clusters. In one case we needed a different region, in another case we needed different VM sizes. |
I am locking this thread. I am sorry that this issue is not being resolved at the speed you’d like (or that the AKS team likes). To be clear, SLB support is entering final testing to go into preview. The current timeout issues, as communicated are directly tied to the Basic LB timeouts and behaviors. Issues on this repo are not places to vent at the AKS team, state that we’re not taking the product seriously, or are not doing our jobs. I understand bugs impact all of us, and low level behavioral issues like this are very painful. This repository is meant for feature requests and bug reports, and the repository is governed by the code of conduct linked in the readme. This issue is still open, but this comment thread is locked. The issue will be updated when we roll out the required changes. |
See kubectl debug output here:
Also in Azure Cloud Shell it times out so it seems to be a internal error in AKS?
The text was updated successfully, but these errors were encountered: