Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS capacity issues in West US 2 #2

Closed
seanknox opened this issue Oct 25, 2017 · 50 comments
Closed

AKS capacity issues in West US 2 #2

seanknox opened this issue Oct 25, 2017 · 50 comments

Comments

@seanknox
Copy link
Contributor

seanknox commented Oct 25, 2017

Update Nov 6, 15:50 PST
Capacity in westus2 has been increased; if you continue having difficulties with existing clusters, please try to deleting your cluster(s) and re-creating.

Update Nov 5, 12:05PM PST

Users should be able to create new AKS clusters in westus2. Please report any issues on this thread, thanks!

Update Nov 3, 2017 21:01pm PDT

While base compute/network capacity have been addressed, persistent HTTP errors with ARM in westus2 are preventing Azure Load Balancers via Kubernetes from obtaining public IPs. We're working with the ARM team to resolve.

Update Nov 3, 2017 17:10pm PDT

We've still in the process of rolling out additional compute and networking capacity in West US 2. We recommend deleting existing cluster and monitor this issue for updates on when to try again.

Update October 25, 2017 19:07 pm PDT

We received some good news from our capacity team and plan to both expand capacity in West US 2 and deploy AKS in additional US regions by the end of the week. Thanks for your patience with our literal growing pains!

October 25, 2017 11:00 am PDT

The AKS team is currently adding AKS capacity in West US 2 to keep up with demand. Until new capacity is in place, users on new AKS clusters won't be able to run kubectl logs, kubectl exec, and kubectl proxy.

$ kubectl logs kube-svc-redirect-hv3b0  -n kube-system
Error from server: Get https://aks-agentpool1-30179320-2:10250/containerLogs/kube-system/kube-svc-redirect-hv3b0/redirector: dial tcp 10.240.0.4:10250: getsockopt: connection refused
@bgeesaman
Copy link

bgeesaman commented Oct 26, 2017

Should I kill my cluster exhibiting this issue and recreate? Or will the added capacity resolve things automatically when it comes?
Edit: ukwest works just fine for now. Nice!

@srakesh28
Copy link

I am experiencing the same issue waiting for fix

@ghost
Copy link

ghost commented Oct 31, 2017

I'm still having issues

➜  ~ kubectl logs -l k8s-app=kubernetes-dashboard --context=aks -n kube-system
Error from server: Get https://aks-agentpool1-28009576-1:10250/containerLogs/kube-system/kubernetes-dashboard-3427906134-v9fv7/main?tailLines=10: dial tcp 10.240.0.5:10250: getsockopt: connection refused

@seanknox any update from capacity team?

@EamonKeane
Copy link

Is there any update on this? If this is not resolved soon, we'll be forced to use GKE which I've tested and works smoothly.

@bramvdklinkenberg
Copy link

Since yesterday I have the issue on ukwest and westus2. If I deploy clusters (portal or az cli) the pods for tunnelfront, kube-svc-redirect and kubernetes dashboard keep crashing.
Is this because of ip address capacity issues?

@anoff
Copy link

anoff commented Nov 3, 2017

update would be appreciated :)

@blackbaud-brandonstirnaman

I was hoping to spend the weekend evaluating AKS vs our current ACS implementation... Guessing we missed the goal of new capacity to be added in the last week but is there an ETA for this fix?

@seanknox
Copy link
Contributor Author

seanknox commented Nov 4, 2017

Hi all, thanks for your patience, just updated the status above.

@seanknox
Copy link
Contributor Author

seanknox commented Nov 4, 2017

Since yesterday I have the issue on ukwest and westus2. If I deploy clusters (portal or az cli) the pods for tunnelfront, kube-svc-redirect and kubernetes dashboard keep crashing.
Is this because of ip address capacity issues?

It's a combination of various capacity issues:

  • IP address capacity
  • compute capacity (available VMs)
  • Low-level Azure networking limits involving load balancer frontend IPs and NSG rules

We've been working closely with Azure Networking and Capacity teams to address all of these issues.

@seanknox
Copy link
Contributor Author

seanknox commented Nov 4, 2017

Should I kill my cluster exhibiting this issue and recreate? Or will the added capacity resolve things automatically when it comes?

@bgeesaman yes, recommend deleting your cluster until all capacity and ARM issues are resolved--we're hopeful we'll see resolution soon.

@EIrwin
Copy link

EIrwin commented Nov 4, 2017

Though i can see the updated status as of Nov 3rd as having enough capacity, on two separate attempts today (Nov 4th), cluster creation resulted in no nodes being provisioned, and pods stuck in Pending due to no nodes being present.

@blackbaud-brandonstirnaman

We've still in the process of rolling out additional compute and networking capacity in West US 2. If your kube-system/hcp-customer-nginx-ingress-controller service doesn't have a public IP (kubectl -n kube-system get svc hcp-customer-nginx-ingress-controller), we recommend deleting the cluster and monitor this issue for updates on when to try again.

Created a new cluster this morning in West US 2, I can get the dashboard up/view logs/etc.. So its working but I do not have an Ingress controller pod deployed in the cluster. Is it expected that it should be automatically deployed in a new 1.7 cluster? The mc_* resource group also doesn't have a load balancer.

@ekarlso
Copy link

ekarlso commented Nov 5, 2017

Hi, I am getting similar issues in #24

@seanknox
Copy link
Contributor Author

seanknox commented Nov 5, 2017

@blackbaud-brandonstirnaman I pasted the wrong info there, sorry. If you can view logs your cluster should be good to go.

@berndverst
Copy link
Member

I can confirm that at this time cluster creation in WestUS2 works.

@bramvdklinkenberg
Copy link

bramvdklinkenberg commented Nov 6, 2017

In ukwest I still have the same issue... The tunnelfront and kube-svc-redirect still crash after deployment of the cluster.

Also tried in westus2 and indeed that works.

@dendle
Copy link

dendle commented Nov 6, 2017

ukwest failing since thursday last week. Opened support case, and they cited this issue - however this issue only appears to address westus - Can someone check to see if this is the case for ukwest, too?

az aks create --resource-group prelive-kubernetesv2 --name prelive-k8scluster --agent-count 3 --agent-vm-size Standard_DS5_v2 --generate-ssh-keys
Deployment failed. Correlation ID: 654fd43e-45ab-4328-8978-907c6aaf8b1d. Operation failed with status: 200. Details: Resource state Failed

@gabrtv
Copy link

gabrtv commented Nov 6, 2017

Hey Matt,

We are still working on adding capacity to ukwest, while we also bring other AKS regions online.

Thanks for your patience while we sort this out. As you can guess, demand for the AKS preview caught us a little off guard. 😉

Gabe

@amazaheri
Copy link

Issue is resolved for me, I just created a new cluster today, be patient and in a minute you have it up and running. "I SHALL NOT DELETE THIS ONE ANYMORE" 👍

@morellonet
Copy link

How can I delete a cluster? I had a number of failed deployment due to these capacity issues and apparently now have 5 clusters that are 'stuck' and that fill my quota. When I try to do a deployment now, I get this error:

{
  "code": "QuotaExceeded",
  "message": "Public preview limit of 5 for managed cluster(AKS) has been reached for subscription XXX in location westus2. Please try deleting one or more managed cluster resources in this location before trying to create a new cluster or try a different Azure location."
}

I've been playing around with the Azure CLI and UI but I don't see a way to list all the clusters in the sub, much less delete them. Note that I don't have any RGs in the sub, so I don't understand where these clusters are hiding.

@amanohar
Copy link

amanohar commented Nov 9, 2017

@morellonet would it possible to share your resource group name and resource name here (I can look up the sub id) and I will look into the issue. Also, I would recommend opening a separate issue for the delete failures.

@jrthib
Copy link

jrthib commented Nov 9, 2017

I'm experiencing similar issues as noted in this thread. I also can't adjust the amount of nodes using the scale command. It just hangs and eventually times out.

@amanohar
Copy link

amanohar commented Nov 9, 2017

@jrthib are you seeing similar error as described in: #26 ?
Can you add your resource group and resource name to Issue #26 ?

@jrthib
Copy link

jrthib commented Nov 9, 2017

@amanohar I'm receiving that one too. I'm having issues with delete, scale, and browse commands.

@amanohar
Copy link

amanohar commented Nov 9, 2017

@jrthib:

  • Expected ETA for scale issue fix to be in PROD is by Monday in WestUS2
  • For Browse command: Can you describe the error in a new issue?
  • For delete: Can you share resource group and resource name so I can investigate this? Please add it az aks scale operation failing #26

This issue is specifically to track capacity.

@bramvdklinkenberg
Copy link

bramvdklinkenberg commented Nov 9, 2017

Works for me again in westus2, still issues in ukwest though.
Do not delete your cluster if you have a working one.... and also don't stop/ start the agents :)

@amazaheri
Copy link

amazaheri commented Nov 9, 2017

Browse is broken again, this was fine yesterday.

Unable to connect to the server: net/http: TLS handshake timeout

Also looks like the who cluster is down now :(

@Guillaume-Mayer
Copy link

Same here, az aks browse dont work anymore (westus2)

@relferreira
Copy link

I'm having the same problem described by @amazaheri.

kubectl get pods
Unable to connect to the server: net/http: TLS handshake timeout

@artisticcheese
Copy link

Cmon Microsoft. Why status page was never updated saying there are issues with AKS in WEST2? It been all this time happy green checkbox.

image

@jespernohr
Copy link

my aks environment in westus2 stoppet working yesterday and I am unable to deploy in westus2.

I have successfully deployed in ukwest, but am unable to "az aks browse" - connection refused

@seanmck
Copy link
Collaborator

seanmck commented Nov 12, 2017

Unfortunately, we had an unrecoverable service failure in westus2, so we recommend deleting any clusters that you had deployed there. We have resolved the problem and are working on rolling out new capacity in westus2, along with other regions. Please monitor the announcements in this repo for an update on when/where you can try creating new clusters.

We sincerely appreciate your patience as we work through the issues with the preview.

@qmfrederik
Copy link

@seanmck Any word on the status of UK West? I can create a new cluster but some of the pods are unstable and the cluster is inaccessible at times:

fcarlier@ubuntu:~$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                    READY     STATUS             RESTARTS   AGE
kube-system   heapster-553147743-1n0d7                2/2       Running            0          20h
kube-system   kube-dns-v20-1654923623-tnlvg           3/3       Running            0          20h
kube-system   kube-dns-v20-1654923623-wfl2m           3/3       Running            0          20h
kube-system   kube-proxy-8brrl                        1/1       Running            0          20h
kube-system   kube-svc-redirect-22b2h                 0/1       CrashLoopBackOff   248        20h
kube-system   kubernetes-dashboard-3427906134-gh6b2   0/1       CrashLoopBackOff   264        20h
kube-system   tunnelfront-nn13x                       0/1       CrashLoopBackOff   246        20h

@benc-uk
Copy link

benc-uk commented Nov 12, 2017

I fully understand this is a preview service, but with uswest2 down + the capacity issues, ukwest deploying unstable and unusable clusters, and on top of this, all the CLI problems. This has been a really bad start for AKS 😞

@indrayam
Copy link

I am a newbie to MS Azure Cloud. Heard a lot about their Managed K8S (AKS) offering on Twitter so thought I would try it out. Played with the Google Container Engine Quickstart and was up and running in minutes. Tried to work with this Quickstart:
https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough

I am getting this error:
az aks create --resource-group sezResourceGroup --name sez-azcloud-cluster --agent-count 1 --generate-ssh-keys
AAD role propagation done[############################################] 100.0000%Service principal clientID: not found in Active Directory tenant , Please see https://aka.ms/acs-sp-help for more details.

Am I missing a step here or is this all related to the capacity issues in US West 2

@kamoljan
Copy link

+1

@sauryadas
Copy link
Contributor

we have opened up east us for AKS deployments. Please deploy in the east us region.

Thanks for your patience.

@arindam00
Copy link

Is the capacity issue with AKS resolved for UKWEST and WEST US2 ?

@benc-uk
Copy link

benc-uk commented Nov 17, 2017

Not really, they are no longer accepting AKS workloads in those regions. Your choice now is East US, West Europe or Central US

See the regions doc here
https://github.com/Azure/AKS/blob/master/preview_regions.md

@msdotnetclr
Copy link

Looks like we are having the same issue in eastus. This is what I am getting now:

$ kubectl describe pod kube-svc-redirect-jrfjd -n kube-system
Name:           kube-svc-redirect-jrfjd
Namespace:      kube-system
Node:           aks-nodepool1-19361140-0/10.240.0.4
Start Time:     Mon, 11 Dec 2017 08:48:45 -0500
Labels:         component=kube-svc-redirect
                controller-revision-hash=3376999726
                pod-template-generation=1
                tier=node
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"DaemonSet","namespace":"kube-system","name":"kube-svc-redirect","uid":"d8e715e1-de79-11e7-9d8d-0a58ac1f102...
Status:         Running
IP:             10.240.0.4
Created By:     DaemonSet/kube-svc-redirect
Controlled By:  DaemonSet/kube-svc-redirect
Containers:
  redirector:
    Container ID:   docker://7815c8c9f92181645cb2659eef6793123c4bf54624563d429755064795060c35
    Image:          dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3
    Image ID:       docker-pullable://dockerio.azureedge.net/deis/kube-svc-redirect@sha256:ccc6b31039754db718dac8c5d723b9db6a4070a252deaf4ea2c14b018343627e
    Port:           <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 11 Dec 2017 15:03:56 -0500
      Finished:     Mon, 11 Dec 2017 15:03:56 -0500
    Ready:          False
    Restart Count:  78
    Environment:
      APISERVER_FQDN:     t_presto-rgakspresto-1b9b4d-9a5bbdbb.hcp.eastus.azmk8s.io
      KUBERNETES_SVC_IP:  10.0.0.1
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-3t4rg (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  default-token-3t4rg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-3t4rg
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     node-role.kubernetes.io/master=true:NoSchedule
                 node.alpha.kubernetes.io/notReady:NoExecute
                 node.alpha.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason      Age                  From                               Message
  ----     ------      ----                 ----                               -------
  Normal   Pulling     1m (x79 over 6h)     kubelet, aks-nodepool1-19361140-0  pulling image "dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3"
  Normal   Pulled      1m (x79 over 6h)     kubelet, aks-nodepool1-19361140-0  Successfully pulled image "dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3"
  Normal   Created     1m (x79 over 6h)     kubelet, aks-nodepool1-19361140-0  Created container
  Normal   Started     1m (x79 over 6h)     kubelet, aks-nodepool1-19361140-0  Started container
  Warning  BackOff     14s (x1686 over 6h)  kubelet, aks-nodepool1-19361140-0  Back-off restarting failed container
  Warning  FailedSync  14s (x1686 over 6h)  kubelet, aks-nodepool1-19361140-0  Error syncing pod

$ kubectl describe pod kubernetes-dashboard-1672970692-bfn8z -n kube-system
Name:           kubernetes-dashboard-1672970692-bfn8z
Namespace:      kube-system
Node:           aks-nodepool1-19361140-0/10.240.0.4
Start Time:     Mon, 11 Dec 2017 08:49:40 -0500
Labels:         k8s-app=kubernetes-dashboard
                kubernetes.io/cluster-service=true
                pod-template-hash=1672970692
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"kube-system","name":"kubernetes-dashboard-1672970692","uid":"d8ed8d20-de79-11e7-9...
Status:         Running
IP:             10.244.0.2
Created By:     ReplicaSet/kubernetes-dashboard-1672970692
Controlled By:  ReplicaSet/kubernetes-dashboard-1672970692
Containers:
  main:
    Container ID:   docker://5c3600ddff4eee7ca8913577af09fa63c9a23176b064207d58ce2f6cca0fba59
    Image:          gcrio.azureedge.net/google_containers/kubernetes-dashboard-amd64:v1.6.3
    Image ID:       docker-pullable://gcrio.azureedge.net/google_containers/kubernetes-dashboard-amd64@sha256:2c4421ed80358a0ee97b44357b6cd6dc09be6ccc27dfe9d50c9bfc39a760e5fe
    Port:           9090/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 11 Dec 2017 15:03:41 -0500
      Finished:     Mon, 11 Dec 2017 15:04:12 -0500
    Ready:          False
    Restart Count:  76
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:        100m
      memory:     50Mi
    Liveness:     http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-3t4rg (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  default-token-3t4rg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-3t4rg
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     node.alpha.kubernetes.io/notReady:NoExecute for 300s
                 node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason      Age                 From                               Message
  ----     ------      ----                ----                               -------
  Normal   Killing     20m (x7 over 6h)    kubelet, aks-nodepool1-19361140-0  Killing container with id docker://main:pod "kubernetes-dashboard-1672970692-bfn8z_kube-system(d8ef4236-de79-11e7-9d8d-0a58ac1f102b)" container "main" is unhealthy, it will be killed and re-created.
  Warning  Unhealthy   14m (x16 over 6h)   kubelet, aks-nodepool1-19361140-0  Liveness probe failed: Get http://10.244.0.2:9090/: dial tcp 10.244.0.2:9090: getsockopt: connection refused
  Normal   Pulled      4m (x76 over 6h)    kubelet, aks-nodepool1-19361140-0  Container image "gcrio.azureedge.net/google_containers/kubernetes-dashboard-amd64:v1.6.3" already present on machine
  Normal   Created     4m (x77 over 6h)    kubelet, aks-nodepool1-19361140-0  Created container
  Normal   Started     4m (x77 over 6h)    kubelet, aks-nodepool1-19361140-0  Started container
  Warning  BackOff     6s (x1565 over 6h)  kubelet, aks-nodepool1-19361140-0  Back-off restarting failed container
  Warning  FailedSync  6s (x1565 over 6h)  kubelet, aks-nodepool1-19361140-0  Error syncing pod

@luayalem
Copy link

luayalem commented Dec 12, 2017

East US has the same issue:

NAME                                                                 READY     STATUS                  RESTARTS   AGE
heapster-2888171832-jm0g2                          2/2           Running                    0                  2h
kube-dns-v20-1654923623-8jxp3                 3/3           Running                    0                  2h
kube-dns-v20-1654923623-v2ddc                3/3           Running                    0                  2h
kube-proxy-46gz6                                            1/1           Running                     0                  2h
kube-proxy-snwjw                                             1/1           Running                    0                  2h
kube-svc-redirect-l8xjx                                    0/1          CrashLoopBackOff   32                2h
kube-svc-redirect-q2nmp                                0/1          CrashLoopBackOff   32                2h
kubernetes-dashboard-1672970692-s71l4    0/1          CrashLoopBackOff   30                2h
tunnelfront-3490006108-g4d28                     1/1           Running                     0                 2h

@slack
Copy link
Contributor

slack commented Dec 20, 2017

Thanks for your patience during the preview.

We are up in East US, Central US and West Europe. Additional details here: #56 (comment)

I'm going to close out this umbrella ticket. Feel free to open new issues as you experience problems.

@slack slack closed this as completed Dec 20, 2017
@ghost ghost locked as resolved and limited conversation to collaborators Aug 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests