AKS capacity issues in West US 2 #2

seanknox · 2017-10-25T18:17:07Z

Update Nov 6, 15:50 PST
Capacity in westus2 has been increased; if you continue having difficulties with existing clusters, please try to deleting your cluster(s) and re-creating.

Update Nov 5, 12:05PM PST

Users should be able to create new AKS clusters in westus2. Please report any issues on this thread, thanks!

Update Nov 3, 2017 21:01pm PDT

While base compute/network capacity have been addressed, persistent HTTP errors with ARM in westus2 are preventing Azure Load Balancers via Kubernetes from obtaining public IPs. We're working with the ARM team to resolve.

Update Nov 3, 2017 17:10pm PDT

We've still in the process of rolling out additional compute and networking capacity in West US 2. We recommend deleting existing cluster and monitor this issue for updates on when to try again.

Update October 25, 2017 19:07 pm PDT

We received some good news from our capacity team and plan to both expand capacity in West US 2 and deploy AKS in additional US regions by the end of the week. Thanks for your patience with our literal growing pains!

October 25, 2017 11:00 am PDT

The AKS team is currently adding AKS capacity in West US 2 to keep up with demand. Until new capacity is in place, users on new AKS clusters won't be able to run kubectl logs, kubectl exec, and kubectl proxy.

$ kubectl logs kube-svc-redirect-hv3b0  -n kube-system
Error from server: Get https://aks-agentpool1-30179320-2:10250/containerLogs/kube-system/kube-svc-redirect-hv3b0/redirector: dial tcp 10.240.0.4:10250: getsockopt: connection refused

The text was updated successfully, but these errors were encountered:

bgeesaman · 2017-10-26T00:22:09Z

Should I kill my cluster exhibiting this issue and recreate? Or will the added capacity resolve things automatically when it comes?
Edit: ukwest works just fine for now. Nice!

srakesh28 · 2017-10-29T08:20:47Z

I am experiencing the same issue waiting for fix

ghost · 2017-10-31T06:05:43Z

I'm still having issues

➜  ~ kubectl logs -l k8s-app=kubernetes-dashboard --context=aks -n kube-system
Error from server: Get https://aks-agentpool1-28009576-1:10250/containerLogs/kube-system/kubernetes-dashboard-3427906134-v9fv7/main?tailLines=10: dial tcp 10.240.0.5:10250: getsockopt: connection refused

@seanknox any update from capacity team?

EamonKeane · 2017-11-02T20:47:37Z

Is there any update on this? If this is not resolved soon, we'll be forced to use GKE which I've tested and works smoothly.

bramvdklinkenberg · 2017-11-03T07:37:58Z

Since yesterday I have the issue on ukwest and westus2. If I deploy clusters (portal or az cli) the pods for tunnelfront, kube-svc-redirect and kubernetes dashboard keep crashing.
Is this because of ip address capacity issues?

anoff · 2017-11-03T13:23:36Z

update would be appreciated :)

blackbaud-brandonstirnaman · 2017-11-03T22:57:08Z

I was hoping to spend the weekend evaluating AKS vs our current ACS implementation... Guessing we missed the goal of new capacity to be added in the last week but is there an ETA for this fix?

seanknox · 2017-11-04T00:24:08Z

Hi all, thanks for your patience, just updated the status above.

seanknox · 2017-11-04T00:30:45Z

Since yesterday I have the issue on ukwest and westus2. If I deploy clusters (portal or az cli) the pods for tunnelfront, kube-svc-redirect and kubernetes dashboard keep crashing.
Is this because of ip address capacity issues?

It's a combination of various capacity issues:

IP address capacity
compute capacity (available VMs)
Low-level Azure networking limits involving load balancer frontend IPs and NSG rules

We've been working closely with Azure Networking and Capacity teams to address all of these issues.

seanknox · 2017-11-04T04:11:51Z

Should I kill my cluster exhibiting this issue and recreate? Or will the added capacity resolve things automatically when it comes?

@bgeesaman yes, recommend deleting your cluster until all capacity and ARM issues are resolved--we're hopeful we'll see resolution soon.

EIrwin · 2017-11-04T13:59:53Z

Though i can see the updated status as of Nov 3rd as having enough capacity, on two separate attempts today (Nov 4th), cluster creation resulted in no nodes being provisioned, and pods stuck in Pending due to no nodes being present.

blackbaud-brandonstirnaman · 2017-11-04T17:24:29Z

We've still in the process of rolling out additional compute and networking capacity in West US 2. If your kube-system/hcp-customer-nginx-ingress-controller service doesn't have a public IP (kubectl -n kube-system get svc hcp-customer-nginx-ingress-controller), we recommend deleting the cluster and monitor this issue for updates on when to try again.

Created a new cluster this morning in West US 2, I can get the dashboard up/view logs/etc.. So its working but I do not have an Ingress controller pod deployed in the cluster. Is it expected that it should be automatically deployed in a new 1.7 cluster? The mc_* resource group also doesn't have a load balancer.

ekarlso · 2017-11-05T13:31:06Z

Hi, I am getting similar issues in #24

seanknox · 2017-11-05T20:06:52Z

@blackbaud-brandonstirnaman I pasted the wrong info there, sorry. If you can view logs your cluster should be good to go.

berndverst · 2017-11-06T06:54:59Z

I can confirm that at this time cluster creation in WestUS2 works.

bramvdklinkenberg · 2017-11-06T07:36:36Z

In ukwest I still have the same issue... The tunnelfront and kube-svc-redirect still crash after deployment of the cluster.

Also tried in westus2 and indeed that works.

dendle · 2017-11-06T09:48:36Z

ukwest failing since thursday last week. Opened support case, and they cited this issue - however this issue only appears to address westus - Can someone check to see if this is the case for ukwest, too?

az aks create --resource-group prelive-kubernetesv2 --name prelive-k8scluster --agent-count 3 --agent-vm-size Standard_DS5_v2 --generate-ssh-keys
Deployment failed. Correlation ID: 654fd43e-45ab-4328-8978-907c6aaf8b1d. Operation failed with status: 200. Details: Resource state Failed

gabrtv · 2017-11-06T09:54:49Z

Hey Matt,

We are still working on adding capacity to ukwest, while we also bring other AKS regions online.

Thanks for your patience while we sort this out. As you can guess, demand for the AKS preview caught us a little off guard. 😉

Gabe

amazaheri · 2017-11-07T16:25:11Z

Issue is resolved for me, I just created a new cluster today, be patient and in a minute you have it up and running. "I SHALL NOT DELETE THIS ONE ANYMORE" 👍

morellonet · 2017-11-08T20:47:50Z

How can I delete a cluster? I had a number of failed deployment due to these capacity issues and apparently now have 5 clusters that are 'stuck' and that fill my quota. When I try to do a deployment now, I get this error:

{
  "code": "QuotaExceeded",
  "message": "Public preview limit of 5 for managed cluster(AKS) has been reached for subscription XXX in location westus2. Please try deleting one or more managed cluster resources in this location before trying to create a new cluster or try a different Azure location."
}

I've been playing around with the Azure CLI and UI but I don't see a way to list all the clusters in the sub, much less delete them. Note that I don't have any RGs in the sub, so I don't understand where these clusters are hiding.

amanohar · 2017-11-09T02:00:32Z

@morellonet would it possible to share your resource group name and resource name here (I can look up the sub id) and I will look into the issue. Also, I would recommend opening a separate issue for the delete failures.

jrthib · 2017-11-09T02:05:10Z

I'm experiencing similar issues as noted in this thread. I also can't adjust the amount of nodes using the scale command. It just hangs and eventually times out.

amanohar · 2017-11-09T02:23:52Z

@jrthib are you seeing similar error as described in: #26 ?
Can you add your resource group and resource name to Issue #26 ?

jrthib · 2017-11-09T02:49:16Z

@amanohar I'm receiving that one too. I'm having issues with delete, scale, and browse commands.

amanohar · 2017-11-09T03:46:57Z

@jrthib:

Expected ETA for scale issue fix to be in PROD is by Monday in WestUS2
For Browse command: Can you describe the error in a new issue?
For delete: Can you share resource group and resource name so I can investigate this? Please add it az aks scale operation failing #26

This issue is specifically to track capacity.

bramvdklinkenberg · 2017-11-09T07:01:01Z

Works for me again in westus2, still issues in ukwest though.
Do not delete your cluster if you have a working one.... and also don't stop/ start the agents :)

amazaheri · 2017-11-09T15:45:16Z

Browse is broken again, this was fine yesterday.

Unable to connect to the server: net/http: TLS handshake timeout

Also looks like the who cluster is down now :(

Guillaume-Mayer · 2017-11-09T17:21:19Z

Same here, az aks browse dont work anymore (westus2)

relferreira · 2017-11-10T13:35:36Z

I'm having the same problem described by @amazaheri.

kubectl get pods
Unable to connect to the server: net/http: TLS handshake timeout

artisticcheese · 2017-11-11T15:20:15Z

Cmon Microsoft. Why status page was never updated saying there are issues with AKS in WEST2? It been all this time happy green checkbox.

jespernohr · 2017-11-11T22:18:21Z

my aks environment in westus2 stoppet working yesterday and I am unable to deploy in westus2.

I have successfully deployed in ukwest, but am unable to "az aks browse" - connection refused

seanmck · 2017-11-12T02:07:20Z

Unfortunately, we had an unrecoverable service failure in westus2, so we recommend deleting any clusters that you had deployed there. We have resolved the problem and are working on rolling out new capacity in westus2, along with other regions. Please monitor the announcements in this repo for an update on when/where you can try creating new clusters.

We sincerely appreciate your patience as we work through the issues with the preview.

qmfrederik · 2017-11-12T11:15:52Z

@seanmck Any word on the status of UK West? I can create a new cluster but some of the pods are unstable and the cluster is inaccessible at times:

fcarlier@ubuntu:~$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                    READY     STATUS             RESTARTS   AGE
kube-system   heapster-553147743-1n0d7                2/2       Running            0          20h
kube-system   kube-dns-v20-1654923623-tnlvg           3/3       Running            0          20h
kube-system   kube-dns-v20-1654923623-wfl2m           3/3       Running            0          20h
kube-system   kube-proxy-8brrl                        1/1       Running            0          20h
kube-system   kube-svc-redirect-22b2h                 0/1       CrashLoopBackOff   248        20h
kube-system   kubernetes-dashboard-3427906134-gh6b2   0/1       CrashLoopBackOff   264        20h
kube-system   tunnelfront-nn13x                       0/1       CrashLoopBackOff   246        20h

benc-uk · 2017-11-12T11:23:09Z

I fully understand this is a preview service, but with uswest2 down + the capacity issues, ukwest deploying unstable and unusable clusters, and on top of this, all the CLI problems. This has been a really bad start for AKS 😞

indrayam · 2017-11-12T21:41:46Z

I am a newbie to MS Azure Cloud. Heard a lot about their Managed K8S (AKS) offering on Twitter so thought I would try it out. Played with the Google Container Engine Quickstart and was up and running in minutes. Tried to work with this Quickstart:
https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough

I am getting this error:
az aks create --resource-group sezResourceGroup --name sez-azcloud-cluster --agent-count 1 --generate-ssh-keys
AAD role propagation done[############################################] 100.0000%Service principal clientID: not found in Active Directory tenant , Please see https://aka.ms/acs-sp-help for more details.

Am I missing a step here or is this all related to the capacity issues in US West 2

kamoljan · 2017-11-13T13:07:55Z

+1

sauryadas · 2017-11-13T23:07:35Z

we have opened up east us for AKS deployments. Please deploy in the east us region.

Thanks for your patience.

arindam00 · 2017-11-17T16:42:57Z

Is the capacity issue with AKS resolved for UKWEST and WEST US2 ?

benc-uk · 2017-11-17T16:44:33Z

Not really, they are no longer accepting AKS workloads in those regions. Your choice now is East US, West Europe or Central US

See the regions doc here
https://github.com/Azure/AKS/blob/master/preview_regions.md

msdotnetclr · 2017-12-11T20:12:13Z

Looks like we are having the same issue in eastus. This is what I am getting now:

$ kubectl describe pod kube-svc-redirect-jrfjd -n kube-system
Name:           kube-svc-redirect-jrfjd
Namespace:      kube-system
Node:           aks-nodepool1-19361140-0/10.240.0.4
Start Time:     Mon, 11 Dec 2017 08:48:45 -0500
Labels:         component=kube-svc-redirect
                controller-revision-hash=3376999726
                pod-template-generation=1
                tier=node
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"DaemonSet","namespace":"kube-system","name":"kube-svc-redirect","uid":"d8e715e1-de79-11e7-9d8d-0a58ac1f102...
Status:         Running
IP:             10.240.0.4
Created By:     DaemonSet/kube-svc-redirect
Controlled By:  DaemonSet/kube-svc-redirect
Containers:
  redirector:
    Container ID:   docker://7815c8c9f92181645cb2659eef6793123c4bf54624563d429755064795060c35
    Image:          dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3
    Image ID:       docker-pullable://dockerio.azureedge.net/deis/kube-svc-redirect@sha256:ccc6b31039754db718dac8c5d723b9db6a4070a252deaf4ea2c14b018343627e
    Port:           <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 11 Dec 2017 15:03:56 -0500
      Finished:     Mon, 11 Dec 2017 15:03:56 -0500
    Ready:          False
    Restart Count:  78
    Environment:
      APISERVER_FQDN:     t_presto-rgakspresto-1b9b4d-9a5bbdbb.hcp.eastus.azmk8s.io
      KUBERNETES_SVC_IP:  10.0.0.1
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-3t4rg (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  default-token-3t4rg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-3t4rg
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     node-role.kubernetes.io/master=true:NoSchedule
                 node.alpha.kubernetes.io/notReady:NoExecute
                 node.alpha.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason      Age                  From                               Message
  ----     ------      ----                 ----                               -------
  Normal   Pulling     1m (x79 over 6h)     kubelet, aks-nodepool1-19361140-0  pulling image "dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3"
  Normal   Pulled      1m (x79 over 6h)     kubelet, aks-nodepool1-19361140-0  Successfully pulled image "dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3"
  Normal   Created     1m (x79 over 6h)     kubelet, aks-nodepool1-19361140-0  Created container
  Normal   Started     1m (x79 over 6h)     kubelet, aks-nodepool1-19361140-0  Started container
  Warning  BackOff     14s (x1686 over 6h)  kubelet, aks-nodepool1-19361140-0  Back-off restarting failed container
  Warning  FailedSync  14s (x1686 over 6h)  kubelet, aks-nodepool1-19361140-0  Error syncing pod

$ kubectl describe pod kubernetes-dashboard-1672970692-bfn8z -n kube-system
Name:           kubernetes-dashboard-1672970692-bfn8z
Namespace:      kube-system
Node:           aks-nodepool1-19361140-0/10.240.0.4
Start Time:     Mon, 11 Dec 2017 08:49:40 -0500
Labels:         k8s-app=kubernetes-dashboard
                kubernetes.io/cluster-service=true
                pod-template-hash=1672970692
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"kube-system","name":"kubernetes-dashboard-1672970692","uid":"d8ed8d20-de79-11e7-9...
Status:         Running
IP:             10.244.0.2
Created By:     ReplicaSet/kubernetes-dashboard-1672970692
Controlled By:  ReplicaSet/kubernetes-dashboard-1672970692
Containers:
  main:
    Container ID:   docker://5c3600ddff4eee7ca8913577af09fa63c9a23176b064207d58ce2f6cca0fba59
    Image:          gcrio.azureedge.net/google_containers/kubernetes-dashboard-amd64:v1.6.3
    Image ID:       docker-pullable://gcrio.azureedge.net/google_containers/kubernetes-dashboard-amd64@sha256:2c4421ed80358a0ee97b44357b6cd6dc09be6ccc27dfe9d50c9bfc39a760e5fe
    Port:           9090/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 11 Dec 2017 15:03:41 -0500
      Finished:     Mon, 11 Dec 2017 15:04:12 -0500
    Ready:          False
    Restart Count:  76
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:        100m
      memory:     50Mi
    Liveness:     http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-3t4rg (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  default-token-3t4rg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-3t4rg
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     node.alpha.kubernetes.io/notReady:NoExecute for 300s
                 node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason      Age                 From                               Message
  ----     ------      ----                ----                               -------
  Normal   Killing     20m (x7 over 6h)    kubelet, aks-nodepool1-19361140-0  Killing container with id docker://main:pod "kubernetes-dashboard-1672970692-bfn8z_kube-system(d8ef4236-de79-11e7-9d8d-0a58ac1f102b)" container "main" is unhealthy, it will be killed and re-created.
  Warning  Unhealthy   14m (x16 over 6h)   kubelet, aks-nodepool1-19361140-0  Liveness probe failed: Get http://10.244.0.2:9090/: dial tcp 10.244.0.2:9090: getsockopt: connection refused
  Normal   Pulled      4m (x76 over 6h)    kubelet, aks-nodepool1-19361140-0  Container image "gcrio.azureedge.net/google_containers/kubernetes-dashboard-amd64:v1.6.3" already present on machine
  Normal   Created     4m (x77 over 6h)    kubelet, aks-nodepool1-19361140-0  Created container
  Normal   Started     4m (x77 over 6h)    kubelet, aks-nodepool1-19361140-0  Started container
  Warning  BackOff     6s (x1565 over 6h)  kubelet, aks-nodepool1-19361140-0  Back-off restarting failed container
  Warning  FailedSync  6s (x1565 over 6h)  kubelet, aks-nodepool1-19361140-0  Error syncing pod

luayalem · 2017-12-12T15:48:33Z

East US has the same issue:

NAME                                                                 READY     STATUS                  RESTARTS   AGE
heapster-2888171832-jm0g2                          2/2           Running                    0                  2h
kube-dns-v20-1654923623-8jxp3                 3/3           Running                    0                  2h
kube-dns-v20-1654923623-v2ddc                3/3           Running                    0                  2h
kube-proxy-46gz6                                            1/1           Running                     0                  2h
kube-proxy-snwjw                                             1/1           Running                    0                  2h
kube-svc-redirect-l8xjx                                    0/1          CrashLoopBackOff   32                2h
kube-svc-redirect-q2nmp                                0/1          CrashLoopBackOff   32                2h
kubernetes-dashboard-1672970692-s71l4    0/1          CrashLoopBackOff   30                2h
tunnelfront-3490006108-g4d28                     1/1           Running                     0                 2h

slack · 2017-12-20T00:33:21Z

Thanks for your patience during the preview.

We are up in East US, Central US and West Europe. Additional details here: #56 (comment)

I'm going to close out this umbrella ticket. Feel free to open new issues as you experience problems.

seanknox added the regions/capacity label Oct 25, 2017

slack added the known-issue label Oct 25, 2017

derekbekoe mentioned this issue Oct 25, 2017

AKS system pods failing to start Azure/azure-cli#4756

Closed

ghost mentioned this issue Oct 31, 2017

Helm is not pre-installed in AKS clusters #4

Closed

berndverst mentioned this issue Nov 2, 2017

Deployment failed #19

Closed

anhowe mentioned this issue Nov 2, 2017

Issues with Multiple Components of Kubernetes Crashing #12

Closed

khenidak mentioned this issue Nov 4, 2017

Unable to get pod logs #18

Closed

berndverst mentioned this issue Nov 5, 2017

kubectl is not working on CloudShell? #20

Closed

bacongobbler mentioned this issue Nov 5, 2017

Error creating cluster #24

Closed

This was referenced Nov 6, 2017

Cluster creation fails in West US 2 #17

Closed

10/31/17 Issue: Groups failing to create and then being recreated #15

Closed

Cluster creation fails to created Nodes #23

Closed

brendandburns mentioned this issue Nov 8, 2017

AZ AKS Browse does not work from WSL (Bash for Windows) #6

Closed

slack closed this as completed Dec 20, 2017

ondrad1 mentioned this issue Jan 28, 2018

kube-svc-redirect CrashLoopBackOff in West Europe #153

Closed

mbrancato mentioned this issue Jul 16, 2018

Advanced networking pods getting wrong IPs #533

Closed

vishiy mentioned this issue Aug 11, 2020

AKS Memory RSS & Memory working set #1766

Closed

ghost locked as resolved and limited conversation to collaborators Aug 13, 2020

AKS capacity issues in West US 2 #2

AKS capacity issues in West US 2 #2

Comments

seanknox commented Oct 25, 2017 • edited Loading

bgeesaman commented Oct 26, 2017 • edited Loading

srakesh28 commented Oct 29, 2017

ghost commented Oct 31, 2017

EamonKeane commented Nov 2, 2017

bramvdklinkenberg commented Nov 3, 2017

anoff commented Nov 3, 2017

blackbaud-brandonstirnaman commented Nov 3, 2017

seanknox commented Nov 4, 2017

seanknox commented Nov 4, 2017

seanknox commented Nov 4, 2017

EIrwin commented Nov 4, 2017

blackbaud-brandonstirnaman commented Nov 4, 2017

ekarlso commented Nov 5, 2017

seanknox commented Nov 5, 2017

berndverst commented Nov 6, 2017

bramvdklinkenberg commented Nov 6, 2017 • edited Loading

dendle commented Nov 6, 2017

gabrtv commented Nov 6, 2017

amazaheri commented Nov 7, 2017

morellonet commented Nov 8, 2017

amanohar commented Nov 9, 2017

jrthib commented Nov 9, 2017

amanohar commented Nov 9, 2017

jrthib commented Nov 9, 2017

amanohar commented Nov 9, 2017 • edited Loading

bramvdklinkenberg commented Nov 9, 2017 • edited Loading

amazaheri commented Nov 9, 2017 • edited Loading

Guillaume-Mayer commented Nov 9, 2017

relferreira commented Nov 10, 2017

artisticcheese commented Nov 11, 2017

jespernohr commented Nov 11, 2017

seanmck commented Nov 12, 2017

qmfrederik commented Nov 12, 2017

benc-uk commented Nov 12, 2017

indrayam commented Nov 12, 2017

kamoljan commented Nov 13, 2017

sauryadas commented Nov 13, 2017

arindam00 commented Nov 17, 2017

benc-uk commented Nov 17, 2017 • edited Loading

msdotnetclr commented Dec 11, 2017

luayalem commented Dec 12, 2017 • edited Loading

slack commented Dec 20, 2017

seanknox commented Oct 25, 2017 •

edited

Loading

bgeesaman commented Oct 26, 2017 •

edited

Loading

bramvdklinkenberg commented Nov 6, 2017 •

edited

Loading

amanohar commented Nov 9, 2017 •

edited

Loading

bramvdklinkenberg commented Nov 9, 2017 •

edited

Loading

amazaheri commented Nov 9, 2017 •

edited

Loading

benc-uk commented Nov 17, 2017 •

edited

Loading

luayalem commented Dec 12, 2017 •

edited

Loading