-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AKS capacity issues in West US 2 #2
Comments
Should I kill my cluster exhibiting this issue and recreate? Or will the added capacity resolve things automatically when it comes? |
I am experiencing the same issue waiting for fix |
I'm still having issues
@seanknox any update from capacity team? |
Is there any update on this? If this is not resolved soon, we'll be forced to use GKE which I've tested and works smoothly. |
Since yesterday I have the issue on ukwest and westus2. If I deploy clusters (portal or az cli) the pods for tunnelfront, kube-svc-redirect and kubernetes dashboard keep crashing. |
update would be appreciated :) |
I was hoping to spend the weekend evaluating AKS vs our current ACS implementation... Guessing we missed the goal of new capacity to be added in the last week but is there an ETA for this fix? |
Hi all, thanks for your patience, just updated the status above. |
It's a combination of various capacity issues:
We've been working closely with Azure Networking and Capacity teams to address all of these issues. |
@bgeesaman yes, recommend deleting your cluster until all capacity and ARM issues are resolved--we're hopeful we'll see resolution soon. |
Though i can see the updated status as of Nov 3rd as having enough capacity, on two separate attempts today (Nov 4th), cluster creation resulted in no nodes being provisioned, and pods stuck in |
Created a new cluster this morning in West US 2, I can get the dashboard up/view logs/etc.. So its working but I do not have an Ingress controller pod deployed in the cluster. Is it expected that it should be automatically deployed in a new 1.7 cluster? The mc_* resource group also doesn't have a load balancer. |
Hi, I am getting similar issues in #24 |
@blackbaud-brandonstirnaman I pasted the wrong info there, sorry. If you can view logs your cluster should be good to go. |
I can confirm that at this time cluster creation in WestUS2 works. |
In ukwest I still have the same issue... The tunnelfront and kube-svc-redirect still crash after deployment of the cluster. Also tried in westus2 and indeed that works. |
ukwest failing since thursday last week. Opened support case, and they cited this issue - however this issue only appears to address westus - Can someone check to see if this is the case for ukwest, too? az aks create --resource-group prelive-kubernetesv2 --name prelive-k8scluster --agent-count 3 --agent-vm-size Standard_DS5_v2 --generate-ssh-keys |
Hey Matt, We are still working on adding capacity to ukwest, while we also bring other AKS regions online. Thanks for your patience while we sort this out. As you can guess, demand for the AKS preview caught us a little off guard. 😉 Gabe |
Issue is resolved for me, I just created a new cluster today, be patient and in a minute you have it up and running. "I SHALL NOT DELETE THIS ONE ANYMORE" 👍 |
How can I delete a cluster? I had a number of failed deployment due to these capacity issues and apparently now have 5 clusters that are 'stuck' and that fill my quota. When I try to do a deployment now, I get this error:
I've been playing around with the Azure CLI and UI but I don't see a way to list all the clusters in the sub, much less delete them. Note that I don't have any RGs in the sub, so I don't understand where these clusters are hiding. |
@morellonet would it possible to share your resource group name and resource name here (I can look up the sub id) and I will look into the issue. Also, I would recommend opening a separate issue for the delete failures. |
I'm experiencing similar issues as noted in this thread. I also can't adjust the amount of nodes using the scale command. It just hangs and eventually times out. |
@amanohar I'm receiving that one too. I'm having issues with delete, scale, and browse commands. |
This issue is specifically to track capacity. |
Works for me again in westus2, still issues in ukwest though. |
Browse is broken again, this was fine yesterday. Unable to connect to the server: net/http: TLS handshake timeout Also looks like the who cluster is down now :( |
Same here, az aks browse dont work anymore (westus2) |
I'm having the same problem described by @amazaheri.
|
my aks environment in westus2 stoppet working yesterday and I am unable to deploy in westus2. I have successfully deployed in ukwest, but am unable to "az aks browse" - connection refused |
Unfortunately, we had an unrecoverable service failure in westus2, so we recommend deleting any clusters that you had deployed there. We have resolved the problem and are working on rolling out new capacity in westus2, along with other regions. Please monitor the announcements in this repo for an update on when/where you can try creating new clusters. We sincerely appreciate your patience as we work through the issues with the preview. |
@seanmck Any word on the status of UK West? I can create a new cluster but some of the pods are unstable and the cluster is inaccessible at times:
|
I fully understand this is a preview service, but with uswest2 down + the capacity issues, ukwest deploying unstable and unusable clusters, and on top of this, all the CLI problems. This has been a really bad start for AKS 😞 |
I am a newbie to MS Azure Cloud. Heard a lot about their Managed K8S (AKS) offering on Twitter so thought I would try it out. Played with the Google Container Engine Quickstart and was up and running in minutes. Tried to work with this Quickstart: I am getting this error: Am I missing a step here or is this all related to the capacity issues in US West 2 |
+1 |
we have opened up east us for AKS deployments. Please deploy in the east us region. Thanks for your patience. |
Is the capacity issue with AKS resolved for UKWEST and WEST US2 ? |
Not really, they are no longer accepting AKS workloads in those regions. Your choice now is East US, West Europe or Central US See the regions doc here |
Looks like we are having the same issue in eastus. This is what I am getting now: $ kubectl describe pod kube-svc-redirect-jrfjd -n kube-system
Name: kube-svc-redirect-jrfjd
Namespace: kube-system
Node: aks-nodepool1-19361140-0/10.240.0.4
Start Time: Mon, 11 Dec 2017 08:48:45 -0500
Labels: component=kube-svc-redirect
controller-revision-hash=3376999726
pod-template-generation=1
tier=node
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"DaemonSet","namespace":"kube-system","name":"kube-svc-redirect","uid":"d8e715e1-de79-11e7-9d8d-0a58ac1f102...
Status: Running
IP: 10.240.0.4
Created By: DaemonSet/kube-svc-redirect
Controlled By: DaemonSet/kube-svc-redirect
Containers:
redirector:
Container ID: docker://7815c8c9f92181645cb2659eef6793123c4bf54624563d429755064795060c35
Image: dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3
Image ID: docker-pullable://dockerio.azureedge.net/deis/kube-svc-redirect@sha256:ccc6b31039754db718dac8c5d723b9db6a4070a252deaf4ea2c14b018343627e
Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 11 Dec 2017 15:03:56 -0500
Finished: Mon, 11 Dec 2017 15:03:56 -0500
Ready: False
Restart Count: 78
Environment:
APISERVER_FQDN: t_presto-rgakspresto-1b9b4d-9a5bbdbb.hcp.eastus.azmk8s.io
KUBERNETES_SVC_IP: 10.0.0.1
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-3t4rg (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-3t4rg:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-3t4rg
Optional: false
QoS Class: BestEffort
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: node-role.kubernetes.io/master=true:NoSchedule
node.alpha.kubernetes.io/notReady:NoExecute
node.alpha.kubernetes.io/unreachable:NoExecute
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 1m (x79 over 6h) kubelet, aks-nodepool1-19361140-0 pulling image "dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3"
Normal Pulled 1m (x79 over 6h) kubelet, aks-nodepool1-19361140-0 Successfully pulled image "dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3"
Normal Created 1m (x79 over 6h) kubelet, aks-nodepool1-19361140-0 Created container
Normal Started 1m (x79 over 6h) kubelet, aks-nodepool1-19361140-0 Started container
Warning BackOff 14s (x1686 over 6h) kubelet, aks-nodepool1-19361140-0 Back-off restarting failed container
Warning FailedSync 14s (x1686 over 6h) kubelet, aks-nodepool1-19361140-0 Error syncing pod
$ kubectl describe pod kubernetes-dashboard-1672970692-bfn8z -n kube-system
Name: kubernetes-dashboard-1672970692-bfn8z
Namespace: kube-system
Node: aks-nodepool1-19361140-0/10.240.0.4
Start Time: Mon, 11 Dec 2017 08:49:40 -0500
Labels: k8s-app=kubernetes-dashboard
kubernetes.io/cluster-service=true
pod-template-hash=1672970692
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"kube-system","name":"kubernetes-dashboard-1672970692","uid":"d8ed8d20-de79-11e7-9...
Status: Running
IP: 10.244.0.2
Created By: ReplicaSet/kubernetes-dashboard-1672970692
Controlled By: ReplicaSet/kubernetes-dashboard-1672970692
Containers:
main:
Container ID: docker://5c3600ddff4eee7ca8913577af09fa63c9a23176b064207d58ce2f6cca0fba59
Image: gcrio.azureedge.net/google_containers/kubernetes-dashboard-amd64:v1.6.3
Image ID: docker-pullable://gcrio.azureedge.net/google_containers/kubernetes-dashboard-amd64@sha256:2c4421ed80358a0ee97b44357b6cd6dc09be6ccc27dfe9d50c9bfc39a760e5fe
Port: 9090/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 11 Dec 2017 15:03:41 -0500
Finished: Mon, 11 Dec 2017 15:04:12 -0500
Ready: False
Restart Count: 76
Limits:
cpu: 100m
memory: 50Mi
Requests:
cpu: 100m
memory: 50Mi
Liveness: http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-3t4rg (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-3t4rg:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-3t4rg
Optional: false
QoS Class: Guaranteed
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 20m (x7 over 6h) kubelet, aks-nodepool1-19361140-0 Killing container with id docker://main:pod "kubernetes-dashboard-1672970692-bfn8z_kube-system(d8ef4236-de79-11e7-9d8d-0a58ac1f102b)" container "main" is unhealthy, it will be killed and re-created.
Warning Unhealthy 14m (x16 over 6h) kubelet, aks-nodepool1-19361140-0 Liveness probe failed: Get http://10.244.0.2:9090/: dial tcp 10.244.0.2:9090: getsockopt: connection refused
Normal Pulled 4m (x76 over 6h) kubelet, aks-nodepool1-19361140-0 Container image "gcrio.azureedge.net/google_containers/kubernetes-dashboard-amd64:v1.6.3" already present on machine
Normal Created 4m (x77 over 6h) kubelet, aks-nodepool1-19361140-0 Created container
Normal Started 4m (x77 over 6h) kubelet, aks-nodepool1-19361140-0 Started container
Warning BackOff 6s (x1565 over 6h) kubelet, aks-nodepool1-19361140-0 Back-off restarting failed container
Warning FailedSync 6s (x1565 over 6h) kubelet, aks-nodepool1-19361140-0 Error syncing pod |
East US has the same issue:
|
Thanks for your patience during the preview. We are up in East US, Central US and West Europe. Additional details here: #56 (comment) I'm going to close out this umbrella ticket. Feel free to open new issues as you experience problems. |
Update Nov 6, 15:50 PST
Capacity in westus2 has been increased; if you continue having difficulties with existing clusters, please try to deleting your cluster(s) and re-creating.
Update Nov 5, 12:05PM PST
Users should be able to create new AKS clusters in westus2. Please report any issues on this thread, thanks!
Update Nov 3, 2017 21:01pm PDT
While base compute/network capacity have been addressed, persistent HTTP errors with ARM in westus2 are preventing Azure Load Balancers via Kubernetes from obtaining public IPs. We're working with the ARM team to resolve.
Update Nov 3, 2017 17:10pm PDT
We've still in the process of rolling out additional compute and networking capacity in West US 2. We recommend deleting existing cluster and monitor this issue for updates on when to try again.
Update October 25, 2017 19:07 pm PDT
We received some good news from our capacity team and plan to both expand capacity in West US 2 and deploy AKS in additional US regions by the end of the week. Thanks for your patience with our literal growing pains!
October 25, 2017 11:00 am PDT
The AKS team is currently adding AKS capacity in West US 2 to keep up with demand. Until new capacity is in place, users on new AKS clusters won't be able to run
kubectl logs
,kubectl exec
, andkubectl proxy
.The text was updated successfully, but these errors were encountered: