azure disk dangling attach issue on VMSS which would cause API throttling #90762

andyzhangx · 2020-05-05T13:32:53Z

What happened:

PR(#81266) does not convert the VMSS node name which causes error like this:

failed to get azure instance id for node \"k8s-agentpool1-32474172-vmss_1216\" (not a vmss instance)

This is not a frequently happen issue since dangling error only happens when the previous detach disk failed, the disk was still attached to the node while no pod on that node is using that disk.
The disk detach failed could be due to 1) VM busy 2) kube-controller-manager restart, etc.

Original VMSS nodeName value

disk.ManagedBy value: /subscriptions/0e46bd28-a80f-4d3a-8200-d9eb8d80cb2e/resourceGroups/kubetest-b28d2392-498e-11ea-af8c-fab712ce9f74/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-agentpool-36841236-vmss/virtualMachines/k8s-agentpool-36841236-vmss_1
correct nodeName value: k8s-agentpool-36841236-vmss000001

That will make dangling attach return error, and k8s volume attach/detach controller will getVmssInstance, and since the nodeName is in an incorrect format, it will always clean vmss cache if node not found, thus incur a get vmss API call storm.

kubernetes/staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss.go

Lines 172 to 174 in 3365ed5

    
           if !found { 
        
           	klog.V(2).Infof("Couldn't find VMSS VM with nodeName %s, refreshing the cache", nodeName) 
        
           	vmssName, instanceID, vm, found, err = getter(nodeName, azcache.CacheReadTypeForceRefresh)

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

/kind bug
/assign
/priority important-soon
/sig cloud-provider
/area provider/azure

The text was updated successfully, but these errors were encountered:

andyzhangx added the kind/bug Categorizes issue or PR as related to a bug. label May 5, 2020

k8s-ci-robot assigned andyzhangx May 5, 2020

andyzhangx mentioned this issue May 5, 2020

fix: azure disk dangling attach issue on VMSS which would cause API throttling #90749

Merged

k8s-ci-robot closed this as completed in #90749 May 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure disk dangling attach issue on VMSS which would cause API throttling #90762

azure disk dangling attach issue on VMSS which would cause API throttling #90762

andyzhangx commented May 5, 2020

azure disk dangling attach issue on VMSS which would cause API throttling #90762

azure disk dangling attach issue on VMSS which would cause API throttling #90762

Comments

andyzhangx commented May 5, 2020

Original VMSS nodeName value