add retry for detach azure disk #74398

andyzhangx · 2019-02-22T07:17:42Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Current azure cloud provider would fail to detach azure disk when there is server side error, need to add retry mechanism for detach disk operation, while for attach disk operation, it's not necessary since k8s pv-controller will try if first try failed.

Which issue(s) this PR fixes:

Fixes #74396

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

fix issue: fail to detach azure disk when there is server side error

/kind bug
/assign @feiskyer
/priority important-soon
/sig azure

k8s-ci-robot · 2019-02-22T07:18:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/cloudprovider/providers/azure/OWNERS~~ [andyzhangx]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

add more logging info in detach disk

feiskyer

/lgtm

andyzhangx · 2019-02-22T08:52:52Z

BTW, if move 8 disks from one node to another in parellel, there will be always one disk not detched(only one), the error is like following:

I0222 08:38:22.503087       1 azure_controller_vmss.go:140] azureDisk - update(andy-vmss1124) backing off: vm(k8s-agentpool-24194600-vmss000001) detach disk(, /subscriptions/xxx/resourceGroups/andy-vmss1124/providers/Microsoft.Compute/disks/andy-vmss1124-dynamic-pvc-66815035-35ac-11e9-921c-000d3a002125), err: compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk '763c0d05-4ae5-4699-8ce8-9c8ced5283da' to VM 'k8s-agentpool-24194600-vmss_1' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."

The detach action finally failed, it's more like attach a disk to node#1 would cause detach that disk from node#2 to fail which is not reasonable, I will contact with azure disk RP to check with this issue.

feiskyer · 2019-02-22T09:05:17Z

Then let's hold a while for root causes.

/hold

feiskyer · 2019-02-22T09:06:59Z

/test pull-kubernetes-e2e-aks-engine-azure

andyzhangx · 2019-02-22T14:18:32Z

the funny thing is Cannot attach data disk '763c0d05-4ae5-4699-8ce8-9c8ced5283da' to VM 'k8s-agentpool-24194600-vmss_1', while I never have data disk '763c0d05-4ae5-4699-8ce8-9c8ced5283da' in my resource group, and I could repro that easily in two vmss k8s clusters.

andyzhangx · 2019-02-25T07:18:17Z

@feiskyer The issue I could repro is related to VMSS, while I still insist on adding this retry logic only for detach disk (not necessary for attach disk since k8s controller will retry if failed) in case there is any potential issue, thus azure cloud provider could have more chance to retry, although in this case, it does not work perfectly. what's your opinion?

feiskyer · 2019-02-25T09:14:49Z

The issue I could repro is related to VMSS, while I still insist on adding this retry logic only for detach disk (not necessary for attach disk since k8s controller will retry if failed) in case there is any potential issue, thus azure cloud provider could have more chance to retry, although in this case, it does not work perfectly. what's your opinion?

Agreed. Let's wait a while for VMSS responses, in case there're still other potential issues.

andyzhangx · 2019-02-26T07:05:05Z

the failed error is most likely due to slow disk attach/detach on VMSS, let's merge this PR first since it would mitigate the issue a little
/hold cancel

…4398-upstream-release-1.13 Automated cherry pick of #74398: add retry for detach azure disk

…4398-upstream-release-1.11 Automated cherry pick of #74398: add retry for detach azure disk

…4398-upstream-release-1.12 Automated cherry pick of #74398: add retry for detach azure disk

antoineco · 2019-05-20T13:17:33Z

@andyzhangx @feiskyer I see there was no final conclusion on the issue, but in some clusters we can't seem to be able to get out of that "AttachDiskWhileBeingDetached" error loop without deleting affected Pods and giving Azure enough time to clean up the mess (detach disks from their VMSS instance).

andyzhangx · 2019-05-20T13:26:44Z

@antoineco the AttachDiskWhileBeingDetached issue on VMSS is already fixed in VMSS RP in mid April. This PR fixed the issue when disk detach operation failed and the disk is never detached on the original node.

antoineco · 2019-05-20T13:52:25Z

@andyzhangx thanks for the quick answer, unfortunately I'm not familiar with the "RP" acronym sorry (😅).

What I'm currently observing is a seemingly infinite loop of:

Code="AttachDiskWhileBeingDetached"
Message="Cannot attach data disk 'foo-bar-123' to VM 'mycluster-vmss_0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."

I've tried waiting for over an hour sometimes, but for some reason the retry logic is not enough. Detaching disks manually (e.g. via the az CLI) returns the same error.

andyzhangx · 2019-05-20T14:00:35Z

@antoineco can you open an issue and also provide details info:

k8s version
region

And what's the status of disk foo-bar-123? is it attached or unattached?
if that disk is attached, you could also provide info by az vmss show -g <RESOURCE_GROUP_NAME> --name <VMSS_NAME> --instance-id <ID(number)> to show which VM is that disk attached to.

andyzhangx · 2019-05-20T14:07:51Z

@antoineco you may also file an azure support ticket for this issue, and paste this github link, and azure team would analyze your issue first, pls also provide my suggested info, thanks.

antoineco · 2019-05-20T14:12:17Z

@andyzhangx I will open an issue, just let me drop that here first for future reference:

Kubernetes version: it's complicated (I'll explain, but we're testing 1.12 too)
Regions: eastus, westeurope

"diskState": "Unattached"
(sometimes the disk is attached and impossible to detach for > 20 min)

andyzhangx · 2019-05-20T15:28:13Z

@andyzhangx I will open an issue, just let me drop that here first for future reference:

Kubernetes version: it's complicated (I'll explain, but we're testing 1.12 too)

Regions: eastus, westeurope

"diskState": "Unattached"
(sometimes the disk is attached and impossible to detach for > 20 min)

pls also cherry pick this PR: #74398, it fixed this issue: (sometimes the disk is attached and impossible to detach for > 20 min), only need to patch the kube-controller-manager

BTW, recently we found slow disk attach/detach issue when disk num is large, how many disks are attached to one VM?

antoineco · 2019-05-20T15:36:28Z

It's been cherry-picked but I documented it as #74581 (all references are to 1.12 cherry-picks). Maybe I lost something important in the conflict resolution and need to review again.

Each node usually has about 5 disks attached, which should be fairly reasonable.

antoineco · 2019-05-21T13:37:10Z

Ref. #78172

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 22, 2019

k8s-ci-robot assigned feiskyer Feb 22, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 22, 2019

k8s-ci-robot requested review from justaugustus and karataliu February 22, 2019 07:18

k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Feb 22, 2019

add retry for detach azure disk

8c53db0

add more logging info in detach disk

andyzhangx force-pushed the detach-azuredisk-retry branch from df2edd2 to 8c53db0 Compare February 22, 2019 07:45

feiskyer reviewed Feb 22, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2019

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 22, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 26, 2019

k8s-ci-robot merged commit d500740 into kubernetes:master Feb 26, 2019

This was referenced Feb 26, 2019

Automated cherry pick of #74398: add retry for detach azure disk #74579

Merged

Automated cherry pick of #74398: add retry for detach azure disk #74581

Merged

Automated cherry pick of #74398: add retry for detach azure disk #74593

Merged

k8s-ci-robot added a commit that referenced this pull request Feb 27, 2019

Merge pull request #74579 from andyzhangx/automated-cherry-pick-of-#7…

b1a3a65

…4398-upstream-release-1.13 Automated cherry pick of #74398: add retry for detach azure disk

k8s-ci-robot added a commit that referenced this pull request Mar 5, 2019

Merge pull request #74593 from andyzhangx/automated-cherry-pick-of-#7…

4de1b07

…4398-upstream-release-1.11 Automated cherry pick of #74398: add retry for detach azure disk

k8s-ci-robot added a commit that referenced this pull request Mar 7, 2019

Merge pull request #74581 from andyzhangx/automated-cherry-pick-of-#7…

3efde81

…4398-upstream-release-1.12 Automated cherry pick of #74398: add retry for detach azure disk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add retry for detach azure disk #74398

add retry for detach azure disk #74398

andyzhangx commented Feb 22, 2019

k8s-ci-robot commented Feb 22, 2019

feiskyer left a comment

andyzhangx commented Feb 22, 2019 •

edited

Loading

feiskyer commented Feb 22, 2019

feiskyer commented Feb 22, 2019

andyzhangx commented Feb 22, 2019

andyzhangx commented Feb 25, 2019

feiskyer commented Feb 25, 2019

andyzhangx commented Feb 26, 2019

antoineco commented May 20, 2019

andyzhangx commented May 20, 2019

antoineco commented May 20, 2019 •

edited

Loading

andyzhangx commented May 20, 2019

andyzhangx commented May 20, 2019

antoineco commented May 20, 2019 •

edited

Loading

andyzhangx commented May 20, 2019 •

edited

Loading

antoineco commented May 20, 2019

antoineco commented May 21, 2019

add retry for detach azure disk #74398

add retry for detach azure disk #74398

Conversation

andyzhangx commented Feb 22, 2019

k8s-ci-robot commented Feb 22, 2019

feiskyer left a comment

Choose a reason for hiding this comment

andyzhangx commented Feb 22, 2019 • edited Loading

feiskyer commented Feb 22, 2019

feiskyer commented Feb 22, 2019

andyzhangx commented Feb 22, 2019

andyzhangx commented Feb 25, 2019

feiskyer commented Feb 25, 2019

andyzhangx commented Feb 26, 2019

antoineco commented May 20, 2019

andyzhangx commented May 20, 2019

antoineco commented May 20, 2019 • edited Loading

andyzhangx commented May 20, 2019

andyzhangx commented May 20, 2019

antoineco commented May 20, 2019 • edited Loading

andyzhangx commented May 20, 2019 • edited Loading

antoineco commented May 20, 2019

antoineco commented May 21, 2019

andyzhangx commented Feb 22, 2019 •

edited

Loading

antoineco commented May 20, 2019 •

edited

Loading

antoineco commented May 20, 2019 •

edited

Loading

andyzhangx commented May 20, 2019 •

edited

Loading