Fixed ControllerUnpublish error handling#165
Fixed ControllerUnpublish error handling#165k8s-ci-robot merged 1 commit intokubernetes-csi:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jsafrane The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
bertinatto
left a comment
There was a problem hiding this comment.
LGTM (I might not be aware of all consequences of this change, though)
| var success error | ||
| var readWrite = false | ||
| var readOnly = true | ||
| var ignored = false // the vlaue is irrelevant for given call |
|
|
||
| _, err = client.ControllerUnpublishVolume(ctx, &req) | ||
| if err != nil { | ||
| return isFinalError(err), err |
There was a problem hiding this comment.
I'm worried this will break detach for plugins that interpret the CSI spec as such:
CSI spec says sp must return NotFound if volume or node is not found: https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerunpublishvolume-errors.
I guess here we were assuming if NotFound was returned, the detach is actually irrelevant/successful. Two scenarios to consider:
-
SP automatically handles volume or node not found as detached: no Unpublish call is actually needed to clean up. Should the sp go against the CSI spec and return success?
-
SP needs ControllerUnpublish called to clean up its state. Still a question of if it should return success or NotFound error afterwards
There was a problem hiding this comment.
I added code that interprets NotFound as success and filled container-storage-interface/spec#373 to clarify it in the spec.
There was a problem hiding this comment.
Based on CSI spec meeting this morning, it sounds like there is agreement to relax the NotFound error code so that plugins can decide if the unpublish needs to be retried. Based on that, I think we should just retry for any error.
And given the fact that all of the csi plugins we've looked at so far are handling this wrong already, we probably should add an "ACTION REQUIRED" to the release note.
76d3701 to
679ffb8
Compare
| // This is not gRPC error. | ||
| return err | ||
| } | ||
| if st.Code() == codes.NotFound { |
There was a problem hiding this comment.
what about a case where SP temporarily cannot "find" the volume or the disk, but the disk is still attached. This should be NOT_FOUND error that could be remedied by retrying - disk is not actually detached yet.
I would argue that the SP should return OK if it has determined that the volume/node is gone in a way that we can assume the volume is detached. This should be up to the SP to decide.
There was a problem hiding this comment.
We're assuming all volume plugins automatically detach volumes when a node is deleted, and I'm not confident that we can assume that.
There was a problem hiding this comment.
I think it is a reasonable compromise until this behaviour is clarified in spec. I do not know any volume type where attachment outlives lifecycle of node.
There was a problem hiding this comment.
Sorry ignore my last comment, I misread SP as CO. We're on the same page :-)
There was a problem hiding this comment.
Either way I think we are asking plugin authors to reinterpret spec in a way which was previously not documented.
Esp. for node deletion case - do we have a solution in the meanwhile?
There was a problem hiding this comment.
I think vsphere may have some interesting behaviors around node failure/deletion handling. cc @codenrhoden @vladimirvivien
There was a problem hiding this comment.
afaik - in vsphere too, when you delete a node in vcenter, the vmdk files are marked as detached.
|
I checked AWS, it most probably returns |
|
Cinder: it returns |
|
Filed kubernetes/cloud-provider-openstack#718 for OpenStack / Cinder |
Any error from ControllerUnpublish can mean that a volume could be still attached (or being detached).
11ac689 to
452b089
Compare
|
I removed special handling of |
|
And updated release-note. |
|
/lgtm @davidz627, I and a few others discussed this, and we came to the conclusion that this change will require a major version bump because we're significantly changing behavior that may require drivers to change. |
At the same time it fixes pretty serious bug that should be fixed in all supported branches. Volumes are marked as detached after "final error" (e.g. I could backport safer version of this patch, marking a volume as detached after |
The behavior of the external-attacher recently changed (kubernetes-csi/external-attacher#165) such that it now treats "not found" as real error. The effect was that some Kubernetes E2E tests (like "CSI mock volume CSI workload information using mock driver should not be passed when podInfoOnMount=false") sometimes ran for over 2 minutes, just waiting for detatch. That the test then proceeds without marking the test as failed is a bug in the test cleanup code which will be fixed. This slowdown is not deterministic: sometimes the detach is done early enough while the volume still exists. With this change, the same test completes in under 30 seconds.
Any error from ControllerUnpublish can mean that a volume could be still attached (or being detached).
What type of PR is this?
/kind bug
Which issue(s) this PR fixes:
Fixes #164
Does this PR introduce a user-facing change?:
cc @kubernetes-csi/csi-misc