Volume create delete hardening#1592
Conversation
|
|
|
Welcome @dluthcke! |
|
Hi @dluthcke. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| // Take the lock to prevent this access point from being deleted while creating volume | ||
| d.lockManager.lockMutex(accessPoint.AccessPointId) | ||
| defer d.lockManager.unlockMutex(accessPoint.AccessPointId) | ||
| if d.lockManager.lockMutex(accessPoint.AccessPointId, ApLockWaitTimeSec * time.Second) { |
There was a problem hiding this comment.
it would be better we have random timeout, for example: random value between 3 and 5 ?
There was a problem hiding this comment.
I'm not sure a random timeout buys us much. The intention of the timeout here is to prevent a deadlock, if threads fail and all get rescheduled at the same time the mutex should still serialize calls to the same AP where I think the length of the timeout wouldn't matter too much.
There was a problem hiding this comment.
That is true. We can leave with constant timeout for now.
|
/ok-to-test |
|
/test pull-aws-efs-csi-driver-unit |
|
looks good to me, can you squash the commits to one? |
…ent createVolume and deleteVolume calls are handled correctly
a207ccf to
76324ab
Compare
|
/test pull-aws-efs-csi-driver-unit |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dluthcke, mskanth972, wangnyue The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Is this a bug fix or adding new feature?
This change is to harden the logic around volume creation and deletion to ensure idempotency and avoid race conditions in the creation and deletion logic.
What is this PR about? / Why do we need it?
In rare cases, volume creation and deletion functions can race against each other if kubernetes attempts a retry while either creation or deletion is in progress. This PR will serialize concurrent requests to the same access point and adds additional cleanup logic to ensure each individual request can complete and clean up after itself before allowing a subsequent operation on that access point.
What testing is done?