Volume create delete hardening by dluthcke · Pull Request #1592 · kubernetes-sigs/aws-efs-csi-driver

dluthcke · 2025-02-20T15:30:26Z

Is this a bug fix or adding new feature?
This change is to harden the logic around volume creation and deletion to ensure idempotency and avoid race conditions in the creation and deletion logic.

What is this PR about? / Why do we need it?
In rare cases, volume creation and deletion functions can race against each other if kubernetes attempts a retry while either creation or deletion is in progress. This PR will serialize concurrent requests to the same access point and adds additional cleanup logic to ensure each individual request can complete and clean up after itself before allowing a subsequent operation on that access point.

What testing is done?

New unit tests were added to prove out concurrent access to the create and delete functions are handled correctly.
e2e and driver upgrade tests were run and passed successfully
Workload with many concurrent PV creations and deletions was run on a cluster and no adverse effects

linux-foundation-easycla · 2025-02-20T15:30:31Z

The committers listed above are authorized under a signed CLA.

✅ login: dluthcke (76324ab)

k8s-ci-robot · 2025-02-20T15:30:35Z

Welcome @dluthcke!

It looks like this is your first PR to kubernetes-sigs/aws-efs-csi-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/aws-efs-csi-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-02-20T15:30:36Z

Hi @dluthcke. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wangnyue · 2025-03-12T13:34:09Z

 			// Take the lock to prevent this access point from being deleted while creating volume
-			d.lockManager.lockMutex(accessPoint.AccessPointId)
-			defer d.lockManager.unlockMutex(accessPoint.AccessPointId)
+			if d.lockManager.lockMutex(accessPoint.AccessPointId, ApLockWaitTimeSec * time.Second) {


it would be better we have random timeout, for example: random value between 3 and 5 ?

I'm not sure a random timeout buys us much. The intention of the timeout here is to prevent a deadlock, if threads fail and all get rescheduled at the same time the mutex should still serialize calls to the same AP where I think the length of the timeout wouldn't matter too much.

That is true. We can leave with constant timeout for now.

wangnyue

LGTM

mskanth972 · 2025-03-13T14:56:46Z

/ok-to-test

mskanth972 · 2025-03-13T16:12:32Z

/test pull-aws-efs-csi-driver-unit

mskanth972 · 2025-03-17T14:26:09Z

looks good to me, can you squash the commits to one?

…ent createVolume and deleteVolume calls are handled correctly

mskanth972 · 2025-03-17T15:01:39Z

/test pull-aws-efs-csi-driver-unit

mskanth972 · 2025-03-17T15:42:11Z

/lgtm
/approve

k8s-ci-robot · 2025-03-17T15:42:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dluthcke, mskanth972, wangnyue

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mskanth972]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Feb 20, 2025

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 20, 2025

k8s-ci-robot requested review from Ashley-wenyizha and leakingtapan February 20, 2025 15:30

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 20, 2025

wangnyue reviewed Mar 12, 2025

View reviewed changes

wangnyue approved these changes Mar 12, 2025

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2025

Adding additional checks and multi threaded testing to ensure concurr…

76324ab

…ent createVolume and deleteVolume calls are handled correctly

dluthcke force-pushed the volume-create-delete-hardening branch from a207ccf to 76324ab Compare March 17, 2025 14:40

k8s-ci-robot assigned mskanth972 Mar 17, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2025

k8s-ci-robot merged commit 6b2a4a5 into kubernetes-sigs:master Mar 17, 2025

dankova22 mentioned this pull request Mar 18, 2025

Buggy check for non-negative gid #1602

Closed

hlhl040 mentioned this pull request Jun 14, 2025

read-only file system error during volume deletion #1646

Closed

rhrmo mentioned this pull request Jun 26, 2025

STOR-2403: Rebase to upstream v2.1.8 for OCP 4.20 openshift/aws-efs-csi-driver#97

Merged

dankova22 mentioned this pull request Aug 15, 2025

Added checks to make sure delete-access-point should not delete entir… #1201

Closed

Conversation

dluthcke commented Feb 20, 2025

Uh oh!

linux-foundation-easycla Bot commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 20, 2025

Uh oh!

k8s-ci-robot commented Feb 20, 2025

Uh oh!

wangnyue Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

dluthcke Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

wangnyue Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

wangnyue left a comment

Choose a reason for hiding this comment

Uh oh!

mskanth972 commented Mar 13, 2025

Uh oh!

mskanth972 commented Mar 13, 2025

Uh oh!

mskanth972 commented Mar 17, 2025

Uh oh!

mskanth972 commented Mar 17, 2025

Uh oh!

mskanth972 commented Mar 17, 2025

Uh oh!

k8s-ci-robot commented Mar 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

linux-foundation-easycla Bot commented Feb 20, 2025 •

edited

Loading