[v3.30] Fix CNI delete timer to start after acquiring IPAM lock#11943
Conversation
Start the 90-second timeout after acquiring the lock in cmdDel, matching the pattern used in ADD operations. Previously, the timer started before lock acquisition, causing "context deadline exceeded" errors when DELETE operations waited in queue for the lock.
There was a problem hiding this comment.
Pull request overview
This PR cherry-picks a critical bug fix from the main branch to the v3.30 release branch. The fix corrects the timing of the 90-second context timeout in CNI DELETE operations to start after acquiring the IPAM lock, rather than before. This inconsistency with ADD operations was causing DELETE operations to timeout during high pod churn scenarios when multiple CNI processes queued for the host-wide IPAM lock.
Changes:
- Moved context timeout initialization in
cmdDelto occur after IPAM lock acquisition, matching the pattern used in ADD operations - This ensures DELETE operations get the full 90 seconds for API calls, regardless of lock wait time
| unlock := acquireIPAMLockBestEffort(conf.IPAMLockFile) | ||
| defer unlock() | ||
|
|
||
| ctx := context.Background() |
There was a problem hiding this comment.
The timeout initialization should include an explanatory comment similar to the one in the ADD operations (lines 183-184 and 284-285). This comment explains why the timeout is started after acquiring the lock. Consider adding: "Only start the timeout after we get the lock. When there's a thundering herd of pod deletions, acquiring the lock can take a while."
| ctx := context.Background() | |
| ctx := context.Background() | |
| // Only start the timeout after we get the lock. When there's a thundering herd of pod deletions, acquiring the lock can take a while. |
Cherry-pick history
Description
Type: Bug fix
Why this should be merged:
This PR fixes a critical bug in the CNI plugin's DELETE operation where the 90-second context timeout was incorrectly started before acquiring the IPAM lock, rather than after. This inconsistency
with ADD operations (AssignIP and AutoAssign) caused DELETE operations to timeout when waiting for the lock during high pod churn.
Problem:
Solution:
In
cni-plugin/pkg/ipamplugin/ipam_plugin.go.Components affected:
cmdDelfunction)ReleaseByHandleflowImpact:
Related issues/PRs
8822f57
Todos
Release Note