Skip to content

[v3.30] Fix CNI delete timer to start after acquiring IPAM lock#11943

Merged
fasaxc merged 1 commit into
projectcalico:release-v3.30from
sridhartigera:auto-pick-of-#11824-upstream-release-v3.30
Feb 27, 2026
Merged

[v3.30] Fix CNI delete timer to start after acquiring IPAM lock#11943
fasaxc merged 1 commit into
projectcalico:release-v3.30from
sridhartigera:auto-pick-of-#11824-upstream-release-v3.30

Conversation

@sridhartigera
Copy link
Copy Markdown
Member

@sridhartigera sridhartigera commented Feb 26, 2026

Cherry-pick history

Description

Type: Bug fix

Why this should be merged:

This PR fixes a critical bug in the CNI plugin's DELETE operation where the 90-second context timeout was incorrectly started before acquiring the IPAM lock, rather than after. This inconsistency
with ADD operations (AssignIP and AutoAssign) caused DELETE operations to timeout when waiting for the lock during high pod churn.

Problem:

  • During concurrent pod deletions, CNI processes queue for the host-wide IPAM lock
  • The DELETE operation started its 90-second timer before acquiring the lock
  • Lock wait time counted against the timeout budget
  • By the time a DELETE acquired the lock, the context was often already expired or near expiration
  • This resulted in "client rate limiter Wait returned an error: context deadline exceeded" errors

Solution:
In cni-plugin/pkg/ipamplugin/ipam_plugin.go.

  • Lock is now acquired first, then the 90-second timer starts
  • This matches the pattern already used in ADD operations
  • DELETE operations now get the full 90 seconds for API calls, regardless of lock wait time

Components affected:

  • CNI plugin IPAM delete operations (cmdDel function)
  • Specifically the ReleaseByHandle flow

Impact:

  • Low risk: Only reorders existing code, no logic changes
  • High benefit: Eliminates cascading DELETE failures during high pod churn
  • No API changes, backward compatible

Related issues/PRs

8822f57

Todos

  • Tests - Existing tests should pass; new tests for lock/timer ordering would be beneficial
  • Documentation - Code comment added explaining the timer ordering rationale
  • Release note - Included below

Release Note

Fix CNI delete timeout to start after IPAM lock acquisition, preventing "context deadline exceeded" failures during high pod churn.

Start the 90-second timeout after acquiring the lock in cmdDel, matching the pattern used in ADD operations. Previously, the timer started before lock acquisition, causing "context deadline exceeded" errors when DELETE operations waited in queue for the lock.
@sridhartigera sridhartigera requested a review from a team as a code owner February 26, 2026 18:35
Copilot AI review requested due to automatic review settings February 26, 2026 18:35
@sridhartigera sridhartigera added release-note-required Change has user-facing impact (no matter how small) docs-not-required Docs not required for this change labels Feb 26, 2026
@marvin-tigera marvin-tigera added this to the Calico v3.30.6 milestone Feb 26, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR cherry-picks a critical bug fix from the main branch to the v3.30 release branch. The fix corrects the timing of the 90-second context timeout in CNI DELETE operations to start after acquiring the IPAM lock, rather than before. This inconsistency with ADD operations was causing DELETE operations to timeout during high pod churn scenarios when multiple CNI processes queued for the host-wide IPAM lock.

Changes:

  • Moved context timeout initialization in cmdDel to occur after IPAM lock acquisition, matching the pattern used in ADD operations
  • This ensures DELETE operations get the full 90 seconds for API calls, regardless of lock wait time

unlock := acquireIPAMLockBestEffort(conf.IPAMLockFile)
defer unlock()

ctx := context.Background()
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout initialization should include an explanatory comment similar to the one in the ADD operations (lines 183-184 and 284-285). This comment explains why the timeout is started after acquiring the lock. Consider adding: "Only start the timeout after we get the lock. When there's a thundering herd of pod deletions, acquiring the lock can take a while."

Suggested change
ctx := context.Background()
ctx := context.Background()
// Only start the timeout after we get the lock. When there's a thundering herd of pod deletions, acquiring the lock can take a while.

Copilot uses AI. Check for mistakes.
@fasaxc fasaxc merged commit 3da8fa3 into projectcalico:release-v3.30 Feb 27, 2026
8 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants