Skip to content

fix(advisory-lock): make TryAttainLockAsync idempotent against re-entrant calls#2691

Merged
jeremydmiller merged 1 commit intomainfrom
fix-advisory-lock-stacking
May 7, 2026
Merged

fix(advisory-lock): make TryAttainLockAsync idempotent against re-entrant calls#2691
jeremydmiller merged 1 commit intomainfrom
fix-advisory-lock-stacking

Conversation

@jeremydmiller
Copy link
Copy Markdown
Member

Summary

  • Postgres session-level advisory locks stack (Postgres docs: "Multiple lock requests stack, so that if the same resource is locked three times it must then be unlocked three times to be released."). SQL Server session-scoped application locks have the same reentrancy semantics. The heartbeat lease-renewal change in a84d6a262 calls TryAttainLeadershipLockAsync on every tick — including ticks where the leader already holds the lock — to refresh the lease for lease-based backends (RavenDb, Cosmos). For advisory-lock backends (Postgres, SQL Server) this stacks the lock by one count per heartbeat. The single ReleaseLeadershipLockAsync call during DisableAgentsAsync / stepDownAsync then only decrements once, leaving the lock held server-side and silently blocking failover — no error logged, just a stalled election.
  • Surfaced in SlowTests.SharedMemory.leadership_compliance.take_over_leader_ship_if_leader_becomes_stale, which fails 0/10 in isolation on main (consistent regression since a84d6a262, bisected) and passes 3/3 with this fix. The full Postgres LeaderElection compliance suite (12 tests) is green with this change.
  • Both AdvisoryLock.TryAttainLockAsync (Postgres) and SqlServerAdvisoryLock.TryAttainLockAsync get the same idempotent short-circuit: if _locks already contains the id and HasLock confirms the held connection is still alive, return true without calling pg_try_advisory_lock / sp_getapplock again. The HasLock liveness ping keeps the Leader split-brain: stale Postgres advisory lock causes two leaders to assign agents concurrently #2602 split-brain detection intact — if the lock was lost server-side, HasLock clears _locks and the next TryAttainLockAsync will actually re-attain.

Test plan

  • New Bug_advisory_lock_stacking_blocks_failover regression test for both backends — verified to fail on unpatched main (proves they exercise the bug) and pass with the fix
  • Bug_split_brain_advisory_lock_state_divergence (Leader split-brain: stale Postgres advisory lock causes two leaders to assign agents concurrently #2602 regression) still passes — the HasLock liveness ping is still exercised by the next TryAttainLockAsync after a backend kill
  • Bug_2518_concurrent_migration_advisory_lock still passes
  • SlowTests.SharedMemory.leadership_compliance.take_over_leader_ship_if_leader_becomes_stale — 3/3 pass (was 0/10 on unpatched main)
  • Full Postgres LeaderElection compliance suite — 12/12 passed, 0 failed (~2m13s)

🤖 Generated with Claude Code

…rant calls

Postgres session-level advisory locks STACK ("Multiple lock requests
stack, so that if the same resource is locked three times it must then
be unlocked three times to be released" — Postgres docs). SQL Server
session-scoped application locks have the same reentrancy semantics
("If a lock has been requested ... by the current session,
sp_getapplock can be called multiple times for it ... For each request
that returns success ... sp_releaseapplock must also be called.").

The heartbeat lease-renewal change in a84d6a2 calls
`TryAttainLeadershipLockAsync` on every tick — including ticks where
the leader already holds the lock — to refresh the lease for
lease-based backends (RavenDb, Cosmos). For advisory-lock backends
(Postgres, SQL Server) this stacks the lock by one count per
heartbeat. The single `ReleaseLeadershipLockAsync` call during
`DisableAgentsAsync` / `stepDownAsync` then only decrements once,
leaving the lock held server-side and silently blocking failover —
no error logged, just a stalled election.

Surfaced in
`SlowTests.SharedMemory.leadership_compliance.take_over_leader_ship_if_leader_becomes_stale`,
which fails 0/10 in isolation on main (consistent regression, not
flake) and passes 3/3 with this fix. The full Postgres LeaderElection
compliance suite (12 tests) is green with this change.

Both `AdvisoryLock.TryAttainLockAsync` (Postgres) and
`SqlServerAdvisoryLock.TryAttainLockAsync` get the same idempotent
short-circuit: if `_locks` already contains the id and `HasLock`
confirms the held connection is still alive, return true without
calling `pg_try_advisory_lock` / `sp_getapplock` again. The `HasLock`
liveness ping keeps the GH-2602 split-brain detection intact — if the
lock was lost server-side, `HasLock` clears `_locks` and the next
`TryAttainLockAsync` will actually re-attain.

Regression coverage in `Bug_advisory_lock_stacking_blocks_failover`
for both backends. Both tests fail on unpatched main and pass with
the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant