fix(advisory-lock): make TryAttainLockAsync idempotent against re-entrant calls#2691
Merged
jeremydmiller merged 1 commit intomainfrom May 7, 2026
Merged
fix(advisory-lock): make TryAttainLockAsync idempotent against re-entrant calls#2691jeremydmiller merged 1 commit intomainfrom
jeremydmiller merged 1 commit intomainfrom
Conversation
…rant calls
Postgres session-level advisory locks STACK ("Multiple lock requests
stack, so that if the same resource is locked three times it must then
be unlocked three times to be released" — Postgres docs). SQL Server
session-scoped application locks have the same reentrancy semantics
("If a lock has been requested ... by the current session,
sp_getapplock can be called multiple times for it ... For each request
that returns success ... sp_releaseapplock must also be called.").
The heartbeat lease-renewal change in a84d6a2 calls
`TryAttainLeadershipLockAsync` on every tick — including ticks where
the leader already holds the lock — to refresh the lease for
lease-based backends (RavenDb, Cosmos). For advisory-lock backends
(Postgres, SQL Server) this stacks the lock by one count per
heartbeat. The single `ReleaseLeadershipLockAsync` call during
`DisableAgentsAsync` / `stepDownAsync` then only decrements once,
leaving the lock held server-side and silently blocking failover —
no error logged, just a stalled election.
Surfaced in
`SlowTests.SharedMemory.leadership_compliance.take_over_leader_ship_if_leader_becomes_stale`,
which fails 0/10 in isolation on main (consistent regression, not
flake) and passes 3/3 with this fix. The full Postgres LeaderElection
compliance suite (12 tests) is green with this change.
Both `AdvisoryLock.TryAttainLockAsync` (Postgres) and
`SqlServerAdvisoryLock.TryAttainLockAsync` get the same idempotent
short-circuit: if `_locks` already contains the id and `HasLock`
confirms the held connection is still alive, return true without
calling `pg_try_advisory_lock` / `sp_getapplock` again. The `HasLock`
liveness ping keeps the GH-2602 split-brain detection intact — if the
lock was lost server-side, `HasLock` clears `_locks` and the next
`TryAttainLockAsync` will actually re-attain.
Regression coverage in `Bug_advisory_lock_stacking_blocks_failover`
for both backends. Both tests fail on unpatched main and pass with
the fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
a84d6a262callsTryAttainLeadershipLockAsyncon every tick — including ticks where the leader already holds the lock — to refresh the lease for lease-based backends (RavenDb, Cosmos). For advisory-lock backends (Postgres, SQL Server) this stacks the lock by one count per heartbeat. The singleReleaseLeadershipLockAsynccall duringDisableAgentsAsync/stepDownAsyncthen only decrements once, leaving the lock held server-side and silently blocking failover — no error logged, just a stalled election.SlowTests.SharedMemory.leadership_compliance.take_over_leader_ship_if_leader_becomes_stale, which fails 0/10 in isolation on main (consistent regression sincea84d6a262, bisected) and passes 3/3 with this fix. The full Postgres LeaderElection compliance suite (12 tests) is green with this change.AdvisoryLock.TryAttainLockAsync(Postgres) andSqlServerAdvisoryLock.TryAttainLockAsyncget the same idempotent short-circuit: if_locksalready contains the id andHasLockconfirms the held connection is still alive, returntruewithout callingpg_try_advisory_lock/sp_getapplockagain. TheHasLockliveness ping keeps the Leader split-brain: stale Postgres advisory lock causes two leaders to assign agents concurrently #2602 split-brain detection intact — if the lock was lost server-side,HasLockclears_locksand the nextTryAttainLockAsyncwill actually re-attain.Test plan
Bug_advisory_lock_stacking_blocks_failoverregression test for both backends — verified to fail on unpatched main (proves they exercise the bug) and pass with the fixBug_split_brain_advisory_lock_state_divergence(Leader split-brain: stale Postgres advisory lock causes two leaders to assign agents concurrently #2602 regression) still passes — theHasLockliveness ping is still exercised by the nextTryAttainLockAsyncafter a backend killBug_2518_concurrent_migration_advisory_lockstill passesSlowTests.SharedMemory.leadership_compliance.take_over_leader_ship_if_leader_becomes_stale— 3/3 pass (was 0/10 on unpatched main)🤖 Generated with Claude Code