Fix #2602: leader split-brain via stale advisory-lock state#2607
Merged
jeremydmiller merged 1 commit intomainfrom Apr 27, 2026
Merged
Fix #2602: leader split-brain via stale advisory-lock state#2607jeremydmiller merged 1 commit intomainfrom
jeremydmiller merged 1 commit intomainfrom
Conversation
In DurabilityMode.Balanced with the PostgreSQL persistence (and the same
pattern in SQL Server / MySQL / Oracle / SQLite), leader election relied
on a session-level advisory lock. When the holder's underlying database
backend was terminated server-side — network blip, idle-connection cull,
pg_terminate_backend, AlwaysOn / RAC failover, Azure flexserver
maintenance — the database released the lock and another node legitimately
acquired it, but the original leader's in-process AdvisoryLock.HasLock
kept returning true. Two nodes simultaneously believed they were the
leader, both ran EvaluateAssignmentsAsync, both dispatched AssignAgent
commands, and the same agent ended up running twice.
Three layered fixes:
1. Server-side liveness check in AdvisoryLock.HasLock. Each call now
pings the held connection (`SELECT 1`, 2-second timeout). On failure
the in-memory _locks list is cleared and the broken connection is
disposed. Applied across all five Wolverine-owned implementations:
- Wolverine.Postgresql.AdvisoryLock
- Wolverine.MySql.MySqlAdvisoryLock
- Wolverine.Oracle.OracleAdvisoryLock (per-lock connection ping)
- Wolverine.Sqlite.SqliteAdvisoryLock
- Wolverine.SqlServer.SqlServerAdvisoryLock (NEW; Wolverine-owned
replacement for Weasel.SqlServer.AdvisoryLock so we can ship the
fix without a Weasel coordination round-trip — Marten / other
Weasel consumers should land the same fix upstream)
2. Heartbeat step-down + re-election. NodeAgentController.HeartBeat now
detects "I was the leader last tick, but the lock is gone now" and
calls a new stepDownAsync(reason): clears IsLeader, stops the local
LeaderUri agent so this node stops dispatching assignments, best-
effort releases the persistence-layer lock, notifies the observer
(new IWolverineObserver.LostLeadership() with a default no-op so
third-party observers are unaffected), and falls through to the
normal TryAttainLeadershipLockAsync path so the same tick triggers
a fresh leadership election. Logs a NodeRecordType.LeadershipLost
record.
3. AssignmentGrid duplicate detection + heal. AssignmentGrid.Node.Running
used to silently overwrite the dictionary when two nodes both reported
the same agent in their ActiveAgents — the smoking gun for split-brain
that FindDelta could never see. The grid now records each duplicate
in DuplicateAgentReports, and EvaluateAssignmentsAsync emits a
StopRemoteAgent for the older copy on each detected duplicate so the
split-brain residue self-heals on the next leadership tick even if
Layers 1 and 2 had a hole.
Tests:
- PostgresqlTests.Bugs.Bug_split_brain_advisory_lock_state_divergence
(the reporter's exact reproducer, dropped in verbatim) — fails on
main, passes with Layer 1.
- CoreTests.Runtime.Agents.duplicate_agent_split_brain_detection — 3
tests covering Layer 3.
- Full CoreTests: 1367/1367 pass.
- Postgres agent / leader / advisory-lock tests: 21/21.
- SqlServer agent / leader / advisory-lock tests: 17/17.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremydmiller
added a commit
that referenced
this pull request
Apr 27, 2026
Release v5.33.0 includes: - Fix #2602: leader split-brain via stale Postgres advisory lock (#2607) - Port Polecat 2.x event store integration from Marten (#2598) - Fix #2571: preserve context fields on scheduled-send wrap/unwrap (#2605) - Add launchSettings.json to sample projects (#2600) - gRPC: middleware weaving, validate convention, user exception mapping, bidirectional streaming, code-first codegen, new samples (#2565) - Move non-sticky-handlers guard inside the compile lock (#2556) - Allow RabbitMQ exchanges to be declared passive (#2574) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #2602.
In
DurabilityMode.Balancedwith PostgreSQL (and the same pattern in SQL Server / MySQL / Oracle / SQLite), leader election relied on a session-level advisory lock. When the holder's underlying database backend was terminated server-side — network blip, idle-connection cull,pg_terminate_backend, AlwaysOn / RAC failover, Azure flexserver maintenance — the database released the lock and another node legitimately acquired it, but the original leader's in-processAdvisoryLock.HasLockkept returningtrue. Two nodes simultaneously believed they were the leader, both ranEvaluateAssignmentsAsync, both dispatchedAssignAgentcommands, and the same agent ended up running on two nodes at once.Reporter's exact xunit reproducer ships with the PR as
PostgresqlTests.Bugs.Bug_split_brain_advisory_lock_state_divergence.Three layered fixes
Layer 1 —
AdvisoryLock.HasLockpings the held connectionApplied across all five Wolverine-owned implementations:
Wolverine.Postgresql.AdvisoryLock,Wolverine.MySql.MySqlAdvisoryLock,Wolverine.Oracle.OracleAdvisoryLock(per-lock connection ping since Oracle holds one connection per lock),Wolverine.Sqlite.SqliteAdvisoryLock, and a newWolverine.SqlServer.Persistence.SqlServerAdvisoryLock. The new SQL Server class is a Wolverine-owned drop-in replacement forWeasel.SqlServer.AdvisoryLock(Wolverine.SqlServer was previously calling Weasel's class directly), so the fix lands without coordinating a Weasel release — see "Weasel follow-up" below.Layer 2 — heartbeat step-down + re-election
NodeAgentController.HeartBeat.DoHealthChecksAsyncnow detects "I was leader last tick, but the lock is gone now" before the existing leader-fast-path check and calls a newstepDownAsync(reason):IsLeaderLeaderUriagent so this node stops dispatchingAssignAgent/ReassignAgentIWolverineObserver.LostLeadership()(default no-op so third-party observers compile unchanged)NodeRecordType.LeadershipLost(new enum value) for CritterWatchTryAttainLeadershipLockAsyncpath so a fresh leadership election happens on the same tick — directly answering the reporter's directive that we should "stop the leadership and request a new leadership election".Layer 3 —
AssignmentGridduplicate detection + healAssignmentGrid.Node.Runningused to silently_parent._agents[agentUri] = agenteven if another node already reported the same agent — the smoking gun for split-brain thatFindDeltacould never see. The grid now records every duplicate in aDuplicateAgentReportscollection, andEvaluateAssignmentsAsyncemits aStopRemoteAgentfor the older copy. Self-heals split-brain residue on the next leadership tick even if Layer 1 or 2 had a hole.Weasel follow-up (your call when you're ready)
Weasel.SqlServer.AdvisoryLock.HasLock(weasel/src/Weasel.SqlServer/AdvisoryLock.cs:24-27) has the exact same buggy pattern as the Wolverine-side classes did before this PR. SQL Server'ssp_getapplockis session-scoped just like PG's advisory locks, so the same split-brain hits any Weasel.SqlServer consumer (Marten, etc.). Wolverine sidesteps it via the new override, but the upstream class should get the sameSELECT 1ping treatment. Same goes forWeasel.SqlServerif any other consumer relies on it. Happy to open a Weasel PR mirroring this fix when you say go.Weasel.Postgresql.AdvisoryLockis a different (Medallion-based) implementation with its ownHandleLostTokenmechanism — already handles this case viaLockMonitoringEnabled. Not affected.Test plan
PostgresqlTests.Bugs.Bug_split_brain_advisory_lock_state_divergence— reporter's exact reproducer; fails onmain, passes with Layer 1.CoreTests.Runtime.Agents.duplicate_agent_split_brain_detection— 3 tests covering Layer 3.dotnet test src/Testing/CoreTests --framework net9.0— 1367/1367 pass.dotnet test src/Persistence/PostgresqlTests --filter "Bug_split_brain|Agents|Leader|AdvisoryLock"— 21/21.dotnet test src/Persistence/SqlServerTests --filter "Agent|Leader|AdvisoryLock"— 17/17.CritterWatch
Will open a companion CritterWatch issue covering the new
NodeRecordType.LeadershipLostevent so the UI can render leadership transitions including stale-leader step-downs.🤖 Generated with Claude Code