Fix #2602: leader split-brain via stale advisory-lock state by jeremydmiller · Pull Request #2607 · JasperFx/wolverine

jeremydmiller · 2026-04-27T17:48:21Z

Summary

Closes #2602.

In DurabilityMode.Balanced with PostgreSQL (and the same pattern in SQL Server / MySQL / Oracle / SQLite), leader election relied on a session-level advisory lock. When the holder's underlying database backend was terminated server-side — network blip, idle-connection cull, pg_terminate_backend, AlwaysOn / RAC failover, Azure flexserver maintenance — the database released the lock and another node legitimately acquired it, but the original leader's in-process AdvisoryLock.HasLock kept returning true. Two nodes simultaneously believed they were the leader, both ran EvaluateAssignmentsAsync, both dispatched AssignAgent commands, and the same agent ended up running on two nodes at once.

Reporter's exact xunit reproducer ships with the PR as PostgresqlTests.Bugs.Bug_split_brain_advisory_lock_state_divergence.

Three layered fixes

Layer 1 — `AdvisoryLock.HasLock` pings the held connection

public bool HasLock(int lockId)
{
    if (_conn is null) return false;
    if (!_locks.Contains(lockId)) return false;

    try
    {
        using var cmd = _conn.CreateCommand();
        cmd.CommandText = "select 1";
        cmd.CommandTimeout = 2;
        cmd.ExecuteScalar();
        return true;
    }
    catch (Exception e)
    {
        _logger.LogWarning(e,
            "Lost advisory-lock connection for database {Database}; clearing held lock ids {Locks}",
            _databaseName, _locks);
        _locks.Clear();
        try { _conn.Dispose(); } catch { /* already broken */ }
        _conn = null;
        return false;
    }
}

Applied across all five Wolverine-owned implementations: Wolverine.Postgresql.AdvisoryLock, Wolverine.MySql.MySqlAdvisoryLock, Wolverine.Oracle.OracleAdvisoryLock (per-lock connection ping since Oracle holds one connection per lock), Wolverine.Sqlite.SqliteAdvisoryLock, and a new Wolverine.SqlServer.Persistence.SqlServerAdvisoryLock. The new SQL Server class is a Wolverine-owned drop-in replacement for Weasel.SqlServer.AdvisoryLock (Wolverine.SqlServer was previously calling Weasel's class directly), so the fix lands without coordinating a Weasel release — see "Weasel follow-up" below.

Layer 2 — heartbeat step-down + re-election

NodeAgentController.HeartBeat.DoHealthChecksAsync now detects "I was leader last tick, but the lock is gone now" before the existing leader-fast-path check and calls a new stepDownAsync(reason):

Clears IsLeader
Stops the local LeaderUri agent so this node stops dispatching AssignAgent / ReassignAgent
Best-effort releases the persistence-layer lock
Notifies the observer via a new IWolverineObserver.LostLeadership() (default no-op so third-party observers compile unchanged)
Logs a NodeRecordType.LeadershipLost (new enum value) for CritterWatch
Falls through to the existing TryAttainLeadershipLockAsync path so a fresh leadership election happens on the same tick — directly answering the reporter's directive that we should "stop the leadership and request a new leadership election".

Layer 3 — `AssignmentGrid` duplicate detection + heal

AssignmentGrid.Node.Running used to silently _parent._agents[agentUri] = agent even if another node already reported the same agent — the smoking gun for split-brain that FindDelta could never see. The grid now records every duplicate in a DuplicateAgentReports collection, and EvaluateAssignmentsAsync emits a StopRemoteAgent for the older copy. Self-heals split-brain residue on the next leadership tick even if Layer 1 or 2 had a hole.

Weasel follow-up (your call when you're ready)

Weasel.SqlServer.AdvisoryLock.HasLock (weasel/src/Weasel.SqlServer/AdvisoryLock.cs:24-27) has the exact same buggy pattern as the Wolverine-side classes did before this PR. SQL Server's sp_getapplock is session-scoped just like PG's advisory locks, so the same split-brain hits any Weasel.SqlServer consumer (Marten, etc.). Wolverine sidesteps it via the new override, but the upstream class should get the same SELECT 1 ping treatment. Same goes for Weasel.SqlServer if any other consumer relies on it. Happy to open a Weasel PR mirroring this fix when you say go.

Weasel.Postgresql.AdvisoryLock is a different (Medallion-based) implementation with its own HandleLostToken mechanism — already handles this case via LockMonitoringEnabled. Not affected.

Test plan

PostgresqlTests.Bugs.Bug_split_brain_advisory_lock_state_divergence — reporter's exact reproducer; fails on main, passes with Layer 1.
CoreTests.Runtime.Agents.duplicate_agent_split_brain_detection — 3 tests covering Layer 3.
dotnet test src/Testing/CoreTests --framework net9.0 — 1367/1367 pass.
dotnet test src/Persistence/PostgresqlTests --filter "Bug_split_brain|Agents|Leader|AdvisoryLock" — 21/21.
dotnet test src/Persistence/SqlServerTests --filter "Agent|Leader|AdvisoryLock" — 17/17.

CritterWatch

Will open a companion CritterWatch issue covering the new NodeRecordType.LeadershipLost event so the UI can render leadership transitions including stale-leader step-downs.

🤖 Generated with Claude Code

In DurabilityMode.Balanced with the PostgreSQL persistence (and the same pattern in SQL Server / MySQL / Oracle / SQLite), leader election relied on a session-level advisory lock. When the holder's underlying database backend was terminated server-side — network blip, idle-connection cull, pg_terminate_backend, AlwaysOn / RAC failover, Azure flexserver maintenance — the database released the lock and another node legitimately acquired it, but the original leader's in-process AdvisoryLock.HasLock kept returning true. Two nodes simultaneously believed they were the leader, both ran EvaluateAssignmentsAsync, both dispatched AssignAgent commands, and the same agent ended up running twice. Three layered fixes: 1. Server-side liveness check in AdvisoryLock.HasLock. Each call now pings the held connection (`SELECT 1`, 2-second timeout). On failure the in-memory _locks list is cleared and the broken connection is disposed. Applied across all five Wolverine-owned implementations: - Wolverine.Postgresql.AdvisoryLock - Wolverine.MySql.MySqlAdvisoryLock - Wolverine.Oracle.OracleAdvisoryLock (per-lock connection ping) - Wolverine.Sqlite.SqliteAdvisoryLock - Wolverine.SqlServer.SqlServerAdvisoryLock (NEW; Wolverine-owned replacement for Weasel.SqlServer.AdvisoryLock so we can ship the fix without a Weasel coordination round-trip — Marten / other Weasel consumers should land the same fix upstream) 2. Heartbeat step-down + re-election. NodeAgentController.HeartBeat now detects "I was the leader last tick, but the lock is gone now" and calls a new stepDownAsync(reason): clears IsLeader, stops the local LeaderUri agent so this node stops dispatching assignments, best- effort releases the persistence-layer lock, notifies the observer (new IWolverineObserver.LostLeadership() with a default no-op so third-party observers are unaffected), and falls through to the normal TryAttainLeadershipLockAsync path so the same tick triggers a fresh leadership election. Logs a NodeRecordType.LeadershipLost record. 3. AssignmentGrid duplicate detection + heal. AssignmentGrid.Node.Running used to silently overwrite the dictionary when two nodes both reported the same agent in their ActiveAgents — the smoking gun for split-brain that FindDelta could never see. The grid now records each duplicate in DuplicateAgentReports, and EvaluateAssignmentsAsync emits a StopRemoteAgent for the older copy on each detected duplicate so the split-brain residue self-heals on the next leadership tick even if Layers 1 and 2 had a hole. Tests: - PostgresqlTests.Bugs.Bug_split_brain_advisory_lock_state_divergence (the reporter's exact reproducer, dropped in verbatim) — fails on main, passes with Layer 1. - CoreTests.Runtime.Agents.duplicate_agent_split_brain_detection — 3 tests covering Layer 3. - Full CoreTests: 1367/1367 pass. - Postgres agent / leader / advisory-lock tests: 21/21. - SqlServer agent / leader / advisory-lock tests: 17/17. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Release v5.33.0 includes: - Fix #2602: leader split-brain via stale Postgres advisory lock (#2607) - Port Polecat 2.x event store integration from Marten (#2598) - Fix #2571: preserve context fields on scheduled-send wrap/unwrap (#2605) - Add launchSettings.json to sample projects (#2600) - gRPC: middleware weaving, validate convention, user exception mapping, bidirectional streaming, code-first codegen, new samples (#2565) - Move non-sticky-handlers guard inside the compile lock (#2556) - Allow RabbitMQ exchanges to be declared passive (#2574) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jeremydmiller merged commit 3708f4a into main Apr 27, 2026
19 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #2602: leader split-brain via stale advisory-lock state#2607

Fix #2602: leader split-brain via stale advisory-lock state#2607
jeremydmiller merged 1 commit intomainfrom
fix/2602-leader-split-brain

jeremydmiller commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jeremydmiller commented Apr 27, 2026

Summary

Three layered fixes

Layer 1 — AdvisoryLock.HasLock pings the held connection

Layer 2 — heartbeat step-down + re-election

Layer 3 — AssignmentGrid duplicate detection + heal

Weasel follow-up (your call when you're ready)

Test plan

CritterWatch

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Layer 1 — `AdvisoryLock.HasLock` pings the held connection

Layer 3 — `AssignmentGrid` duplicate detection + heal