Skip to content

Fix #2602: leader split-brain via stale advisory-lock state#2607

Merged
jeremydmiller merged 1 commit intomainfrom
fix/2602-leader-split-brain
Apr 27, 2026
Merged

Fix #2602: leader split-brain via stale advisory-lock state#2607
jeremydmiller merged 1 commit intomainfrom
fix/2602-leader-split-brain

Conversation

@jeremydmiller
Copy link
Copy Markdown
Member

Summary

Closes #2602.

In DurabilityMode.Balanced with PostgreSQL (and the same pattern in SQL Server / MySQL / Oracle / SQLite), leader election relied on a session-level advisory lock. When the holder's underlying database backend was terminated server-side — network blip, idle-connection cull, pg_terminate_backend, AlwaysOn / RAC failover, Azure flexserver maintenance — the database released the lock and another node legitimately acquired it, but the original leader's in-process AdvisoryLock.HasLock kept returning true. Two nodes simultaneously believed they were the leader, both ran EvaluateAssignmentsAsync, both dispatched AssignAgent commands, and the same agent ended up running on two nodes at once.

Reporter's exact xunit reproducer ships with the PR as PostgresqlTests.Bugs.Bug_split_brain_advisory_lock_state_divergence.

Three layered fixes

Layer 1 — AdvisoryLock.HasLock pings the held connection

public bool HasLock(int lockId)
{
    if (_conn is null) return false;
    if (!_locks.Contains(lockId)) return false;

    try
    {
        using var cmd = _conn.CreateCommand();
        cmd.CommandText = "select 1";
        cmd.CommandTimeout = 2;
        cmd.ExecuteScalar();
        return true;
    }
    catch (Exception e)
    {
        _logger.LogWarning(e,
            "Lost advisory-lock connection for database {Database}; clearing held lock ids {Locks}",
            _databaseName, _locks);
        _locks.Clear();
        try { _conn.Dispose(); } catch { /* already broken */ }
        _conn = null;
        return false;
    }
}

Applied across all five Wolverine-owned implementations: Wolverine.Postgresql.AdvisoryLock, Wolverine.MySql.MySqlAdvisoryLock, Wolverine.Oracle.OracleAdvisoryLock (per-lock connection ping since Oracle holds one connection per lock), Wolverine.Sqlite.SqliteAdvisoryLock, and a new Wolverine.SqlServer.Persistence.SqlServerAdvisoryLock. The new SQL Server class is a Wolverine-owned drop-in replacement for Weasel.SqlServer.AdvisoryLock (Wolverine.SqlServer was previously calling Weasel's class directly), so the fix lands without coordinating a Weasel release — see "Weasel follow-up" below.

Layer 2 — heartbeat step-down + re-election

NodeAgentController.HeartBeat.DoHealthChecksAsync now detects "I was leader last tick, but the lock is gone now" before the existing leader-fast-path check and calls a new stepDownAsync(reason):

  • Clears IsLeader
  • Stops the local LeaderUri agent so this node stops dispatching AssignAgent / ReassignAgent
  • Best-effort releases the persistence-layer lock
  • Notifies the observer via a new IWolverineObserver.LostLeadership() (default no-op so third-party observers compile unchanged)
  • Logs a NodeRecordType.LeadershipLost (new enum value) for CritterWatch
  • Falls through to the existing TryAttainLeadershipLockAsync path so a fresh leadership election happens on the same tick — directly answering the reporter's directive that we should "stop the leadership and request a new leadership election".

Layer 3 — AssignmentGrid duplicate detection + heal

AssignmentGrid.Node.Running used to silently _parent._agents[agentUri] = agent even if another node already reported the same agent — the smoking gun for split-brain that FindDelta could never see. The grid now records every duplicate in a DuplicateAgentReports collection, and EvaluateAssignmentsAsync emits a StopRemoteAgent for the older copy. Self-heals split-brain residue on the next leadership tick even if Layer 1 or 2 had a hole.

Weasel follow-up (your call when you're ready)

Weasel.SqlServer.AdvisoryLock.HasLock (weasel/src/Weasel.SqlServer/AdvisoryLock.cs:24-27) has the exact same buggy pattern as the Wolverine-side classes did before this PR. SQL Server's sp_getapplock is session-scoped just like PG's advisory locks, so the same split-brain hits any Weasel.SqlServer consumer (Marten, etc.). Wolverine sidesteps it via the new override, but the upstream class should get the same SELECT 1 ping treatment. Same goes for Weasel.SqlServer if any other consumer relies on it. Happy to open a Weasel PR mirroring this fix when you say go.

Weasel.Postgresql.AdvisoryLock is a different (Medallion-based) implementation with its own HandleLostToken mechanism — already handles this case via LockMonitoringEnabled. Not affected.

Test plan

  • PostgresqlTests.Bugs.Bug_split_brain_advisory_lock_state_divergence — reporter's exact reproducer; fails on main, passes with Layer 1.
  • CoreTests.Runtime.Agents.duplicate_agent_split_brain_detection — 3 tests covering Layer 3.
  • dotnet test src/Testing/CoreTests --framework net9.0 — 1367/1367 pass.
  • dotnet test src/Persistence/PostgresqlTests --filter "Bug_split_brain|Agents|Leader|AdvisoryLock" — 21/21.
  • dotnet test src/Persistence/SqlServerTests --filter "Agent|Leader|AdvisoryLock" — 17/17.

CritterWatch

Will open a companion CritterWatch issue covering the new NodeRecordType.LeadershipLost event so the UI can render leadership transitions including stale-leader step-downs.

🤖 Generated with Claude Code

In DurabilityMode.Balanced with the PostgreSQL persistence (and the same
pattern in SQL Server / MySQL / Oracle / SQLite), leader election relied
on a session-level advisory lock. When the holder's underlying database
backend was terminated server-side — network blip, idle-connection cull,
pg_terminate_backend, AlwaysOn / RAC failover, Azure flexserver
maintenance — the database released the lock and another node legitimately
acquired it, but the original leader's in-process AdvisoryLock.HasLock
kept returning true. Two nodes simultaneously believed they were the
leader, both ran EvaluateAssignmentsAsync, both dispatched AssignAgent
commands, and the same agent ended up running twice.

Three layered fixes:

1. Server-side liveness check in AdvisoryLock.HasLock. Each call now
   pings the held connection (`SELECT 1`, 2-second timeout). On failure
   the in-memory _locks list is cleared and the broken connection is
   disposed. Applied across all five Wolverine-owned implementations:
   - Wolverine.Postgresql.AdvisoryLock
   - Wolverine.MySql.MySqlAdvisoryLock
   - Wolverine.Oracle.OracleAdvisoryLock (per-lock connection ping)
   - Wolverine.Sqlite.SqliteAdvisoryLock
   - Wolverine.SqlServer.SqlServerAdvisoryLock (NEW; Wolverine-owned
     replacement for Weasel.SqlServer.AdvisoryLock so we can ship the
     fix without a Weasel coordination round-trip — Marten / other
     Weasel consumers should land the same fix upstream)

2. Heartbeat step-down + re-election. NodeAgentController.HeartBeat now
   detects "I was the leader last tick, but the lock is gone now" and
   calls a new stepDownAsync(reason): clears IsLeader, stops the local
   LeaderUri agent so this node stops dispatching assignments, best-
   effort releases the persistence-layer lock, notifies the observer
   (new IWolverineObserver.LostLeadership() with a default no-op so
   third-party observers are unaffected), and falls through to the
   normal TryAttainLeadershipLockAsync path so the same tick triggers
   a fresh leadership election. Logs a NodeRecordType.LeadershipLost
   record.

3. AssignmentGrid duplicate detection + heal. AssignmentGrid.Node.Running
   used to silently overwrite the dictionary when two nodes both reported
   the same agent in their ActiveAgents — the smoking gun for split-brain
   that FindDelta could never see. The grid now records each duplicate
   in DuplicateAgentReports, and EvaluateAssignmentsAsync emits a
   StopRemoteAgent for the older copy on each detected duplicate so the
   split-brain residue self-heals on the next leadership tick even if
   Layers 1 and 2 had a hole.

Tests:
- PostgresqlTests.Bugs.Bug_split_brain_advisory_lock_state_divergence
  (the reporter's exact reproducer, dropped in verbatim) — fails on
  main, passes with Layer 1.
- CoreTests.Runtime.Agents.duplicate_agent_split_brain_detection — 3
  tests covering Layer 3.
- Full CoreTests: 1367/1367 pass.
- Postgres agent / leader / advisory-lock tests: 21/21.
- SqlServer agent / leader / advisory-lock tests: 17/17.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremydmiller jeremydmiller merged commit 3708f4a into main Apr 27, 2026
19 of 22 checks passed
jeremydmiller added a commit that referenced this pull request Apr 27, 2026
Release v5.33.0 includes:
- Fix #2602: leader split-brain via stale Postgres advisory lock (#2607)
- Port Polecat 2.x event store integration from Marten (#2598)
- Fix #2571: preserve context fields on scheduled-send wrap/unwrap (#2605)
- Add launchSettings.json to sample projects (#2600)
- gRPC: middleware weaving, validate convention, user exception mapping,
  bidirectional streaming, code-first codegen, new samples (#2565)
- Move non-sticky-handlers guard inside the compile lock (#2556)
- Allow RabbitMQ exchanges to be declared passive (#2574)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Leader split-brain: stale Postgres advisory lock causes two leaders to assign agents concurrently

1 participant