Skip to content

Prevent leadership re-election every 5 minutes#2625

Merged
jeremydmiller merged 1 commit intoJasperFx:mainfrom
Bishbulb:fix/heartbeat-renews-leadership-lease
May 1, 2026
Merged

Prevent leadership re-election every 5 minutes#2625
jeremydmiller merged 1 commit intoJasperFx:mainfrom
Bishbulb:fix/heartbeat-renews-leadership-lease

Conversation

@Bishbulb
Copy link
Copy Markdown
Contributor

Overview

On lease-based persistence backends (RavenDb, Cosmos) a single-node deployment fires a spurious LostLeadership / AssumedLeadership pair every 5 minutes.

NodeAgentController.HeartBeat.cs returned early when HasLeadershipLock() was true and never called TryAttainLeadershipLockAsync, so the existing renewal branch in TryAttain (the path taken when _leaderLock != null) was unreachable. The in-memory lease aged out at the 5-minute mark, HasLeadershipLock flipped to false on its next read, and the leader would step down, only to immediately re-elect itself in the same heartbeat tick.

From what I could see, SQL persistence backend didn't expose this because the lock is bound to the open connection, and thus there is no lease to renew.

Fix

Always go through TryAttainLeadershipLockAsync, and then branch on IsLeader afterwards:

  • If IsLeader == true, we go through the existing renewal branch and refresh the lease
  • If IsLeader == false, we go through a normal election attempt
  • If we fail to renew, we explicitly step down

@Bishbulb
Copy link
Copy Markdown
Contributor Author

Bishbulb commented Apr 29, 2026

@jeremydmiller Let me know if you have questions about this. I was looking into a way to do this specifically inside the Wolverine.RavenDb persistence layer, but it was going to require much broader changes. The root cause of the issue here is that the Raven (and likely Cosmos) heartbeats were causing a leader re-election every 5 minutes which seemed spurious. The visible side effects were agent reassignment churn and NodeRecord noise, but I didn't trace it to any specific failures/bugs.

I would put this as a lower priority item.

@jeremydmiller
Copy link
Copy Markdown
Member

@Bishbulb I'm deeply unenthusiastic about any changes to these internals. This one is going to be slower because I wouldn't trust AI on this.

@Bishbulb
Copy link
Copy Markdown
Contributor Author

@jeremydmiller Completely understood. Like I said, this mostly resulted in awkward noise in the logs as they were indicating a new leader was being elected every 5 mins.

@jeremydmiller
Copy link
Copy Markdown
Member

Let's get this in the next go around. As it turned out, we're kicking out yet another one today because there was something I wanted for CritterWatch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants