Prevent leadership re-election every 5 minutes#2625
Prevent leadership re-election every 5 minutes#2625jeremydmiller merged 1 commit intoJasperFx:mainfrom
Conversation
|
@jeremydmiller Let me know if you have questions about this. I was looking into a way to do this specifically inside the Wolverine.RavenDb persistence layer, but it was going to require much broader changes. The root cause of the issue here is that the Raven (and likely Cosmos) heartbeats were causing a leader re-election every 5 minutes which seemed spurious. The visible side effects were agent reassignment churn and NodeRecord noise, but I didn't trace it to any specific failures/bugs. I would put this as a lower priority item. |
|
@Bishbulb I'm deeply unenthusiastic about any changes to these internals. This one is going to be slower because I wouldn't trust AI on this. |
|
@jeremydmiller Completely understood. Like I said, this mostly resulted in awkward noise in the logs as they were indicating a new leader was being elected every 5 mins. |
|
Let's get this in the next go around. As it turned out, we're kicking out yet another one today because there was something I wanted for CritterWatch |
Overview
On lease-based persistence backends (RavenDb, Cosmos) a single-node deployment fires a spurious
LostLeadership/AssumedLeadershippair every 5 minutes.NodeAgentController.HeartBeat.csreturned early whenHasLeadershipLock()was true and never calledTryAttainLeadershipLockAsync, so the existing renewal branch inTryAttain(the path taken when_leaderLock != null) was unreachable. The in-memory lease aged out at the 5-minute mark,HasLeadershipLockflipped to false on its next read, and the leader would step down, only to immediately re-elect itself in the same heartbeat tick.From what I could see, SQL persistence backend didn't expose this because the lock is bound to the open connection, and thus there is no lease to renew.
Fix
Always go through
TryAttainLeadershipLockAsync, and then branch onIsLeaderafterwards:IsLeader == true, we go through the existing renewal branch and refresh the leaseIsLeader == false, we go through a normal election attempt