fix(agents): never let self fall into the staleNodes filter (GH-2682)#2689
Merged
jeremydmiller merged 1 commit intomainfrom May 7, 2026
Merged
fix(agents): never let self fall into the staleNodes filter (GH-2682)#2689jeremydmiller merged 1 commit intomainfrom
jeremydmiller merged 1 commit intomainfrom
Conversation
Stale snapshot reads (read replica lag, snapshot isolation, GC pause between the heartbeat write and the read, Oracle session-TZ round-trip, an aggressive `StaleNodeTimeout`) used to fold the current node into the staleNodes filter inside `NodeAgentController.DoHealthChecksAsync`. The tick then crashed in `tryStartLeadershipAsync` with NRE on `self!.AssignAgents([LeaderUri])` *after* `IsLeader=true`, the leadership lock was held, `AssumedLeadership` had fired, and the assignment row was written — leaving the cluster half-elected with no agent dispatch and other nodes constantly falling off. We just wrote our own heartbeat in this same tick, so by definition self is not stale. Two complementary defensive changes: 1. `NodeAgentController.HeartBeat.cs`: exclude self from the staleness filter unconditionally; also inject self into `nodes` if the snapshot omitted it entirely (handles read-after-write lag against the upsert above and brand-new-node propagation). 2. `NodeAgentController.EvaluateAssignments.cs`: defense in depth — the pre-existing self-injection guard only fired on an empty `nodes` list. Now it also fires on a non-empty list missing self, for any caller that happens to slip a self-less list through. `ejectStaleNodes` already protected self from DB deletion via the `AssignedNodeNumber` check (GH-1116); this just plugs the in-memory gap that survived that protection. Regression tests in `leader_election_self_visibility_tests.cs` cover both the stale-self and missing-self paths through `DoHealthChecksAsync`, plus the standalone `EvaluateAssignmentsAsync` self-injection branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2682.
Summary
StaleNodeTimeout) used to fold the current node into the staleNodes filter insideNodeAgentController.DoHealthChecksAsync. The tick then crashed intryStartLeadershipAsyncwithNREonself!.AssignAgents([LeaderUri])afterIsLeader=true, the leadership lock was held,AssumedLeadershiphad fired, and the assignment row was written — leaving the cluster half-elected with no agent dispatch and other nodes constantly falling off.NodeAgentController.HeartBeat.cs— exclude self from the staleness filter unconditionally; inject self intonodesif the snapshot omitted it entirely (handles read-after-write lag against the upsert above and brand-new-node propagation).NodeAgentController.EvaluateAssignments.cs(defense in depth) — the pre-existing self-injection guard only fired on an emptynodeslist. Now it also fires on a non-empty list missing self, for any caller that happens to slip a self-less list through.ejectStaleNodesalready protected self from DB deletion via theAssignedNodeNumbercheck (Local dev issue with zero nodes #1116); this just plugs the in-memory gap that survived that protection.Test plan
src/Testing/CoreTests/Runtime/Agents/leader_election_self_visibility_tests.cscovering both the stale-self and missing-self paths throughDoHealthChecksAsync, plus the standaloneEvaluateAssignmentsAsyncself-injection branch — 4 passeddotnet test src/Testing/CoreTests --framework net9.0— 1544 passed, 0 failedSlowTests.SharedMemory.leadership_compliance(run one at a time):the_only_known_node_is_automatically_the_leader,eject_a_stale_node,add_second_node_see_balanced_nodes,spin_up_several_nodes_take_away_non_leader_node— all pass.take_over_leader_ship_if_leader_becomes_stale) reproduces the sameTimeoutExceptionon clean main (verified viagit stash); not introduced by this change.🤖 Generated with Claude Code