Skip to content

fix(agents): never let self fall into the staleNodes filter (GH-2682)#2689

Merged
jeremydmiller merged 1 commit intomainfrom
issue-2682-research
May 7, 2026
Merged

fix(agents): never let self fall into the staleNodes filter (GH-2682)#2689
jeremydmiller merged 1 commit intomainfrom
issue-2682-research

Conversation

@jeremydmiller
Copy link
Copy Markdown
Member

Closes #2682.

Summary

  • Stale snapshot reads (read replica lag, snapshot isolation, GC pause between the heartbeat write and the read, Oracle session-TZ round-trip, an aggressive StaleNodeTimeout) used to fold the current node into the staleNodes filter inside NodeAgentController.DoHealthChecksAsync. The tick then crashed in tryStartLeadershipAsync with NRE on self!.AssignAgents([LeaderUri]) after IsLeader=true, the leadership lock was held, AssumedLeadership had fired, and the assignment row was written — leaving the cluster half-elected with no agent dispatch and other nodes constantly falling off.
  • We just wrote our own heartbeat in this same tick, so by definition we're not stale. Two complementary defensive changes:
    1. NodeAgentController.HeartBeat.cs — exclude self from the staleness filter unconditionally; inject self into nodes if the snapshot omitted it entirely (handles read-after-write lag against the upsert above and brand-new-node propagation).
    2. NodeAgentController.EvaluateAssignments.cs (defense in depth) — the pre-existing self-injection guard only fired on an empty nodes list. Now it also fires on a non-empty list missing self, for any caller that happens to slip a self-less list through.
  • ejectStaleNodes already protected self from DB deletion via the AssignedNodeNumber check (Local dev issue with zero nodes #1116); this just plugs the in-memory gap that survived that protection.

Test plan

  • New regression tests in src/Testing/CoreTests/Runtime/Agents/leader_election_self_visibility_tests.cs covering both the stale-self and missing-self paths through DoHealthChecksAsync, plus the standalone EvaluateAssignmentsAsync self-injection branch — 4 passed
  • dotnet test src/Testing/CoreTests --framework net9.01544 passed, 0 failed
  • SlowTests.SharedMemory.leadership_compliance (run one at a time): the_only_known_node_is_automatically_the_leader, eject_a_stale_node, add_second_node_see_balanced_nodes, spin_up_several_nodes_take_away_non_leader_node — all pass.
    • One pre-existing flake (take_over_leader_ship_if_leader_becomes_stale) reproduces the same TimeoutException on clean main (verified via git stash); not introduced by this change.

🤖 Generated with Claude Code

Stale snapshot reads (read replica lag, snapshot isolation, GC pause
between the heartbeat write and the read, Oracle session-TZ round-trip,
an aggressive `StaleNodeTimeout`) used to fold the current node into the
staleNodes filter inside `NodeAgentController.DoHealthChecksAsync`. The
tick then crashed in `tryStartLeadershipAsync` with NRE on
`self!.AssignAgents([LeaderUri])` *after* `IsLeader=true`, the
leadership lock was held, `AssumedLeadership` had fired, and the
assignment row was written — leaving the cluster half-elected with no
agent dispatch and other nodes constantly falling off.

We just wrote our own heartbeat in this same tick, so by definition
self is not stale. Two complementary defensive changes:

1. `NodeAgentController.HeartBeat.cs`: exclude self from the staleness
   filter unconditionally; also inject self into `nodes` if the
   snapshot omitted it entirely (handles read-after-write lag against
   the upsert above and brand-new-node propagation).

2. `NodeAgentController.EvaluateAssignments.cs`: defense in depth —
   the pre-existing self-injection guard only fired on an empty
   `nodes` list. Now it also fires on a non-empty list missing self,
   for any caller that happens to slip a self-less list through.

`ejectStaleNodes` already protected self from DB deletion via the
`AssignedNodeNumber` check (GH-1116); this just plugs the in-memory
gap that survived that protection.

Regression tests in `leader_election_self_visibility_tests.cs` cover
both the stale-self and missing-self paths through `DoHealthChecksAsync`,
plus the standalone `EvaluateAssignmentsAsync` self-injection branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NullReferenceException in tryStartLeadershipAsync when current node is filtered out as stale.

1 participant