Prevent concurrent health check race causing leadership lock corruption#2999
Merged
jeremydmiller merged 1 commit intoJun 1, 2026
Conversation
Member
|
@dmytro-pryvedeniuk Good catch! |
This was referenced Jun 2, 2026
This was referenced Jun 3, 2026
Merged
This was referenced Jun 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR started as an attempt to fix flaky
RavenDbTests.LeaderElection.leadership_election_compliance.take_over_leader_ship_if_leader_becomes_staletest, but turned out to be a fix for the issue that may occur in non-testing environment. Relevant for the storages using lease-based locks (RavenDB or CosmosDB).Problem
DoHealthChecksAsynccan be called from two concurrent paths:CheckAgentHealth->DoHealthChecksAsync->TryAttainLockAsync(index = 0)DoHealthChecksAsync->TryAttainLockAsync(index = 0)One wins and updates the index (
_lastLockIndexfor RavenDB), another one uses the same initial (stale now) index, fails and callsstepDownAsync->ReleaseLeadershipLockAsync.Result: Leader kills itself. Election stalls until the lease expires.
Fix
Non-blocking re-entrancy guard is added to
NodeAgentController.DoHealthChecksAsync.concurrent_DoHealthChecksAsync_guard_prevents_spurious_stepdowntest demonstrates the leader survives the concurrent calls.The tests
leader_switchover_between_nodesandtake_over_leader_ship_if_leader_becomes_stalewere hardened.The default heartbeat in tests is 1 second so all hosts compete for the leadership. Using combination of default fast (1s) and custom slow (10mins) heartbeats gives more control.
Also, added a new test -
take_over_leader_ship_if_leader_becomes_stale_with_racing_nodes- which relies only on the heartbeats without explicitCheckAgentHealthmessage.