Skip to content

Prevent concurrent health check race causing leadership lock corruption#2999

Merged
jeremydmiller merged 1 commit into
JasperFx:mainfrom
dmytro-pryvedeniuk:prevent-leadership-lock-corruption
Jun 1, 2026
Merged

Prevent concurrent health check race causing leadership lock corruption#2999
jeremydmiller merged 1 commit into
JasperFx:mainfrom
dmytro-pryvedeniuk:prevent-leadership-lock-corruption

Conversation

@dmytro-pryvedeniuk
Copy link
Copy Markdown
Contributor

@dmytro-pryvedeniuk dmytro-pryvedeniuk commented Jun 1, 2026

This PR started as an attempt to fix flaky RavenDbTests.LeaderElection.leadership_election_compliance.take_over_leader_ship_if_leader_becomes_stale test, but turned out to be a fix for the issue that may occur in non-testing environment. Relevant for the storages using lease-based locks (RavenDB or CosmosDB).

Problem

DoHealthChecksAsync can be called from two concurrent paths:

  • CheckAgentHealth -> DoHealthChecksAsync -> TryAttainLockAsync (index = 0)
  • heartbeat -> DoHealthChecksAsync -> TryAttainLockAsync (index = 0)

One wins and updates the index (_lastLockIndex for RavenDB), another one uses the same initial (stale now) index, fails and calls stepDownAsync -> ReleaseLeadershipLockAsync.
Result: Leader kills itself. Election stalls until the lease expires.

Fix

Non-blocking re-entrancy guard is added to NodeAgentController.DoHealthChecksAsync. concurrent_DoHealthChecksAsync_guard_prevents_spurious_stepdown test demonstrates the leader survives the concurrent calls.

The tests leader_switchover_between_nodes and take_over_leader_ship_if_leader_becomes_stale were hardened.
The default heartbeat in tests is 1 second so all hosts compete for the leadership. Using combination of default fast (1s) and custom slow (10mins) heartbeats gives more control.

Also, added a new test - take_over_leader_ship_if_leader_becomes_stale_with_racing_nodes - which relies only on the heartbeats without explicit CheckAgentHealth message.

@jeremydmiller
Copy link
Copy Markdown
Member

@dmytro-pryvedeniuk Good catch!

@dmytro-pryvedeniuk dmytro-pryvedeniuk marked this pull request as ready for review June 1, 2026 13:48
@jeremydmiller jeremydmiller merged commit 245014c into JasperFx:main Jun 1, 2026
24 checks passed
@dmytro-pryvedeniuk dmytro-pryvedeniuk deleted the prevent-leadership-lock-corruption branch June 2, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants