Fix built-in roles sync to retry on lock contention instead of silently discarding pending updates#142433
Merged
ebarlas merged 7 commits intoelastic:mainfrom Feb 23, 2026
Conversation
…ly discarding pending updates
Collaborator
|
Hi @ebarlas, I've created a changelog YAML for you. |
Collaborator
|
Pinging @elastic/es-security (Team:Security) |
…for active shards on ResourceAlreadyExistsException
slobodanadamovic
approved these changes
Feb 23, 2026
Contributor
slobodanadamovic
left a comment
There was a problem hiding this comment.
LGTM 🚀
This is a very neat way to fix the issue!
ebarlas
added a commit
to ebarlas/elasticsearch
that referenced
this pull request
Feb 23, 2026
Built-in role sync requests arriving while another sync was in progress were silently discarded. This caused roles to go missing when cluster change events fired in quick succession during bootstrap. Replace the synchronizationInProgress AtomicBoolean with a lock-free AtomicInteger state machine (RolesSync) that tracks idle, syncing, and syncing_pending states. When a sync completes and updates are pending, it automatically retries with the latest roles.
ebarlas
added a commit
to ebarlas/elasticsearch
that referenced
this pull request
Feb 23, 2026
Built-in role sync requests arriving while another sync was in progress were silently discarded. This caused roles to go missing when cluster change events fired in quick succession during bootstrap. Replace the synchronizationInProgress AtomicBoolean with a lock-free AtomicInteger state machine (RolesSync) that tracks idle, syncing, and syncing_pending states. When a sync completes and updates are pending, it automatically retries with the latest roles.
ebarlas
added a commit
to ebarlas/elasticsearch
that referenced
this pull request
Feb 23, 2026
Built-in role sync requests arriving while another sync was in progress were silently discarded. This caused roles to go missing when cluster change events fired in quick succession during bootstrap. Replace the synchronizationInProgress AtomicBoolean with a lock-free AtomicInteger state machine (RolesSync) that tracks idle, syncing, and syncing_pending states. When a sync completes and updates are pending, it automatically retries with the latest roles.
Collaborator
szybia
added a commit
to szybia/elasticsearch
that referenced
this pull request
Feb 23, 2026
…on-sliced-reindex * upstream/main: (110 commits) Add search task watchdog to log hot threads on slow search (elastic#142746) Fix return_intermediate_results query param on get async search results (elastic#142875) Mute org.elasticsearch.compute.operator.exchange.BatchDriverTests testSinglePageSingleBatch elastic#142895 Cancel reindex body always has status (elastic#142766) Fix built-in roles sync losing updates (elastic#142433) ESQL: Clarify docs and add csv test for WHERE in STATS (elastic#133629) Fix and unmute ReindexResumeIT (elastic#142788) Fix broken release notes Mute org.elasticsearch.benchmark.vector.scorer.VectorScorerOSQBenchmarkTests testSingleScalarVsVectorized {p0=384 p1=4 p2=NIO p3=COSINE} elastic#142883 ES|QL: fix Generative tests for commands that don't change the output schema (elastic#142864) Mute org.elasticsearch.benchmark.vector.scorer.VectorScorerOSQBenchmarkTests testSingleScalarVsVectorized {p0=1024 p1=1 p2=NIO p3=DOT_PRODUCT} elastic#142881 SQL: Fix QlIllegalArgumentException with non-foldable date range queries (elastic#142386) Add more errors to the allowed_errors with github issue links (elastic#142862) ESQL: reapply "NDJSON datasource" (elastic#142855) Add implementation to update service settings method for Alibaba Cloud Search service (elastic#142738) Mute org.elasticsearch.snapshots.SnapshotShutdownIT testStartRemoveNodeButDoNotComplete elastic#142871 Mute org.elasticsearch.snapshots.SnapshotShutdownIT testDeleteSnapshotWithPausedShardSnapshots elastic#142870 Mute org.elasticsearch.snapshots.SnapshotShutdownIT testAbortSnapshotWhileRemovingNode elastic#142869 Mute org.elasticsearch.snapshots.SnapshotShutdownIT testRemoveNodeDuringSnapshot elastic#142868 ES|QL: Guard exponential_histogram TO_STRING against too large inputs (elastic#140718) ...
elasticsearchmachine
pushed a commit
that referenced
this pull request
Feb 23, 2026
Built-in role sync requests arriving while another sync was in progress were silently discarded. This caused roles to go missing when cluster change events fired in quick succession during bootstrap. Replace the synchronizationInProgress AtomicBoolean with a lock-free AtomicInteger state machine (RolesSync) that tracks idle, syncing, and syncing_pending states. When a sync completes and updates are pending, it automatically retries with the latest roles.
elasticsearchmachine
pushed a commit
that referenced
this pull request
Feb 23, 2026
Built-in role sync requests arriving while another sync was in progress were silently discarded. This caused roles to go missing when cluster change events fired in quick succession during bootstrap. Replace the synchronizationInProgress AtomicBoolean with a lock-free AtomicInteger state machine (RolesSync) that tracks idle, syncing, and syncing_pending states. When a sync completes and updates are pending, it automatically retries with the latest roles.
jdconrad
pushed a commit
to jdconrad/elasticsearch
that referenced
this pull request
Feb 24, 2026
Built-in role sync requests arriving while another sync was in progress were silently discarded. This caused roles to go missing when cluster change events fired in quick succession during bootstrap. Replace the synchronizationInProgress AtomicBoolean with a lock-free AtomicInteger state machine (RolesSync) that tracks idle, syncing, and syncing_pending states. When a sync completes and updates are pending, it automatically retries with the latest roles.
sidosera
pushed a commit
to sidosera/elasticsearch
that referenced
this pull request
Feb 24, 2026
Built-in role sync requests arriving while another sync was in progress were silently discarded. This caused roles to go missing when cluster change events fired in quick succession during bootstrap. Replace the synchronizationInProgress AtomicBoolean with a lock-free AtomicInteger state machine (RolesSync) that tracks idle, syncing, and syncing_pending states. When a sync completes and updates are pending, it automatically retries with the latest roles.
elasticsearchmachine
pushed a commit
that referenced
this pull request
Apr 3, 2026
Built-in role sync requests arriving while another sync was in progress were silently discarded. This caused roles to go missing when cluster change events fired in quick succession during bootstrap. Replace the synchronizationInProgress AtomicBoolean with a lock-free AtomicInteger state machine (RolesSync) that tracks idle, syncing, and syncing_pending states. When a sync completes and updates are pending, it automatically retries with the latest roles.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
synchronizationInProgressAtomicBooleanwith aRolesSyncstate machine that tracks apendingflag, so completing a sync automatically retries with the latest roles if any requests were dropped.stateDiagram-v2 [*] --> idle idle --> syncing : startSync() → true syncing --> idle : endSync() → false syncing --> syncing_pending : startSync() → false syncing_pending --> syncing : endSync() → true syncing_pending --> syncing_pending : startSync() → falseSecurityIntegTestCase and NativeRealmIntegTestCase
The built-in roles sync retry mechanism can now race with
SecurityIntegTestCase.createSecurityIndexWithWaitForActiveShards()to create the.securityindex. When the synchronizer creates the index first, the test catchesResourceAlreadyExistsExceptionbut previously did not wait for active shards before proceeding. This left a window where the index existed but its primary shard was still initializing, causingReservedRealmto fail withUnavailableShardsExceptionwhen authenticating theelasticuser duringsetupReservedPasswords. The fix adds aClusterHealthRequest.waitForActiveShardscall in the catch block, matching the fix already applied toSecuritySingleNodeTestCasein #128825.