Skip to content

Fix built-in roles sync to retry on lock contention instead of silently discarding pending updates#142433

Merged
ebarlas merged 7 commits intoelastic:mainfrom
ebarlas:fix-silent-discard-of-roles-update-sync
Feb 23, 2026
Merged

Fix built-in roles sync to retry on lock contention instead of silently discarding pending updates#142433
ebarlas merged 7 commits intoelastic:mainfrom
ebarlas:fix-silent-discard-of-roles-update-sync

Conversation

@ebarlas
Copy link
Copy Markdown
Contributor

@ebarlas ebarlas commented Feb 12, 2026

  • Built-in role sync requests that arrived while another sync was in progress were silently discarded, causing roles to go missing when cluster change events fired in quick succession during bootstrap.
  • Replaced the synchronizationInProgress AtomicBoolean with a RolesSync state machine that tracks a pending flag, so completing a sync automatically retries with the latest roles if any requests were dropped.
stateDiagram-v2
    [*] --> idle

    idle --> syncing : startSync() → true

    syncing --> idle : endSync() → false
    syncing --> syncing_pending : startSync() → false

    syncing_pending --> syncing : endSync() → true
    syncing_pending --> syncing_pending : startSync() → false
Loading

SecurityIntegTestCase and NativeRealmIntegTestCase

The built-in roles sync retry mechanism can now race with SecurityIntegTestCase.createSecurityIndexWithWaitForActiveShards() to create the .security index. When the synchronizer creates the index first, the test catches ResourceAlreadyExistsException but previously did not wait for active shards before proceeding. This left a window where the index existed but its primary shard was still initializing, causing ReservedRealm to fail with UnavailableShardsException when authenticating the elastic user during setupReservedPasswords. The fix adds a ClusterHealthRequest.waitForActiveShards call in the catch block, matching the fix already applied to SecuritySingleNodeTestCase in #128825.

[2026-02-13T04:52:46,790][ERROR][o.e.x.s.a.e.ReservedRealm][node_s2][transport_worker][T#1] failed to retrieve password hash for reserved user [elastic]
org.elasticsearch.action.UnavailableShardsException: at least one primary shard for the index [.security-7] is unavailable

@ebarlas ebarlas self-assigned this Feb 12, 2026
@ebarlas ebarlas added >bug :Security/Security Security issues without another label Team:Cloud Security Meta label for Cloud Security team labels Feb 12, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ebarlas, I've created a changelog YAML for you.

@ebarlas ebarlas marked this pull request as ready for review February 12, 2026 23:42
@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Feb 12, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-security (Team:Security)

@ebarlas ebarlas removed the Team:Cloud Security Meta label for Cloud Security team label Feb 13, 2026
@slobodanadamovic slobodanadamovic self-requested a review February 23, 2026 09:57
Copy link
Copy Markdown
Contributor

@slobodanadamovic slobodanadamovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

This is a very neat way to fix the issue!

@ebarlas ebarlas added auto-backport Automatically create backport pull requests when merged v9.3.2 v8.19.13 v9.2.7 labels Feb 23, 2026
@ebarlas ebarlas merged commit 7d4b56e into elastic:main Feb 23, 2026
40 of 41 checks passed
ebarlas added a commit to ebarlas/elasticsearch that referenced this pull request Feb 23, 2026
Built-in role sync requests arriving while another sync was
in progress were silently discarded. This caused roles to go
missing when cluster change events fired in quick succession
during bootstrap.

Replace the synchronizationInProgress AtomicBoolean with a
lock-free AtomicInteger state machine (RolesSync) that tracks
idle, syncing, and syncing_pending states. When a sync
completes and updates are pending, it automatically retries
with the latest roles.
ebarlas added a commit to ebarlas/elasticsearch that referenced this pull request Feb 23, 2026
Built-in role sync requests arriving while another sync was
in progress were silently discarded. This caused roles to go
missing when cluster change events fired in quick succession
during bootstrap.

Replace the synchronizationInProgress AtomicBoolean with a
lock-free AtomicInteger state machine (RolesSync) that tracks
idle, syncing, and syncing_pending states. When a sync
completes and updates are pending, it automatically retries
with the latest roles.
ebarlas added a commit to ebarlas/elasticsearch that referenced this pull request Feb 23, 2026
Built-in role sync requests arriving while another sync was
in progress were silently discarded. This caused roles to go
missing when cluster change events fired in quick succession
during bootstrap.

Replace the synchronizationInProgress AtomicBoolean with a
lock-free AtomicInteger state machine (RolesSync) that tracks
idle, syncing, and syncing_pending states. When a sync
completes and updates are pending, it automatically retries
with the latest roles.
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
9.3
8.19
9.2

szybia added a commit to szybia/elasticsearch that referenced this pull request Feb 23, 2026
…on-sliced-reindex

* upstream/main: (110 commits)
  Add search task watchdog to log hot threads on slow search (elastic#142746)
  Fix return_intermediate_results query param on get async search results (elastic#142875)
  Mute org.elasticsearch.compute.operator.exchange.BatchDriverTests testSinglePageSingleBatch elastic#142895
  Cancel reindex body always has status (elastic#142766)
  Fix built-in roles sync losing updates (elastic#142433)
  ESQL: Clarify docs and add csv test for WHERE in STATS (elastic#133629)
  Fix and unmute ReindexResumeIT (elastic#142788)
  Fix broken release notes
  Mute org.elasticsearch.benchmark.vector.scorer.VectorScorerOSQBenchmarkTests testSingleScalarVsVectorized {p0=384 p1=4 p2=NIO p3=COSINE} elastic#142883
  ES|QL: fix Generative tests for commands that don't change the output schema (elastic#142864)
  Mute org.elasticsearch.benchmark.vector.scorer.VectorScorerOSQBenchmarkTests testSingleScalarVsVectorized {p0=1024 p1=1 p2=NIO p3=DOT_PRODUCT} elastic#142881
  SQL: Fix QlIllegalArgumentException with non-foldable date range queries (elastic#142386)
  Add more errors to the allowed_errors with github issue links (elastic#142862)
  ESQL: reapply "NDJSON datasource" (elastic#142855)
  Add implementation to update service settings method for Alibaba Cloud Search service (elastic#142738)
  Mute org.elasticsearch.snapshots.SnapshotShutdownIT testStartRemoveNodeButDoNotComplete elastic#142871
  Mute org.elasticsearch.snapshots.SnapshotShutdownIT testDeleteSnapshotWithPausedShardSnapshots elastic#142870
  Mute org.elasticsearch.snapshots.SnapshotShutdownIT testAbortSnapshotWhileRemovingNode elastic#142869
  Mute org.elasticsearch.snapshots.SnapshotShutdownIT testRemoveNodeDuringSnapshot elastic#142868
  ES|QL: Guard exponential_histogram TO_STRING against too large inputs (elastic#140718)
  ...
elasticsearchmachine pushed a commit that referenced this pull request Feb 23, 2026
Built-in role sync requests arriving while another sync was
in progress were silently discarded. This caused roles to go
missing when cluster change events fired in quick succession
during bootstrap.

Replace the synchronizationInProgress AtomicBoolean with a
lock-free AtomicInteger state machine (RolesSync) that tracks
idle, syncing, and syncing_pending states. When a sync
completes and updates are pending, it automatically retries
with the latest roles.
elasticsearchmachine pushed a commit that referenced this pull request Feb 23, 2026
Built-in role sync requests arriving while another sync was
in progress were silently discarded. This caused roles to go
missing when cluster change events fired in quick succession
during bootstrap.

Replace the synchronizationInProgress AtomicBoolean with a
lock-free AtomicInteger state machine (RolesSync) that tracks
idle, syncing, and syncing_pending states. When a sync
completes and updates are pending, it automatically retries
with the latest roles.
jdconrad pushed a commit to jdconrad/elasticsearch that referenced this pull request Feb 24, 2026
Built-in role sync requests arriving while another sync was
in progress were silently discarded. This caused roles to go
missing when cluster change events fired in quick succession
during bootstrap.

Replace the synchronizationInProgress AtomicBoolean with a
lock-free AtomicInteger state machine (RolesSync) that tracks
idle, syncing, and syncing_pending states. When a sync
completes and updates are pending, it automatically retries
with the latest roles.
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Feb 24, 2026
Built-in role sync requests arriving while another sync was
in progress were silently discarded. This caused roles to go
missing when cluster change events fired in quick succession
during bootstrap.

Replace the synchronizationInProgress AtomicBoolean with a
lock-free AtomicInteger state machine (RolesSync) that tracks
idle, syncing, and syncing_pending states. When a sync
completes and updates are pending, it automatically retries
with the latest roles.
elasticsearchmachine pushed a commit that referenced this pull request Apr 3, 2026
Built-in role sync requests arriving while another sync was
in progress were silently discarded. This caused roles to go
missing when cluster change events fired in quick succession
during bootstrap.

Replace the synchronizationInProgress AtomicBoolean with a
lock-free AtomicInteger state machine (RolesSync) that tracks
idle, syncing, and syncing_pending states. When a sync
completes and updates are pending, it automatically retries
with the latest roles.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :Security/Security Security issues without another label Team:Security Meta label for security team v8.19.13 v9.2.7 v9.3.2 v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants