[StorageIndexAdapter] Set auto_expand_replicas to fix yellow health on single-node ES clusters#263096
Conversation
…n single-node ES clusters StorageIndexAdapter did not include index settings in its template, causing all managed indices to default to number_of_replicas: 1. On single-node Elasticsearch clusters, the replica shard cannot be allocated, leaving cluster health yellow indefinitely. This adds auto_expand_replicas: '0-1' and number_of_shards: 1 to the index template and updates existing indices on write if their settings differ. Fixes elastic#263048
…gIndex With flat_settings: true, Elasticsearch returns dot-notation keys like 'index.auto_expand_replicas' instead of nested objects. This caused currentAutoExpandReplicas to always be undefined, making putSettings run on every write even when the setting was already correct.
|
I tested this locally with the streams index, also in the upgrade scenario and it seemed to work fine, however, it only changes the settings on the first write call, so it does not auto heal existing problems. You think that's the right approach? |
|
Pinging @elastic/kibana-core (Team:Core) |
| this.logger.debug(`Updating mappings of existing index due to schema version mismatch`); | ||
| await this.updateMappingsOfExistingIndex({ | ||
| } else { | ||
| await this.updateSettingsOfExistingIndex({ |
There was a problem hiding this comment.
We already get the index on line 308 which includes index settings. Is it necessary to get the index again inside updateSettingsOfExistingIndex.
If it's necessary to fix the write index, isn't it necessary to fix all backing indices or can we safely assume no consumers have rolled over to a new index?
There was a problem hiding this comment.
Good question, ties into the one I posted above about how far we should go with proactively fixing the existing configurations or whether we should just fix it forward.
I don't have a strong opinion on it, we can also just make it a thing for newly created backing indices and ignore existing ones, wdyt?
There was a problem hiding this comment.
I'm leaning towards just fix it for new indices, not for existing ones.
There was a problem hiding this comment.
yeah my assumption is that single node clusters are test/demo/dev clusters only. So it's unlikely that we have production customers impacted and fixing for the write index/new indices would be sufficient.
I think it's still worth reusing the settings we already have
|
Ralph applied changes for: simplify the implementation so the auto_expand_replicas as rudolf suggested by reusing the settings we already have Updated by Ralph Engine. |
|
@rudolf simplified to just changing this for future indices - less moving parts and as you say it shouldn't have an impact on production systems. |
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
|
|
Starting backport for target branches: 9.4 https://github.com/elastic/kibana/actions/runs/24599997646 |
…n single-node ES clusters (elastic#263096) ## Summary Fixes elastic#263048 `StorageIndexAdapter` did not include index settings in its template, causing all 24 managed indices (`.kibana_streams`, `.chat-conversations`, `kibana-evaluation-datasets`, etc.) to default to `number_of_replicas: 1`. On single-node Elasticsearch clusters, the replica shard cannot be allocated, leaving cluster health yellow indefinitely. This is the same class of issue as elastic#261933 (`.workflows-events`), but affecting all indices managed by `StorageIndexAdapter`. ### Changes - Added `settings: { auto_expand_replicas: '0-1', number_of_shards: 1 }` to the index template in `createOrUpdateIndexTemplate()` — this is the standard pattern used by all other Kibana system indices (`.kibana`, `.kibana_task_manager`, event log, lock manager, blob storage, etc.) - Added `updateSettingsOfExistingIndex()` method that checks the current `auto_expand_replicas` value on an existing write index and updates it to `'0-1'` if it differs — this fixes existing deployments that already have indices with `number_of_replicas: 1` - Wired `updateSettingsOfExistingIndex()` into `validateComponentsBeforeWriting()` so it runs on every write to an existing index ### Affected indices (all 24 automatically benefit) | Plugin | Indices | |--------|---------| | streams (10) | `.chat-memory`, `.chat-memhistory`, `.kibana_streams`, `.kibana_streams_settings`, `.kibana_streams_features`, `.kibana_streams_assets`, `.kibana_streams_attachments`, `.kibana_streams_insights`, `.kibana_streams_tasks`, `.kibana_streams_content_packs` | | agent_builder (10) | `.chat-conversations`, `.chat-skills`, `.chat-tools`, `.chat-tool-health`, `.chat-plugins`, `.chat-agent-executions`, `.chat-agents`, `.chat-sml-data`, `.chat-sml-crawler-state`, `.chat-user-prompts` | | evals (2) | `kibana-evaluation-datasets`, `kibana-evaluation-dataset-examples` | | automatic_import (1) | `.kibana-automatic-import-samples` | | workflows_management (1) | `.workflows-workflows` | ### Test plan - [x] Unit tests: 9 passing (3 new tests for template settings, settings update, and no-op when already correct) - [x] Integration tests: 20 passing (1 new test verifying existing index gets `auto_expand_replicas` updated on next write) - [x] Type check passes --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit b805e2e)
…alth on single-node ES clusters (#263096) (#264262) # Backport This will backport the following commits from `main` to `9.4`: - [StorageIndexAdapter] Set auto_expand_replicas to fix yellow health on single-node ES clusters (#263096) (b805e2e) <!--- Backport version: 9.6.6 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sorenlouv/backport) <!--BACKPORT [{"author":{"name":"Joe Reuter","email":"johannes.reuter@elastic.co"},"sourceCommit":{"committedDate":"2026-04-18T07:38:46Z","message":"[StorageIndexAdapter] Set auto_expand_replicas to fix yellow health on single-node ES clusters (#263096)\n\n## Summary\n\nFixes #263048\n\n`StorageIndexAdapter` did not include index settings in its template,\ncausing all 24 managed indices (`.kibana_streams`,\n`.chat-conversations`, `kibana-evaluation-datasets`, etc.) to default to\n`number_of_replicas: 1`. On single-node Elasticsearch clusters, the\nreplica shard cannot be allocated, leaving cluster health yellow\nindefinitely.\n\nThis is the same class of issue as #261933 (`.workflows-events`), but\naffecting all indices managed by `StorageIndexAdapter`.\n\n### Changes\n\n- Added `settings: { auto_expand_replicas: '0-1', number_of_shards: 1 }`\nto the index template in `createOrUpdateIndexTemplate()` — this is the\nstandard pattern used by all other Kibana system indices (`.kibana`,\n`.kibana_task_manager`, event log, lock manager, blob storage, etc.)\n- Added `updateSettingsOfExistingIndex()` method that checks the current\n`auto_expand_replicas` value on an existing write index and updates it\nto `'0-1'` if it differs — this fixes existing deployments that already\nhave indices with `number_of_replicas: 1`\n- Wired `updateSettingsOfExistingIndex()` into\n`validateComponentsBeforeWriting()` so it runs on every write to an\nexisting index\n\n### Affected indices (all 24 automatically benefit)\n\n| Plugin | Indices |\n|--------|---------|\n| streams (10) | `.chat-memory`, `.chat-memhistory`, `.kibana_streams`,\n`.kibana_streams_settings`, `.kibana_streams_features`,\n`.kibana_streams_assets`, `.kibana_streams_attachments`,\n`.kibana_streams_insights`, `.kibana_streams_tasks`,\n`.kibana_streams_content_packs` |\n| agent_builder (10) | `.chat-conversations`, `.chat-skills`,\n`.chat-tools`, `.chat-tool-health`, `.chat-plugins`,\n`.chat-agent-executions`, `.chat-agents`, `.chat-sml-data`,\n`.chat-sml-crawler-state`, `.chat-user-prompts` |\n| evals (2) | `kibana-evaluation-datasets`,\n`kibana-evaluation-dataset-examples` |\n| automatic_import (1) | `.kibana-automatic-import-samples` |\n| workflows_management (1) | `.workflows-workflows` |\n\n### Test plan\n\n- [x] Unit tests: 9 passing (3 new tests for template settings, settings\nupdate, and no-op when already correct)\n- [x] Integration tests: 20 passing (1 new test verifying existing index\ngets `auto_expand_replicas` updated on next write)\n- [x] Type check passes\n\n---------\n\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>","sha":"b805e2e703e8c385da2386a819a1fbb727a71720"},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[]}] BACKPORT--> Co-authored-by: Joe Reuter <johannes.reuter@elastic.co>
* main: (114 commits) Fix observability_ai_assistant_tool_call EBT error when connector is an inference endpoint (elastic#263334) init on install (elastic#263732) [One Workflow] fail-fast TaskRecovery for interrupted runs (elastic#261275) [Entity Store] Reset state error after successful task run (elastic#263087) [api-docs] 2026-04-19 Daily api_docs build (elastic#264280) [UII] Fix integration card row height calculation (elastic#264212) [scout] migrate FTR logstash api tests (elastic#262953) [StorageIndexAdapter] Set auto_expand_replicas to fix yellow health on single-node ES clusters (elastic#263096) [api-docs] 2026-04-18 Daily api_docs build (elastic#264260) [Scout] Update test config manifests (elastic#264257) [Security Solution][Detection Engine] enables AI rule creation feature flag (elastic#264036) [dashboards as code] only validate id on PUT route when creating new dashboard (elastic#264161) chore(NA): bump version to 9.5.0 (elastic#262165) skip failing test suite (elastic#263649) skip failing test suite (elastic#264236) [Discover] Convert remaining Enzyme tests to RTL (elastic#259676) auto-implement: Labels in model endpoints table of the model details flyout look misaligned (elastic#263770) [ci] Promote ES docker image after verification (elastic#263890) [Observability:Onboarding] Remove suppress global announcements that was breaking ensemble tests (elastic#264169) [Cases][AttachmentV2] Migrate persistable state part 2 - ML and AIOps charts (elastic#262597) ...
…64760) Closes #264845 ## Summary Fixes index template creation on Serverless for indices `kibana-evaluation-datasets`, `kibana-evaluation-dataset-examples`). PR #263096 added `auto_expand_replicas` and `number_of_shards` to index templates in `StorageIndexAdapter`. Serverless ES rejects these settings on non-system indices with an `illegal_argument_exception`, while hidden indices (e.g.: used by Streams) are unaffected because Kibana manages them as system indices. ### Dataset upsert error for Kibana evaluation runs <img width="1247" height="473" alt="image" src="https://github.com/user-attachments/assets/10e75668-7a1d-462e-9594-37fbee0f08e3" /> ### Error in logs: ``` Failed to upsert evaluation dataset: ResponseError: illegal_argument_exception Root causes: illegal_argument_exception: Settings [index.auto_expand_replicas,index.number_of_shards] are not available when running in serverless mode ``` ## Fix The changes were introduced in three tiers to detect serverless environments for index template settings: - Explicit detection - Introduced a new `isServerless` option in `StorageIndexAdapterOptions`. When provided, the adapter skips or includes settings without any extra calls. - Proactive - if `isServerless` is not provided, the adapter calls `esClient.info()` on the first write and checks `version.build_flavor`. The result is cached for the adapter's lifetime. - Reactive - if both above are unavailable (e.g.: `info()` fails due to insufficient privileges), the adapter catches the `illegal_argument_exception` on the first write, retries without settings, and caches the result. The Evals plugin passes `isServerless` explicitly because the evals route handler creates `StorageIndexAdapter` with `esClient.asCurrentUser`, which is scoped to the caller's API key. This API key may lack the monitor cluster privilege needed for `esClient.info()`, making tier 2 unreliable. There `buildFlavor` is passed from the plugin context. ## Test Plan - [x] Deploy the fix to a serverless project from this PR - [x] Create a config file (e.g.: `config.testcluster.json`) and add the serverless project URL as the dataset target - [x] Run evals with `node scripts/evals start --suite significant-events --project eis-anthropic-claude-4-6-sonnet --judge eis-google-gemini-3-1-pro --export-profile local --datasets-profile testcluster` ### With this fix, the dataset upsert works as expected <img width="1531" height="877" alt="image" src="https://github.com/user-attachments/assets/84c2a5cd-138b-457e-85d3-bd87bff4867c" /> <img width="1710" height="556" alt="image" src="https://github.com/user-attachments/assets/bbfeb03a-405f-4551-8326-e12b0192d332" /> ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) - [x] Review the [backport guidelines](https://docs.google.com/document/d/1VyN5k91e5OVumlc0Gb9RPa3h1ewuPE705nRtioPiTvY/edit?usp=sharing) and apply applicable `backport:*` labels.
…astic#264760) Closes elastic#264845 ## Summary Fixes index template creation on Serverless for indices `kibana-evaluation-datasets`, `kibana-evaluation-dataset-examples`). PR elastic#263096 added `auto_expand_replicas` and `number_of_shards` to index templates in `StorageIndexAdapter`. Serverless ES rejects these settings on non-system indices with an `illegal_argument_exception`, while hidden indices (e.g.: used by Streams) are unaffected because Kibana manages them as system indices. ### Dataset upsert error for Kibana evaluation runs <img width="1247" height="473" alt="image" src="https://github.com/user-attachments/assets/10e75668-7a1d-462e-9594-37fbee0f08e3" /> ### Error in logs: ``` Failed to upsert evaluation dataset: ResponseError: illegal_argument_exception Root causes: illegal_argument_exception: Settings [index.auto_expand_replicas,index.number_of_shards] are not available when running in serverless mode ``` ## Fix The changes were introduced in three tiers to detect serverless environments for index template settings: - Explicit detection - Introduced a new `isServerless` option in `StorageIndexAdapterOptions`. When provided, the adapter skips or includes settings without any extra calls. - Proactive - if `isServerless` is not provided, the adapter calls `esClient.info()` on the first write and checks `version.build_flavor`. The result is cached for the adapter's lifetime. - Reactive - if both above are unavailable (e.g.: `info()` fails due to insufficient privileges), the adapter catches the `illegal_argument_exception` on the first write, retries without settings, and caches the result. The Evals plugin passes `isServerless` explicitly because the evals route handler creates `StorageIndexAdapter` with `esClient.asCurrentUser`, which is scoped to the caller's API key. This API key may lack the monitor cluster privilege needed for `esClient.info()`, making tier 2 unreliable. There `buildFlavor` is passed from the plugin context. ## Test Plan - [x] Deploy the fix to a serverless project from this PR - [x] Create a config file (e.g.: `config.testcluster.json`) and add the serverless project URL as the dataset target - [x] Run evals with `node scripts/evals start --suite significant-events --project eis-anthropic-claude-4-6-sonnet --judge eis-google-gemini-3-1-pro --export-profile local --datasets-profile testcluster` ### With this fix, the dataset upsert works as expected <img width="1531" height="877" alt="image" src="https://github.com/user-attachments/assets/84c2a5cd-138b-457e-85d3-bd87bff4867c" /> <img width="1710" height="556" alt="image" src="https://github.com/user-attachments/assets/bbfeb03a-405f-4551-8326-e12b0192d332" /> ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) - [x] Review the [backport guidelines](https://docs.google.com/document/d/1VyN5k91e5OVumlc0Gb9RPa3h1ewuPE705nRtioPiTvY/edit?usp=sharing) and apply applicable `backport:*` labels.
…astic#264760) Closes elastic#264845 ## Summary Fixes index template creation on Serverless for indices `kibana-evaluation-datasets`, `kibana-evaluation-dataset-examples`). PR elastic#263096 added `auto_expand_replicas` and `number_of_shards` to index templates in `StorageIndexAdapter`. Serverless ES rejects these settings on non-system indices with an `illegal_argument_exception`, while hidden indices (e.g.: used by Streams) are unaffected because Kibana manages them as system indices. ### Dataset upsert error for Kibana evaluation runs <img width="1247" height="473" alt="image" src="https://github.com/user-attachments/assets/10e75668-7a1d-462e-9594-37fbee0f08e3" /> ### Error in logs: ``` Failed to upsert evaluation dataset: ResponseError: illegal_argument_exception Root causes: illegal_argument_exception: Settings [index.auto_expand_replicas,index.number_of_shards] are not available when running in serverless mode ``` ## Fix The changes were introduced in three tiers to detect serverless environments for index template settings: - Explicit detection - Introduced a new `isServerless` option in `StorageIndexAdapterOptions`. When provided, the adapter skips or includes settings without any extra calls. - Proactive - if `isServerless` is not provided, the adapter calls `esClient.info()` on the first write and checks `version.build_flavor`. The result is cached for the adapter's lifetime. - Reactive - if both above are unavailable (e.g.: `info()` fails due to insufficient privileges), the adapter catches the `illegal_argument_exception` on the first write, retries without settings, and caches the result. The Evals plugin passes `isServerless` explicitly because the evals route handler creates `StorageIndexAdapter` with `esClient.asCurrentUser`, which is scoped to the caller's API key. This API key may lack the monitor cluster privilege needed for `esClient.info()`, making tier 2 unreliable. There `buildFlavor` is passed from the plugin context. ## Test Plan - [x] Deploy the fix to a serverless project from this PR - [x] Create a config file (e.g.: `config.testcluster.json`) and add the serverless project URL as the dataset target - [x] Run evals with `node scripts/evals start --suite significant-events --project eis-anthropic-claude-4-6-sonnet --judge eis-google-gemini-3-1-pro --export-profile local --datasets-profile testcluster` ### With this fix, the dataset upsert works as expected <img width="1531" height="877" alt="image" src="https://github.com/user-attachments/assets/84c2a5cd-138b-457e-85d3-bd87bff4867c" /> <img width="1710" height="556" alt="image" src="https://github.com/user-attachments/assets/bbfeb03a-405f-4551-8326-e12b0192d332" /> ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) - [x] Review the [backport guidelines](https://docs.google.com/document/d/1VyN5k91e5OVumlc0Gb9RPa3h1ewuPE705nRtioPiTvY/edit?usp=sharing) and apply applicable `backport:*` labels.
…astic#264760) Closes elastic#264845 ## Summary Fixes index template creation on Serverless for indices `kibana-evaluation-datasets`, `kibana-evaluation-dataset-examples`). PR elastic#263096 added `auto_expand_replicas` and `number_of_shards` to index templates in `StorageIndexAdapter`. Serverless ES rejects these settings on non-system indices with an `illegal_argument_exception`, while hidden indices (e.g.: used by Streams) are unaffected because Kibana manages them as system indices. ### Dataset upsert error for Kibana evaluation runs <img width="1247" height="473" alt="image" src="https://github.com/user-attachments/assets/10e75668-7a1d-462e-9594-37fbee0f08e3" /> ### Error in logs: ``` Failed to upsert evaluation dataset: ResponseError: illegal_argument_exception Root causes: illegal_argument_exception: Settings [index.auto_expand_replicas,index.number_of_shards] are not available when running in serverless mode ``` ## Fix The changes were introduced in three tiers to detect serverless environments for index template settings: - Explicit detection - Introduced a new `isServerless` option in `StorageIndexAdapterOptions`. When provided, the adapter skips or includes settings without any extra calls. - Proactive - if `isServerless` is not provided, the adapter calls `esClient.info()` on the first write and checks `version.build_flavor`. The result is cached for the adapter's lifetime. - Reactive - if both above are unavailable (e.g.: `info()` fails due to insufficient privileges), the adapter catches the `illegal_argument_exception` on the first write, retries without settings, and caches the result. The Evals plugin passes `isServerless` explicitly because the evals route handler creates `StorageIndexAdapter` with `esClient.asCurrentUser`, which is scoped to the caller's API key. This API key may lack the monitor cluster privilege needed for `esClient.info()`, making tier 2 unreliable. There `buildFlavor` is passed from the plugin context. ## Test Plan - [x] Deploy the fix to a serverless project from this PR - [x] Create a config file (e.g.: `config.testcluster.json`) and add the serverless project URL as the dataset target - [x] Run evals with `node scripts/evals start --suite significant-events --project eis-anthropic-claude-4-6-sonnet --judge eis-google-gemini-3-1-pro --export-profile local --datasets-profile testcluster` ### With this fix, the dataset upsert works as expected <img width="1531" height="877" alt="image" src="https://github.com/user-attachments/assets/84c2a5cd-138b-457e-85d3-bd87bff4867c" /> <img width="1710" height="556" alt="image" src="https://github.com/user-attachments/assets/bbfeb03a-405f-4551-8326-e12b0192d332" /> ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) - [x] Review the [backport guidelines](https://docs.google.com/document/d/1VyN5k91e5OVumlc0Gb9RPa3h1ewuPE705nRtioPiTvY/edit?usp=sharing) and apply applicable `backport:*` labels.
Summary
Fixes #263048
StorageIndexAdapterdid not include index settings in its template, causing all 24 managed indices (.kibana_streams,.chat-conversations,kibana-evaluation-datasets, etc.) to default tonumber_of_replicas: 1. On single-node Elasticsearch clusters, the replica shard cannot be allocated, leaving cluster health yellow indefinitely.This is the same class of issue as #261933 (
.workflows-events), but affecting all indices managed byStorageIndexAdapter.Changes
settings: { auto_expand_replicas: '0-1', number_of_shards: 1 }to the index template increateOrUpdateIndexTemplate()— this is the standard pattern used by all other Kibana system indices (.kibana,.kibana_task_manager, event log, lock manager, blob storage, etc.)updateSettingsOfExistingIndex()method that checks the currentauto_expand_replicasvalue on an existing write index and updates it to'0-1'if it differs — this fixes existing deployments that already have indices withnumber_of_replicas: 1updateSettingsOfExistingIndex()intovalidateComponentsBeforeWriting()so it runs on every write to an existing indexAffected indices (all 24 automatically benefit)
.chat-memory,.chat-memhistory,.kibana_streams,.kibana_streams_settings,.kibana_streams_features,.kibana_streams_assets,.kibana_streams_attachments,.kibana_streams_insights,.kibana_streams_tasks,.kibana_streams_content_packs.chat-conversations,.chat-skills,.chat-tools,.chat-tool-health,.chat-plugins,.chat-agent-executions,.chat-agents,.chat-sml-data,.chat-sml-crawler-state,.chat-user-promptskibana-evaluation-datasets,kibana-evaluation-dataset-examples.kibana-automatic-import-samples.workflows-workflowsTest plan
auto_expand_replicasupdated on next write)