[kbn-es] Add --docker flag to yarn es snapshot#254306
[kbn-es] Add --docker flag to yarn es snapshot#254306patrykkopycinski merged 22 commits intoelastic:mainfrom
Conversation
|
/ci |
|
/ci |
|
/ci |
16 similar comments
|
/ci |
|
/ci |
|
/ci |
|
/ci |
|
/ci |
|
/ci |
|
/ci |
|
/ci |
|
/ci |
|
/ci |
|
/ci |
|
/ci |
|
/ci |
|
/ci |
|
/ci |
|
/ci |
ce9fbe2 to
854ee1a
Compare
|
/ci |
4 similar comments
|
/ci |
|
/ci |
|
/ci |
|
/ci |
4d33ac0 to
39730fd
Compare
|
/ci |
When pRetry retries runElasticsearch after a failure, the container from the previous attempt still exists (stopped or dead). Docker refuses to create a new container with the same name. Force-remove it before docker run.
ES Docker containers crash when esArgs reference config files (e.g. JWT JWKS keys) that only exist on the host. Detect file-path values in esArgs, volume-mount them, and rewrite the path to the container- internal location using getDockerFileMountPath.
Only default to Docker ES for non-serverless tests. Serverless Cypress tests have esArgs (e.g. serverless.search.enable_replicas_for_instant_failover) that are incompatible with the standard ES Docker image.
Fleet Server runs in Docker on the `elastic` network and needs to reach ES. When ES also runs in Docker, localhostRealIp may not be routable from containers (e.g. on macOS where it can be a VM bridge IP). Use host.docker.internal (set via --add-host) for both the Fleet Server bootstrap env var and the preconfigured Fleet output so that Fleet Server can reach ES after enrollment.
…tput change - Add transportPort option to DockerSnapshotOptions so CCS tests can map the correct host-side port to container port 9300 - Pass transportPort from test_es_cluster to runDockerSnapshot - Add --add-host host.docker.internal:host-gateway to ES Docker containers so cross-cluster seeds resolve via host networking - Replace localhost with host.docker.internal in cluster.remote.*.seeds esArgs inside Docker containers - Revert the host.docker.internal Fleet output change since Defend Workflows failures are pre-existing (unrelated to Docker ES changes)
- Rename loadModules to tryLoadModules for clarity - Move CreateVagrantVmOptions interface next to createVagrantVm - Log success message after VirtualBox 7.1 upgrade
5efac7e to
945347c
Compare
gergoabraham
left a comment
There was a problem hiding this comment.
looks amazing, thanks for improving on the test env! 🙇
@elastic/security-defend-workflows related changes look good 👍
ashokaditya
left a comment
There was a problem hiding this comment.
Thanks for adding changes from the previous PR review! Tested this out and works as expected.
| const containerName = options.name || 'es01'; | ||
| const port = options.port || DEFAULT_PORT; | ||
| const password = options.password || 'changeme'; | ||
| const transportPort = options.transportPort ?? port + 100; |
There was a problem hiding this comment.
this is not following the existing logic, maybe we can unify it:
kibana/src/platform/packages/shared/kbn-test/src/es/es_test_config.ts
Lines 35 to 37 in 773bb6a
dmlemeshko
left a comment
There was a problem hiding this comment.
LGTM, left a nit about TEST_ES_TRANSPORT_PORT variable.
Please double check with Ops Team about potential impact.
Resolve the transport port from esTestConfig.getTransportPort() for the Docker ES path, matching the existing behavior of the local snapshot flow. Parses the first port from range strings (e.g. '9300-9400') since Docker requires a single numeric port binding.
|
/ci |
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
Public APIs missing exports
History
|
delanni
left a comment
There was a problem hiding this comment.
Unfortunately this introduces even more modes to run ES in kibana, which will surely be more confusion, but I understand it's beneficial for a multi-repo usecase 👍
…ps-config-rebase * commit 'f135f030951237c5e9b0251931441aee3121b31d': (163 commits) [CPS] Support data view requests and do not sanitize project_routing in data plugin/resolve indices (elastic#253654) [One Workflow] Execute workflow from historical (elastic#253396) [streams][background tasks] gracefully handle non existing stream (elastic#254683) [Lens API] Waffle/Mosaic get green as a default color (elastic#254304) [Security Solution] Remove prebuilt rules customization callout on Rule Management page (elastic#254386) [Workflows] support passing attachments to run_agent step (elastic#251291) [One Discover][Logs UX] Update OpenTelemetry Semantic Conventions (elastic#254367) [kbn-es] Add --docker flag to yarn es snapshot (elastic#254306) [Workplace AI] Remove Data Source Config (elastic#254521) [Entity Store v2] Add CRUD API (elastic#252052) [CI] Increase type checking machine (elastic#254676) [main] Sync bundled packages with Package Storage (elastic#254232) Skip flaky test elastic#254625 (elastic#254662) Upgrade `@elastic/elasticsearch` to 9.3.1 (elastic#253660) [One Workflow] Migrate http step to new connector (elastic#249004) [Entity Store] Store EUID Scripts (elastic#254515) [APM] Fix Otel missing fields undefined errors (elastic#254271) [Console] Add support for documentation links on Serverless (elastic#254489) Create edit ILM flow (elastic#253393) [Agent Builder] Mid term: minimal recommended model set elastic#12875 (elastic#254560) ...
## Summary Adds a `--docker` flag to `yarn es snapshot` that runs Elasticsearch in a Docker container with **1:1 equivalent behavior** to the local snapshot flow, and wires Docker-based ES as the **default mode for non-serverless Cypress tests** (opt-out via `CYPRESS_ES_FROM=snapshot`). ### What changed **Cypress tests: Docker ES as default for non-serverless** - Non-serverless Cypress tests now use `esFrom: 'docker'` by default, starting ES in a Docker container instead of downloading and extracting a local snapshot - Set `CYPRESS_ES_FROM=snapshot` to revert to the previous behavior - Serverless Cypress tests are unaffected — they continue using `esFrom: 'serverless'` **New `esFrom: 'docker'` mode for test infrastructure** - `createTestEsCluster` now supports `esFrom: 'docker'`, starting ES in a Docker container instead of downloading and extracting a local snapshot - `runDockerSnapshotContainer` accepts `name` (unique container naming for parallel safety) and `background` (non-blocking mode for test integration) - `Cluster` class tracks Docker snapshot containers and handles cleanup (kill + rm) on stop - New `stopDockerSnapshotContainer()` utility for explicit container teardown - Transport port resolution unified with `TEST_ES_TRANSPORT_PORT` env var, matching the existing `esTestConfig.getTransportPort()` behavior (parses first port from range strings like `'9300-9400'`) **New `--docker` mode for `yarn es snapshot`** - `--docker` — run ES in Docker instead of downloading locally - `--port` — bind port (default 9200, Docker mode only) - `--kill` — kill existing ES containers before starting (Docker mode only) - All existing flags work: `--license`, `--password`, `--ssl`, `-E`, `--skip-ready-check`, `--ready-timeout` - `-E path.data=<relative-path>` is automatically mapped to a Docker volume mount for data persistence - `-E` values that reference host files (e.g. JWT JWKS keys) are automatically volume-mounted into the container with path rewriting - `--license=trial` is mapped to `xpack.license.self_generated.type=trial` - Same default esArgs as the local snapshot flow (`action.destructive_requires_name`, `cluster.routing.allocation.disk.threshold_enabled=false`, etc.) - Waits for cluster readiness and sets up native realm passwords, matching local behavior **Usage example:** ```bash # Default for non-serverless (Docker ES) node scripts/run_cypress parallel --config ... # Opt-out to local snapshot CYPRESS_ES_FROM=snapshot node scripts/run_cypress parallel --config ... # yarn es snapshot with Docker yarn es snapshot --docker -E path.data=../my-data --license=trial ``` ### Docker networking & container management - ES binds to `0.0.0.0` inside the container for Fleet Server / Elastic Agent connectivity from other containers and VMs - `--add-host host.docker.internal:host-gateway` added to ES containers for cross-cluster seed resolution - `cluster.remote.*.seeds` esArgs automatically rewritten from `localhost` to `host.docker.internal` inside Docker - Fleet Server uses `host.docker.internal` for ES connectivity when ES runs in Docker - CCS transport port (`transportPort`) properly mapped to container port 9300 - Stale containers are force-removed before starting a new one (handles pRetry retries cleanly) - Only `verifyDockerInstalled()` + `maybeCreateDockerNetwork()` are called for Docker snapshot setup — avoids interfering with serverless containers (no `detectRunningNodes` / `cleanUpDanglingContainers`) ### Expected boot time gains (when using Docker) Running ES via Docker eliminates the snapshot download + extraction overhead. With a pre-pulled image the gains are significant: | Phase | Snapshot (local) | Docker | Savings | |-------|-----------------|--------|---------| | Download archive | ~30-60s (cold) / 0s (cached) | 0s (image cached) | up to ~60s | | Extract / install | ~10-20s | 0s | ~10-20s | | Container startup | N/A | ~2-5s | — | | Cluster ready check | ~10-15s | ~10-15s | ~0s | | Native realm setup | ~2-3s | ~2-3s | ~0s | | **Total (cold)** | **~55-100s** | **~15-25s** | **~40-75s** | | **Total (warm)** | **~25-40s** | **~15-25s** | **~10-20s** | Key observations: - **Cold start** (first run / CI with no cache): Docker avoids the full snapshot download (~200MB) and extraction. The Docker image is pulled once and cached across runs. - **Warm start** (snapshot already cached locally): Docker still saves ~10-20s by skipping archive extraction and JVM bootstrap overhead. - **CI impact**: Requires Docker-in-Docker support on CI agents. When available, the Docker image can be pre-cached in the base image or pulled in parallel with other setup steps. - **Per-spec overhead**: Since the Cypress parallel runner starts/stops ES for each spec file, savings compound — e.g. 10 spec files × 15s saved = ~2.5 minutes total. **`native_realm.js` → `native_realm.ts` migration** - Full TypeScript conversion with typed constructor (`NativeRealmOptions`), retry options (`RetryOpts`), and all method signatures - Removed legacy `body` wrapper on ES client API calls (not needed with newer `@elastic/elasticsearch` client) - Converted from mixed CJS/ESM to pure ESM, removing the `@ts-expect-error` suppression in the barrel export **Cleanup** - Removed stale arm64 `-XX:UseSVE=0` JVM workaround ([elastic/elasticsearch#118583](elastic/elasticsearch#118583)) — no longer needed at ES 9.x - Extracted duplicated Java opts logic into the inline arrays directly ### Files changed | File | Change | |------|--------| | `kbn-es/src/cli_commands/snapshot.ts` | Added `--docker`, `--port`, `--kill` flags | | `kbn-es/src/cluster.ts` | Added `runDockerSnapshot()` method, Docker container tracking and cleanup | | `kbn-es/src/cluster_exec_options.ts` | Added Docker-related exec option types | | `kbn-es/src/utils/docker.ts` | Added `runDockerSnapshotContainer()` with `name`/`background`/`transportPort` options, `stopDockerSnapshotContainer()`, host file volume mounting, `host.docker.internal` support, removed arm64 workaround | | `kbn-es/src/utils/docker_uiam.ts` | Updated for Docker snapshot compatibility | | `kbn-es/src/utils/index.ts` | Removed `@ts-expect-error` for native_realm | | `kbn-es/src/utils/native_realm.js → .ts` | TypeScript migration | | `kbn-es/src/utils/native_realm.test.js → .ts` | TypeScript migration | | `kbn-test/src/es/test_es_cluster.ts` | Added `esFrom: 'docker'` branch, unified transport port resolution with `TEST_ES_TRANSPORT_PORT` | | `security_solution/scripts/run_cypress/parallel.ts` | Default to Docker ES for non-serverless tests, `CYPRESS_ES_FROM` env var for opt-out | ## Risk Low — additive feature behind new flags/env vars. Serverless tests are untouched. Non-serverless Cypress tests default to Docker but can opt out via `CYPRESS_ES_FROM=snapshot`. The native_realm migration preserves identical runtime behavior. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
## Summary Adds a `--docker` flag to `yarn es snapshot` that runs Elasticsearch in a Docker container with **1:1 equivalent behavior** to the local snapshot flow, and wires Docker-based ES as the **default mode for non-serverless Cypress tests** (opt-out via `CYPRESS_ES_FROM=snapshot`). ### What changed **Cypress tests: Docker ES as default for non-serverless** - Non-serverless Cypress tests now use `esFrom: 'docker'` by default, starting ES in a Docker container instead of downloading and extracting a local snapshot - Set `CYPRESS_ES_FROM=snapshot` to revert to the previous behavior - Serverless Cypress tests are unaffected — they continue using `esFrom: 'serverless'` **New `esFrom: 'docker'` mode for test infrastructure** - `createTestEsCluster` now supports `esFrom: 'docker'`, starting ES in a Docker container instead of downloading and extracting a local snapshot - `runDockerSnapshotContainer` accepts `name` (unique container naming for parallel safety) and `background` (non-blocking mode for test integration) - `Cluster` class tracks Docker snapshot containers and handles cleanup (kill + rm) on stop - New `stopDockerSnapshotContainer()` utility for explicit container teardown - Transport port resolution unified with `TEST_ES_TRANSPORT_PORT` env var, matching the existing `esTestConfig.getTransportPort()` behavior (parses first port from range strings like `'9300-9400'`) **New `--docker` mode for `yarn es snapshot`** - `--docker` — run ES in Docker instead of downloading locally - `--port` — bind port (default 9200, Docker mode only) - `--kill` — kill existing ES containers before starting (Docker mode only) - All existing flags work: `--license`, `--password`, `--ssl`, `-E`, `--skip-ready-check`, `--ready-timeout` - `-E path.data=<relative-path>` is automatically mapped to a Docker volume mount for data persistence - `-E` values that reference host files (e.g. JWT JWKS keys) are automatically volume-mounted into the container with path rewriting - `--license=trial` is mapped to `xpack.license.self_generated.type=trial` - Same default esArgs as the local snapshot flow (`action.destructive_requires_name`, `cluster.routing.allocation.disk.threshold_enabled=false`, etc.) - Waits for cluster readiness and sets up native realm passwords, matching local behavior **Usage example:** ```bash # Default for non-serverless (Docker ES) node scripts/run_cypress parallel --config ... # Opt-out to local snapshot CYPRESS_ES_FROM=snapshot node scripts/run_cypress parallel --config ... # yarn es snapshot with Docker yarn es snapshot --docker -E path.data=../my-data --license=trial ``` ### Docker networking & container management - ES binds to `0.0.0.0` inside the container for Fleet Server / Elastic Agent connectivity from other containers and VMs - `--add-host host.docker.internal:host-gateway` added to ES containers for cross-cluster seed resolution - `cluster.remote.*.seeds` esArgs automatically rewritten from `localhost` to `host.docker.internal` inside Docker - Fleet Server uses `host.docker.internal` for ES connectivity when ES runs in Docker - CCS transport port (`transportPort`) properly mapped to container port 9300 - Stale containers are force-removed before starting a new one (handles pRetry retries cleanly) - Only `verifyDockerInstalled()` + `maybeCreateDockerNetwork()` are called for Docker snapshot setup — avoids interfering with serverless containers (no `detectRunningNodes` / `cleanUpDanglingContainers`) ### Expected boot time gains (when using Docker) Running ES via Docker eliminates the snapshot download + extraction overhead. With a pre-pulled image the gains are significant: | Phase | Snapshot (local) | Docker | Savings | |-------|-----------------|--------|---------| | Download archive | ~30-60s (cold) / 0s (cached) | 0s (image cached) | up to ~60s | | Extract / install | ~10-20s | 0s | ~10-20s | | Container startup | N/A | ~2-5s | — | | Cluster ready check | ~10-15s | ~10-15s | ~0s | | Native realm setup | ~2-3s | ~2-3s | ~0s | | **Total (cold)** | **~55-100s** | **~15-25s** | **~40-75s** | | **Total (warm)** | **~25-40s** | **~15-25s** | **~10-20s** | Key observations: - **Cold start** (first run / CI with no cache): Docker avoids the full snapshot download (~200MB) and extraction. The Docker image is pulled once and cached across runs. - **Warm start** (snapshot already cached locally): Docker still saves ~10-20s by skipping archive extraction and JVM bootstrap overhead. - **CI impact**: Requires Docker-in-Docker support on CI agents. When available, the Docker image can be pre-cached in the base image or pulled in parallel with other setup steps. - **Per-spec overhead**: Since the Cypress parallel runner starts/stops ES for each spec file, savings compound — e.g. 10 spec files × 15s saved = ~2.5 minutes total. **`native_realm.js` → `native_realm.ts` migration** - Full TypeScript conversion with typed constructor (`NativeRealmOptions`), retry options (`RetryOpts`), and all method signatures - Removed legacy `body` wrapper on ES client API calls (not needed with newer `@elastic/elasticsearch` client) - Converted from mixed CJS/ESM to pure ESM, removing the `@ts-expect-error` suppression in the barrel export **Cleanup** - Removed stale arm64 `-XX:UseSVE=0` JVM workaround ([elastic/elasticsearch#118583](elastic/elasticsearch#118583)) — no longer needed at ES 9.x - Extracted duplicated Java opts logic into the inline arrays directly ### Files changed | File | Change | |------|--------| | `kbn-es/src/cli_commands/snapshot.ts` | Added `--docker`, `--port`, `--kill` flags | | `kbn-es/src/cluster.ts` | Added `runDockerSnapshot()` method, Docker container tracking and cleanup | | `kbn-es/src/cluster_exec_options.ts` | Added Docker-related exec option types | | `kbn-es/src/utils/docker.ts` | Added `runDockerSnapshotContainer()` with `name`/`background`/`transportPort` options, `stopDockerSnapshotContainer()`, host file volume mounting, `host.docker.internal` support, removed arm64 workaround | | `kbn-es/src/utils/docker_uiam.ts` | Updated for Docker snapshot compatibility | | `kbn-es/src/utils/index.ts` | Removed `@ts-expect-error` for native_realm | | `kbn-es/src/utils/native_realm.js → .ts` | TypeScript migration | | `kbn-es/src/utils/native_realm.test.js → .ts` | TypeScript migration | | `kbn-test/src/es/test_es_cluster.ts` | Added `esFrom: 'docker'` branch, unified transport port resolution with `TEST_ES_TRANSPORT_PORT` | | `security_solution/scripts/run_cypress/parallel.ts` | Default to Docker ES for non-serverless tests, `CYPRESS_ES_FROM` env var for opt-out | ## Risk Low — additive feature behind new flags/env vars. Serverless tests are untouched. Non-serverless Cypress tests default to Docker but can opt out via `CYPRESS_ES_FROM=snapshot`. The native_realm migration preserves identical runtime behavior. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
…264218) ## Summary Default Cypress stateful Elasticsearch provisioning to `snapshot` on CI and keep `docker` for local development. The earlier switch to Docker as the universal default (#254306) was motivated by: - making local dev match shipped artifacts, - multi-arch support for Apple Silicon, - avoiding per-spec snapshot extraction, - faster warm starts on developer machines. All four are genuine wins **for local dev**. On CI they either don't apply, are neutral, or are actively counter-productive. After gathering empirical data from Buildkite, the right default on CI is `snapshot`; on workstations the right default stays `docker`. ## Why snapshot on CI 1. **No version-skew race.** Kibana CI already resolves an ES snapshot manifest once per build in [`.buildkite/scripts/lifecycle/pre_build.sh`](https://github.com/elastic/kibana/blob/main/.buildkite/scripts/lifecycle/pre_build.sh) against `kibana-ci-es-snapshots-daily` — Kibana's own daily-verified bucket, version-locked to Kibana by construction. The post-version-bump window (`9.5.0`, `9.6.0`, …) that my earlier auto-detect probe tried to guard against doesn't actually exist for stateful Cypress on CI: the tar.gz is already there, or `pre_build.sh` has already failed the build before any Cypress agent starts. A Docker image for that same version is _not_ guaranteed to exist at the same moment — which is the exact failure mode we kept running into. 2. **Docker-on-CI is not meaningfully faster on the same hardware.** I pulled job durations from Buildkite for `kibana-on-merge` Security Solution Cypress jobs before and after #254306 and reconciled them against the Buildkite agent machine-type change (`n2-standard-4` → `n2-highmem-4`) that landed in the same window. Controlling for that hardware change, ES start-up on a warm CI agent is ~5s different between snapshot tar.gz and Docker — within noise for a 20–40 minute Cypress group. The speedups originally attributed to Docker were largely a hardware upgrade. 3. **ES starts once per FTR config group, not per spec.** `parallel.ts` provisions ES once for each group in `specGroups`, runs all specs in that group against the same cluster, then shuts down (see [`runSpecGroup`](https://github.com/elastic/kibana/blob/main/x-pack/solutions/security/plugins/security_solution/scripts/run_cypress/parallel.ts)). Only retry runs go per-spec. So the "Docker avoids per-spec extraction on CI" argument is mostly about retries, which are a tiny fraction of total runtime. 4. **Fewer moving parts on CI.** No Docker registry auth, no Docker pull on every agent, no fallback logic between Docker and snapshot, no GCS probe script. Snapshot tar.gz is already pre-fetched/cached by the standard Kibana CI lifecycle. ## Why keep Docker for local dev 1. Matches shipped artifacts byte-for-byte. 2. Native multi-arch (Apple Silicon) without a separate tar.gz pipeline. 3. Warm starts are fast once the image is cached on the workstation. 4. `CYPRESS_ES_FROM=snapshot` (or `docker`) still works as an explicit override for both environments. ## Change ```ts const defaultEsFrom = process.env.CI ? 'snapshot' : 'docker'; const esFrom = configEsFrom === 'serverless' ? 'serverless' : esFromEnv || defaultEsFrom; ``` Also drops the earlier `detect_cypress_es_from.sh` probe and its hook in `setup_job_env.sh` — `pre_build.sh` already covers the version-skew concern at a better layer. The serverless routing fix (`configEsFrom === 'serverless'` wins over `CYPRESS_ES_FROM`) is retained from the first commit and is independent of the default flip — it prevents stateful `CYPRESS_ES_FROM=snapshot` from accidentally booting serverless suites against a stateful snapshot tar.gz and blowing up with `unknown setting [xpack.security.authc.native_roles.enabled]`. ## Test plan - [ ] Green `kibana-on-merge` Security Solution Cypress jobs (stateful + serverless). - [ ] Green `kibana-pull-request` Security Solution Cypress jobs with no `CYPRESS_ES_FROM` set. - [ ] Local: `yarn cypress:run ...` still uses Docker by default. - [ ] Local: `CYPRESS_ES_FROM=snapshot yarn cypress:run ...` uses snapshot. - [ ] Serverless suites remain on `serverless` regardless of `CYPRESS_ES_FROM`. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
…lly (#264218) (#267726) # Backport This will backport the following commits from `main` to `9.4`: - [ci(cypress): default stateful ES to snapshot on CI, docker locally (#264218)](#264218) <!--- Backport version: 9.6.6 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sorenlouv/backport) <!--BACKPORT [{"author":{"name":"Patryk Kopyciński","email":"contact@patrykkopycinski.com"},"sourceCommit":{"committedDate":"2026-05-05T12:40:20Z","message":"ci(cypress): default stateful ES to snapshot on CI, docker locally (#264218)\n\n## Summary\n\nDefault Cypress stateful Elasticsearch provisioning to `snapshot` on CI\nand keep `docker` for local development.\n\nThe earlier switch to Docker as the universal default (#254306) was\nmotivated by:\n\n- making local dev match shipped artifacts,\n- multi-arch support for Apple Silicon,\n- avoiding per-spec snapshot extraction,\n- faster warm starts on developer machines.\n\nAll four are genuine wins **for local dev**. On CI they either don't\napply, are neutral, or are actively counter-productive. After gathering\nempirical data from Buildkite, the right default on CI is `snapshot`; on\nworkstations the right default stays `docker`.\n\n## Why snapshot on CI\n\n1. **No version-skew race.** Kibana CI already resolves an ES snapshot\nmanifest once per build in\n[`.buildkite/scripts/lifecycle/pre_build.sh`](https://github.com/elastic/kibana/blob/main/.buildkite/scripts/lifecycle/pre_build.sh)\nagainst `kibana-ci-es-snapshots-daily` — Kibana's own daily-verified\nbucket, version-locked to Kibana by construction. The post-version-bump\nwindow (`9.5.0`, `9.6.0`, …) that my earlier auto-detect probe tried to\nguard against doesn't actually exist for stateful Cypress on CI: the\ntar.gz is already there, or `pre_build.sh` has already failed the build\nbefore any Cypress agent starts. A Docker image for that same version is\n_not_ guaranteed to exist at the same moment — which is the exact\nfailure mode we kept running into.\n\n2. **Docker-on-CI is not meaningfully faster on the same hardware.** I\npulled job durations from Buildkite for `kibana-on-merge` Security\nSolution Cypress jobs before and after #254306 and reconciled them\nagainst the Buildkite agent machine-type change (`n2-standard-4` →\n`n2-highmem-4`) that landed in the same window. Controlling for that\nhardware change, ES start-up on a warm CI agent is ~5s different between\nsnapshot tar.gz and Docker — within noise for a 20–40 minute Cypress\ngroup. The speedups originally attributed to Docker were largely a\nhardware upgrade.\n\n3. **ES starts once per FTR config group, not per spec.** `parallel.ts`\nprovisions ES once for each group in `specGroups`, runs all specs in\nthat group against the same cluster, then shuts down (see\n[`runSpecGroup`](https://github.com/elastic/kibana/blob/main/x-pack/solutions/security/plugins/security_solution/scripts/run_cypress/parallel.ts)).\nOnly retry runs go per-spec. So the \"Docker avoids per-spec extraction\non CI\" argument is mostly about retries, which are a tiny fraction of\ntotal runtime.\n\n4. **Fewer moving parts on CI.** No Docker registry auth, no Docker pull\non every agent, no fallback logic between Docker and snapshot, no GCS\nprobe script. Snapshot tar.gz is already pre-fetched/cached by the\nstandard Kibana CI lifecycle.\n\n## Why keep Docker for local dev\n\n1. Matches shipped artifacts byte-for-byte.\n2. Native multi-arch (Apple Silicon) without a separate tar.gz pipeline.\n3. Warm starts are fast once the image is cached on the workstation.\n4. `CYPRESS_ES_FROM=snapshot` (or `docker`) still works as an explicit\noverride for both environments.\n\n## Change\n\n```ts\nconst defaultEsFrom = process.env.CI ? 'snapshot' : 'docker';\nconst esFrom =\n configEsFrom === 'serverless' ? 'serverless' : esFromEnv || defaultEsFrom;\n```\n\nAlso drops the earlier `detect_cypress_es_from.sh` probe and its hook in\n`setup_job_env.sh` — `pre_build.sh` already covers the version-skew\nconcern at a better layer.\n\nThe serverless routing fix (`configEsFrom === 'serverless'` wins over\n`CYPRESS_ES_FROM`) is retained from the first commit and is independent\nof the default flip — it prevents stateful `CYPRESS_ES_FROM=snapshot`\nfrom accidentally booting serverless suites against a stateful snapshot\ntar.gz and blowing up with `unknown setting\n[xpack.security.authc.native_roles.enabled]`.\n\n## Test plan\n\n- [ ] Green `kibana-on-merge` Security Solution Cypress jobs (stateful +\nserverless).\n- [ ] Green `kibana-pull-request` Security Solution Cypress jobs with no\n`CYPRESS_ES_FROM` set.\n- [ ] Local: `yarn cypress:run ...` still uses Docker by default.\n- [ ] Local: `CYPRESS_ES_FROM=snapshot yarn cypress:run ...` uses\nsnapshot.\n- [ ] Serverless suites remain on `serverless` regardless of\n`CYPRESS_ES_FROM`.\n\n---------\n\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>","sha":"66c8e08c9b8dec386784e7af9f2a981464ae43f1","branchLabelMapping":{"^v9.5.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","backport:version","v9.4.0","v9.5.0"],"title":"ci(cypress): default stateful ES to snapshot on CI, docker locally","number":264218,"url":"https://github.com/elastic/kibana/pull/264218","mergeCommit":{"message":"ci(cypress): default stateful ES to snapshot on CI, docker locally (#264218)\n\n## Summary\n\nDefault Cypress stateful Elasticsearch provisioning to `snapshot` on CI\nand keep `docker` for local development.\n\nThe earlier switch to Docker as the universal default (#254306) was\nmotivated by:\n\n- making local dev match shipped artifacts,\n- multi-arch support for Apple Silicon,\n- avoiding per-spec snapshot extraction,\n- faster warm starts on developer machines.\n\nAll four are genuine wins **for local dev**. On CI they either don't\napply, are neutral, or are actively counter-productive. After gathering\nempirical data from Buildkite, the right default on CI is `snapshot`; on\nworkstations the right default stays `docker`.\n\n## Why snapshot on CI\n\n1. **No version-skew race.** Kibana CI already resolves an ES snapshot\nmanifest once per build in\n[`.buildkite/scripts/lifecycle/pre_build.sh`](https://github.com/elastic/kibana/blob/main/.buildkite/scripts/lifecycle/pre_build.sh)\nagainst `kibana-ci-es-snapshots-daily` — Kibana's own daily-verified\nbucket, version-locked to Kibana by construction. The post-version-bump\nwindow (`9.5.0`, `9.6.0`, …) that my earlier auto-detect probe tried to\nguard against doesn't actually exist for stateful Cypress on CI: the\ntar.gz is already there, or `pre_build.sh` has already failed the build\nbefore any Cypress agent starts. A Docker image for that same version is\n_not_ guaranteed to exist at the same moment — which is the exact\nfailure mode we kept running into.\n\n2. **Docker-on-CI is not meaningfully faster on the same hardware.** I\npulled job durations from Buildkite for `kibana-on-merge` Security\nSolution Cypress jobs before and after #254306 and reconciled them\nagainst the Buildkite agent machine-type change (`n2-standard-4` →\n`n2-highmem-4`) that landed in the same window. Controlling for that\nhardware change, ES start-up on a warm CI agent is ~5s different between\nsnapshot tar.gz and Docker — within noise for a 20–40 minute Cypress\ngroup. The speedups originally attributed to Docker were largely a\nhardware upgrade.\n\n3. **ES starts once per FTR config group, not per spec.** `parallel.ts`\nprovisions ES once for each group in `specGroups`, runs all specs in\nthat group against the same cluster, then shuts down (see\n[`runSpecGroup`](https://github.com/elastic/kibana/blob/main/x-pack/solutions/security/plugins/security_solution/scripts/run_cypress/parallel.ts)).\nOnly retry runs go per-spec. So the \"Docker avoids per-spec extraction\non CI\" argument is mostly about retries, which are a tiny fraction of\ntotal runtime.\n\n4. **Fewer moving parts on CI.** No Docker registry auth, no Docker pull\non every agent, no fallback logic between Docker and snapshot, no GCS\nprobe script. Snapshot tar.gz is already pre-fetched/cached by the\nstandard Kibana CI lifecycle.\n\n## Why keep Docker for local dev\n\n1. Matches shipped artifacts byte-for-byte.\n2. Native multi-arch (Apple Silicon) without a separate tar.gz pipeline.\n3. Warm starts are fast once the image is cached on the workstation.\n4. `CYPRESS_ES_FROM=snapshot` (or `docker`) still works as an explicit\noverride for both environments.\n\n## Change\n\n```ts\nconst defaultEsFrom = process.env.CI ? 'snapshot' : 'docker';\nconst esFrom =\n configEsFrom === 'serverless' ? 'serverless' : esFromEnv || defaultEsFrom;\n```\n\nAlso drops the earlier `detect_cypress_es_from.sh` probe and its hook in\n`setup_job_env.sh` — `pre_build.sh` already covers the version-skew\nconcern at a better layer.\n\nThe serverless routing fix (`configEsFrom === 'serverless'` wins over\n`CYPRESS_ES_FROM`) is retained from the first commit and is independent\nof the default flip — it prevents stateful `CYPRESS_ES_FROM=snapshot`\nfrom accidentally booting serverless suites against a stateful snapshot\ntar.gz and blowing up with `unknown setting\n[xpack.security.authc.native_roles.enabled]`.\n\n## Test plan\n\n- [ ] Green `kibana-on-merge` Security Solution Cypress jobs (stateful +\nserverless).\n- [ ] Green `kibana-pull-request` Security Solution Cypress jobs with no\n`CYPRESS_ES_FROM` set.\n- [ ] Local: `yarn cypress:run ...` still uses Docker by default.\n- [ ] Local: `CYPRESS_ES_FROM=snapshot yarn cypress:run ...` uses\nsnapshot.\n- [ ] Serverless suites remain on `serverless` regardless of\n`CYPRESS_ES_FROM`.\n\n---------\n\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>","sha":"66c8e08c9b8dec386784e7af9f2a981464ae43f1"}},"sourceBranch":"main","suggestedTargetBranches":["9.4"],"targetPullRequestStates":[{"branch":"9.4","label":"v9.4.0","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v9.5.0","branchLabelMappingKey":"^v9.5.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/264218","number":264218,"mergeCommit":{"message":"ci(cypress): default stateful ES to snapshot on CI, docker locally (#264218)\n\n## Summary\n\nDefault Cypress stateful Elasticsearch provisioning to `snapshot` on CI\nand keep `docker` for local development.\n\nThe earlier switch to Docker as the universal default (#254306) was\nmotivated by:\n\n- making local dev match shipped artifacts,\n- multi-arch support for Apple Silicon,\n- avoiding per-spec snapshot extraction,\n- faster warm starts on developer machines.\n\nAll four are genuine wins **for local dev**. On CI they either don't\napply, are neutral, or are actively counter-productive. After gathering\nempirical data from Buildkite, the right default on CI is `snapshot`; on\nworkstations the right default stays `docker`.\n\n## Why snapshot on CI\n\n1. **No version-skew race.** Kibana CI already resolves an ES snapshot\nmanifest once per build in\n[`.buildkite/scripts/lifecycle/pre_build.sh`](https://github.com/elastic/kibana/blob/main/.buildkite/scripts/lifecycle/pre_build.sh)\nagainst `kibana-ci-es-snapshots-daily` — Kibana's own daily-verified\nbucket, version-locked to Kibana by construction. The post-version-bump\nwindow (`9.5.0`, `9.6.0`, …) that my earlier auto-detect probe tried to\nguard against doesn't actually exist for stateful Cypress on CI: the\ntar.gz is already there, or `pre_build.sh` has already failed the build\nbefore any Cypress agent starts. A Docker image for that same version is\n_not_ guaranteed to exist at the same moment — which is the exact\nfailure mode we kept running into.\n\n2. **Docker-on-CI is not meaningfully faster on the same hardware.** I\npulled job durations from Buildkite for `kibana-on-merge` Security\nSolution Cypress jobs before and after #254306 and reconciled them\nagainst the Buildkite agent machine-type change (`n2-standard-4` →\n`n2-highmem-4`) that landed in the same window. Controlling for that\nhardware change, ES start-up on a warm CI agent is ~5s different between\nsnapshot tar.gz and Docker — within noise for a 20–40 minute Cypress\ngroup. The speedups originally attributed to Docker were largely a\nhardware upgrade.\n\n3. **ES starts once per FTR config group, not per spec.** `parallel.ts`\nprovisions ES once for each group in `specGroups`, runs all specs in\nthat group against the same cluster, then shuts down (see\n[`runSpecGroup`](https://github.com/elastic/kibana/blob/main/x-pack/solutions/security/plugins/security_solution/scripts/run_cypress/parallel.ts)).\nOnly retry runs go per-spec. So the \"Docker avoids per-spec extraction\non CI\" argument is mostly about retries, which are a tiny fraction of\ntotal runtime.\n\n4. **Fewer moving parts on CI.** No Docker registry auth, no Docker pull\non every agent, no fallback logic between Docker and snapshot, no GCS\nprobe script. Snapshot tar.gz is already pre-fetched/cached by the\nstandard Kibana CI lifecycle.\n\n## Why keep Docker for local dev\n\n1. Matches shipped artifacts byte-for-byte.\n2. Native multi-arch (Apple Silicon) without a separate tar.gz pipeline.\n3. Warm starts are fast once the image is cached on the workstation.\n4. `CYPRESS_ES_FROM=snapshot` (or `docker`) still works as an explicit\noverride for both environments.\n\n## Change\n\n```ts\nconst defaultEsFrom = process.env.CI ? 'snapshot' : 'docker';\nconst esFrom =\n configEsFrom === 'serverless' ? 'serverless' : esFromEnv || defaultEsFrom;\n```\n\nAlso drops the earlier `detect_cypress_es_from.sh` probe and its hook in\n`setup_job_env.sh` — `pre_build.sh` already covers the version-skew\nconcern at a better layer.\n\nThe serverless routing fix (`configEsFrom === 'serverless'` wins over\n`CYPRESS_ES_FROM`) is retained from the first commit and is independent\nof the default flip — it prevents stateful `CYPRESS_ES_FROM=snapshot`\nfrom accidentally booting serverless suites against a stateful snapshot\ntar.gz and blowing up with `unknown setting\n[xpack.security.authc.native_roles.enabled]`.\n\n## Test plan\n\n- [ ] Green `kibana-on-merge` Security Solution Cypress jobs (stateful +\nserverless).\n- [ ] Green `kibana-pull-request` Security Solution Cypress jobs with no\n`CYPRESS_ES_FROM` set.\n- [ ] Local: `yarn cypress:run ...` still uses Docker by default.\n- [ ] Local: `CYPRESS_ES_FROM=snapshot yarn cypress:run ...` uses\nsnapshot.\n- [ ] Serverless suites remain on `serverless` regardless of\n`CYPRESS_ES_FROM`.\n\n---------\n\nCo-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>","sha":"66c8e08c9b8dec386784e7af9f2a981464ae43f1"}}]}] BACKPORT--> Co-authored-by: Patryk Kopyciński <contact@patrykkopycinski.com>
Summary
Adds a
--dockerflag toyarn es snapshotthat runs Elasticsearch in a Docker container with 1:1 equivalent behavior to the local snapshot flow, and wires Docker-based ES as the default mode for non-serverless Cypress tests (opt-out viaCYPRESS_ES_FROM=snapshot).What changed
Cypress tests: Docker ES as default for non-serverless
esFrom: 'docker'by default, starting ES in a Docker container instead of downloading and extracting a local snapshotCYPRESS_ES_FROM=snapshotto revert to the previous behavioresFrom: 'serverless'New
esFrom: 'docker'mode for test infrastructurecreateTestEsClusternow supportsesFrom: 'docker', starting ES in a Docker container instead of downloading and extracting a local snapshotrunDockerSnapshotContaineracceptsname(unique container naming for parallel safety) andbackground(non-blocking mode for test integration)Clusterclass tracks Docker snapshot containers and handles cleanup (kill + rm) on stopstopDockerSnapshotContainer()utility for explicit container teardownTEST_ES_TRANSPORT_PORTenv var, matching the existingesTestConfig.getTransportPort()behavior (parses first port from range strings like'9300-9400')New
--dockermode foryarn es snapshot--docker— run ES in Docker instead of downloading locally--port— bind port (default 9200, Docker mode only)--kill— kill existing ES containers before starting (Docker mode only)--license,--password,--ssl,-E,--skip-ready-check,--ready-timeout-E path.data=<relative-path>is automatically mapped to a Docker volume mount for data persistence-Evalues that reference host files (e.g. JWT JWKS keys) are automatically volume-mounted into the container with path rewriting--license=trialis mapped toxpack.license.self_generated.type=trialaction.destructive_requires_name,cluster.routing.allocation.disk.threshold_enabled=false, etc.)Usage example:
Docker networking & container management
0.0.0.0inside the container for Fleet Server / Elastic Agent connectivity from other containers and VMs--add-host host.docker.internal:host-gatewayadded to ES containers for cross-cluster seed resolutioncluster.remote.*.seedsesArgs automatically rewritten fromlocalhosttohost.docker.internalinside Dockerhost.docker.internalfor ES connectivity when ES runs in DockertransportPort) properly mapped to container port 9300verifyDockerInstalled()+maybeCreateDockerNetwork()are called for Docker snapshot setup — avoids interfering with serverless containers (nodetectRunningNodes/cleanUpDanglingContainers)Expected boot time gains (when using Docker)
Running ES via Docker eliminates the snapshot download + extraction overhead. With a pre-pulled image the gains are significant:
Key observations:
native_realm.js→native_realm.tsmigrationNativeRealmOptions), retry options (RetryOpts), and all method signaturesbodywrapper on ES client API calls (not needed with newer@elastic/elasticsearchclient)@ts-expect-errorsuppression in the barrel exportCleanup
-XX:UseSVE=0JVM workaround (elastic/elasticsearch#118583) — no longer needed at ES 9.xFiles changed
kbn-es/src/cli_commands/snapshot.ts--docker,--port,--killflagskbn-es/src/cluster.tsrunDockerSnapshot()method, Docker container tracking and cleanupkbn-es/src/cluster_exec_options.tskbn-es/src/utils/docker.tsrunDockerSnapshotContainer()withname/background/transportPortoptions,stopDockerSnapshotContainer(), host file volume mounting,host.docker.internalsupport, removed arm64 workaroundkbn-es/src/utils/docker_uiam.tskbn-es/src/utils/index.ts@ts-expect-errorfor native_realmkbn-es/src/utils/native_realm.js → .tskbn-es/src/utils/native_realm.test.js → .tskbn-test/src/es/test_es_cluster.tsesFrom: 'docker'branch, unified transport port resolution withTEST_ES_TRANSPORT_PORTsecurity_solution/scripts/run_cypress/parallel.tsCYPRESS_ES_FROMenv var for opt-outRisk
Low — additive feature behind new flags/env vars. Serverless tests are untouched. Non-serverless Cypress tests default to Docker but can opt out via
CYPRESS_ES_FROM=snapshot. The native_realm migration preserves identical runtime behavior.