Skip to content

[AO] realign-kibana-265798-inline-local-into-kbn-evals-retire-kbn#268862

Draft
patrykkopycinski wants to merge 9 commits into
elastic:mainfrom
patrykkopycinski:ao/realign-kibana-265798-inl-f9be30
Draft

[AO] realign-kibana-265798-inline-local-into-kbn-evals-retire-kbn#268862
patrykkopycinski wants to merge 9 commits into
elastic:mainfrom
patrykkopycinski:ao/realign-kibana-265798-inl-f9be30

Conversation

@patrykkopycinski
Copy link
Copy Markdown
Contributor

Auto-generated by patryks-treadmill (per-plan worktree)

Realign kibana#265798: inline --local into @kbn/evals, retire @kbn/evals-local

Why

PR #265798 put too much responsibility on the framework too early — auto-provisioning a runtime, maintaining a model registry, running a tool-calling probe before every eval, auto-generating recommendatio

Tasks completed

  • Add export in x-pack/platform/packages/shared/kbn-evals/index.ts: export { injectLocalConnector } from './src/cli/inject_local_connector'
  • Delete the entire x-pack/platform/packages/shared/kbn-evals-local/ directory via git rm -r
  • Move x-pack/platform/packages/shared/kbn-evals-local/src/cli/inject.ts to x-pack/platform/packages/shared/kbn-evals/src/cli/inject_local_connector.ts, preserving all logic verbatim (hard-fail guard, process.argv sync, execFileSync model-list call)
  • Remove kbn-evals-local entry from .github/CODEOWNERS
  • Remove @kbn/evals-local path mapping from tsconfig.base.json
  • Remove @kbn/evals-local workspace entry from root package.json
  • Run yarn install to regenerate yarn.lock without stale @kbn/evals-local entries
  • Update scripts/evals.js to require from @kbn/evals instead of @kbn/evals-local for the --local branch, preserving the .then() chaining
  • In x-pack/platform/packages/shared/kbn-evals/src/cli/commands/run.ts, replace the --dry-run early-return stub with logic that: sets EVALUATION_REPETITIONS=1 in envOverrides, sets EVALUATION_DRY_RUN=true in envOverrides, prints the [DRY-RUN] sampling 1 example per dataset, repetitions=1 banner, and falls through to spawn Playwright (~15 LOC)
  • Update dataset fixture helpers to check for EVALUATION_DRY_RUN=true and slice the examples array to [examples[0]] when true (~15–35 LOC across fixture files)
  • Add "Local model quick start" section to x-pack/platform/packages/shared/kbn-evals/README.md containing: one-line install command (brew install ollama && ollama pull <model>), one recommended model per RAM tier (16 GB / 32 GB / 48 GB / 64 GB+), required env var (EVAL_TASK_TIMEOUT_MS=600000), guidance on when to use --local vs --dry-run, and pointer to the local-evals skill in elastic-agent-builder-skill-dev for automated orchestration
  • Re-request review from @SrdjanLL and @viduni94
  • Run node scripts/eslint --fix on all changed files and verify no remaining lint errors
  • Run node scripts/type_check --project x-pack/platform/packages/shared/kbn-evals/tsconfig.json and verify zero exit code
  • Smoke test: execute node scripts/evals run --suite agent-builder --dry-run and verify 1 example per dataset executes with EVALUATION_REPETITIONS=1
  • Smoke test: execute node scripts/evals run --suite agent-builder --local with no Ollama running and verify hard-fail with actionable error message and non-zero exit
  • Smoke test: with running local Ollama, execute node scripts/evals run --suite agent-builder --local end-to-end
  • Update PR description with: "(a) --local connector injection in @kbn/evals, (b) configurable timeout, (c) --dry-run, (d) doc. Orchestrator + benchmark + registry moved to elastic-agent-builder-skill-dev — see ."

One commit per task on a single shared plan branch.
This PR was autonomously generated and verified by the patryks-treadmill pipeline.

patrykkopycinski and others added 9 commits May 12, 2026 09:14
Move src/cli/inject.ts from @kbn/evals-local into @kbn/evals as
src/cli/inject_local_connector.ts. The file is made self-contained by
inlining the runtime-detection helpers (probeEndpoint, getOllamaModels,
getLmStudioModel, commandExists, detect) and the connector env-setter
from connector_factory, both of which are being deleted in the broader
kbn-evals-local retirement. The ModelRegistry dependency is dropped in
favour of accepting --local-model values as plain model-name strings
(bare-connector path). The three load-bearing behaviours are preserved
verbatim: the hard-fail guard when no endpoint is detected, the
process.argv strip-and-sync after --local removal, and the
execFileSync-based commandExists call that prevents shell injection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the named re-export so scripts/evals.js can require it directly
from @kbn/evals instead of @kbn/evals-local.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When scripts/evals.js is invoked with --local, call
injectLocalConnector(process.argv) from @kbn/evals before handing off
to cli.run(). Passing process.argv directly (not a slice) lets the
function strip --local / --local-endpoint / --local-model in-place so
cli.run() sees a clean argv. The .then() chain ensures cli.run() only
starts after connector env vars are set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…anner

Replace the early-return stub with env-var injection so --dry-run falls
through to spawn Playwright: sets EVALUATION_REPETITIONS=1 and
EVALUATION_DRY_RUN=true in envOverrides, prints the
'[DRY-RUN] sampling 1 example per dataset, repetitions=1' banner, then
proceeds to the existing spawn block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…RUN=true

In KibanaEvalsClient.runExperiment(), check process.env.EVALUATION_DRY_RUN
and slice resolvedDataset.examples to [examples[0]] before the run loop.
This wires the --dry-run flag end-to-end: CLI sets EVALUATION_DRY_RUN=true,
Playwright inherits it, and the executor limits each dataset to one example.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers: one-line Ollama install, one model recommendation per RAM tier
(16/32/48/64 GB+), EVAL_TASK_TIMEOUT_MS=600000 requirement, --local vs
--dry-run guidance, and pointer to elastic-agent-builder-skill-dev for
advanced benchmarking orchestration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…view

Move the --dry-run envOverrides block above the commandPreview snapshot so
EVALUATION_REPETITIONS=1 and EVALUATION_DRY_RUN=true appear in the logged
"Running: ..." line. Previously the preview was built before the dry-run
mutation, making the logged command unreproducible when copy-pasted.

Runtime behavior is unchanged — spawn() always received the correct env.

Smoke test verified: node scripts/evals run --suite agent-builder --dry-run
--local prints the [DRY-RUN] banner, shows all overrides in the Running:
line, and Playwright starts 12 tests (1 per dataset spec) with
EVALUATION_REPETITIONS=1 and EVALUATION_DRY_RUN=true set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d; fix timer leak

Adds inject_local_connector.test.ts with 7 unit tests covering the
hard-fail path (no Ollama, no LM Studio, no binary installed), the
env-var injection happy path, and the binary-installed-but-not-running
path. All assertions verified: throws with actionable message, strips
--local from args before detection, never sets EVALUATION_CONNECTOR_ID
or KIBANA_TESTING_AI_CONNECTORS when no runtime is found.

Also fixes a timer resource leak in probeEndpoint / getOllamaModels /
getLmStudioModel: clearTimeout was only called on the success path; moved
to finally{} so it fires on rejection too. This eliminated the
"Jest did not exit" open-handle warning that surfaced during test authoring.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lsClient

Verifies that runExperiment() limits execution to the first example
when EVALUATION_DRY_RUN=true (regardless of repetitions), and runs
all examples when the var is absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@infra-vault-gh-plugin-prod
Copy link
Copy Markdown

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

  • Click to trigger kibana-pull-request for this PR!
  • Click to trigger kibana-deploy-project-from-pr for this PR!
  • Click to trigger kibana-deploy-cloud-from-pr for this PR!
  • Click to trigger kibana-entity-store-performance-from-pr for this PR!
  • Click to trigger kibana-storybooks-from-pr for this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant