feat(@kbn/evals): --local connector injection, --dry-run, configurable timeout; retire @kbn/evals-local by patrykkopycinski · Pull Request #265798 · elastic/kibana

patrykkopycinski · 2026-04-27T14:00:48Z

Summary

What's in this PR

(a) `--local` connector injection in `@kbn/evals`

src/cli/inject_local_connector.ts is moved from the deleted @kbn/evals-local package into @kbn/evals. It probes running Ollama (localhost:11434) or LM Studio (localhost:1234), discovers the loaded model name, and injects KIBANA_TESTING_AI_CONNECTORS + EVALUATION_CONNECTOR_ID so that any existing eval command picks up the local model.

scripts/evals.js routes --local via require('@kbn/evals').injectLocalConnector (was @kbn/evals-local).

Hard-fail behaviour is preserved: if no runtime is reachable, the process exits with a non-zero code and prints:

--local requires a running local runtime, but none was detected.
Start Ollama (`ollama serve`) or LM Studio.
Refusing to silently fall back to the cloud connector.

(b) Configurable eval task timeout (`EVAL_TASK_TIMEOUT_MS`)

create_playwright_eval_config.ts reads process.env.EVAL_TASK_TIMEOUT_MS and uses it as the Playwright task timeout, falling back to 5 minutes. Local models are significantly slower than EIS — this allows callers to bump the timeout without modifying test fixtures.

(c) `--dry-run`

node scripts/evals run --suite <id> --dry-run slices each registered dataset to its first example, sets EVALUATION_REPETITIONS=1, and prints a banner at startup:

[DRY-RUN] sampling 1 example per dataset, repetitions=1

This lets developers verify that the full eval pipeline (Kibana → Scout → evaluators → export) works end-to-end in ~1 minute before committing to a full run.

(d) Documentation

@kbn/evals/README.md gains a "Local model quick-start" section: one-line install (brew install ollama && ollama pull <model>), one model recommendation per RAM tier (16 GB / 32 GB / 48 GB / 64 GB+), and the required env var (EVAL_TASK_TIMEOUT_MS=120000).

What was removed

The entire x-pack/platform/packages/shared/kbn-evals-local/ package is deleted:

src/cli/local.ts, src/cli/benchmark.ts — auto-provisioning + benchmark orchestrator
src/local/provision.ts, teardown.ts, model_negotiator.ts, model_validator.ts — runtime lifecycle
src/local/model_registry.ts, models.json, tier_config.ts — curated model registry
src/benchmark/result_writer.ts, benchmark-results/ — benchmark result storage
RECOMMENDATIONS.md, .cursor/skills/local-evals/SKILL.md — derived artefacts

References cleaned up in package.json (workspace entry), tsconfig.base.json (path mapping), .github/CODEOWNERS, and yarn.lock.

Orchestrator, benchmark pipeline, and model registry move to elastic/agent-builder-skill-dev-cursor-plugin#2 (feat(local-evals): port @kbn/evals-local orchestrator + benchmark + registry), the sibling PR to this one. Tracked under the realign-pr-265798 treadmill plan group.

Usage after this PR

# Run evals locally — assumes Ollama or LM Studio is already running
node scripts/evals run --suite agent-builder --local

# Dry-run: verify the pipeline in ~1 min before a full run
node scripts/evals run --suite agent-builder --local --dry-run

# Longer timeout for slow local models (120 s per task)
EVAL_TASK_TIMEOUT_MS=120000 node scripts/evals run --suite agent-builder --local

Test plan

node scripts/evals run --suite agent-builder --local with Ollama running → connector injected, suite starts
node scripts/evals run --suite agent-builder --local with no Ollama or LM Studio running → non-zero exit, actionable error message (7 unit tests in inject_local_connector.test.ts)
node scripts/evals run --suite agent-builder --dry-run → [DRY-RUN] banner printed, EVALUATION_DRY_RUN=true in Running: preview, dataset sliced to 1 example
Type check: node scripts/type_check --project x-pack/platform/packages/shared/kbn-evals/tsconfig.json
Lint: node scripts/eslint --fix $(git diff --name-only HEAD)

elasticmachine · 2026-04-27T14:01:03Z

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

Click to trigger kibana-pull-request for this PR!
Click to trigger kibana-deploy-project-from-pr for this PR!
Click to trigger kibana-deploy-cloud-from-pr for this PR!
Click to trigger kibana-entity-store-performance-from-pr for this PR!
Click to trigger kibana-storybooks-from-pr for this PR!

patrykkopycinski · 2026-04-27T15:21:37Z

Test Plan Results

Environment: macOS 25.4.0, Apple Silicon, 48GB RAM, Node v24.14.1, Ollama 0.21.1
Model used: qwen3:8b (Qwen3-8B, 5.2GB, Q4_K_M)

1. Type checking — PASS

node scripts/type_check --project x-pack/platform/packages/shared/kbn-evals-local/tsconfig.json

Zero errors from @kbn/evals-local. (6 pre-existing errors in upstream get_agents_sockets_stats.test.ts — unrelated.)

2. `--list-models` — PASS

$ node scripts/evals local --list-models
[evals-local] System RAM: 48 GB
[evals-local] Available models for this machine:

  qwen2.5-32b-instruct [RECOMMENDED]
    Qwen2.5-32B-Instruct (32B, Q4_K_M, 20GB VRAM)
    Judge quality: very-good | Speed: ~15 tok/s

  mistral-small-3.1-24b
    Mistral Small 3.1 24B (24B, Q4_K_M, 15GB VRAM)
    Judge quality: good | Speed: ~20 tok/s

  devstral-small-2-24b
    Devstral Small 2 24B (24B, Q4_K_M, 15GB VRAM)
    Judge quality: good | Speed: ~23 tok/s

  qwen2.5-14b-instruct
    Qwen2.5-14B-Instruct (14B, Q4_K_M, 9GB VRAM)
    Judge quality: good | Speed: ~35 tok/s

  gemma-3-27b
    Gemma 3 27B (27B, Q4_K_M, 17GB VRAM)
    Judge quality: good | Speed: ~14 tok/s

  qwen3-8b
    Qwen3-8B (8B, Q4_K_M, 5GB VRAM)
    Judge quality: fair | Speed: ~63 tok/s

Shows 6 models eligible for 48GB system, sorted by priority with recommended flag.

3. `--validate-only` with Ollama — PASS

$ node scripts/evals local --validate-only --model qwen3-8b
[evals-local] Runtime: ollama
[evals-local] Endpoint: http://localhost:11434/v1
[evals-local] Swapping to requested model: Qwen3-8B
[evals-local] Model Qwen3-8B already available.
[evals-local] Validation complete. Model is ready for evals.

Detects Ollama, resolves model from registry, confirms model is pulled and server is healthy.

4. `--local` flag connector injection — PASS

// Test: inject connector when Ollama is running with qwen3:8b loaded
const args = ['run', '--suite', 'agent-builder', '--local'];
await injectLocalConnector(args);

// Result:
// [evals-local] Local connector injected: qwen3:8b at http://localhost:11434/v1
// KIBANA_TESTING_AI_CONNECTORS set (292 chars base64)
// Decoded connector:
{
  "local-eval-model": {
    "name": "Local: qwen3:8b",
    "actionTypeId": ".gen-ai",
    "config": {
      "apiUrl": "http://localhost:11434/v1/chat/completions",
      "apiProvider": "Other",
      "defaultModel": "qwen3:8b"
    },
    "secrets": { "apiKey": "local-eval" }
  }
}
// --local flag stripped from args: ['run', '--suite', 'agent-builder']

Auto-detects loaded model, generates correct connector config, injects env var, and cleans args.

5. Teardown — PASS

// Before teardown:
//   ollama ps → qwen3:8b loaded (10 GB, 100% GPU)

await teardown({
  runtime: 'ollama', modelTag: 'qwen3:8b',
  keepLoaded: false, serverWasRunning: true
});

// [evals-local] Model scheduled for unload -- memory freed.
// After teardown:
//   ollama ps → NAME (empty) — model fully unloaded

6. Full e2e eval run — PASS (partial)

$ node scripts/evals local --suite agent-builder --model qwen3-8b

[evals-local] Runtime: ollama
[evals-local] Endpoint: http://localhost:11434/v1
[evals-local] Swapping to requested model: Qwen3-8B
[evals-local] Model Qwen3-8B already available.
[evals-local] Running evals: node scripts/evals run --suite agent-builder

Running 12 tests using 1 worker
[scout-worker] Running tests against local stateful cluster
[scout-worker] Creating connector: local-eval-model as a7f5e5ba-1ebd-5767-b0e5-51fca96c42ad
[scout-worker] Loading inference client
[scout-worker] 🧪 Starting experiment with 1 evaluators and 5 concurrent runs
[scout-worker] ✅ Evaluator "ES|QL Functional Equivalence" on run (exampleIndex=0) completed
[scout-worker] ✅ Evaluator "ES|QL Functional Equivalence" on run (exampleIndex=1) completed
[scout-worker] ✅ Evaluator "ES|QL Functional Equivalence" on run (exampleIndex=2) completed
[scout-worker] ✅ Experiment e804300c-27ee-4f18-8cbe-89258818803e completed

The full orchestration pipeline works end-to-end:

Detect → Ollama found at localhost:11434
Negotiate → resolved qwen3-8b from registry
Provision → model already pulled, skipped
Connector → injected env vars for @kbn/evals
Execute → spawned node scripts/evals run, Playwright launched, first experiment (ES|QL analytical queries, 3 examples) completed with all evaluators passing
Teardown → model scheduled for unload

Subsequent experiments (agent-builder converse tests) fail with ECONNREFUSED localhost:5620 because those tests require a local Kibana instance which is not running. This is expected — those tests would pass when running against a local dev Kibana.

Bug found & fixed during testing

scripts/evals.js was calling async injectLocalConnector() without awaiting it, so env vars wouldn't be set before @kbn/evals CLI started. Fixed by chaining via .then() (commit ac86147).

macroscopeapp · 2026-04-28T08:40:58Z

+plugins:
+  - name: verifier
+    source: local
+    path: /Users/patrykkopycinski/Projects/work/ao-verifier-plugin


🟡 Medium agent-orchestrator.yaml:14

The configuration file embeds hardcoded personal filesystem paths (/Users/patrykkopycinski/Projects/work/ao-verifier-plugin, /Users/patrykkopycinski/Projects/ao-workspace), a private machine hostname (worker-m1max), and a username (mac). This commits private infrastructure details to the repository and breaks portability—no other developer can use this configuration without modification, and CI/automation environments will fail. Consider externalizing these values to environment variables or a separate .env file that is .gitignored.

Also found in 3 other location(s)

openspec/changes/ad-triage-eval-suite/.ao-generate.sh:3

Hardcoded absolute paths at lines 3, 5, and 6 reference a specific developer's local machine path (/Users/patrykkopycinski/Projects/kibana). This auto-generated script is committed to the repository but will fail immediately with cd: /Users/patrykkopycinski/Projects/kibana: No such file or directory on any other developer's machine or CI environment. The set -e on line 2 ensures the script exits on the first failed cd command.

openspec/changes/ao-health-panel/.ao-generate.sh:3

Hardcoded absolute paths to a specific developer's home directory at lines 3, 5, and 6 (/Users/patrykkopycinski/Projects/kibana). This script is committed to the shared repository but will fail immediately for any other developer or CI environment because cd "/Users/patrykkopycinski/Projects/kibana" will error out under set -e, aborting execution. This appears to be a generated local artifact that should not have been committed, or should use relative/dynamic paths (e.g., $(git rev-parse --show-toplevel)).

eval-ad-remaining.sh:6

Line 6 hardcodes cd /Users/patrykkopycinski/Projects/kibana, a developer-specific absolute path. Any other user or CI environment running this script will fail immediately because the directory won't exist, causing the cd to fail. With set -u (but not set -e), the script will continue running from whatever directory it was invoked in, likely causing source ./eval-env.sh and node scripts/evals.js to fail with confusing errors.

🤖 Copy this AI Prompt to have your agent fix this:

In file agent-orchestrator.yaml around line 14: The configuration file embeds hardcoded personal filesystem paths (`/Users/patrykkopycinski/Projects/work/ao-verifier-plugin`, `/Users/patrykkopycinski/Projects/ao-workspace`), a private machine hostname (`worker-m1max`), and a username (`mac`). This commits private infrastructure details to the repository and breaks portability—no other developer can use this configuration without modification, and CI/automation environments will fail. Consider externalizing these values to environment variables or a separate `.env` file that is `.gitignore`d. Also found in 3 other location(s): - openspec/changes/ad-triage-eval-suite/.ao-generate.sh:3 -- Hardcoded absolute paths at lines 3, 5, and 6 reference a specific developer's local machine path (`/Users/patrykkopycinski/Projects/kibana`). This auto-generated script is committed to the repository but will fail immediately with `cd: /Users/patrykkopycinski/Projects/kibana: No such file or directory` on any other developer's machine or CI environment. The `set -e` on line 2 ensures the script exits on the first failed `cd` command. - openspec/changes/ao-health-panel/.ao-generate.sh:3 -- Hardcoded absolute paths to a specific developer's home directory at lines 3, 5, and 6 (`/Users/patrykkopycinski/Projects/kibana`). This script is committed to the shared repository but will fail immediately for any other developer or CI environment because `cd "/Users/patrykkopycinski/Projects/kibana"` will error out under `set -e`, aborting execution. This appears to be a generated local artifact that should not have been committed, or should use relative/dynamic paths (e.g., `$(git rev-parse --show-toplevel)`). - eval-ad-remaining.sh:6 -- Line 6 hardcodes `cd /Users/patrykkopycinski/Projects/kibana`, a developer-specific absolute path. Any other user or CI environment running this script will fail immediately because the directory won't exist, causing the `cd` to fail. With `set -u` (but not `set -e`), the script will continue running from whatever directory it was invoked in, likely causing `source ./eval-env.sh` and `node scripts/evals.js` to fail with confusing errors.

macroscopeapp · 2026-04-28T08:40:58Z

+  ) > "$log" 2>&1 || echo "  NOTE: eval exited non-zero (see $log)"
+
+  echo "  exit-code:  $?"


🟢 Low eval-nfc-ab.sh:126

echo " exit-code: $?" on line 128 always prints 0. The subshell exit code on line 126 is lost because when it fails, the || echo ... succeeds and resets $? to 0; when it succeeds, $? is naturally 0. The actual exit code is never captured.

Consider saving the subshell exit code before the || branch: capture rc=$? immediately after the subshell, then use $rc for both the error note and the final exit-code print. Note that with set -e, you'll need to restructure to avoid the subshell triggering an immediate exit on failure.

+ ) > "$log" 2>&1; rc=$? + if [[ $rc -ne 0 ]]; then + echo " NOTE: eval exited non-zero (see $log)" + fi + + echo " exit-code: $rc"

🤖 Copy this AI Prompt to have your agent fix this:

In file eval-nfc-ab.sh around lines 126-128: `echo " exit-code: $?"` on line 128 always prints `0`. The subshell exit code on line 126 is lost because when it fails, the `|| echo ...` succeeds and resets `$?` to `0`; when it succeeds, `$?` is naturally `0`. The actual exit code is never captured. Consider saving the subshell exit code before the `||` branch: capture `rc=$?` immediately after the subshell, then use `$rc` for both the error note and the final exit-code print. Note that with `set -e`, you'll need to restructure to avoid the subshell triggering an immediate exit on failure.

macroscopeapp · 2026-04-28T08:40:58Z

+  exit 1
+fi
+
+echo ">> ES        = $ES_URL"


🟢 Low eval-restore-kbs.sh:38

ES_URL (which may contain credentials like user:password@host) is printed verbatim to stdout on line 38, exposing secrets to terminal output and CI logs. Consider sanitizing the URL before display, e.g., by masking the password portion with sed or parameter expansion.

-echo ">> ES = $ES_URL" +echo ">> ES = ${ES_URL//@*/**SECRET**@}"

🤖 Copy this AI Prompt to have your agent fix this:

In file eval-restore-kbs.sh around line 38: `ES_URL` (which may contain credentials like `user:password@host`) is printed verbatim to stdout on line 38, exposing secrets to terminal output and CI logs. Consider sanitizing the URL before display, e.g., by masking the password portion with `sed` or parameter expansion.

macroscopeapp · 2026-04-28T08:40:58Z

+# Detect: gh pr create
+if [[ "$clean_command" =~ ^gh[[:space:]]+pr[[:space:]]+create ]]; then
+  # Extract PR URL from output
+  pr_url=$(echo "$output" | grep -Eo 'https://github[.]com/[^/]+/[^/]+/pull/[0-9]+' | head -1)


🟢 Low .claude/metadata-updater.sh:106

When gh pr create output doesn't match the expected URL pattern, grep -Eo returns exit code 1, which with pipefail causes the script to terminate before reaching the [[ -n "$pr_url" ]] check. Consider appending || true to the pipeline so the script gracefully handles missing URLs instead of crashing.

- pr_url=$(echo "$output" | grep -Eo 'https://github[.]com/[^/]+/[^/]+/pull/[0-9]+' | head -1) + pr_url=$(echo "$output" | grep -Eo 'https://github[.]com/[^/]+/[^/]+/pull/[0-9]+' | head -1) || true

🤖 Copy this AI Prompt to have your agent fix this:

In file .claude/metadata-updater.sh around line 106: When `gh pr create` output doesn't match the expected URL pattern, `grep -Eo` returns exit code 1, which with `pipefail` causes the script to terminate before reaching the `[[ -n "$pr_url" ]]` check. Consider appending `|| true` to the pipeline so the script gracefully handles missing URLs instead of crashing.

macroscopeapp · 2026-04-28T08:40:58Z

+  BASIC=$(grep -c 'Evaluator "AttackDiscoveryBasic".*completed' "$LOG" 2>/dev/null || echo 0)
+  RUBRIC=$(grep -c 'Evaluator "AttackDiscoveryRubric".*completed' "$LOG" 2>/dev/null || echo 0)
+  RAN=$(grep -c 'Evaluator "Ran".*completed' "$LOG" 2>/dev/null || echo 0)
+  EXPORT=$(grep -c 'Evaluation scores exported' "$LOG" 2>/dev/null || echo 0)
+  RETRIES=$(grep -cE 'ECONNRESET|ERR_CANCELED|retrying' "$LOG" 2>/dev/null || echo 0)


🟢 Low eval-ad-remaining.sh:38

Lines 38–42 use || echo 0 as a fallback when grep -c finds no matches, but grep -c already prints 0 to stdout before exiting with code 1. The command substitution therefore captures the concatenated output "0\n0" (two lines), which causes garbled meta-file values like basic_done=0\n0/6. Consider removing the || echo 0 (or replacing it with || true) since the script doesn't use set -e.

+ BASIC=$(grep -c 'Evaluator "AttackDiscoveryBasic".*completed' "$LOG" 2>/dev/null || true) + RUBRIC=$(grep -c 'Evaluator "AttackDiscoveryRubric".*completed' "$LOG" 2>/dev/null || true) + RAN=$(grep -c 'Evaluator "Ran".*completed' "$LOG" 2>/dev/null || true) + EXPORT=$(grep -c 'Evaluation scores exported' "$LOG" 2>/dev/null || true) + RETRIES=$(grep -cE 'ECONNRESET|ERR_CANCELED|retrying' "$LOG" 2>/dev/null || true)

🤖 Copy this AI Prompt to have your agent fix this:

In file eval-ad-remaining.sh around lines 38-42: Lines 38–42 use `|| echo 0` as a fallback when `grep -c` finds no matches, but `grep -c` already prints `0` to stdout before exiting with code 1. The command substitution therefore captures the concatenated output `"0\n0"` (two lines), which causes garbled meta-file values like `basic_done=0\n0/6`. Consider removing the `|| echo 0` (or replacing it with `|| true`) since the script doesn't use `set -e`.

…andExists Discovered via patryks-treadmill ship-readiness audit on PR elastic#265798: 1. RECOMMENDATIONS.md was labeled 'Auto-generated, do not edit' but contained richer tables (5-col Quick Pick with Why, 8-col All Models with Active) that generateRecommendations() couldn't reproduce. Next --update-registry run would silently overwrite the enriched content. Fix: extend ModelConfig with whyShortReason + activeParamsBillions, update generator to emit matching schema, and move the one-time 'What changed from v1' migration notes to CHANGELOG.md (their structural home). 2. commandExists() in detect.ts and provision.ts shell-interpolated its argument (\`command -v ${cmd}\`). Currently safe (only called with hardcoded literals), but a future caller passing user input would create a shell-injection sink. Replaced with execFileSync + sh -c "command -v \$1" + positional argument so cmd is never re-tokenized by the shell. Both findings were verified by the gate against the actual source files (anti-hallucination rules in the audit prompt held).

injectLocalConnector() previously logged a warning and resolved without setting any connector env vars when neither Ollama nor LM Studio was running. The caller in scripts/evals.js chains `.then(() => cli.run())` unconditionally on the resolved promise, so the warn-and-return path let @kbn/evals start with the default CLOUD connector — silently producing eval reports that look like local model output but actually used Anthropic/OpenAI. A developer reading the report could not tell their local model was never invoked. Throw an explicit Error with actionable remediation instead. Refusing to silently fall back to cloud is a data-trust requirement for an "--local" flag. Surfaced by patryks-treadmill ship-readiness audit run #7 against PR elastic#265798 (verdict warn / score 74, single major finding, verified by tracing inject.ts:54 → scripts/evals.js .then() chain).

…them benchmarkSingleModel's catch bound `e` but never referenced it. When ensureRuntime() or ensureModel() threw (Ollama not installed, model pull failed, disk full), the printed results table showed `evalExitCode: 1` with zero explanation. Operators had to dig through the source to guess what failed. Log the underlying error message before returning the failure record so the failure is actionable on first read of the table. Surfaced by patryks-treadmill ship-readiness audit run #9 against PR elastic#265798 (verdict warn / score 74, verified at benchmark.ts:93-105).

patrykkopycinski · 2026-05-06T07:52:18Z

/ci

viduni94

Hey @patrykkopycinski
Did a quick skim through of the PR. I'll do a deeper review tomorrow.

Out of curiosity, what's the advantage of auto provisioning the models via Ollama or LM studio vs. just pointing a connector config at an already running Ollama/LM Studio instance?

patrykkopycinski · 2026-05-11T08:55:58Z

@viduni94 great question — and to be clear, the PR ships both options. There are actually three pieces in here, and each one answers part of your question:

node scripts/evals run --suite … --local → just probes a running Ollama/LM Studio, builds the connector config, runs the eval. No provisioning. This is the "I already have my own setup" path (src/cli/inject.ts).
node scripts/evals local --suite … → the orchestrator (detect → negotiate → provision → run → tear down) — src/cli/local.ts.
node scripts/evals local benchmark … → the benchmark pipeline that produces the data the orchestrator relies on (src/cli/benchmark.ts, src/benchmark/result_writer.ts).

So the question is really why does the orchestrator exist on top of the bare connector path, and why is there a benchmark pipeline at all. The two things are linked.

Why auto-provisioning is the default for `local`

The bare "point a connector at a running runtime" path has a few foot-guns that show up in practice:

1. Wrong model loaded → silent garbage results. If a dev has qwen2.5:7b loaded and runs evals, a bare-connector path will happily start a 20–30 minute run and produce numbers that look fine but are meaningless because that model doesn't reliably tool-call. The orchestrator validates tool-calling against a tiny 2 + 2 function-calling probe (model_validator.ts) before the run and either swaps to a working model, prompts the user, or falls back to CODE-only evaluators. In inject.ts we even hard-fail when no local runtime is detected, specifically to prevent silently falling back to the cloud connector — a "data-trust regression" the comment there calls out.

2. No model / wrong tier for the machine. Many Kibana devs have never installed Ollama at all, and those who have don't necessarily know which model fits 16 GB vs 48 GB vs 96 GB. The orchestrator reads os.totalmem() and picks from a curated registry with a 14 B floor (model_registry.autoSelect() + MIN_EVAL_PARAMS_BILLIONS). The connector-only path assumes you already made all those decisions correctly.

3. VRAM hygiene. After the run, teardown.ts posts keep_alive: 0 so the model is unloaded from VRAM. Without that, a 20 GB model sits in GPU memory indefinitely and chokes the Kibana dev server, Cursor, Docker, etc. — and developers don't reliably remember to ollama stop manually.

4. Tier-aware execution. The orchestrator sets EVAL_TASK_TIMEOUT_MS, EVAL_THRESHOLD_MULTIPLIER, EVAL_MAX_SAMPLES, and SELECTED_EVALUATORS for local runs (tier_config.ts). The default 5-minute Playwright task timeout doesn't survive a 15 tok/s local model, so a bare connector setup just times out. We also touched create_playwright_eval_config.ts to honor EVAL_TASK_TIMEOUT_MS.

5. Zero-config first run. The whole point of the PR is "use local to reduce EIS token consumption before pushing to CI". That only works if the friction to running local is near-zero — brew install Ollama → pull right model → run eval → unload, in one command. If the first step is "set up Ollama, pick a model from this big list, figure out the connector JSON, then run the eval", most people will just keep using EIS.

Why the benchmarker matters (and why the orchestrator depends on it)

Steps 1, 2 and 4 above all rely on the registry knowing which models actually work — toolCalling, judgeQuality, tokPerSecEstimate, minRamGb. Those are the dials the negotiator uses to pick a model. The benchmarker is how we keep that data honest:

node scripts/evals local benchmark --model <id> runs the full pipeline (provision → warm up → tool-calling probe → run a real eval suite → measure → teardown) and writes a per-model benchmark-results/<id>-<date>.json.
--all does it across every model that fits the current machine — so a dev with 48 GB and a dev with 16 GB each contribute to a different tier of the table.
--update-registry folds the results back into models.json and regenerates RECOMMENDATIONS.md from the merged data (the file is explicitly marked auto-generated for that reason — result_writer.generateRecommendations).
--create-pr opens a draft, data-only PR with the updated models.json, the rendered RECOMMENDATIONS.md, and the new benchmark-results/ files. We capture the original branch so a failed gh pr create doesn't strand the dev on the benchmark branch.

The intent is to make the recommendations system crowd-hardenable: when a new Qwen / Gemma / Mistral version drops, or when someone wants to validate that a model still passes tool calling after an Ollama upgrade, they run one command and contribute a PR. The orchestrator immediately starts auto-selecting the better model for everyone else on that RAM tier. Without that loop, the registry is just a static table that goes stale within weeks and silently degrades the auto-select decision back into the same "wrong model loaded → silent garbage results" failure mode we built the orchestrator to prevent.

tl;dr

Bare connector → power-user path, opt in via --local.
Orchestrator → defensive defaults so the common path doesn't silently produce bad eval data.
Benchmarker → the feedback loop that keeps the orchestrator's defaults trustworthy as the OSS model landscape moves.

patrykkopycinski · 2026-05-11T09:00:39Z

/ci

kibanamachine · 2026-05-11T09:07:20Z

💔 Build Failed

Buildkite Build
Commit: 626ce9e

Failed CI Steps

Metrics [docs]

✅ unchanged

History

💔 Build #438569 failed 9947167

SrdjanLL

Hey Patryk, thanks for drafting this - it's a good nudge in the right direction and consideration of the tokens we'll be burning through with evals.

Top of my mind, besides what Viduni's mentioned, is I'm worried that we're putting a lot of the responsibility on the framework, early. I'm wondering if an internal documentation with a clear "local model of choice" for smoke testing eval suites (published here) should be sufficient to get people started with running local evals (e.g we can choose one or two models depending on what machine you're running with).

From my understanding - our goal is to reduce the cost of the developer feedback loop by running locally, but still evaluate against EIS models using the standard process as we expect our users to use the same models in their workloads. If that's the right goal, I think we can start simple and see how that goes with dev adoption and reduce friction if any arises.

There's a lot to uncover here, but I'll also jump in on individual points.

Wrong model loaded → silent garbage results. If a dev has qwen2.5:7b loaded and runs evals, a bare-connector path will happily start a 20–30 minute run and produce numbers that look fine but are meaningless because that model doesn't reliably tool-call. The orchestrator validates tool-calling against a tiny 2 + 2 function-calling probe (model_validator.ts) before the run and either swaps to a working model, prompts the user, or falls back to CODE-only evaluators. In inject.ts we even hard-fail when no local runtime is detected, specifically to prevent silently falling back to the cloud connector — a "data-trust regression" the comment there calls out.

What does a wrong model mean here? A model that can't call tools? Could this be captured by the evaluations in the standard process - either via the tool call inspection or the groundedness evaluator? If we go with the above "local model of choice" that can call tools we can expect consistent results without maintaining an orchestrator that validates probes into this. Feels like a "figure it out" once kind of a problem, but let me know if there's something I'm missing.

No model / wrong tier for the machine. Many Kibana devs have never installed Ollama at all, and those who have don't necessarily know which model fits 16 GB vs 48 GB vs 96 GB. The orchestrator reads os.totalmem() and picks from a curated registry with a 14 B floor (model_registry.autoSelect() + MIN_EVAL_PARAMS_BILLIONS). The connector-only path assumes you already made all those decisions correctly.

Could this be sorted by having better internal documentation/getting started? To guide our peers which models to set up with (e.g depending on the machine they're using)

VRAM hygiene. After the run, teardown.ts posts keep_alive: 0 so the model is unloaded from VRAM. Without that, a 20 GB model sits in GPU memory indefinitely and chokes the Kibana dev server, Cursor, Docker, etc. — and developers don't reliably remember to ollama stop manually.

It's a good point and if we were to make this automatic it would make sense to unload. My thinking is that we're just putting a lot of responsibility on the framework itself, where this feels more of a dev env hygiene.

Tier-aware execution. The orchestrator sets EVAL_TASK_TIMEOUT_MS, EVAL_THRESHOLD_MULTIPLIER, EVAL_MAX_SAMPLES, and SELECTED_EVALUATORS for local runs (tier_config.ts). The default 5-minute Playwright task timeout doesn't survive a 15 tok/s local model, so a bare connector setup just times out. We also touched create_playwright_eval_config.ts to honor EVAL_TASK_TIMEOUT_MS.

++ Makes sense to have this a bit more configurable for local runs, it could blow up indeed.

Zero-config first run. The whole point of the PR is "use local to reduce EIS token consumption before pushing to CI". That only works if the friction to running local is near-zero — brew install Ollama → pull right model → run eval → unload, in one command. If the first step is "set up Ollama, pick a model from this big list, figure out the connector JSON, then run the eval", most people will just keep using EIS.

It makes sense to reduce friction, I'm just wondering whether full automation of the setup and having this exist in Kibana codebase is the right call here? I'd also imagine that a lot of the folks wanting to run evals with local models will already have ollama or similar set up in their environment anyways and may just need a nudge towards the right model + how to set up the connector.

Please let me know what you think and whether we can start a bit lighter while achieving the same outcome.

SrdjanLL · 2026-05-11T09:42:05Z

Also, other frameworks have dry run concepts, such as this: https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/run-experiments#dry-run

Maybe this is something we could try out and build into our experiments - pick a single example on each dataset, override repetitions to 1 and run.
I've done this with Phoenix before and helped me have a tighter feedback loop when I'm not worried about inspecting the results from the entire experiment.

Move src/cli/inject.ts from @kbn/evals-local into @kbn/evals as src/cli/inject_local_connector.ts. The file is made self-contained by inlining the runtime-detection helpers (probeEndpoint, getOllamaModels, getLmStudioModel, commandExists, detect) and the connector env-setter from connector_factory, both of which are being deleted in the broader kbn-evals-local retirement. The ModelRegistry dependency is dropped in favour of accepting --local-model values as plain model-name strings (bare-connector path). The three load-bearing behaviours are preserved verbatim: the hard-fail guard when no endpoint is detected, the process.argv strip-and-sync after --local removal, and the execFileSync-based commandExists call that prevents shell injection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds the named re-export so scripts/evals.js can require it directly from @kbn/evals instead of @kbn/evals-local. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When scripts/evals.js is invoked with --local, call injectLocalConnector(process.argv) from @kbn/evals before handing off to cli.run(). Passing process.argv directly (not a slice) lets the function strip --local / --local-endpoint / --local-model in-place so cli.run() sees a clean argv. The .then() chain ensures cli.run() only starts after connector env vars are set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…anner Replace the early-return stub with env-var injection so --dry-run falls through to spawn Playwright: sets EVALUATION_REPETITIONS=1 and EVALUATION_DRY_RUN=true in envOverrides, prints the '[DRY-RUN] sampling 1 example per dataset, repetitions=1' banner, then proceeds to the existing spawn block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…RUN=true In KibanaEvalsClient.runExperiment(), check process.env.EVALUATION_DRY_RUN and slice resolvedDataset.examples to [examples[0]] before the run loop. This wires the --dry-run flag end-to-end: CLI sets EVALUATION_DRY_RUN=true, Playwright inherits it, and the executor limits each dataset to one example. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Covers: one-line Ollama install, one model recommendation per RAM tier (16/32/48/64 GB+), EVAL_TASK_TIMEOUT_MS=600000 requirement, --local vs --dry-run guidance, and pointer to elastic-agent-builder-skill-dev for advanced benchmarking orchestration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cla-checker-service · 2026-05-12T10:35:29Z

💚 CLA has been signed

…view Move the --dry-run envOverrides block above the commandPreview snapshot so EVALUATION_REPETITIONS=1 and EVALUATION_DRY_RUN=true appear in the logged "Running: ..." line. Previously the preview was built before the dry-run mutation, making the logged command unreproducible when copy-pasted. Runtime behavior is unchanged — spawn() always received the correct env. Smoke test verified: node scripts/evals run --suite agent-builder --dry-run --local prints the [DRY-RUN] banner, shows all overrides in the Running: line, and Playwright starts 12 tests (1 per dataset spec) with EVALUATION_REPETITIONS=1 and EVALUATION_DRY_RUN=true set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…d; fix timer leak Adds inject_local_connector.test.ts with 7 unit tests covering the hard-fail path (no Ollama, no LM Studio, no binary installed), the env-var injection happy path, and the binary-installed-but-not-running path. All assertions verified: throws with actionable message, strips --local from args before detection, never sets EVALUATION_CONNECTOR_ID or KIBANA_TESTING_AI_CONNECTORS when no runtime is found. Also fixes a timer resource leak in probeEndpoint / getOllamaModels / getLmStudioModel: clearTimeout was only called on the success path; moved to finally{} so it fires on rejection too. This eliminated the "Jest did not exit" open-handle warning that surfaced during test authoring. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…lsClient Verifies that runExperiment() limits execution to the first example when EVALUATION_DRY_RUN=true (regardless of repetitions), and runs all examples when the var is absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…amily tools Replaces the monolithic `runEmulationCommand` tool with four narrower, typed-by-discriminated-union tools — one per command family: - `security.detection-emulation.run-process-command` — kill-process, suspend-process, running-processes, memory-dump - `security.detection-emulation.run-file-command` — get-file, scan, upload - `security.detection-emulation.run-network-command` — isolate, unisolate - `security.detection-emulation.run-execution-command` — execute, runscript, cancel Each tool's schema is a Zod `discriminatedUnion('command', …)` over the exported parameter shapes (`processRefParams`, `pathParams`, `executeParams`, `runscriptParams`, `cancelParams`, `optionalCommentParams`, `memoryDumpParams`, `uploadParams`) so misspelled fields, extra keys, or wrong types fail at the boundary before reaching the EDR connector. Per-family handlers all delegate to the new `withCommandGates` helper that consolidates the security gate sequence the monolithic tool used to inline (feature flag → per-command RBAC → host allowlist → atomic rate-limit acquire → authenticated caller → runner dispatch → typed runner-error mapping → rate-limit release on failure). The helper's gate order, error classifications, status codes, and side effects match the original 1:1. Skill body updated: - Tool-IDs alias paragraph now lists the four per-family tool IDs (replaces the single `runEmulationCommand → run-command` entry). - "Low-level command dispatch" guidance now picks the family by command, with a per-family table of required `parameters` shapes. - Guardrails table collapses the four `run*Command` columns into one shared column (the gates are identical across families). - Skill `description` extended with the four-family taxonomy hint so activation tokens still resolve. Tests: - New `run_command_tools.test.ts` exercises each family's schema discrimination — accept-valid / reject-invalid per command + cross-family rejection (each tool refuses commands from the other three families) + shared rules (empty endpointIds, unknown agentType, empty emulationId). 45 tests, all green. - Old `run_emulation_command_tool.test.ts` deleted (its scope is covered by the new per-family tests + the shared `RunEmulationCommandInputSchema` round-trip in `validate_rule_input.test.ts`). Files: - common/detection_emulation/schemas/run_emulation_command_input.ts: exported `commonFields`, `optionalCommentParams`, `processRefParams`, `memoryDumpParams`, `cancelParams`, `pathParams`, `executeParams`, `runscriptParams`, `uploadParams` so the per-family tools can reuse them. Single source of truth for parameter shapes. - New: with_command_gates.ts, run_process_command_tool.ts, run_file_command_tool.ts, run_network_command_tool.ts, run_execution_command_tool.ts - Deleted: run_emulation_command_tool.ts, run_emulation_command_tool.test.ts - Updated: detection_emulation_skill.ts (registration array + skill body), index.ts (re-exports the four per-family factories) Addresses Srdjan's I5 review item from PR elastic#265798.

macroscopeapp Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread scripts/evals.js Outdated

macroscopeapp Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/benchmark/result_writer.ts Outdated

patrykkopycinski added release_note:skip Skip the PR/issue when compiling release notes v9.1.0 labels Apr 28, 2026

macroscopeapp Bot reviewed Apr 28, 2026

View reviewed changes

patrykkopycinski force-pushed the feat/local-evals-with-ollama branch from a47298c to b73ba71 Compare April 28, 2026 15:52

macroscopeapp Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/benchmark/result_writer.ts Outdated

Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/cli/benchmark.ts Outdated

patrykkopycinski added v9.5.0 and removed v9.1.0 labels May 4, 2026

macroscopeapp Bot reviewed May 4, 2026

View reviewed changes

Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/cli/benchmark.ts Outdated

macroscopeapp Bot reviewed May 4, 2026

View reviewed changes

Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/cli/benchmark.ts Outdated

viduni94 reviewed May 7, 2026

View reviewed changes

SrdjanLL reviewed May 11, 2026

View reviewed changes

patrykkopycinski and others added 6 commits May 12, 2026 09:14

feat(kbn-evals): export injectLocalConnector from package index

6073da3

Adds the named re-export so scripts/evals.js can require it directly from @kbn/evals instead of @kbn/evals-local. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

patrykkopycinski changed the title ~~feat(@kbn/evals-local): auto-provisioned local model support for eval runner~~ feat(@kbn/evals): --local connector injection, --dry-run, configurable timeout; retire @kbn/evals-local May 12, 2026

patrykkopycinski requested review from SrdjanLL and viduni94 and removed request for SrdjanLL and viduni94 May 12, 2026 09:25

patrykkopycinski force-pushed the feat/local-evals-with-ollama branch from ddcce6e to 012354a Compare May 12, 2026 10:35

patrykkopycinski and others added 3 commits May 12, 2026 12:53

patrykkopycinski force-pushed the feat/local-evals-with-ollama branch from 012354a to 2904710 Compare May 12, 2026 10:54

patrykkopycinski mentioned this pull request May 12, 2026

[AO] realign-kibana-265798-inline-local-into-kbn-evals-retire-kbn #268862

Draft

18 tasks

		) > "$log" 2>&1 \|\| echo " NOTE: eval exited non-zero (see $log)"

		echo " exit-code: $?"

Conversation

patrykkopycinski commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

(a) --local connector injection in @kbn/evals

(b) Configurable eval task timeout (EVAL_TASK_TIMEOUT_MS)

(c) --dry-run

(d) Documentation

What was removed

Usage after this PR

Test plan

Uh oh!

elasticmachine commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrykkopycinski commented Apr 27, 2026

Test Plan Results

1. Type checking — PASS

2. --list-models — PASS

3. --validate-only with Ollama — PASS

4. --local flag connector injection — PASS

5. Teardown — PASS

6. Full e2e eval run — PASS (partial)

Bug found & fixed during testing

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrykkopycinski commented May 6, 2026

Uh oh!

viduni94 left a comment

Choose a reason for hiding this comment

Uh oh!

patrykkopycinski commented May 11, 2026

Why auto-provisioning is the default for local

Why the benchmarker matters (and why the orchestrator depends on it)

tl;dr

Uh oh!

patrykkopycinski commented May 11, 2026

Uh oh!

kibanamachine commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💔 Build Failed

Failed CI Steps

Metrics [docs]

History

Uh oh!

SrdjanLL left a comment

Choose a reason for hiding this comment

Uh oh!

SrdjanLL commented May 11, 2026

Uh oh!

cla-checker-service Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrykkopycinski commented Apr 27, 2026 •

edited

Loading

(a) `--local` connector injection in `@kbn/evals`

(b) Configurable eval task timeout (`EVAL_TASK_TIMEOUT_MS`)

(c) `--dry-run`

2. `--list-models` — PASS

3. `--validate-only` with Ollama — PASS

4. `--local` flag connector injection — PASS

Why auto-provisioning is the default for `local`

kibanamachine commented May 11, 2026 •

edited

Loading

cla-checker-service Bot commented May 12, 2026 •

edited

Loading