Skip to content

feat(@kbn/evals): --local connector injection, --dry-run, configurable timeout; retire @kbn/evals-local#265798

Draft
patrykkopycinski wants to merge 9 commits into
elastic:mainfrom
patrykkopycinski:feat/local-evals-with-ollama
Draft

feat(@kbn/evals): --local connector injection, --dry-run, configurable timeout; retire @kbn/evals-local#265798
patrykkopycinski wants to merge 9 commits into
elastic:mainfrom
patrykkopycinski:feat/local-evals-with-ollama

Conversation

@patrykkopycinski
Copy link
Copy Markdown
Contributor

@patrykkopycinski patrykkopycinski commented Apr 27, 2026

Summary

What's in this PR

(a) --local connector injection in @kbn/evals

src/cli/inject_local_connector.ts is moved from the deleted @kbn/evals-local package into @kbn/evals. It probes running Ollama (localhost:11434) or LM Studio (localhost:1234), discovers the loaded model name, and injects KIBANA_TESTING_AI_CONNECTORS + EVALUATION_CONNECTOR_ID so that any existing eval command picks up the local model.

scripts/evals.js routes --local via require('@kbn/evals').injectLocalConnector (was @kbn/evals-local).

Hard-fail behaviour is preserved: if no runtime is reachable, the process exits with a non-zero code and prints:

--local requires a running local runtime, but none was detected.
Start Ollama (`ollama serve`) or LM Studio.
Refusing to silently fall back to the cloud connector.

(b) Configurable eval task timeout (EVAL_TASK_TIMEOUT_MS)

create_playwright_eval_config.ts reads process.env.EVAL_TASK_TIMEOUT_MS and uses it as the Playwright task timeout, falling back to 5 minutes. Local models are significantly slower than EIS — this allows callers to bump the timeout without modifying test fixtures.

(c) --dry-run

node scripts/evals run --suite <id> --dry-run slices each registered dataset to its first example, sets EVALUATION_REPETITIONS=1, and prints a banner at startup:

[DRY-RUN] sampling 1 example per dataset, repetitions=1

This lets developers verify that the full eval pipeline (Kibana → Scout → evaluators → export) works end-to-end in ~1 minute before committing to a full run.

(d) Documentation

@kbn/evals/README.md gains a "Local model quick-start" section: one-line install (brew install ollama && ollama pull <model>), one model recommendation per RAM tier (16 GB / 32 GB / 48 GB / 64 GB+), and the required env var (EVAL_TASK_TIMEOUT_MS=120000).


What was removed

The entire x-pack/platform/packages/shared/kbn-evals-local/ package is deleted:

  • src/cli/local.ts, src/cli/benchmark.ts — auto-provisioning + benchmark orchestrator
  • src/local/provision.ts, teardown.ts, model_negotiator.ts, model_validator.ts — runtime lifecycle
  • src/local/model_registry.ts, models.json, tier_config.ts — curated model registry
  • src/benchmark/result_writer.ts, benchmark-results/ — benchmark result storage
  • RECOMMENDATIONS.md, .cursor/skills/local-evals/SKILL.md — derived artefacts

References cleaned up in package.json (workspace entry), tsconfig.base.json (path mapping), .github/CODEOWNERS, and yarn.lock.

Orchestrator, benchmark pipeline, and model registry move to elastic/agent-builder-skill-dev-cursor-plugin#2 (feat(local-evals): port @kbn/evals-local orchestrator + benchmark + registry), the sibling PR to this one. Tracked under the realign-pr-265798 treadmill plan group.


Usage after this PR

# Run evals locally — assumes Ollama or LM Studio is already running
node scripts/evals run --suite agent-builder --local

# Dry-run: verify the pipeline in ~1 min before a full run
node scripts/evals run --suite agent-builder --local --dry-run

# Longer timeout for slow local models (120 s per task)
EVAL_TASK_TIMEOUT_MS=120000 node scripts/evals run --suite agent-builder --local

Test plan

  • node scripts/evals run --suite agent-builder --local with Ollama running → connector injected, suite starts
  • node scripts/evals run --suite agent-builder --local with no Ollama or LM Studio running → non-zero exit, actionable error message (7 unit tests in inject_local_connector.test.ts)
  • node scripts/evals run --suite agent-builder --dry-run[DRY-RUN] banner printed, EVALUATION_DRY_RUN=true in Running: preview, dataset sliced to 1 example
  • Type check: node scripts/type_check --project x-pack/platform/packages/shared/kbn-evals/tsconfig.json
  • Lint: node scripts/eslint --fix $(git diff --name-only HEAD)

@elasticmachine
Copy link
Copy Markdown
Contributor

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

  • Click to trigger kibana-pull-request for this PR!
  • Click to trigger kibana-deploy-project-from-pr for this PR!
  • Click to trigger kibana-deploy-cloud-from-pr for this PR!
  • Click to trigger kibana-entity-store-performance-from-pr for this PR!
  • Click to trigger kibana-storybooks-from-pr for this PR!

Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/cli/inject.ts Outdated
Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/local/model_negotiator.ts Outdated
Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/local/model_registry.ts Outdated
Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/cli/local.ts Outdated
Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/local/connector_factory.ts Outdated
Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/local/provision.ts Outdated
Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/local/detect.ts Outdated
@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

Test Plan Results

Environment: macOS 25.4.0, Apple Silicon, 48GB RAM, Node v24.14.1, Ollama 0.21.1
Model used: qwen3:8b (Qwen3-8B, 5.2GB, Q4_K_M)


1. Type checking — PASS

node scripts/type_check --project x-pack/platform/packages/shared/kbn-evals-local/tsconfig.json

Zero errors from @kbn/evals-local. (6 pre-existing errors in upstream get_agents_sockets_stats.test.ts — unrelated.)

2. --list-models — PASS

$ node scripts/evals local --list-models
[evals-local] System RAM: 48 GB
[evals-local] Available models for this machine:

  qwen2.5-32b-instruct [RECOMMENDED]
    Qwen2.5-32B-Instruct (32B, Q4_K_M, 20GB VRAM)
    Judge quality: very-good | Speed: ~15 tok/s

  mistral-small-3.1-24b
    Mistral Small 3.1 24B (24B, Q4_K_M, 15GB VRAM)
    Judge quality: good | Speed: ~20 tok/s

  devstral-small-2-24b
    Devstral Small 2 24B (24B, Q4_K_M, 15GB VRAM)
    Judge quality: good | Speed: ~23 tok/s

  qwen2.5-14b-instruct
    Qwen2.5-14B-Instruct (14B, Q4_K_M, 9GB VRAM)
    Judge quality: good | Speed: ~35 tok/s

  gemma-3-27b
    Gemma 3 27B (27B, Q4_K_M, 17GB VRAM)
    Judge quality: good | Speed: ~14 tok/s

  qwen3-8b
    Qwen3-8B (8B, Q4_K_M, 5GB VRAM)
    Judge quality: fair | Speed: ~63 tok/s

Shows 6 models eligible for 48GB system, sorted by priority with recommended flag.

3. --validate-only with Ollama — PASS

$ node scripts/evals local --validate-only --model qwen3-8b
[evals-local] Runtime: ollama
[evals-local] Endpoint: http://localhost:11434/v1
[evals-local] Swapping to requested model: Qwen3-8B
[evals-local] Model Qwen3-8B already available.
[evals-local] Validation complete. Model is ready for evals.

Detects Ollama, resolves model from registry, confirms model is pulled and server is healthy.

4. --local flag connector injection — PASS

// Test: inject connector when Ollama is running with qwen3:8b loaded
const args = ['run', '--suite', 'agent-builder', '--local'];
await injectLocalConnector(args);

// Result:
// [evals-local] Local connector injected: qwen3:8b at http://localhost:11434/v1
// KIBANA_TESTING_AI_CONNECTORS set (292 chars base64)
// Decoded connector:
{
  "local-eval-model": {
    "name": "Local: qwen3:8b",
    "actionTypeId": ".gen-ai",
    "config": {
      "apiUrl": "http://localhost:11434/v1/chat/completions",
      "apiProvider": "Other",
      "defaultModel": "qwen3:8b"
    },
    "secrets": { "apiKey": "local-eval" }
  }
}
// --local flag stripped from args: ['run', '--suite', 'agent-builder']

Auto-detects loaded model, generates correct connector config, injects env var, and cleans args.

5. Teardown — PASS

// Before teardown:
//   ollama ps → qwen3:8b loaded (10 GB, 100% GPU)

await teardown({
  runtime: 'ollama', modelTag: 'qwen3:8b',
  keepLoaded: false, serverWasRunning: true
});

// [evals-local] Model scheduled for unload -- memory freed.
// After teardown:
//   ollama ps → NAME (empty) — model fully unloaded

6. Full e2e eval run — PASS (partial)

$ node scripts/evals local --suite agent-builder --model qwen3-8b

[evals-local] Runtime: ollama
[evals-local] Endpoint: http://localhost:11434/v1
[evals-local] Swapping to requested model: Qwen3-8B
[evals-local] Model Qwen3-8B already available.
[evals-local] Running evals: node scripts/evals run --suite agent-builder

Running 12 tests using 1 worker
[scout-worker] Running tests against local stateful cluster
[scout-worker] Creating connector: local-eval-model as a7f5e5ba-1ebd-5767-b0e5-51fca96c42ad
[scout-worker] Loading inference client
[scout-worker] 🧪 Starting experiment with 1 evaluators and 5 concurrent runs
[scout-worker] ✅ Evaluator "ES|QL Functional Equivalence" on run (exampleIndex=0) completed
[scout-worker] ✅ Evaluator "ES|QL Functional Equivalence" on run (exampleIndex=1) completed
[scout-worker] ✅ Evaluator "ES|QL Functional Equivalence" on run (exampleIndex=2) completed
[scout-worker] ✅ Experiment e804300c-27ee-4f18-8cbe-89258818803e completed

The full orchestration pipeline works end-to-end:

  • Detect → Ollama found at localhost:11434
  • Negotiate → resolved qwen3-8b from registry
  • Provision → model already pulled, skipped
  • Connector → injected env vars for @kbn/evals
  • Execute → spawned node scripts/evals run, Playwright launched, first experiment (ES|QL analytical queries, 3 examples) completed with all evaluators passing
  • Teardown → model scheduled for unload

Subsequent experiments (agent-builder converse tests) fail with ECONNREFUSED localhost:5620 because those tests require a local Kibana instance which is not running. This is expected — those tests would pass when running against a local dev Kibana.

Bug found & fixed during testing

scripts/evals.js was calling async injectLocalConnector() without awaiting it, so env vars wouldn't be set before @kbn/evals CLI started. Fixed by chaining via .then() (commit ac86147).

Comment thread scripts/evals.js Outdated
Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/benchmark/result_writer.ts Outdated
@patrykkopycinski patrykkopycinski added release_note:skip Skip the PR/issue when compiling release notes v9.1.0 labels Apr 28, 2026
Comment thread agent-orchestrator.yaml Outdated
plugins:
- name: verifier
source: local
path: /Users/patrykkopycinski/Projects/work/ao-verifier-plugin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium agent-orchestrator.yaml:14

The configuration file embeds hardcoded personal filesystem paths (/Users/patrykkopycinski/Projects/work/ao-verifier-plugin, /Users/patrykkopycinski/Projects/ao-workspace), a private machine hostname (worker-m1max), and a username (mac). This commits private infrastructure details to the repository and breaks portability—no other developer can use this configuration without modification, and CI/automation environments will fail. Consider externalizing these values to environment variables or a separate .env file that is .gitignored.

Also found in 3 other location(s)

openspec/changes/ad-triage-eval-suite/.ao-generate.sh:3

Hardcoded absolute paths at lines 3, 5, and 6 reference a specific developer's local machine path (/Users/patrykkopycinski/Projects/kibana). This auto-generated script is committed to the repository but will fail immediately with cd: /Users/patrykkopycinski/Projects/kibana: No such file or directory on any other developer's machine or CI environment. The set -e on line 2 ensures the script exits on the first failed cd command.

openspec/changes/ao-health-panel/.ao-generate.sh:3

Hardcoded absolute paths to a specific developer's home directory at lines 3, 5, and 6 (/Users/patrykkopycinski/Projects/kibana). This script is committed to the shared repository but will fail immediately for any other developer or CI environment because cd &#34;/Users/patrykkopycinski/Projects/kibana&#34; will error out under set -e, aborting execution. This appears to be a generated local artifact that should not have been committed, or should use relative/dynamic paths (e.g., $(git rev-parse --show-toplevel)).

eval-ad-remaining.sh:6

Line 6 hardcodes cd /Users/patrykkopycinski/Projects/kibana, a developer-specific absolute path. Any other user or CI environment running this script will fail immediately because the directory won't exist, causing the cd to fail. With set -u (but not set -e), the script will continue running from whatever directory it was invoked in, likely causing source ./eval-env.sh and node scripts/evals.js to fail with confusing errors.

🤖 Copy this AI Prompt to have your agent fix this:
In file agent-orchestrator.yaml around line 14:

The configuration file embeds hardcoded personal filesystem paths (`/Users/patrykkopycinski/Projects/work/ao-verifier-plugin`, `/Users/patrykkopycinski/Projects/ao-workspace`), a private machine hostname (`worker-m1max`), and a username (`mac`). This commits private infrastructure details to the repository and breaks portability—no other developer can use this configuration without modification, and CI/automation environments will fail. Consider externalizing these values to environment variables or a separate `.env` file that is `.gitignore`d.

Also found in 3 other location(s):
- openspec/changes/ad-triage-eval-suite/.ao-generate.sh:3 -- Hardcoded absolute paths at lines 3, 5, and 6 reference a specific developer's local machine path (`/Users/patrykkopycinski/Projects/kibana`). This auto-generated script is committed to the repository but will fail immediately with `cd: /Users/patrykkopycinski/Projects/kibana: No such file or directory` on any other developer's machine or CI environment. The `set -e` on line 2 ensures the script exits on the first failed `cd` command.
- openspec/changes/ao-health-panel/.ao-generate.sh:3 -- Hardcoded absolute paths to a specific developer's home directory at lines 3, 5, and 6 (`/Users/patrykkopycinski/Projects/kibana`). This script is committed to the shared repository but will fail immediately for any other developer or CI environment because `cd "/Users/patrykkopycinski/Projects/kibana"` will error out under `set -e`, aborting execution. This appears to be a generated local artifact that should not have been committed, or should use relative/dynamic paths (e.g., `$(git rev-parse --show-toplevel)`).
- eval-ad-remaining.sh:6 -- Line 6 hardcodes `cd /Users/patrykkopycinski/Projects/kibana`, a developer-specific absolute path. Any other user or CI environment running this script will fail immediately because the directory won't exist, causing the `cd` to fail. With `set -u` (but not `set -e`), the script will continue running from whatever directory it was invoked in, likely causing `source ./eval-env.sh` and `node scripts/evals.js` to fail with confusing errors.

Comment thread eval-nfc-ab.sh Outdated
Comment on lines +126 to +128
) > "$log" 2>&1 || echo " NOTE: eval exited non-zero (see $log)"

echo " exit-code: $?"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low eval-nfc-ab.sh:126

echo " exit-code: $?" on line 128 always prints 0. The subshell exit code on line 126 is lost because when it fails, the || echo ... succeeds and resets $? to 0; when it succeeds, $? is naturally 0. The actual exit code is never captured.

Consider saving the subshell exit code before the || branch: capture rc=$? immediately after the subshell, then use $rc for both the error note and the final exit-code print. Note that with set -e, you'll need to restructure to avoid the subshell triggering an immediate exit on failure.

+  ) > "$log" 2>&1; rc=$?
+  if [[ $rc -ne 0 ]]; then
+    echo "  NOTE: eval exited non-zero (see $log)"
+  fi
+
+  echo "  exit-code:  $rc"
🤖 Copy this AI Prompt to have your agent fix this:
In file eval-nfc-ab.sh around lines 126-128:

`echo "  exit-code:  $?"` on line 128 always prints `0`. The subshell exit code on line 126 is lost because when it fails, the `|| echo ...` succeeds and resets `$?` to `0`; when it succeeds, `$?` is naturally `0`. The actual exit code is never captured.

Consider saving the subshell exit code before the `||` branch: capture `rc=$?` immediately after the subshell, then use `$rc` for both the error note and the final exit-code print. Note that with `set -e`, you'll need to restructure to avoid the subshell triggering an immediate exit on failure.

Comment thread eval-restore-kbs.sh Outdated
exit 1
fi

echo ">> ES = $ES_URL"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low eval-restore-kbs.sh:38

ES_URL (which may contain credentials like user:password@host) is printed verbatim to stdout on line 38, exposing secrets to terminal output and CI logs. Consider sanitizing the URL before display, e.g., by masking the password portion with sed or parameter expansion.

-echo ">> ES        = $ES_URL"
+echo ">> ES        = ${ES_URL//@*/**SECRET**@}"
🤖 Copy this AI Prompt to have your agent fix this:
In file eval-restore-kbs.sh around line 38:

`ES_URL` (which may contain credentials like `user:password@host`) is printed verbatim to stdout on line 38, exposing secrets to terminal output and CI logs. Consider sanitizing the URL before display, e.g., by masking the password portion with `sed` or parameter expansion.

Comment thread .claude/metadata-updater.sh Outdated
# Detect: gh pr create
if [[ "$clean_command" =~ ^gh[[:space:]]+pr[[:space:]]+create ]]; then
# Extract PR URL from output
pr_url=$(echo "$output" | grep -Eo 'https://github[.]com/[^/]+/[^/]+/pull/[0-9]+' | head -1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low .claude/metadata-updater.sh:106

When gh pr create output doesn't match the expected URL pattern, grep -Eo returns exit code 1, which with pipefail causes the script to terminate before reaching the [[ -n "$pr_url" ]] check. Consider appending || true to the pipeline so the script gracefully handles missing URLs instead of crashing.

-  pr_url=$(echo "$output" | grep -Eo 'https://github[.]com/[^/]+/[^/]+/pull/[0-9]+' | head -1)
+  pr_url=$(echo "$output" | grep -Eo 'https://github[.]com/[^/]+/[^/]+/pull/[0-9]+' | head -1) || true
🤖 Copy this AI Prompt to have your agent fix this:
In file .claude/metadata-updater.sh around line 106:

When `gh pr create` output doesn't match the expected URL pattern, `grep -Eo` returns exit code 1, which with `pipefail` causes the script to terminate before reaching the `[[ -n "$pr_url" ]]` check. Consider appending `|| true` to the pipeline so the script gracefully handles missing URLs instead of crashing.

Comment thread eval-ad-remaining.sh Outdated
Comment on lines +38 to +42
BASIC=$(grep -c 'Evaluator "AttackDiscoveryBasic".*completed' "$LOG" 2>/dev/null || echo 0)
RUBRIC=$(grep -c 'Evaluator "AttackDiscoveryRubric".*completed' "$LOG" 2>/dev/null || echo 0)
RAN=$(grep -c 'Evaluator "Ran".*completed' "$LOG" 2>/dev/null || echo 0)
EXPORT=$(grep -c 'Evaluation scores exported' "$LOG" 2>/dev/null || echo 0)
RETRIES=$(grep -cE 'ECONNRESET|ERR_CANCELED|retrying' "$LOG" 2>/dev/null || echo 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low eval-ad-remaining.sh:38

Lines 38–42 use || echo 0 as a fallback when grep -c finds no matches, but grep -c already prints 0 to stdout before exiting with code 1. The command substitution therefore captures the concatenated output "0\n0" (two lines), which causes garbled meta-file values like basic_done=0\n0/6. Consider removing the || echo 0 (or replacing it with || true) since the script doesn't use set -e.

+  BASIC=$(grep -c 'Evaluator "AttackDiscoveryBasic".*completed' "$LOG" 2>/dev/null || true)
+  RUBRIC=$(grep -c 'Evaluator "AttackDiscoveryRubric".*completed' "$LOG" 2>/dev/null || true)
+  RAN=$(grep -c 'Evaluator "Ran".*completed' "$LOG" 2>/dev/null || true)
+  EXPORT=$(grep -c 'Evaluation scores exported' "$LOG" 2>/dev/null || true)
+  RETRIES=$(grep -cE 'ECONNRESET|ERR_CANCELED|retrying' "$LOG" 2>/dev/null || true)
🤖 Copy this AI Prompt to have your agent fix this:
In file eval-ad-remaining.sh around lines 38-42:

Lines 38–42 use `|| echo 0` as a fallback when `grep -c` finds no matches, but `grep -c` already prints `0` to stdout before exiting with code 1. The command substitution therefore captures the concatenated output `"0\n0"` (two lines), which causes garbled meta-file values like `basic_done=0\n0/6`. Consider removing the `|| echo 0` (or replacing it with `|| true`) since the script doesn't use `set -e`.

@patrykkopycinski patrykkopycinski force-pushed the feat/local-evals-with-ollama branch from a47298c to b73ba71 Compare April 28, 2026 15:52
Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/benchmark/result_writer.ts Outdated
Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/cli/benchmark.ts Outdated
patrykkopycinski added a commit to patrykkopycinski/kibana that referenced this pull request May 4, 2026
…andExists

Discovered via patryks-treadmill ship-readiness audit on PR elastic#265798:

1. RECOMMENDATIONS.md was labeled 'Auto-generated, do not edit' but contained
   richer tables (5-col Quick Pick with Why, 8-col All Models with Active)
   that generateRecommendations() couldn't reproduce. Next --update-registry
   run would silently overwrite the enriched content. Fix: extend
   ModelConfig with whyShortReason + activeParamsBillions, update generator
   to emit matching schema, and move the one-time 'What changed from v1'
   migration notes to CHANGELOG.md (their structural home).

2. commandExists() in detect.ts and provision.ts shell-interpolated its
   argument (\`command -v ${cmd}\`). Currently safe (only called with
   hardcoded literals), but a future caller passing user input would create
   a shell-injection sink. Replaced with execFileSync + sh -c "command -v
   \$1" + positional argument so cmd is never re-tokenized by the shell.

Both findings were verified by the gate against the actual source files
(anti-hallucination rules in the audit prompt held).
patrykkopycinski added a commit to patrykkopycinski/kibana that referenced this pull request May 4, 2026
injectLocalConnector() previously logged a warning and resolved without
setting any connector env vars when neither Ollama nor LM Studio was
running. The caller in scripts/evals.js chains `.then(() => cli.run())`
unconditionally on the resolved promise, so the warn-and-return path let
@kbn/evals start with the default CLOUD connector — silently producing
eval reports that look like local model output but actually used
Anthropic/OpenAI. A developer reading the report could not tell their
local model was never invoked.

Throw an explicit Error with actionable remediation instead. Refusing
to silently fall back to cloud is a data-trust requirement for an
"--local" flag.

Surfaced by patryks-treadmill ship-readiness audit run #7 against
PR elastic#265798 (verdict warn / score 74, single major finding, verified
by tracing inject.ts:54 → scripts/evals.js .then() chain).
patrykkopycinski added a commit to patrykkopycinski/kibana that referenced this pull request May 4, 2026
…them

benchmarkSingleModel's catch bound `e` but never referenced it. When
ensureRuntime() or ensureModel() threw (Ollama not installed, model pull
failed, disk full), the printed results table showed `evalExitCode: 1`
with zero explanation. Operators had to dig through the source to guess
what failed.

Log the underlying error message before returning the failure record
so the failure is actionable on first read of the table.

Surfaced by patryks-treadmill ship-readiness audit run #9 against
PR elastic#265798 (verdict warn / score 74, verified at benchmark.ts:93-105).
Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/cli/benchmark.ts Outdated
Comment thread x-pack/platform/packages/shared/kbn-evals-local/src/cli/benchmark.ts Outdated
@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

Copy link
Copy Markdown
Contributor

@viduni94 viduni94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @patrykkopycinski
Did a quick skim through of the PR. I'll do a deeper review tomorrow.

Out of curiosity, what's the advantage of auto provisioning the models via Ollama or LM studio vs. just pointing a connector config at an already running Ollama/LM Studio instance?

@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

@viduni94 great question — and to be clear, the PR ships both options. There are actually three pieces in here, and each one answers part of your question:

  1. node scripts/evals run --suite … --local → just probes a running Ollama/LM Studio, builds the connector config, runs the eval. No provisioning. This is the "I already have my own setup" path (src/cli/inject.ts).
  2. node scripts/evals local --suite … → the orchestrator (detect → negotiate → provision → run → tear down) — src/cli/local.ts.
  3. node scripts/evals local benchmark … → the benchmark pipeline that produces the data the orchestrator relies on (src/cli/benchmark.ts, src/benchmark/result_writer.ts).

So the question is really why does the orchestrator exist on top of the bare connector path, and why is there a benchmark pipeline at all. The two things are linked.

Why auto-provisioning is the default for local

The bare "point a connector at a running runtime" path has a few foot-guns that show up in practice:

1. Wrong model loaded → silent garbage results. If a dev has qwen2.5:7b loaded and runs evals, a bare-connector path will happily start a 20–30 minute run and produce numbers that look fine but are meaningless because that model doesn't reliably tool-call. The orchestrator validates tool-calling against a tiny 2 + 2 function-calling probe (model_validator.ts) before the run and either swaps to a working model, prompts the user, or falls back to CODE-only evaluators. In inject.ts we even hard-fail when no local runtime is detected, specifically to prevent silently falling back to the cloud connector — a "data-trust regression" the comment there calls out.

2. No model / wrong tier for the machine. Many Kibana devs have never installed Ollama at all, and those who have don't necessarily know which model fits 16 GB vs 48 GB vs 96 GB. The orchestrator reads os.totalmem() and picks from a curated registry with a 14 B floor (model_registry.autoSelect() + MIN_EVAL_PARAMS_BILLIONS). The connector-only path assumes you already made all those decisions correctly.

3. VRAM hygiene. After the run, teardown.ts posts keep_alive: 0 so the model is unloaded from VRAM. Without that, a 20 GB model sits in GPU memory indefinitely and chokes the Kibana dev server, Cursor, Docker, etc. — and developers don't reliably remember to ollama stop manually.

4. Tier-aware execution. The orchestrator sets EVAL_TASK_TIMEOUT_MS, EVAL_THRESHOLD_MULTIPLIER, EVAL_MAX_SAMPLES, and SELECTED_EVALUATORS for local runs (tier_config.ts). The default 5-minute Playwright task timeout doesn't survive a 15 tok/s local model, so a bare connector setup just times out. We also touched create_playwright_eval_config.ts to honor EVAL_TASK_TIMEOUT_MS.

5. Zero-config first run. The whole point of the PR is "use local to reduce EIS token consumption before pushing to CI". That only works if the friction to running local is near-zero — brew install Ollama → pull right model → run eval → unload, in one command. If the first step is "set up Ollama, pick a model from this big list, figure out the connector JSON, then run the eval", most people will just keep using EIS.

Why the benchmarker matters (and why the orchestrator depends on it)

Steps 1, 2 and 4 above all rely on the registry knowing which models actually work — toolCalling, judgeQuality, tokPerSecEstimate, minRamGb. Those are the dials the negotiator uses to pick a model. The benchmarker is how we keep that data honest:

  • node scripts/evals local benchmark --model <id> runs the full pipeline (provision → warm up → tool-calling probe → run a real eval suite → measure → teardown) and writes a per-model benchmark-results/<id>-<date>.json.
  • --all does it across every model that fits the current machine — so a dev with 48 GB and a dev with 16 GB each contribute to a different tier of the table.
  • --update-registry folds the results back into models.json and regenerates RECOMMENDATIONS.md from the merged data (the file is explicitly marked auto-generated for that reason — result_writer.generateRecommendations).
  • --create-pr opens a draft, data-only PR with the updated models.json, the rendered RECOMMENDATIONS.md, and the new benchmark-results/ files. We capture the original branch so a failed gh pr create doesn't strand the dev on the benchmark branch.

The intent is to make the recommendations system crowd-hardenable: when a new Qwen / Gemma / Mistral version drops, or when someone wants to validate that a model still passes tool calling after an Ollama upgrade, they run one command and contribute a PR. The orchestrator immediately starts auto-selecting the better model for everyone else on that RAM tier. Without that loop, the registry is just a static table that goes stale within weeks and silently degrades the auto-select decision back into the same "wrong model loaded → silent garbage results" failure mode we built the orchestrator to prevent.

tl;dr

  • Bare connector → power-user path, opt in via --local.
  • Orchestrator → defensive defaults so the common path doesn't silently produce bad eval data.
  • Benchmarker → the feedback loop that keeps the orchestrator's defaults trustworthy as the OSS model landscape moves.

@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

@kibanamachine
Copy link
Copy Markdown
Contributor

kibanamachine commented May 11, 2026

💔 Build Failed

Failed CI Steps

Metrics [docs]

✅ unchanged

History

Copy link
Copy Markdown
Contributor

@SrdjanLL SrdjanLL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Patryk, thanks for drafting this - it's a good nudge in the right direction and consideration of the tokens we'll be burning through with evals.

Top of my mind, besides what Viduni's mentioned, is I'm worried that we're putting a lot of the responsibility on the framework, early. I'm wondering if an internal documentation with a clear "local model of choice" for smoke testing eval suites (published here) should be sufficient to get people started with running local evals (e.g we can choose one or two models depending on what machine you're running with).

From my understanding - our goal is to reduce the cost of the developer feedback loop by running locally, but still evaluate against EIS models using the standard process as we expect our users to use the same models in their workloads. If that's the right goal, I think we can start simple and see how that goes with dev adoption and reduce friction if any arises.

There's a lot to uncover here, but I'll also jump in on individual points.

  1. Wrong model loaded → silent garbage results. If a dev has qwen2.5:7b loaded and runs evals, a bare-connector path will happily start a 20–30 minute run and produce numbers that look fine but are meaningless because that model doesn't reliably tool-call. The orchestrator validates tool-calling against a tiny 2 + 2 function-calling probe (model_validator.ts) before the run and either swaps to a working model, prompts the user, or falls back to CODE-only evaluators. In inject.ts we even hard-fail when no local runtime is detected, specifically to prevent silently falling back to the cloud connector — a "data-trust regression" the comment there calls out.

What does a wrong model mean here? A model that can't call tools? Could this be captured by the evaluations in the standard process - either via the tool call inspection or the groundedness evaluator? If we go with the above "local model of choice" that can call tools we can expect consistent results without maintaining an orchestrator that validates probes into this. Feels like a "figure it out" once kind of a problem, but let me know if there's something I'm missing.

  1. No model / wrong tier for the machine. Many Kibana devs have never installed Ollama at all, and those who have don't necessarily know which model fits 16 GB vs 48 GB vs 96 GB. The orchestrator reads os.totalmem() and picks from a curated registry with a 14 B floor (model_registry.autoSelect() + MIN_EVAL_PARAMS_BILLIONS). The connector-only path assumes you already made all those decisions correctly.

Could this be sorted by having better internal documentation/getting started? To guide our peers which models to set up with (e.g depending on the machine they're using)

  1. VRAM hygiene. After the run, teardown.ts posts keep_alive: 0 so the model is unloaded from VRAM. Without that, a 20 GB model sits in GPU memory indefinitely and chokes the Kibana dev server, Cursor, Docker, etc. — and developers don't reliably remember to ollama stop manually.

It's a good point and if we were to make this automatic it would make sense to unload. My thinking is that we're just putting a lot of responsibility on the framework itself, where this feels more of a dev env hygiene.

  1. Tier-aware execution. The orchestrator sets EVAL_TASK_TIMEOUT_MS, EVAL_THRESHOLD_MULTIPLIER, EVAL_MAX_SAMPLES, and SELECTED_EVALUATORS for local runs (tier_config.ts). The default 5-minute Playwright task timeout doesn't survive a 15 tok/s local model, so a bare connector setup just times out. We also touched create_playwright_eval_config.ts to honor EVAL_TASK_TIMEOUT_MS.

++ Makes sense to have this a bit more configurable for local runs, it could blow up indeed.

  1. Zero-config first run. The whole point of the PR is "use local to reduce EIS token consumption before pushing to CI". That only works if the friction to running local is near-zero — brew install Ollama → pull right model → run eval → unload, in one command. If the first step is "set up Ollama, pick a model from this big list, figure out the connector JSON, then run the eval", most people will just keep using EIS.

It makes sense to reduce friction, I'm just wondering whether full automation of the setup and having this exist in Kibana codebase is the right call here? I'd also imagine that a lot of the folks wanting to run evals with local models will already have ollama or similar set up in their environment anyways and may just need a nudge towards the right model + how to set up the connector.

Please let me know what you think and whether we can start a bit lighter while achieving the same outcome.

@SrdjanLL
Copy link
Copy Markdown
Contributor

Also, other frameworks have dry run concepts, such as this: https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/run-experiments#dry-run

Maybe this is something we could try out and build into our experiments - pick a single example on each dataset, override repetitions to 1 and run.
I've done this with Phoenix before and helped me have a tighter feedback loop when I'm not worried about inspecting the results from the entire experiment.

patrykkopycinski and others added 6 commits May 12, 2026 09:14
Move src/cli/inject.ts from @kbn/evals-local into @kbn/evals as
src/cli/inject_local_connector.ts. The file is made self-contained by
inlining the runtime-detection helpers (probeEndpoint, getOllamaModels,
getLmStudioModel, commandExists, detect) and the connector env-setter
from connector_factory, both of which are being deleted in the broader
kbn-evals-local retirement. The ModelRegistry dependency is dropped in
favour of accepting --local-model values as plain model-name strings
(bare-connector path). The three load-bearing behaviours are preserved
verbatim: the hard-fail guard when no endpoint is detected, the
process.argv strip-and-sync after --local removal, and the
execFileSync-based commandExists call that prevents shell injection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the named re-export so scripts/evals.js can require it directly
from @kbn/evals instead of @kbn/evals-local.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When scripts/evals.js is invoked with --local, call
injectLocalConnector(process.argv) from @kbn/evals before handing off
to cli.run(). Passing process.argv directly (not a slice) lets the
function strip --local / --local-endpoint / --local-model in-place so
cli.run() sees a clean argv. The .then() chain ensures cli.run() only
starts after connector env vars are set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…anner

Replace the early-return stub with env-var injection so --dry-run falls
through to spawn Playwright: sets EVALUATION_REPETITIONS=1 and
EVALUATION_DRY_RUN=true in envOverrides, prints the
'[DRY-RUN] sampling 1 example per dataset, repetitions=1' banner, then
proceeds to the existing spawn block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…RUN=true

In KibanaEvalsClient.runExperiment(), check process.env.EVALUATION_DRY_RUN
and slice resolvedDataset.examples to [examples[0]] before the run loop.
This wires the --dry-run flag end-to-end: CLI sets EVALUATION_DRY_RUN=true,
Playwright inherits it, and the executor limits each dataset to one example.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers: one-line Ollama install, one model recommendation per RAM tier
(16/32/48/64 GB+), EVAL_TASK_TIMEOUT_MS=600000 requirement, --local vs
--dry-run guidance, and pointer to elastic-agent-builder-skill-dev for
advanced benchmarking orchestration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@patrykkopycinski patrykkopycinski changed the title feat(@kbn/evals-local): auto-provisioned local model support for eval runner feat(@kbn/evals): --local connector injection, --dry-run, configurable timeout; retire @kbn/evals-local May 12, 2026
@patrykkopycinski patrykkopycinski requested review from SrdjanLL and viduni94 and removed request for SrdjanLL and viduni94 May 12, 2026 09:25
@patrykkopycinski patrykkopycinski force-pushed the feat/local-evals-with-ollama branch from ddcce6e to 012354a Compare May 12, 2026 10:35
@cla-checker-service
Copy link
Copy Markdown

cla-checker-service Bot commented May 12, 2026

💚 CLA has been signed

patrykkopycinski and others added 3 commits May 12, 2026 12:53
…view

Move the --dry-run envOverrides block above the commandPreview snapshot so
EVALUATION_REPETITIONS=1 and EVALUATION_DRY_RUN=true appear in the logged
"Running: ..." line. Previously the preview was built before the dry-run
mutation, making the logged command unreproducible when copy-pasted.

Runtime behavior is unchanged — spawn() always received the correct env.

Smoke test verified: node scripts/evals run --suite agent-builder --dry-run
--local prints the [DRY-RUN] banner, shows all overrides in the Running:
line, and Playwright starts 12 tests (1 per dataset spec) with
EVALUATION_REPETITIONS=1 and EVALUATION_DRY_RUN=true set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d; fix timer leak

Adds inject_local_connector.test.ts with 7 unit tests covering the
hard-fail path (no Ollama, no LM Studio, no binary installed), the
env-var injection happy path, and the binary-installed-but-not-running
path. All assertions verified: throws with actionable message, strips
--local from args before detection, never sets EVALUATION_CONNECTOR_ID
or KIBANA_TESTING_AI_CONNECTORS when no runtime is found.

Also fixes a timer resource leak in probeEndpoint / getOllamaModels /
getLmStudioModel: clearTimeout was only called on the success path; moved
to finally{} so it fires on rejection too. This eliminated the
"Jest did not exit" open-handle warning that surfaced during test authoring.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lsClient

Verifies that runExperiment() limits execution to the first example
when EVALUATION_DRY_RUN=true (regardless of repetitions), and runs
all examples when the var is absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@patrykkopycinski patrykkopycinski force-pushed the feat/local-evals-with-ollama branch from 012354a to 2904710 Compare May 12, 2026 10:54
patrykkopycinski added a commit to patrykkopycinski/kibana that referenced this pull request May 13, 2026
…amily tools

Replaces the monolithic `runEmulationCommand` tool with four narrower,
typed-by-discriminated-union tools — one per command family:

- `security.detection-emulation.run-process-command` — kill-process,
  suspend-process, running-processes, memory-dump
- `security.detection-emulation.run-file-command`    — get-file, scan, upload
- `security.detection-emulation.run-network-command` — isolate, unisolate
- `security.detection-emulation.run-execution-command` — execute,
  runscript, cancel

Each tool's schema is a Zod `discriminatedUnion('command', …)` over the
exported parameter shapes (`processRefParams`, `pathParams`,
`executeParams`, `runscriptParams`, `cancelParams`,
`optionalCommentParams`, `memoryDumpParams`, `uploadParams`) so
misspelled fields, extra keys, or wrong types fail at the boundary
before reaching the EDR connector.

Per-family handlers all delegate to the new `withCommandGates` helper
that consolidates the security gate sequence the monolithic tool used
to inline (feature flag → per-command RBAC → host allowlist → atomic
rate-limit acquire → authenticated caller → runner dispatch → typed
runner-error mapping → rate-limit release on failure). The helper's
gate order, error classifications, status codes, and side effects
match the original 1:1.

Skill body updated:
- Tool-IDs alias paragraph now lists the four per-family tool IDs
  (replaces the single `runEmulationCommand → run-command` entry).
- "Low-level command dispatch" guidance now picks the family by
  command, with a per-family table of required `parameters` shapes.
- Guardrails table collapses the four `run*Command` columns into one
  shared column (the gates are identical across families).
- Skill `description` extended with the four-family taxonomy hint so
  activation tokens still resolve.

Tests:
- New `run_command_tools.test.ts` exercises each family's schema
  discrimination — accept-valid / reject-invalid per command +
  cross-family rejection (each tool refuses commands from the other
  three families) + shared rules (empty endpointIds, unknown
  agentType, empty emulationId). 45 tests, all green.
- Old `run_emulation_command_tool.test.ts` deleted (its scope is
  covered by the new per-family tests + the shared
  `RunEmulationCommandInputSchema` round-trip in
  `validate_rule_input.test.ts`).

Files:
- common/detection_emulation/schemas/run_emulation_command_input.ts:
  exported `commonFields`, `optionalCommentParams`, `processRefParams`,
  `memoryDumpParams`, `cancelParams`, `pathParams`, `executeParams`,
  `runscriptParams`, `uploadParams` so the per-family tools can reuse
  them. Single source of truth for parameter shapes.
- New: with_command_gates.ts, run_process_command_tool.ts,
  run_file_command_tool.ts, run_network_command_tool.ts,
  run_execution_command_tool.ts
- Deleted: run_emulation_command_tool.ts, run_emulation_command_tool.test.ts
- Updated: detection_emulation_skill.ts (registration array + skill body),
  index.ts (re-exports the four per-family factories)

Addresses Srdjan's I5 review item from PR elastic#265798.
patrykkopycinski added a commit to patrykkopycinski/kibana that referenced this pull request May 13, 2026
…amily tools

Replaces the monolithic `runEmulationCommand` tool with four narrower,
typed-by-discriminated-union tools — one per command family:

- `security.detection-emulation.run-process-command` — kill-process,
  suspend-process, running-processes, memory-dump
- `security.detection-emulation.run-file-command`    — get-file, scan, upload
- `security.detection-emulation.run-network-command` — isolate, unisolate
- `security.detection-emulation.run-execution-command` — execute,
  runscript, cancel

Each tool's schema is a Zod `discriminatedUnion('command', …)` over the
exported parameter shapes (`processRefParams`, `pathParams`,
`executeParams`, `runscriptParams`, `cancelParams`,
`optionalCommentParams`, `memoryDumpParams`, `uploadParams`) so
misspelled fields, extra keys, or wrong types fail at the boundary
before reaching the EDR connector.

Per-family handlers all delegate to the new `withCommandGates` helper
that consolidates the security gate sequence the monolithic tool used
to inline (feature flag → per-command RBAC → host allowlist → atomic
rate-limit acquire → authenticated caller → runner dispatch → typed
runner-error mapping → rate-limit release on failure). The helper's
gate order, error classifications, status codes, and side effects
match the original 1:1.

Skill body updated:
- Tool-IDs alias paragraph now lists the four per-family tool IDs
  (replaces the single `runEmulationCommand → run-command` entry).
- "Low-level command dispatch" guidance now picks the family by
  command, with a per-family table of required `parameters` shapes.
- Guardrails table collapses the four `run*Command` columns into one
  shared column (the gates are identical across families).
- Skill `description` extended with the four-family taxonomy hint so
  activation tokens still resolve.

Tests:
- New `run_command_tools.test.ts` exercises each family's schema
  discrimination — accept-valid / reject-invalid per command +
  cross-family rejection (each tool refuses commands from the other
  three families) + shared rules (empty endpointIds, unknown
  agentType, empty emulationId). 45 tests, all green.
- Old `run_emulation_command_tool.test.ts` deleted (its scope is
  covered by the new per-family tests + the shared
  `RunEmulationCommandInputSchema` round-trip in
  `validate_rule_input.test.ts`).

Files:
- common/detection_emulation/schemas/run_emulation_command_input.ts:
  exported `commonFields`, `optionalCommentParams`, `processRefParams`,
  `memoryDumpParams`, `cancelParams`, `pathParams`, `executeParams`,
  `runscriptParams`, `uploadParams` so the per-family tools can reuse
  them. Single source of truth for parameter shapes.
- New: with_command_gates.ts, run_process_command_tool.ts,
  run_file_command_tool.ts, run_network_command_tool.ts,
  run_execution_command_tool.ts
- Deleted: run_emulation_command_tool.ts, run_emulation_command_tool.test.ts
- Updated: detection_emulation_skill.ts (registration array + skill body),
  index.ts (re-exports the four per-family factories)

Addresses Srdjan's I5 review item from PR elastic#265798.
patrykkopycinski added a commit to patrykkopycinski/kibana that referenced this pull request May 15, 2026
…amily tools

Replaces the monolithic `runEmulationCommand` tool with four narrower,
typed-by-discriminated-union tools — one per command family:

- `security.detection-emulation.run-process-command` — kill-process,
  suspend-process, running-processes, memory-dump
- `security.detection-emulation.run-file-command`    — get-file, scan, upload
- `security.detection-emulation.run-network-command` — isolate, unisolate
- `security.detection-emulation.run-execution-command` — execute,
  runscript, cancel

Each tool's schema is a Zod `discriminatedUnion('command', …)` over the
exported parameter shapes (`processRefParams`, `pathParams`,
`executeParams`, `runscriptParams`, `cancelParams`,
`optionalCommentParams`, `memoryDumpParams`, `uploadParams`) so
misspelled fields, extra keys, or wrong types fail at the boundary
before reaching the EDR connector.

Per-family handlers all delegate to the new `withCommandGates` helper
that consolidates the security gate sequence the monolithic tool used
to inline (feature flag → per-command RBAC → host allowlist → atomic
rate-limit acquire → authenticated caller → runner dispatch → typed
runner-error mapping → rate-limit release on failure). The helper's
gate order, error classifications, status codes, and side effects
match the original 1:1.

Skill body updated:
- Tool-IDs alias paragraph now lists the four per-family tool IDs
  (replaces the single `runEmulationCommand → run-command` entry).
- "Low-level command dispatch" guidance now picks the family by
  command, with a per-family table of required `parameters` shapes.
- Guardrails table collapses the four `run*Command` columns into one
  shared column (the gates are identical across families).
- Skill `description` extended with the four-family taxonomy hint so
  activation tokens still resolve.

Tests:
- New `run_command_tools.test.ts` exercises each family's schema
  discrimination — accept-valid / reject-invalid per command +
  cross-family rejection (each tool refuses commands from the other
  three families) + shared rules (empty endpointIds, unknown
  agentType, empty emulationId). 45 tests, all green.
- Old `run_emulation_command_tool.test.ts` deleted (its scope is
  covered by the new per-family tests + the shared
  `RunEmulationCommandInputSchema` round-trip in
  `validate_rule_input.test.ts`).

Files:
- common/detection_emulation/schemas/run_emulation_command_input.ts:
  exported `commonFields`, `optionalCommentParams`, `processRefParams`,
  `memoryDumpParams`, `cancelParams`, `pathParams`, `executeParams`,
  `runscriptParams`, `uploadParams` so the per-family tools can reuse
  them. Single source of truth for parameter shapes.
- New: with_command_gates.ts, run_process_command_tool.ts,
  run_file_command_tool.ts, run_network_command_tool.ts,
  run_execution_command_tool.ts
- Deleted: run_emulation_command_tool.ts, run_emulation_command_tool.test.ts
- Updated: detection_emulation_skill.ts (registration array + skill body),
  index.ts (re-exports the four per-family factories)

Addresses Srdjan's I5 review item from PR elastic#265798.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release_note:skip Skip the PR/issue when compiling release notes v9.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants