feat(@kbn/evals): --local connector injection, --dry-run, configurable timeout; retire @kbn/evals-local#265798
Conversation
|
🤖 Jobs for this PR can be triggered through checkboxes. 🚧
ℹ️ To trigger the CI, please tick the checkbox below 👇
|
Test Plan ResultsEnvironment: macOS 25.4.0, Apple Silicon, 48GB RAM, Node v24.14.1, Ollama 0.21.1 1. Type checking — PASSZero errors from 2.
|
| plugins: | ||
| - name: verifier | ||
| source: local | ||
| path: /Users/patrykkopycinski/Projects/work/ao-verifier-plugin |
There was a problem hiding this comment.
🟡 Medium agent-orchestrator.yaml:14
The configuration file embeds hardcoded personal filesystem paths (/Users/patrykkopycinski/Projects/work/ao-verifier-plugin, /Users/patrykkopycinski/Projects/ao-workspace), a private machine hostname (worker-m1max), and a username (mac). This commits private infrastructure details to the repository and breaks portability—no other developer can use this configuration without modification, and CI/automation environments will fail. Consider externalizing these values to environment variables or a separate .env file that is .gitignored.
Also found in 3 other location(s)
openspec/changes/ad-triage-eval-suite/.ao-generate.sh:3
Hardcoded absolute paths at lines 3, 5, and 6 reference a specific developer's local machine path (
/Users/patrykkopycinski/Projects/kibana). This auto-generated script is committed to the repository but will fail immediately withcd: /Users/patrykkopycinski/Projects/kibana: No such file or directoryon any other developer's machine or CI environment. Theset -eon line 2 ensures the script exits on the first failedcdcommand.
openspec/changes/ao-health-panel/.ao-generate.sh:3
Hardcoded absolute paths to a specific developer's home directory at lines 3, 5, and 6 (
/Users/patrykkopycinski/Projects/kibana). This script is committed to the shared repository but will fail immediately for any other developer or CI environment becausecd "/Users/patrykkopycinski/Projects/kibana"will error out underset -e, aborting execution. This appears to be a generated local artifact that should not have been committed, or should use relative/dynamic paths (e.g.,$(git rev-parse --show-toplevel)).
eval-ad-remaining.sh:6
Line 6 hardcodes
cd /Users/patrykkopycinski/Projects/kibana, a developer-specific absolute path. Any other user or CI environment running this script will fail immediately because the directory won't exist, causing thecdto fail. Withset -u(but notset -e), the script will continue running from whatever directory it was invoked in, likely causingsource ./eval-env.shandnode scripts/evals.jsto fail with confusing errors.
🤖 Copy this AI Prompt to have your agent fix this:
In file agent-orchestrator.yaml around line 14:
The configuration file embeds hardcoded personal filesystem paths (`/Users/patrykkopycinski/Projects/work/ao-verifier-plugin`, `/Users/patrykkopycinski/Projects/ao-workspace`), a private machine hostname (`worker-m1max`), and a username (`mac`). This commits private infrastructure details to the repository and breaks portability—no other developer can use this configuration without modification, and CI/automation environments will fail. Consider externalizing these values to environment variables or a separate `.env` file that is `.gitignore`d.
Also found in 3 other location(s):
- openspec/changes/ad-triage-eval-suite/.ao-generate.sh:3 -- Hardcoded absolute paths at lines 3, 5, and 6 reference a specific developer's local machine path (`/Users/patrykkopycinski/Projects/kibana`). This auto-generated script is committed to the repository but will fail immediately with `cd: /Users/patrykkopycinski/Projects/kibana: No such file or directory` on any other developer's machine or CI environment. The `set -e` on line 2 ensures the script exits on the first failed `cd` command.
- openspec/changes/ao-health-panel/.ao-generate.sh:3 -- Hardcoded absolute paths to a specific developer's home directory at lines 3, 5, and 6 (`/Users/patrykkopycinski/Projects/kibana`). This script is committed to the shared repository but will fail immediately for any other developer or CI environment because `cd "/Users/patrykkopycinski/Projects/kibana"` will error out under `set -e`, aborting execution. This appears to be a generated local artifact that should not have been committed, or should use relative/dynamic paths (e.g., `$(git rev-parse --show-toplevel)`).
- eval-ad-remaining.sh:6 -- Line 6 hardcodes `cd /Users/patrykkopycinski/Projects/kibana`, a developer-specific absolute path. Any other user or CI environment running this script will fail immediately because the directory won't exist, causing the `cd` to fail. With `set -u` (but not `set -e`), the script will continue running from whatever directory it was invoked in, likely causing `source ./eval-env.sh` and `node scripts/evals.js` to fail with confusing errors.
| ) > "$log" 2>&1 || echo " NOTE: eval exited non-zero (see $log)" | ||
|
|
||
| echo " exit-code: $?" |
There was a problem hiding this comment.
🟢 Low eval-nfc-ab.sh:126
echo " exit-code: $?" on line 128 always prints 0. The subshell exit code on line 126 is lost because when it fails, the || echo ... succeeds and resets $? to 0; when it succeeds, $? is naturally 0. The actual exit code is never captured.
Consider saving the subshell exit code before the || branch: capture rc=$? immediately after the subshell, then use $rc for both the error note and the final exit-code print. Note that with set -e, you'll need to restructure to avoid the subshell triggering an immediate exit on failure.
+ ) > "$log" 2>&1; rc=$?
+ if [[ $rc -ne 0 ]]; then
+ echo " NOTE: eval exited non-zero (see $log)"
+ fi
+
+ echo " exit-code: $rc"🤖 Copy this AI Prompt to have your agent fix this:
In file eval-nfc-ab.sh around lines 126-128:
`echo " exit-code: $?"` on line 128 always prints `0`. The subshell exit code on line 126 is lost because when it fails, the `|| echo ...` succeeds and resets `$?` to `0`; when it succeeds, `$?` is naturally `0`. The actual exit code is never captured.
Consider saving the subshell exit code before the `||` branch: capture `rc=$?` immediately after the subshell, then use `$rc` for both the error note and the final exit-code print. Note that with `set -e`, you'll need to restructure to avoid the subshell triggering an immediate exit on failure.
| exit 1 | ||
| fi | ||
|
|
||
| echo ">> ES = $ES_URL" |
There was a problem hiding this comment.
🟢 Low eval-restore-kbs.sh:38
ES_URL (which may contain credentials like user:password@host) is printed verbatim to stdout on line 38, exposing secrets to terminal output and CI logs. Consider sanitizing the URL before display, e.g., by masking the password portion with sed or parameter expansion.
-echo ">> ES = $ES_URL"
+echo ">> ES = ${ES_URL//@*/**SECRET**@}"🤖 Copy this AI Prompt to have your agent fix this:
In file eval-restore-kbs.sh around line 38:
`ES_URL` (which may contain credentials like `user:password@host`) is printed verbatim to stdout on line 38, exposing secrets to terminal output and CI logs. Consider sanitizing the URL before display, e.g., by masking the password portion with `sed` or parameter expansion.
| # Detect: gh pr create | ||
| if [[ "$clean_command" =~ ^gh[[:space:]]+pr[[:space:]]+create ]]; then | ||
| # Extract PR URL from output | ||
| pr_url=$(echo "$output" | grep -Eo 'https://github[.]com/[^/]+/[^/]+/pull/[0-9]+' | head -1) |
There was a problem hiding this comment.
🟢 Low .claude/metadata-updater.sh:106
When gh pr create output doesn't match the expected URL pattern, grep -Eo returns exit code 1, which with pipefail causes the script to terminate before reaching the [[ -n "$pr_url" ]] check. Consider appending || true to the pipeline so the script gracefully handles missing URLs instead of crashing.
- pr_url=$(echo "$output" | grep -Eo 'https://github[.]com/[^/]+/[^/]+/pull/[0-9]+' | head -1)
+ pr_url=$(echo "$output" | grep -Eo 'https://github[.]com/[^/]+/[^/]+/pull/[0-9]+' | head -1) || true🤖 Copy this AI Prompt to have your agent fix this:
In file .claude/metadata-updater.sh around line 106:
When `gh pr create` output doesn't match the expected URL pattern, `grep -Eo` returns exit code 1, which with `pipefail` causes the script to terminate before reaching the `[[ -n "$pr_url" ]]` check. Consider appending `|| true` to the pipeline so the script gracefully handles missing URLs instead of crashing.
| BASIC=$(grep -c 'Evaluator "AttackDiscoveryBasic".*completed' "$LOG" 2>/dev/null || echo 0) | ||
| RUBRIC=$(grep -c 'Evaluator "AttackDiscoveryRubric".*completed' "$LOG" 2>/dev/null || echo 0) | ||
| RAN=$(grep -c 'Evaluator "Ran".*completed' "$LOG" 2>/dev/null || echo 0) | ||
| EXPORT=$(grep -c 'Evaluation scores exported' "$LOG" 2>/dev/null || echo 0) | ||
| RETRIES=$(grep -cE 'ECONNRESET|ERR_CANCELED|retrying' "$LOG" 2>/dev/null || echo 0) |
There was a problem hiding this comment.
🟢 Low eval-ad-remaining.sh:38
Lines 38–42 use || echo 0 as a fallback when grep -c finds no matches, but grep -c already prints 0 to stdout before exiting with code 1. The command substitution therefore captures the concatenated output "0\n0" (two lines), which causes garbled meta-file values like basic_done=0\n0/6. Consider removing the || echo 0 (or replacing it with || true) since the script doesn't use set -e.
+ BASIC=$(grep -c 'Evaluator "AttackDiscoveryBasic".*completed' "$LOG" 2>/dev/null || true)
+ RUBRIC=$(grep -c 'Evaluator "AttackDiscoveryRubric".*completed' "$LOG" 2>/dev/null || true)
+ RAN=$(grep -c 'Evaluator "Ran".*completed' "$LOG" 2>/dev/null || true)
+ EXPORT=$(grep -c 'Evaluation scores exported' "$LOG" 2>/dev/null || true)
+ RETRIES=$(grep -cE 'ECONNRESET|ERR_CANCELED|retrying' "$LOG" 2>/dev/null || true)🤖 Copy this AI Prompt to have your agent fix this:
In file eval-ad-remaining.sh around lines 38-42:
Lines 38–42 use `|| echo 0` as a fallback when `grep -c` finds no matches, but `grep -c` already prints `0` to stdout before exiting with code 1. The command substitution therefore captures the concatenated output `"0\n0"` (two lines), which causes garbled meta-file values like `basic_done=0\n0/6`. Consider removing the `|| echo 0` (or replacing it with `|| true`) since the script doesn't use `set -e`.
a47298c to
b73ba71
Compare
…andExists Discovered via patryks-treadmill ship-readiness audit on PR elastic#265798: 1. RECOMMENDATIONS.md was labeled 'Auto-generated, do not edit' but contained richer tables (5-col Quick Pick with Why, 8-col All Models with Active) that generateRecommendations() couldn't reproduce. Next --update-registry run would silently overwrite the enriched content. Fix: extend ModelConfig with whyShortReason + activeParamsBillions, update generator to emit matching schema, and move the one-time 'What changed from v1' migration notes to CHANGELOG.md (their structural home). 2. commandExists() in detect.ts and provision.ts shell-interpolated its argument (\`command -v ${cmd}\`). Currently safe (only called with hardcoded literals), but a future caller passing user input would create a shell-injection sink. Replaced with execFileSync + sh -c "command -v \$1" + positional argument so cmd is never re-tokenized by the shell. Both findings were verified by the gate against the actual source files (anti-hallucination rules in the audit prompt held).
injectLocalConnector() previously logged a warning and resolved without setting any connector env vars when neither Ollama nor LM Studio was running. The caller in scripts/evals.js chains `.then(() => cli.run())` unconditionally on the resolved promise, so the warn-and-return path let @kbn/evals start with the default CLOUD connector — silently producing eval reports that look like local model output but actually used Anthropic/OpenAI. A developer reading the report could not tell their local model was never invoked. Throw an explicit Error with actionable remediation instead. Refusing to silently fall back to cloud is a data-trust requirement for an "--local" flag. Surfaced by patryks-treadmill ship-readiness audit run #7 against PR elastic#265798 (verdict warn / score 74, single major finding, verified by tracing inject.ts:54 → scripts/evals.js .then() chain).
…them benchmarkSingleModel's catch bound `e` but never referenced it. When ensureRuntime() or ensureModel() threw (Ollama not installed, model pull failed, disk full), the printed results table showed `evalExitCode: 1` with zero explanation. Operators had to dig through the source to guess what failed. Log the underlying error message before returning the failure record so the failure is actionable on first read of the table. Surfaced by patryks-treadmill ship-readiness audit run #9 against PR elastic#265798 (verdict warn / score 74, verified at benchmark.ts:93-105).
|
/ci |
viduni94
left a comment
There was a problem hiding this comment.
Hey @patrykkopycinski
Did a quick skim through of the PR. I'll do a deeper review tomorrow.
Out of curiosity, what's the advantage of auto provisioning the models via Ollama or LM studio vs. just pointing a connector config at an already running Ollama/LM Studio instance?
|
@viduni94 great question — and to be clear, the PR ships both options. There are actually three pieces in here, and each one answers part of your question:
So the question is really why does the orchestrator exist on top of the bare connector path, and why is there a benchmark pipeline at all. The two things are linked. Why auto-provisioning is the default for
|
|
/ci |
💔 Build Failed
Failed CI StepsMetrics [docs]
History
|
SrdjanLL
left a comment
There was a problem hiding this comment.
Hey Patryk, thanks for drafting this - it's a good nudge in the right direction and consideration of the tokens we'll be burning through with evals.
Top of my mind, besides what Viduni's mentioned, is I'm worried that we're putting a lot of the responsibility on the framework, early. I'm wondering if an internal documentation with a clear "local model of choice" for smoke testing eval suites (published here) should be sufficient to get people started with running local evals (e.g we can choose one or two models depending on what machine you're running with).
From my understanding - our goal is to reduce the cost of the developer feedback loop by running locally, but still evaluate against EIS models using the standard process as we expect our users to use the same models in their workloads. If that's the right goal, I think we can start simple and see how that goes with dev adoption and reduce friction if any arises.
There's a lot to uncover here, but I'll also jump in on individual points.
- Wrong model loaded → silent garbage results. If a dev has qwen2.5:7b loaded and runs evals, a bare-connector path will happily start a 20–30 minute run and produce numbers that look fine but are meaningless because that model doesn't reliably tool-call. The orchestrator validates tool-calling against a tiny 2 + 2 function-calling probe (model_validator.ts) before the run and either swaps to a working model, prompts the user, or falls back to CODE-only evaluators. In inject.ts we even hard-fail when no local runtime is detected, specifically to prevent silently falling back to the cloud connector — a "data-trust regression" the comment there calls out.
What does a wrong model mean here? A model that can't call tools? Could this be captured by the evaluations in the standard process - either via the tool call inspection or the groundedness evaluator? If we go with the above "local model of choice" that can call tools we can expect consistent results without maintaining an orchestrator that validates probes into this. Feels like a "figure it out" once kind of a problem, but let me know if there's something I'm missing.
- No model / wrong tier for the machine. Many Kibana devs have never installed Ollama at all, and those who have don't necessarily know which model fits 16 GB vs 48 GB vs 96 GB. The orchestrator reads os.totalmem() and picks from a curated registry with a 14 B floor (model_registry.autoSelect() + MIN_EVAL_PARAMS_BILLIONS). The connector-only path assumes you already made all those decisions correctly.
Could this be sorted by having better internal documentation/getting started? To guide our peers which models to set up with (e.g depending on the machine they're using)
- VRAM hygiene. After the run, teardown.ts posts keep_alive: 0 so the model is unloaded from VRAM. Without that, a 20 GB model sits in GPU memory indefinitely and chokes the Kibana dev server, Cursor, Docker, etc. — and developers don't reliably remember to ollama stop manually.
It's a good point and if we were to make this automatic it would make sense to unload. My thinking is that we're just putting a lot of responsibility on the framework itself, where this feels more of a dev env hygiene.
- Tier-aware execution. The orchestrator sets EVAL_TASK_TIMEOUT_MS, EVAL_THRESHOLD_MULTIPLIER, EVAL_MAX_SAMPLES, and SELECTED_EVALUATORS for local runs (tier_config.ts). The default 5-minute Playwright task timeout doesn't survive a 15 tok/s local model, so a bare connector setup just times out. We also touched create_playwright_eval_config.ts to honor EVAL_TASK_TIMEOUT_MS.
++ Makes sense to have this a bit more configurable for local runs, it could blow up indeed.
- Zero-config first run. The whole point of the PR is "use local to reduce EIS token consumption before pushing to CI". That only works if the friction to running local is near-zero — brew install Ollama → pull right model → run eval → unload, in one command. If the first step is "set up Ollama, pick a model from this big list, figure out the connector JSON, then run the eval", most people will just keep using EIS.
It makes sense to reduce friction, I'm just wondering whether full automation of the setup and having this exist in Kibana codebase is the right call here? I'd also imagine that a lot of the folks wanting to run evals with local models will already have ollama or similar set up in their environment anyways and may just need a nudge towards the right model + how to set up the connector.
Please let me know what you think and whether we can start a bit lighter while achieving the same outcome.
|
Also, other frameworks have dry run concepts, such as this: https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/run-experiments#dry-run Maybe this is something we could try out and build into our experiments - pick a single example on each dataset, override repetitions to 1 and run. |
Move src/cli/inject.ts from @kbn/evals-local into @kbn/evals as src/cli/inject_local_connector.ts. The file is made self-contained by inlining the runtime-detection helpers (probeEndpoint, getOllamaModels, getLmStudioModel, commandExists, detect) and the connector env-setter from connector_factory, both of which are being deleted in the broader kbn-evals-local retirement. The ModelRegistry dependency is dropped in favour of accepting --local-model values as plain model-name strings (bare-connector path). The three load-bearing behaviours are preserved verbatim: the hard-fail guard when no endpoint is detected, the process.argv strip-and-sync after --local removal, and the execFileSync-based commandExists call that prevents shell injection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the named re-export so scripts/evals.js can require it directly from @kbn/evals instead of @kbn/evals-local. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When scripts/evals.js is invoked with --local, call injectLocalConnector(process.argv) from @kbn/evals before handing off to cli.run(). Passing process.argv directly (not a slice) lets the function strip --local / --local-endpoint / --local-model in-place so cli.run() sees a clean argv. The .then() chain ensures cli.run() only starts after connector env vars are set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…anner Replace the early-return stub with env-var injection so --dry-run falls through to spawn Playwright: sets EVALUATION_REPETITIONS=1 and EVALUATION_DRY_RUN=true in envOverrides, prints the '[DRY-RUN] sampling 1 example per dataset, repetitions=1' banner, then proceeds to the existing spawn block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…RUN=true In KibanaEvalsClient.runExperiment(), check process.env.EVALUATION_DRY_RUN and slice resolvedDataset.examples to [examples[0]] before the run loop. This wires the --dry-run flag end-to-end: CLI sets EVALUATION_DRY_RUN=true, Playwright inherits it, and the executor limits each dataset to one example. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers: one-line Ollama install, one model recommendation per RAM tier (16/32/48/64 GB+), EVAL_TASK_TIMEOUT_MS=600000 requirement, --local vs --dry-run guidance, and pointer to elastic-agent-builder-skill-dev for advanced benchmarking orchestration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ddcce6e to
012354a
Compare
|
💚 CLA has been signed |
…view Move the --dry-run envOverrides block above the commandPreview snapshot so EVALUATION_REPETITIONS=1 and EVALUATION_DRY_RUN=true appear in the logged "Running: ..." line. Previously the preview was built before the dry-run mutation, making the logged command unreproducible when copy-pasted. Runtime behavior is unchanged — spawn() always received the correct env. Smoke test verified: node scripts/evals run --suite agent-builder --dry-run --local prints the [DRY-RUN] banner, shows all overrides in the Running: line, and Playwright starts 12 tests (1 per dataset spec) with EVALUATION_REPETITIONS=1 and EVALUATION_DRY_RUN=true set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d; fix timer leak
Adds inject_local_connector.test.ts with 7 unit tests covering the
hard-fail path (no Ollama, no LM Studio, no binary installed), the
env-var injection happy path, and the binary-installed-but-not-running
path. All assertions verified: throws with actionable message, strips
--local from args before detection, never sets EVALUATION_CONNECTOR_ID
or KIBANA_TESTING_AI_CONNECTORS when no runtime is found.
Also fixes a timer resource leak in probeEndpoint / getOllamaModels /
getLmStudioModel: clearTimeout was only called on the success path; moved
to finally{} so it fires on rejection too. This eliminated the
"Jest did not exit" open-handle warning that surfaced during test authoring.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lsClient Verifies that runExperiment() limits execution to the first example when EVALUATION_DRY_RUN=true (regardless of repetitions), and runs all examples when the var is absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
012354a to
2904710
Compare
…amily tools
Replaces the monolithic `runEmulationCommand` tool with four narrower,
typed-by-discriminated-union tools — one per command family:
- `security.detection-emulation.run-process-command` — kill-process,
suspend-process, running-processes, memory-dump
- `security.detection-emulation.run-file-command` — get-file, scan, upload
- `security.detection-emulation.run-network-command` — isolate, unisolate
- `security.detection-emulation.run-execution-command` — execute,
runscript, cancel
Each tool's schema is a Zod `discriminatedUnion('command', …)` over the
exported parameter shapes (`processRefParams`, `pathParams`,
`executeParams`, `runscriptParams`, `cancelParams`,
`optionalCommentParams`, `memoryDumpParams`, `uploadParams`) so
misspelled fields, extra keys, or wrong types fail at the boundary
before reaching the EDR connector.
Per-family handlers all delegate to the new `withCommandGates` helper
that consolidates the security gate sequence the monolithic tool used
to inline (feature flag → per-command RBAC → host allowlist → atomic
rate-limit acquire → authenticated caller → runner dispatch → typed
runner-error mapping → rate-limit release on failure). The helper's
gate order, error classifications, status codes, and side effects
match the original 1:1.
Skill body updated:
- Tool-IDs alias paragraph now lists the four per-family tool IDs
(replaces the single `runEmulationCommand → run-command` entry).
- "Low-level command dispatch" guidance now picks the family by
command, with a per-family table of required `parameters` shapes.
- Guardrails table collapses the four `run*Command` columns into one
shared column (the gates are identical across families).
- Skill `description` extended with the four-family taxonomy hint so
activation tokens still resolve.
Tests:
- New `run_command_tools.test.ts` exercises each family's schema
discrimination — accept-valid / reject-invalid per command +
cross-family rejection (each tool refuses commands from the other
three families) + shared rules (empty endpointIds, unknown
agentType, empty emulationId). 45 tests, all green.
- Old `run_emulation_command_tool.test.ts` deleted (its scope is
covered by the new per-family tests + the shared
`RunEmulationCommandInputSchema` round-trip in
`validate_rule_input.test.ts`).
Files:
- common/detection_emulation/schemas/run_emulation_command_input.ts:
exported `commonFields`, `optionalCommentParams`, `processRefParams`,
`memoryDumpParams`, `cancelParams`, `pathParams`, `executeParams`,
`runscriptParams`, `uploadParams` so the per-family tools can reuse
them. Single source of truth for parameter shapes.
- New: with_command_gates.ts, run_process_command_tool.ts,
run_file_command_tool.ts, run_network_command_tool.ts,
run_execution_command_tool.ts
- Deleted: run_emulation_command_tool.ts, run_emulation_command_tool.test.ts
- Updated: detection_emulation_skill.ts (registration array + skill body),
index.ts (re-exports the four per-family factories)
Addresses Srdjan's I5 review item from PR elastic#265798.
…amily tools
Replaces the monolithic `runEmulationCommand` tool with four narrower,
typed-by-discriminated-union tools — one per command family:
- `security.detection-emulation.run-process-command` — kill-process,
suspend-process, running-processes, memory-dump
- `security.detection-emulation.run-file-command` — get-file, scan, upload
- `security.detection-emulation.run-network-command` — isolate, unisolate
- `security.detection-emulation.run-execution-command` — execute,
runscript, cancel
Each tool's schema is a Zod `discriminatedUnion('command', …)` over the
exported parameter shapes (`processRefParams`, `pathParams`,
`executeParams`, `runscriptParams`, `cancelParams`,
`optionalCommentParams`, `memoryDumpParams`, `uploadParams`) so
misspelled fields, extra keys, or wrong types fail at the boundary
before reaching the EDR connector.
Per-family handlers all delegate to the new `withCommandGates` helper
that consolidates the security gate sequence the monolithic tool used
to inline (feature flag → per-command RBAC → host allowlist → atomic
rate-limit acquire → authenticated caller → runner dispatch → typed
runner-error mapping → rate-limit release on failure). The helper's
gate order, error classifications, status codes, and side effects
match the original 1:1.
Skill body updated:
- Tool-IDs alias paragraph now lists the four per-family tool IDs
(replaces the single `runEmulationCommand → run-command` entry).
- "Low-level command dispatch" guidance now picks the family by
command, with a per-family table of required `parameters` shapes.
- Guardrails table collapses the four `run*Command` columns into one
shared column (the gates are identical across families).
- Skill `description` extended with the four-family taxonomy hint so
activation tokens still resolve.
Tests:
- New `run_command_tools.test.ts` exercises each family's schema
discrimination — accept-valid / reject-invalid per command +
cross-family rejection (each tool refuses commands from the other
three families) + shared rules (empty endpointIds, unknown
agentType, empty emulationId). 45 tests, all green.
- Old `run_emulation_command_tool.test.ts` deleted (its scope is
covered by the new per-family tests + the shared
`RunEmulationCommandInputSchema` round-trip in
`validate_rule_input.test.ts`).
Files:
- common/detection_emulation/schemas/run_emulation_command_input.ts:
exported `commonFields`, `optionalCommentParams`, `processRefParams`,
`memoryDumpParams`, `cancelParams`, `pathParams`, `executeParams`,
`runscriptParams`, `uploadParams` so the per-family tools can reuse
them. Single source of truth for parameter shapes.
- New: with_command_gates.ts, run_process_command_tool.ts,
run_file_command_tool.ts, run_network_command_tool.ts,
run_execution_command_tool.ts
- Deleted: run_emulation_command_tool.ts, run_emulation_command_tool.test.ts
- Updated: detection_emulation_skill.ts (registration array + skill body),
index.ts (re-exports the four per-family factories)
Addresses Srdjan's I5 review item from PR elastic#265798.
…amily tools
Replaces the monolithic `runEmulationCommand` tool with four narrower,
typed-by-discriminated-union tools — one per command family:
- `security.detection-emulation.run-process-command` — kill-process,
suspend-process, running-processes, memory-dump
- `security.detection-emulation.run-file-command` — get-file, scan, upload
- `security.detection-emulation.run-network-command` — isolate, unisolate
- `security.detection-emulation.run-execution-command` — execute,
runscript, cancel
Each tool's schema is a Zod `discriminatedUnion('command', …)` over the
exported parameter shapes (`processRefParams`, `pathParams`,
`executeParams`, `runscriptParams`, `cancelParams`,
`optionalCommentParams`, `memoryDumpParams`, `uploadParams`) so
misspelled fields, extra keys, or wrong types fail at the boundary
before reaching the EDR connector.
Per-family handlers all delegate to the new `withCommandGates` helper
that consolidates the security gate sequence the monolithic tool used
to inline (feature flag → per-command RBAC → host allowlist → atomic
rate-limit acquire → authenticated caller → runner dispatch → typed
runner-error mapping → rate-limit release on failure). The helper's
gate order, error classifications, status codes, and side effects
match the original 1:1.
Skill body updated:
- Tool-IDs alias paragraph now lists the four per-family tool IDs
(replaces the single `runEmulationCommand → run-command` entry).
- "Low-level command dispatch" guidance now picks the family by
command, with a per-family table of required `parameters` shapes.
- Guardrails table collapses the four `run*Command` columns into one
shared column (the gates are identical across families).
- Skill `description` extended with the four-family taxonomy hint so
activation tokens still resolve.
Tests:
- New `run_command_tools.test.ts` exercises each family's schema
discrimination — accept-valid / reject-invalid per command +
cross-family rejection (each tool refuses commands from the other
three families) + shared rules (empty endpointIds, unknown
agentType, empty emulationId). 45 tests, all green.
- Old `run_emulation_command_tool.test.ts` deleted (its scope is
covered by the new per-family tests + the shared
`RunEmulationCommandInputSchema` round-trip in
`validate_rule_input.test.ts`).
Files:
- common/detection_emulation/schemas/run_emulation_command_input.ts:
exported `commonFields`, `optionalCommentParams`, `processRefParams`,
`memoryDumpParams`, `cancelParams`, `pathParams`, `executeParams`,
`runscriptParams`, `uploadParams` so the per-family tools can reuse
them. Single source of truth for parameter shapes.
- New: with_command_gates.ts, run_process_command_tool.ts,
run_file_command_tool.ts, run_network_command_tool.ts,
run_execution_command_tool.ts
- Deleted: run_emulation_command_tool.ts, run_emulation_command_tool.test.ts
- Updated: detection_emulation_skill.ts (registration array + skill body),
index.ts (re-exports the four per-family factories)
Addresses Srdjan's I5 review item from PR elastic#265798.
Summary
What's in this PR
(a)
--localconnector injection in@kbn/evalssrc/cli/inject_local_connector.tsis moved from the deleted@kbn/evals-localpackage into@kbn/evals. It probes running Ollama (localhost:11434) or LM Studio (localhost:1234), discovers the loaded model name, and injectsKIBANA_TESTING_AI_CONNECTORS+EVALUATION_CONNECTOR_IDso that any existing eval command picks up the local model.scripts/evals.jsroutes--localviarequire('@kbn/evals').injectLocalConnector(was@kbn/evals-local).Hard-fail behaviour is preserved: if no runtime is reachable, the process exits with a non-zero code and prints:
(b) Configurable eval task timeout (
EVAL_TASK_TIMEOUT_MS)create_playwright_eval_config.tsreadsprocess.env.EVAL_TASK_TIMEOUT_MSand uses it as the Playwright task timeout, falling back to 5 minutes. Local models are significantly slower than EIS — this allows callers to bump the timeout without modifying test fixtures.(c)
--dry-runnode scripts/evals run --suite <id> --dry-runslices each registered dataset to its first example, setsEVALUATION_REPETITIONS=1, and prints a banner at startup:This lets developers verify that the full eval pipeline (Kibana → Scout → evaluators → export) works end-to-end in ~1 minute before committing to a full run.
(d) Documentation
@kbn/evals/README.mdgains a "Local model quick-start" section: one-line install (brew install ollama && ollama pull <model>), one model recommendation per RAM tier (16 GB / 32 GB / 48 GB / 64 GB+), and the required env var (EVAL_TASK_TIMEOUT_MS=120000).What was removed
The entire
x-pack/platform/packages/shared/kbn-evals-local/package is deleted:src/cli/local.ts,src/cli/benchmark.ts— auto-provisioning + benchmark orchestratorsrc/local/provision.ts,teardown.ts,model_negotiator.ts,model_validator.ts— runtime lifecyclesrc/local/model_registry.ts,models.json,tier_config.ts— curated model registrysrc/benchmark/result_writer.ts,benchmark-results/— benchmark result storageRECOMMENDATIONS.md,.cursor/skills/local-evals/SKILL.md— derived artefactsReferences cleaned up in
package.json(workspace entry),tsconfig.base.json(path mapping),.github/CODEOWNERS, andyarn.lock.Orchestrator, benchmark pipeline, and model registry move to
elastic/agent-builder-skill-dev-cursor-plugin#2(feat(local-evals): port @kbn/evals-local orchestrator + benchmark + registry), the sibling PR to this one. Tracked under therealign-pr-265798treadmill plan group.Usage after this PR
Test plan
node scripts/evals run --suite agent-builder --localwith Ollama running → connector injected, suite startsnode scripts/evals run --suite agent-builder --localwith no Ollama or LM Studio running → non-zero exit, actionable error message (7 unit tests ininject_local_connector.test.ts)node scripts/evals run --suite agent-builder --dry-run→[DRY-RUN]banner printed,EVALUATION_DRY_RUN=trueinRunning:preview, dataset sliced to 1 examplenode scripts/type_check --project x-pack/platform/packages/shared/kbn-evals/tsconfig.jsonnode scripts/eslint --fix $(git diff --name-only HEAD)