Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
f81ce84
Adds Attack Discovery eval suite
spong Mar 10, 2026
cf28ebd
Changes from node scripts/lint.js --fix
kibanamachine Mar 10, 2026
28cda74
Changes from node scripts/generate codeowners
kibanamachine Mar 10, 2026
b6a8718
Changes from node scripts/regenerate_moon_projects.js --update
kibanamachine Mar 10, 2026
b86008d
Update type def
spong Mar 11, 2026
01b3191
Merge branch 'main' of github.com:elastic/kibana into evals-attack-di…
spong Mar 11, 2026
85c6d65
Merge branch 'main' into evals-attack-discovery
spong Mar 11, 2026
6614014
Merge branch 'main' into evals-attack-discovery
spong Mar 11, 2026
1668076
Fix linter
spong Mar 11, 2026
76f46b0
Merge branch 'main' into evals-attack-discovery
spong Mar 13, 2026
43d4dc4
Update the cheese in the moon
spong Mar 13, 2026
8112c67
Merge branch 'main' into evals-attack-discovery
spong Mar 16, 2026
0058981
Add support for running against online datasets
spong Mar 17, 2026
9616059
Merge branch 'main' of github.com:elastic/kibana into evals-attack-di…
spong Mar 17, 2026
fabd55a
Add safe set
spong Mar 18, 2026
4ecc1a6
Merge branch 'main' of github.com:elastic/kibana into evals-attack-di…
spong Mar 18, 2026
0aff31c
Changes from node scripts/lint_ts_projects --fix
kibanamachine Mar 18, 2026
b42795f
Changes from node scripts/regenerate_moon_projects.js --update
kibanamachine Mar 18, 2026
e3b2d57
Merge branch 'main' of github.com:elastic/kibana into evals-attack-di…
spong Mar 19, 2026
22fe69f
Add suites label updater and dataset uploader safeguards
spong Mar 19, 2026
1549ac4
Merge branch 'main' of github.com:elastic/kibana into evals-attack-di…
spong Mar 19, 2026
b82d1aa
Some eval run details ux enhancements
spong Mar 19, 2026
7c07141
Add profiles and clear index helper
spong Mar 20, 2026
e932e84
Merge branch 'main' of github.com:elastic/kibana into evals-attack-di…
spong Mar 20, 2026
9c2f439
Add RFC validation design spec for LLM batch processing
patrykkopycinski Mar 21, 2026
c2064c6
fix: address plan review issues - dependency management and function …
patrykkopycinski Mar 21, 2026
3faa622
feat: create @kbn/llm-batch-processing package structure
patrykkopycinski Mar 21, 2026
9b94942
fix(llm-batch): correct package type to shared-server and add CODEOWNERS
patrykkopycinski Mar 21, 2026
142c549
feat(llm-batch): add package type definitions
patrykkopycinski Mar 21, 2026
34a445a
feat(llm-batch): add split logic with token-based strategy
patrykkopycinski Mar 21, 2026
e1a599a
feat(llm-batch): add hierarchical merge logic
patrykkopycinski Mar 21, 2026
f6aaf46
feat(llm-batch): add batch orchestrator with concurrency control
patrykkopycinski Mar 21, 2026
7c195c1
feat(llm-batch): add public API and README
patrykkopycinski Mar 21, 2026
8a55d60
feat(evals-ad): add latency and token usage evaluators
patrykkopycinski Mar 21, 2026
0025693
feat(evals-ad): capture latency and token metrics in task
patrykkopycinski Mar 21, 2026
991bb4f
feat(evals-ad): register latency and token evaluators
patrykkopycinski Mar 21, 2026
23d377c
feat(evals-ad): integrate @kbn/llm-batch-processing for large alert sets
patrykkopycinski Mar 21, 2026
c80549c
fix(llm-batch): use source imports instead of built output
patrykkopycinski Mar 21, 2026
a2b7e04
docs: add RFC SEC-2026-002 validation results
patrykkopycinski Mar 21, 2026
f92f0a1
docs: update RFC validation with 100-alert benchmark results
patrykkopycinski Mar 21, 2026
fc02813
fix(evals-ad): implement semantic LLM-based merge for batch insights
patrykkopycinski Mar 21, 2026
b1dac9b
fix(evals-ad): revert to concatenation merge - semantic merge causes …
patrykkopycinski Mar 21, 2026
54dd634
docs: add final RFC validation summary with 500-alert benchmark
patrykkopycinski Mar 21, 2026
591bdd8
docs: add raw metrics analysis and final validation summary
patrykkopycinski Mar 21, 2026
2d30f3c
fix(evals-ad): use raw metrics instead of tiered scores
patrykkopycinski Mar 21, 2026
cf75409
feat(evals-ad): make concurrency configurable via env var
patrykkopycinski Mar 21, 2026
76e27e3
feat(evals-ad): dynamic concurrency + enhanced token tracking
patrykkopycinski Mar 21, 2026
3cc1853
debug(evals-ad): add detailed logging for token usage investigation
patrykkopycinski Mar 21, 2026
58dfea6
fix(evals-ad): correct token extraction - use response.tokens not res…
patrykkopycinski Mar 21, 2026
ccd9f93
docs: add LLM benchmarker agent design spec
patrykkopycinski Mar 21, 2026
2574996
docs: add OSS model validation plan with Ollama
patrykkopycinski Mar 21, 2026
a2ae2af
docs: comprehensive final validation report with all optimizations
patrykkopycinski Mar 21, 2026
9ef8231
docs: add LLM benchmarker agent implementation plan
patrykkopycinski Mar 21, 2026
a7d50e7
docs: honest final RFC validation with all findings
patrykkopycinski Mar 21, 2026
84fa296
docs: complete validation summary + OSS prompt research
patrykkopycinski Mar 21, 2026
175c710
docs: add Incremental Attack Discovery design spec
patrykkopycinski Mar 21, 2026
55462df
docs: complete analysis - batch processing vs incremental AD
patrykkopycinski Mar 21, 2026
a4a21cd
feat: create RFC validator skill for systematic validation
patrykkopycinski Mar 21, 2026
b855e19
wip: save incremental AD work
patrykkopycinski Mar 21, 2026
339ffc5
docs: unified Incremental AD spec (delta + progressive modes)
patrykkopycinski Mar 21, 2026
614f008
docs: complete unified incremental AD implementation plan
patrykkopycinski Mar 21, 2026
9b9adf8
feat(ad): add incremental AD type definitions
patrykkopycinski Mar 21, 2026
16ea191
feat(ad): add ES-backed state tracker for incremental AD
patrykkopycinski Mar 21, 2026
a790149
feat(ad): add delta computer for incremental mode
patrykkopycinski Mar 21, 2026
0b2d613
feat(ad): add rule-based insight merger
patrykkopycinski Mar 21, 2026
32e0291
feat(ad): add round-based processor
patrykkopycinski Mar 21, 2026
1765551
feat(ad): add unified incremental AD API (delta + progressive)
patrykkopycinski Mar 21, 2026
5735996
feat(ad): add incremental AD integration helper and documentation
patrykkopycinski Mar 21, 2026
061452d
feat(ad): add telemetry for incremental attack discovery
patrykkopycinski Mar 21, 2026
5e634bb
docs(ad): add comprehensive API documentation for incremental mode
patrykkopycinski Mar 21, 2026
03f2802
test(ad): add comprehensive validation suite for incremental AD
patrykkopycinski Mar 21, 2026
f8b0b74
docs(ad): add complete implementation summary
patrykkopycinski Mar 21, 2026
41965bf
feat(ad): extend API schema with incremental mode support
patrykkopycinski Mar 21, 2026
66cacf0
feat(ad): wire up incremental mode to Attack Discovery endpoints
patrykkopycinski Mar 21, 2026
9bc6369
feat(ad): add monitoring infrastructure for incremental AD
patrykkopycinski Mar 21, 2026
df2bd4b
feat(ad): add feature flags and production rollout plan
patrykkopycinski Mar 21, 2026
c10ca51
feat(ad): add real LLM validation scripts and testing tools
patrykkopycinski Mar 21, 2026
6ef2326
docs(ad): add final implementation report
patrykkopycinski Mar 21, 2026
960d64f
docs(ad): add comprehensive validation report
patrykkopycinski Mar 21, 2026
5a47ecd
docs(ad): add real LLM validation guide and status
patrykkopycinski Mar 21, 2026
a5a0bec
docs(ad): add customer beta readiness materials
patrykkopycinski Mar 22, 2026
9b6b896
docs(ad): track real LLM validation progress
patrykkopycinski Mar 22, 2026
0bd8535
feat(ad): complete real LLM validation and performance benchmarks
patrykkopycinski Mar 22, 2026
990d04b
fix(ad): fix import path and type errors in incremental attack discovery
patrykkopycinski Mar 23, 2026
a23a121
Merge remote-tracking branch 'upstream/main' into feature/incremental…
patrykkopycinski Mar 23, 2026
82bf9d5
feat(ad): fix integration bug + add real LLM eval for incremental AD
patrykkopycinski Mar 23, 2026
9358e12
fix(ad): make usage optional in incremental runner to match generateI…
patrykkopycinski Mar 23, 2026
8da04b6
feat(ad): add JSON fallback parsing, fix alert index, enable UI incre…
patrykkopycinski Mar 24, 2026
4d9ece6
feat(ad): add delta mode eval + quality evaluators for incremental AD
patrykkopycinski Mar 24, 2026
0fccca8
feat(ad): add quality comparison eval (single-pass vs incremental)
patrykkopycinski Mar 24, 2026
a4274f8
fix(ad): address all eval findings — auto-tune, better merging, confi…
patrykkopycinski Mar 24, 2026
17548e0
fix(ad): update context budget from 8K to 32K, auto-tune alertsPerRound
patrykkopycinski Mar 24, 2026
51f3ccf
fix(ad): robust JSON parsing for OSS models + smarter merge logic
patrykkopycinski Mar 24, 2026
3dfd7b2
feat(ad): add quality experiments — granular prompt + smaller batch A…
patrykkopycinski Mar 24, 2026
4780950
feat(ad): add 4 quality improvements with isolated A/B eval specs
patrykkopycinski Mar 24, 2026
9a3af15
feat(ad): apply eval-validated quality improvements to production code
patrykkopycinski Mar 24, 2026
602b7b4
feat(ad): data stream migration, model-aware UI, delta TODO, eval CI …
patrykkopycinski Mar 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
176 changes: 176 additions & 0 deletions .agents/skills/llm-benchmarker/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
---
name: LLM Benchmarker
description: Deploy vLLM models and run benchmarks using elastic-llm-benchmarker. Use when you need to deploy a model for @kbn/evals, evaluate model tool-calling, or benchmark throughput/latency.
---

# LLM Benchmarker

Use this skill when:

- You need to **deploy a vLLM model** on the GPU VM for testing or `@kbn/evals` runs
- You need to **benchmark** a model's throughput, latency, or tool-calling capabilities
- You need to **enqueue a model** for automated evaluation
- You need to **check benchmark results** or model status

## Prerequisites

The benchmarker lives at `elastic-llm-benchmarker/` (symlinked from the main repo). If the symlink is missing:

```bash
BENCHMARKER="$HOME/Projects/automaker/elastic-llm-benchmarker"
[ -e "elastic-llm-benchmarker" ] || ln -s "$BENCHMARKER" elastic-llm-benchmarker
```

The benchmarker requires a configured `.env` file. Check if it exists before running commands:

```bash
ls elastic-llm-benchmarker/.env
```

## Workflows

### 1. Deploy a Model for `@kbn/evals`

When you need a running model endpoint for Kibana evaluation tests:

```bash
cd elastic-llm-benchmarker

# Print the vLLM docker run command (dry run)
npx tsx src/cli.ts print-deploy-command \
--model <model-id> \
--port 8000

# Deploy model AND run tool-call benchmark in one step
npx tsx src/cli.ts deploy-and-test-tool-calls \
--model <model-id> \
--port 8000 \
--no-stop # keep the container running after tests
```

The `--no-stop` flag is critical for `@kbn/evals` — it keeps the vLLM container running after the tool-call tests complete so the model remains available.

After deployment, the model is accessible at `http://<SSH_HOST>:8000/v1` (OpenAI-compatible API).

To expose the model publicly (for Kibana connectors that can't reach the VM directly), enable the tunnel in `.env`:

```
TUNNEL_ENABLED=true
NGROK_AUTH_TOKEN=<token>
```

### 2. Run Tool-Call Benchmark Against a Running Model

When a model is already deployed and you want to validate tool-calling:

```bash
cd elastic-llm-benchmarker
npx tsx src/cli.ts tool-call-benchmark \
--base-url http://<host>:8000 \
--model <model-id>
```

This runs sequential and parallel tool-call tests and reports success rates.

### 3. Enqueue a Model for Automated Evaluation

Start the Queue API, then submit a model:

```bash
cd elastic-llm-benchmarker

# Start Queue API (port 3100)
npx tsx src/api/queue-server.ts &

# Enqueue a model
curl -X POST http://localhost:3100/api/queue \
-H "Content-Type: application/json" \
-d '{"modelId": "<model-id>", "priority": 100}'
```

The LangGraph agent (`npm run dev`) picks up queued models and runs the full pipeline: deploy → benchmark → store results → create Kibana connector → run eval.

### 4. Check Status and Results

```bash
cd elastic-llm-benchmarker

# Check ES status, result counts, queue state
npx tsx src/cli.ts status

# Query results for a specific model
npx tsx src/cli.ts results --model <model-id> --summary

# Export results
npx tsx src/cli.ts export --format csv --output results.csv
```

### 5. Start Local Elasticsearch + Kibana (for Dashboards)

```bash
cd elastic-llm-benchmarker

# Start ES + Kibana
npm run infra:up

# Create Kibana dashboards and visualizations
python3 scripts/create-kibana-objects.py

# Access dashboards at http://localhost:5601 (elastic/changeme)
```

## Model ID Format

Model IDs follow HuggingFace naming: `org/model-name`, e.g.:
- `meta-llama/Llama-3.3-70B-Instruct`
- `Qwen/Qwen3-4B`
- `mistralai/Devstral-Small-2505`
- `NousResearch/Hermes-3-Llama-3.1-8B`

## Connecting to `@kbn/evals`

After deploying a model with the benchmarker, create a Kibana connector that points to the vLLM endpoint:

1. Deploy the model: `npx tsx src/cli.ts deploy-and-test-tool-calls --model <id> --port 8000 --no-stop`
2. The vLLM API is at `http://<SSH_HOST>:8000/v1`
3. In Kibana, create an OpenAI connector pointing to that URL, or configure it in `kibana.dev.yml`:

```yaml
xpack.actions.preconfigured:
vllm-local:
name: 'vLLM Local'
actionTypeId: .gen-ai
config:
apiProvider: 'Other'
apiUrl: 'http://<SSH_HOST>:8000/v1/chat/completions'
defaultModel: '<model-id>'
secrets:
apiKey: 'not-needed'
```

4. Run `@kbn/evals` with the connector:

```bash
KIBANA_TESTING_AI_CONNECTORS='[{"id":"vllm-local"}]' \
npx playwright test --project vllm-local
```

## Environment Variables

Key settings in `elastic-llm-benchmarker/.env`:

| Variable | Description |
|----------|-------------|
| `SSH_HOST` | GPU VM IP address |
| `SSH_USERNAME` / `SSH_PASSWORD` | VM credentials |
| `HUGGINGFACE_TOKEN` | HuggingFace API token for model discovery |
| `ENGINE_TYPE` | `vllm` (default) or `ollama` |
| `TUNNEL_ENABLED` | `true` to expose via ngrok |
| `ES_URL` / `ES_API_KEY` | Elasticsearch connection for results storage |

## Guardrails

- Always use `--no-stop` when deploying for `@kbn/evals` — the model must stay running
- Check `npx tsx src/cli.ts status` before deploying to see if a model is already running
- Large models (70B+) require multi-GPU VMs; check `HARDWARE_PROFILE_ID` in `.env`
- The VM has limited disk; old containers should be cleaned up after use
156 changes: 156 additions & 0 deletions .agents/skills/openspec-apply-change/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
---
name: openspec-apply-change
description: Implement tasks from an OpenSpec change. Use when the user wants to start implementing, continue implementation, or work through tasks.
license: MIT
compatibility: Requires openspec CLI.
metadata:
author: openspec
version: "1.0"
generatedBy: "1.1.1"
---

Implement tasks from an OpenSpec change.

**Input**: Optionally specify a change name. If omitted, check if it can be inferred from conversation context. If vague or ambiguous you MUST prompt for available changes.

**Steps**

1. **Select the change**

If a name is provided, use it. Otherwise:
- Infer from conversation context if the user mentioned a change
- Auto-select if only one active change exists
- If ambiguous, run `openspec list --json` to get available changes and use the **AskUserQuestion tool** to let the user select

Always announce: "Using change: <name>" and how to override (e.g., `/opsx:apply <other>`).

2. **Check status to understand the schema**
```bash
openspec status --change "<name>" --json
```
Parse the JSON to understand:
- `schemaName`: The workflow being used (e.g., "spec-driven")
- Which artifact contains the tasks (typically "tasks" for spec-driven, check status for others)

3. **Get apply instructions**

```bash
openspec instructions apply --change "<name>" --json
```

This returns:
- Context file paths (varies by schema - could be proposal/specs/design/tasks or spec/tests/implementation/docs)
- Progress (total, complete, remaining)
- Task list with status
- Dynamic instruction based on current state

**Handle states:**
- If `state: "blocked"` (missing artifacts): show message, suggest using openspec-continue-change
- If `state: "all_done"`: congratulate, suggest archive
- Otherwise: proceed to implementation

4. **Read context files**

Read the files listed in `contextFiles` from the apply instructions output.
The files depend on the schema being used:
- **spec-driven**: proposal, specs, design, tasks
- Other schemas: follow the contextFiles from CLI output

5. **Show current progress**

Display:
- Schema being used
- Progress: "N/M tasks complete"
- Remaining tasks overview
- Dynamic instruction from CLI

6. **Implement tasks (loop until done or blocked)**

For each pending task:
- Show which task is being worked on
- Make the code changes required
- Keep changes minimal and focused
- Mark task complete in the tasks file: `- [ ]` → `- [x]`
- Continue to next task

**Pause if:**
- Task is unclear → ask for clarification
- Implementation reveals a design issue → suggest updating artifacts
- Error or blocker encountered → report and wait for guidance
- User interrupts

7. **On completion or pause, show status**

Display:
- Tasks completed this session
- Overall progress: "N/M tasks complete"
- If all done: suggest archive
- If paused: explain why and wait for guidance

**Output During Implementation**

```
## Implementing: <change-name> (schema: <schema-name>)

Working on task 3/7: <task description>
[...implementation happening...]
✓ Task complete

Working on task 4/7: <task description>
[...implementation happening...]
✓ Task complete
```

**Output On Completion**

```
## Implementation Complete

**Change:** <change-name>
**Schema:** <schema-name>
**Progress:** 7/7 tasks complete ✓

### Completed This Session
- [x] Task 1
- [x] Task 2
...

All tasks complete! Ready to archive this change.
```

**Output On Pause (Issue Encountered)**

```
## Implementation Paused

**Change:** <change-name>
**Schema:** <schema-name>
**Progress:** 4/7 tasks complete

### Issue Encountered
<description of the issue>

**Options:**
1. <option 1>
2. <option 2>
3. Other approach

What would you like to do?
```

**Guardrails**
- Keep going through tasks until done or blocked
- Always read context files before starting (from the apply instructions output)
- If task is ambiguous, pause and ask before implementing
- If implementation reveals issues, pause and suggest artifact updates
- Keep code changes minimal and scoped to each task
- Update task checkbox immediately after completing each task
- Pause on errors, blockers, or unclear requirements - don't guess
- Use contextFiles from CLI output, don't assume specific file names

**Fluid Workflow Integration**

This skill supports the "actions on a change" model:

- **Can be invoked anytime**: Before all artifacts are done (if tasks exist), after partial implementation, interleaved with other actions
- **Allows artifact updates**: If implementation reveals design issues, suggest updating artifacts - not phase-locked, work fluidly
Loading
Loading