[Security GenAI] Autonomous vs hand-written PCI compliance skill — side-by-side eval harness by patrykkopycinski · Pull Request #11 · patrykkopycinski/kibana

patrykkopycinski · 2026-05-10T18:54:04Z

TL;DR

Adds a second PCI compliance skill (pci-compliance-autonomous) that ships alongside the hand-written pci-compliance skill (elastic#256060), and parameterizes the existing kbn-evals-suite-pci-compliance so the same 7-scenario eval suite can be run against either skill via the EVAL_PCI_VARIANT env var.

The autonomous skill was generated end-to-end by skill.architect against the current Kibana tool catalog, with PCI domain knowledge synthesized from autonomous web research + model knowledge. It deliberately reuses the same underlying tools as the hand-written skill, so "skill content" (instructions + domain knowledge + trigger phrases) is the only experimental variable — same tools, same dataset, same evaluators, same judge.

Comparison artifact

Side-by-side comparison report: comparison.html in this branch (rendered via htmlpreview)

Currently shows the structural comparison (skill metadata, content metrics, distinguishing autonomous contributions). The "Live evaluation results" section is wired and waits for output from compare_variants.sh once the AI-connector eval cluster runs the suite. The HTML re-renders deterministically from runs/{handwritten,autonomous}/results.json.

What ships

Server (security_solution plugin)

New skill definition pci_compliance_autonomous/ registering pci-compliance-autonomous against the existing PCI tool IDs.
Feature flag pciComplianceAutonomousAgentBuilder (default off).
Skill registration gated by the flag.
Allow-list entry for the new skill ID.

Eval harness (kbn-evals-suite-pci-compliance)

evaluate_dataset.ts reads EVAL_PCI_VARIANT (handwritten | autonomous) to select which skill createSkillInvocationEvaluator targets. Default remains handwritten so existing CI is unchanged.
scripts/compare_variants.sh runs both variants back-to-back and emits the side-by-side report.
scripts/build_comparison_html.mjs generates the report; all embedded paths are repo-relative so the artifact is portable.
README documents the variant matrix and the comparison workflow.

CI plumbing

New Scout config set evals_pci_compliance_autonomous flips ONLY the autonomous flag.
evals.suites.json registers the autonomous suite.
llm_evals.yml adds a Buildkite step for the autonomous variant; existing PCI step tagged EVAL_PCI_VARIANT=handwritten for symmetry.

How to reproduce locally

cd kibana
./x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/compare_variants.sh
open  ./x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/comparison.html

Or via Buildkite — both kbn-evals-weekly-pci-compliance and kbn-evals-weekly-pci-compliance-autonomous steps now exist in llm_evals.yml and can be triggered independently.

Why two skills, same tools?

The hand-written skill already has its real tool implementations in tree. Forking those tools for the autonomous variant would conflate two questions: "is the autonomous skill content better?" and "are different tool surfaces better?". Reusing the tools isolates skill content as the only variable — exactly what the comparison is meant to measure.

Verification done before push

ReadLints clean across all authored files.
ESLint clean on every staged file except evaluate_dataset.ts:17 — pre-existing @kbn/imports/no_boundary_crossing from Add PCI compliance skill and tools for Agent Builder elastic/kibana#256060 that reproduces identically on every sibling kbn-evals-suite-* package on main (verified against kbn-evals-suite-security-ai-rules). Endemic to the eval framework, out of scope for a skill comparison.
Secrets scan: clean.
Personal/absolute paths in diff or generated HTML: clean (HTML uses repoRelative() helper).
15 files, +1373 / -1 — focused and reviewable.
Scoped tsc -b on security_solution/tsconfig.json OOMs at 8 GB locally (a known Kibana issue with this plugin's size). Per Kibana's defer-type-check rule, tier-1 ReadLints is the local signal; CI will run the authoritative type check.

Open as draft

Per local convention; not for review until the live evaluator output is attached to comparison.html.

Branch: https://github.com/patrykkopycinski/kibana/tree/pk/autonomous-vs-handwritten-pci
Comparison artifact: https://github.com/patrykkopycinski/kibana/blob/pk/autonomous-vs-handwritten-pci/x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/comparison.html

…y-side eval harness Adds a second PCI compliance skill (`pci-compliance-autonomous`) that ships ALONGSIDE the existing hand-written `pci-compliance` skill, so the same eval suite can be run against both variants and compared head-to-head. The autonomous variant deliberately reuses the SAME underlying tools as the hand-written variant, isolating "skill content" (instructions + domain knowledge + trigger phrases) as the only experimental variable. ## What ships Server (security_solution plugin) - New skill definition `pci_compliance_autonomous/` registering `pci-compliance-autonomous` against the existing PCI tool IDs. - New feature flag `pciComplianceAutonomousAgentBuilder` (default off). - Skill registration gated by the flag in `register_skills.ts`. - Allow-list entry for the new skill ID. Eval harness (kbn-evals-suite-pci-compliance) - `evaluate_dataset.ts` reads `EVAL_PCI_VARIANT` (`handwritten` | `autonomous`) to select which skill `createSkillInvocationEvaluator` targets. Default remains `handwritten` so existing CI is unchanged. - `scripts/compare_variants.sh` runs both variants back-to-back and emits a side-by-side `comparison.html` with structural metrics + slots for live evaluator output (per-scenario scores, judge rationales, latency). - `scripts/build_comparison_html.mjs` generates the report; all embedded paths are repo-relative so the artifact is portable. - README documents the variant matrix and the comparison workflow. CI plumbing - New Scout config set `evals_pci_compliance_autonomous` that flips ONLY the autonomous flag, so the autonomous run sees only the autonomous skill. - `evals.suites.json` registers `pci-compliance-autonomous`. - `llm_evals.yml` adds a Buildkite step for the autonomous variant and tags the existing PCI step with `EVAL_PCI_VARIANT=handwritten` for symmetry. ## Why The hand-written PCI skill (`pci-compliance`, elastic#256060) is the production baseline. The autonomous skill was generated end-to-end by `skill.architect` against the current Kibana tool catalog, with PCI domain knowledge synthesized from autonomous web research + model knowledge (SAQ taxonomy, v3->v4 deltas, scope-reduction levers, technical-vs-process classification). Running the existing 7-scenario PCI eval suite against both — same tools, same dataset, same evaluators, same judge — gives a clean A/B that answers "is the autonomously generated skill at least as good as the hand-written one?". ## Out of scope (not introduced by this commit) `evaluate_dataset.ts:17` triggers `@kbn/imports/no_boundary_crossing` because `@kbn/evals` is declared `type: "test-helper"` and the suite imports value exports from it. This lint reproduces identically on every sibling `kbn-evals-suite-*` package on `main` (verified against `kbn-evals-suite-security-ai-rules`), so it is endemic to the eval framework and would require a cross-cutting change to `@kbn/evals` ownership / visibility — out of scope for this skill comparison.

…on fix - Ran @kbn/evals-suite-pci-compliance back-to-back against both PCI skill variants on a local Scout cluster wired to llama3.1:8b via a LiteLLM proxy (translates OpenAI-format requests to Ollama, including structured tool_calls). Captured 14 docs per variant from the kibana-evaluations data stream. - Updated build_comparison_html.mjs to consume the framework's actual export shape (Elasticsearch _search response), folding the per-evaluator rows back into per-scenario rows. Added a routing-aggregate diagnostic (scenarios with >=1 PCI-skill tool call, total tool calls vs PCI-skill tool calls) so the HTML can show *why* a score landed where it did, not just the score itself. - Re-rendered comparison.html with the live data. Both variants scored 0.00 across all completed scenarios because llama3.1:8b is too small to engage either PCI skill -- the agent router fell back to the generic platform.core.search tool on every scenario, never invoking security.pci_*. The HTML now carries an honest banner explaining this: the comparison is apples-to-apples (identical model + dataset + infra), it just lives on the floor at this model scale. The structural and domain-coverage deltas in sections 2-3 remain the meaningful signal until the same script is re-run with a stronger model. - Fixed an isolation bug in the autonomous Scout config set: the pciComplianceAgentBuilder feature flag defaults to true in experimental_features.ts, so the autonomous run was loading BOTH skills. Added 'disable:pciComplianceAgentBuilder' to the scout config serverArgs to keep the comparison clean for future runs. Refs: #11

patrykkopycinski · 2026-05-10T20:49:25Z

Live local eval run captured (commit `fc5194e97df3`)

Ran the suite back-to-back against both PCI skill variants on a local Scout
cluster wired to llama3.1:8b via a LiteLLM proxy (translates OpenAI-format
requests to Ollama, including structured tool_calls). 14 docs per variant
landed in the kibana-evaluations data stream and the comparison HTML now
renders with live data.

Headline numbers

Signal	Hand-written	Autonomous
Scenarios completed (of 8)	7	7
PCI Criteria score (mean)	0.000	0.000
Total tool calls observed	12	12
`security.pci_*` skill tool calls	0	0
Wall-clock	17.5 min	15.5 min

Honest read

Both variants scored 0 across all completed scenarios. Not because of the
skill content — because llama3.1:8b is too small to engage either PCI
skill at all. The agent router fell back to the generic
platform.core.search tool on every scenario and never invoked
security.pci_*. The comparison is apples-to-apples (identical model +
dataset + connector + infra), it just lives on the floor at this model
scale.

The structural and domain-coverage deltas in §2-§3 of the comparison HTML
(do-not-use boundaries, SAQ taxonomy, scope-reduction levers, etc.) remain
the meaningful signal until the same harness is re-run with a stronger
model (GPT-4-class, Claude 3.5+, Bedrock Claude 3.7) — at which point the
same script re-renders §4 with discriminating numbers.

Two side-fixes shipped in the same commit

Isolation bug. pciComplianceAgentBuilder defaults to true in
experimental_features.ts, so the autonomous Scout config was loading
both PCI skills. Added 'disable:pciComplianceAgentBuilder' to the
autonomous config's enableExperimental array so the comparison stays
clean for future runs.
build_comparison_html.mjs now consumes the framework's actual export
shape (Elasticsearch _search response from kibana-evaluations),
folds per-evaluator rows back into per-scenario rows, and adds a
routing-aggregate diagnostic so the HTML can show why a score landed
where it did, not just the score. This is what surfaces the "no
PCI-skill tool calls" finding above.

Comparison HTML:
x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/comparison.html

…arison on real connectors The autonomous-vs-handwritten PCI comparison previously ran on llama3.1:8b through a local Ollama proxy. At that model scale the agent router never engaged either PCI skill, so every scenario scored 0.00 and the comparison landed on the floor (see commit fc5194e). This commit promotes the comparison to real Bedrock connectors and ships the connector-side fix that the upgrade required. Bedrock connector — Claude Opus 4.7 enablement ---------------------------------------------- Claude Opus 4.7 on Bedrock rejects the `temperature` inference parameter with `temperature is deprecated for this model`. Without omitting it the connector simply 400s on every request. Fix is in three layers: - `@kbn/inference-common`: new `supportsTemperature?: boolean` on `ModelDefinition`; `claude-opus-4-7` marked `supportsTemperature: false`. Future Claude variants (or other provider models) with the same restriction need only flip the flag — one source of truth. - `inference` plugin: `getTemperatureIfValid` omits temperature when the model definition declares `supportsTemperature: false`. Sits alongside the existing OpenAI o-series exclusions and works for any provider. - `stack_connectors` (Bedrock): new local `bedrockModelSupportsTemperature(model)` helper; `formatBedrockBody` threads `model` through and gates the parameter. `invokeAI`, `invokeStream`, `invokeAIRaw`, `_converse`, and `_converseStream` all consult it. Defense in depth — direct sub-action callers (Security AI Assistant, etc.) are protected without taking a cross-plugin dependency on `@kbn/inference-common`. Smoke-tested with `invokeAI` + `converse` sub-actions: - Claude 4.7 Opus (`us.anthropic.claude-opus-4-7`): now passes — temperature omitted, response returned. - Claude 4.6 Sonnet (`us.anthropic.claude-sonnet-4-6`): still passes — temperature included as before. Live eval comparison (PCI Criteria, LLM-judge 0..1) --------------------------------------------------- Both PCI skill variants ran the same 8-scenario `@kbn/evals-suite-pci-compliance` suite end-to-end against a real Scout cluster, on two production Bedrock connectors: | Variant | Claude 4.7 Opus | Claude 4.6 Sonnet | |-------------|----------------:|------------------:| | Handwritten | 0.977 | 0.989 | | Autonomous | 0.834 | 0.860 | The handwritten skill (Smriti, PR elastic#256060) outperforms the autonomous variant on both models by 14-15 points. The autonomous architect's broader domain framing (SAQ taxonomy, v3→v4 deltas, scope-reduction levers) did not translate into a better PCI-Criteria score. The handwritten contract is shorter (~4.1k vs ~8.1k chars) and lines up more tightly with the eval's scoring rubric — that tight coupling is the deciding factor. build_comparison_html.mjs gains a `--runs <label>=<dir>,...` mode so the 4-cell grid renders from the four results.json snapshots. Legacy `--handwritten`/`--autonomous` mode still works for single-model runs. kbn-scout --------- `run_kibana_server.ts` now respects `SCOUT_READ_DEV_CONFIG=true` and drops `--no-dev-config` when set, so a developer can load `config/kibana.dev.yml` (and the preconfigured AI connectors it defines) into the Scout-managed Kibana process. Default behaviour is unchanged. Without this, evals against real cloud connectors require fragile API-driven connector creation per boot. Refs: #11

patrykkopycinski · 2026-05-11T13:14:18Z

Live eval comparison on real Bedrock connectors

Live PCI Criteria scores (LLM-judge, 0..1) across both PCI skill variants on two production Bedrock connectors:

Variant	Claude 4.7 Opus	Claude 4.6 Sonnet
Handwritten	0.977	0.989
Autonomous	0.834	0.860

The handwritten skill (Smriti, elastic#256060) outperforms the autonomous variant on both models by 14-15 points. The autonomous architect's broader domain framing (SAQ taxonomy, v3→v4 deltas, scope-reduction levers — §3 of the report) does not translate into a better PCI-Criteria score. The handwritten contract is shorter (~4.1k vs ~8.1k chars) and lines up more tightly with the eval's scoring rubric.

Full side-by-side: x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/comparison.html (re-rendered from the four results.json snapshots; the new --runs mode keeps the script reproducible from any subset).

Bedrock connector — Claude Opus 4.7 enablement

Running on Opus 4.7 required a connector-side fix: the model rejects the temperature inference parameter with temperature is deprecated for this model. The patch flows through three layers (see commit 8ee59cf):

@kbn/inference-common — new supportsTemperature?: boolean on ModelDefinition; claude-opus-4-7 marked supportsTemperature: false. One source of truth for any provider.
inference plugin — getTemperatureIfValid omits temperature when the model definition declares it unsupported.
stack_connectors (Bedrock) — local bedrockModelSupportsTemperature(model) helper; formatBedrockBody, invokeAI, invokeStream, invokeAIRaw, _converse, and _converseStream all gate the parameter. Defense in depth for direct sub-action callers (Security AI Assistant, etc.) that bypass the inference plugin layer.

Smoke-tested via invokeAI + converse sub-actions on both connectors — Opus 4.7 now succeeds; Sonnet 4.6 unaffected.

Root-cause analysis of why the autonomously-architected PCI compliance skill scored 12-15 pts below the hand-written variant uncovered two distinct bugs that compounded: 1. **Tool registration bug** in `register_tools.ts` — PCI tools were gated *only* on `experimentalFeatures.pciComplianceAgentBuilder`, which the autonomous scout config explicitly disables to isolate the variant comparison. Result: the autonomous variant ran with NO PCI tools registered. Trace analysis confirmed 0 calls to `security.pci_compliance` across 16 scenarios vs 17-23 for HW. The agent fell back to raw `platform.core.execute_esql` and improvised the entire workflow. Fixed: gate now triggers on either flag. 2. **Skill-content design** — the autonomous prompt's 6-step workflow inserted "Reduce scope (tokenisation/P2PE/segmentation)" and "Classify requirements as technical vs process-based" steps BEFORE the tool calls, plus an 8 KB "Domain Knowledge Notes" block between the workflow and the status vocab. The structure read as "do-your-homework first" rather than "call the tools". Restructured: tools-first 4-step workflow with explicit "Always call the dedicated PCI tools; do not improvise raw ES|QL" injunction, theory moved to a "Background reference (do not consult before calling tools)" tail section. Removed broken handoff references to non-existent sibling skills and stripped tool-description provenance commentary. Validation on Claude 4.6 Sonnet: - pre-fix Auto: 0.860 mean (gap to HW: 12.9 pts) - post-fix Auto v3: 0.955 mean (gap to HW: 3.4 pts) - 6/8 scenarios now perfect 1.000; 1 scenario (full report) regressed -9 pts on a substance-vs-style criterion (agent calls the tool correctly but the report formatting elides specific evidence). Feedback-loop infrastructure: - `scripts/run-eval.sh` extended with optional scenario-grep argument (`run-eval.sh autonomous <connector> <label> "requirement 2.2.4"`) collapsing a full-suite cycle (~28 min) to a single-scenario probe (~5.6 min including scout boot, ~3 min if scout is reused). - Two iterations of this loop fixed both bugs end-to-end. POSTMORTEM.md captures the full analysis, including six ranked content fixes and a three-tier feedback-loop efficiency proposal.

patrykkopycinski · 2026-05-11T15:26:32Z

Deep-dive: why the autonomous skill scored lower (and how it closed the gap)

After the initial run shipped (Auto 0.860 vs HW 0.989 on Sonnet 4.6), I dropped into the per-scenario rationale + tool-call traces in runs/*/results.json and the per-cell judge explanation fields. Full writeup in POSTMORTEM.md. Highlights:

Root causes (two distinct bugs, compounding)

Bug 1 — tool registration. register_tools.ts gated the PCI tools only on experimentalFeatures.pciComplianceAgentBuilder. The autonomous Scout config (evals_pci_compliance_autonomous) explicitly disables that flag to isolate the variant comparison. Result: the autonomous variant ran with zero PCI tools registered. Trace tally across 16 scenarios on both Opus 4.7 and Sonnet 4.6:

Cell	Total steps	PCI tool calls	raw ESQL calls
HW · Opus 4.7	62	17	0
Auto · Opus 4.7	161	0	36
HW · Sonnet 4.6	77	23	0
Auto · Sonnet 4.6	214	0	30

The judge confirmed this in its own words: "The pci_compliance tool was not called; it was unavailable and fell back to ES|QL". The autonomous skill experiment never had a chance — its tools weren't there.

Bug 2 — skill-content design. Even with the registration fix in place, the autonomous prompt's 6-step workflow inserted Reduce scope (tokenisation / P2PE / segmentation) and Classify requirements as technical vs process-based steps before the tool calls, plus an 8 KB "Domain Knowledge Notes" block between the workflow and the status vocab. The structure reads "do your homework first" rather than "call the tools".

Fixes applied

register_tools.ts — gate on either flag (pciComplianceAgentBuilder || pciComplianceAutonomousAgentBuilder).
Autonomous skill content restructured: tools-first 4-step workflow with an explicit **Always call the dedicated PCI tools; do not improvise raw ES|QL** injunction; theory moved to a Background reference (do not consult before calling tools) tail section.
Removed broken handoff references to non-existent sibling skills (threat-hunting, alert-analysis, detection-rule-edit).
Stripped tool-description provenance commentary that the LLM was reading as design-rationale meta-text rather than operational instructions.

Result

Same scout config, same connector, same evaluator, only the skill content + tool-registration changed:

Run	Auto mean (Sonnet 4.6)	Gap to HW
Auto v1 (initial)	0.860	12.9 pts
Auto v3 (after fixes)	0.955	3.4 pts

6/8 scenarios now perfect 1.000 — matching the hand-written variant exactly. One scenario (full report) regressed -9 pts on a substance-vs-style criterion: the agent correctly calls pci_compliance in report mode but the report formatting elides specific evidence (mentions "default accounts active" rather than "admin and root").

The single-scenario trace on requirement 2.2.4 default accounts, before vs after:

Before fix (17 steps, score 0.571):

reasoning → reasoning → filestore.read → reasoning → reasoning →
  platform.core.list_indices → reasoning → reasoning →
  platform.core.get_index_mapping → reasoning →
  platform.core.search → reasoning → reasoning →
  platform.core.execute_esql → reasoning → reasoning →
  platform.core.execute_esql

Missed admin and root violations.

After fix (7 steps, score 1.000):

reasoning → filestore.read → reasoning → reasoning →
  security.pci_scope_discovery → reasoning →
  security.pci_compliance

Both admin and root violations surfaced. Identical pattern to HW.

Feedback loop: 28 min → 5.6 min per iteration

scripts/run-eval.sh now accepts an optional scenario-grep argument:

run-eval.sh autonomous <connector> <label> "requirement 2.2.4"

That collapses a full-suite cycle (~28 min) to a single-scenario probe (~5.6 min including scout boot, ~3 min if scout is reused across iterations). Two iterations of this loop end-to-end fixed both bugs above. The postmortem also proposes Tier 2 (sub-15s tool-call probe via the Kibana chat API) and Tier 3 (skill-architect rewrite loop fed by judge rationales) if you want even faster cycles next.

The comparison.html on this PR's HEAD now renders Auto v3 alongside the previous columns so the delta is visible at a glance.

…y (0.989 vs 0.989) The autonomous PCI compliance skill now ships its own independently-authored 4-tool decomposition under a separate allowlist entry. The autonomous skill has no knowledge of -- and no path to -- the hand-written PCI tools. This validates a fully end-to-end autonomous stack (skill + tools, both autonomously created) and reaches parity with the human-authored variant. What changed ------------ * New PCI tool bundle under `agent_builder/tools/pci_autonomous_tools/`: - `pci_autonomous_scope_discovery` - `pci_autonomous_compliance_check` (split out from the consolidated tool) - `pci_autonomous_scorecard_report` (split out from the consolidated tool) - `pci_autonomous_field_mapper` All four implement the cycle-17 architect blueprint's 4-tool decomposition (vs the hand-written variant's 3 tools, where check+report share one tool via a `mode` parameter). Each tool reuses the underlying domain logic so the comparison stays apples-to-apples on capability while validating the isolation property. * `register_tools.ts`: hand-written PCI tools register ONLY under `experimentalFeatures.pciComplianceAgentBuilder`; autonomous PCI tools register ONLY under `experimentalFeatures.pciComplianceAutonomousAgentBuilder`. The previous lenient gate (`either flag`) is removed -- the two variants are now strictly isolated. * `allow_lists.ts`: all four new autonomous tool IDs added to the `AGENT_BUILDER_BUILTIN_TOOLS` allowlist (without this, tool registration silently fails and the agent falls back to raw ES|QL). * Autonomous skill content + `getRegistryTools` rewired to reference the new tool IDs only. * Eval rubric (`pci_compliance.spec.ts`) is now variant-aware via `EVAL_PCI_VARIANT` -- judging criteria check for `pci_autonomous_*` tool names when the autonomous variant is on, and the original names otherwise. * Skill contract tests harden the isolation property: explicit assertions that the autonomous skill never references any hand-written tool ID, and that `getRegistryTools` advertises ONLY the autonomous bundle. * Comparison HTML updated with a new v5 column and a green success banner showing the autonomous skill+tools reaches parity with the hand-written baseline on Claude 4.6 Sonnet (0.989 vs 0.989, 8/8 scenarios). Why --- The user wanted to validate that the autonomous skill workflow generalises to other domains -- which requires removing every shortcut where the autonomous variant inherits the hand-written variant's tooling. The earlier "shared tool" runs were measuring only skill-content quality; this run measures the full stack the architect would generate from a blank slate. Result ------ | Variant | Mean (8 scenarios) | |-----------------------------------------|-------------------| | Hand-written, Claude 4.6 Sonnet | 0.989 | | Autonomous v5 (own 4 tools), Sonnet 4.6 | 0.989 | | Autonomous v3 (shared tools), Sonnet | 0.955 | | Autonomous v1 (shared, content drift) | 0.860 | Parity on the headline metric. The autonomous stack (skill content + 4-tool decomposition + allowlist entry + register gate) ships as a self-contained bundle the architect can replicate for any other domain.

Adds a second evaluation surface so the iteration loop on the autonomous PCI skill can be trusted to produce a generalisable skill rather than one that has memorised the iteration fixtures. Why --- The 0.989 we got from `sonnet46-autonomous-v5` (cycle that hit parity with the hand-written variant) is scored against the SAME fixtures we inspect while improving the skill. That tight loop is how every dataset-driven optimisation produces overfit: the skill content drifts from "teach the principle" to "match the fixture". Two layers of defence --------------------- 1. **Anti-overfit lockdown** (in `pci_compliance_autonomous_skill.test.ts`). A new `describe('anti-overfit ...')` block asserts the skill content contains NONE of the iteration- or holdout-set fixture values (`jdoe`, `pcompton`, `192.168.1.100`, `10.20.30.40`, `12 failed`, the random `logs-<hex>-{auth|network|...}` index pattern, etc.). Values that ARE legitimate PCI domain knowledge — `admin`/`root` for req 2.2.4, the lockout threshold of 10 for 8.3.4, `TLS 1.0`/`1.1` for 4.1 — are explicitly kept allowable. 11 invariants, all green today. Any future iteration that introduces a fixture-coupled patch will fail CI. 2. **Holdout dataset + spec** (new `pci_data_holdout.ts` + `pci_compliance_holdout/pci_compliance_holdout.spec.ts`). Same five PCI categories (auth/network/vuln/endpoint/legacy) but every memorisable axis is systematically different: - Index naming drops the `logs-*-{category}` pattern in favour of `security-audit-identity-*`, `siem-flows-prod-*`, `pkginfo-cve-*`, `edr-processes-*`, `legacy-app-syslog-*`. Tests that scope discovery uses field caps, not name patterns. - Brute-force volume is 8 (BELOW the PCI 8.3.4 threshold of 10) — expected verdict is GREEN, NOT RED. Catches skills that learnt "any failed-login cluster = violation". - Default-account flavours are Windows `Administrator` + `service_acct_42`, not Unix `admin`/`root`. - Weak TLS signature is TLS 1.1 ALONE — no TLS 1.0, no plain HTTP. Tests sub-version recognition rather than the kitchen-sink "multiple weak versions" pattern of the iteration set. - Non-ECS field schema uses `actor_name` / `client_addr` / `action_status` / `event_verb` / `device_id` / `cve_id` / `risk_rating` / `command` — completely different from the iteration set's `username` / `src_ip` / etc. Tests that field-mapping is semantic, not memorised. - 4-hour time window instead of 1-hour. - 2025-vintage CVEs instead of 2024. The six holdout scenarios mirror the structure of the iteration scenarios so the gap measurement is apples-to-apples: report, single-requirement check (× brute force + TLS + default accounts), scope discovery, field mapping. Result on Sonnet 4.6 -------------------- | | iteration | holdout | gap | verdict | |----------------|-----------|---------|--------|---------| | Hand-written | 0.989 | 0.942 | +0.047 | CLEAN | | Autonomous v5 | 0.989 | 0.927 | +0.062 | CAUTION | Both variants drop the same ~5-6 pts moving from iteration to holdout — and they drop on the SAME two scenarios (default-account variants 0.750/0.750, 4h scorecard 0.900/0.900). That tells us the holdout is genuinely harder, not that the autonomous skill is uniquely overfit. The autonomous gap of 0.062 is only 0.015 wider than the hand-written gap — well within noise of the framework. Crucially, the three HARDEST tests all scored 1.000 for both skills: - below-threshold brute force (counter-case — agent did NOT fabricate a false-positive violation) - TLS 1.1 alone (sub-version recognition without the kitchen-sink signature) - scope discovery on non-`logs-*` indices (worked via field caps, not via index-name pattern matching) Tooling changes --------------- - `run-eval.sh`: scout boot timeout bumped 6 min → 15 min; the default was unreliable when the host was also running an IDE. - `build_comparison_html.mjs`: new `--holdout-runs` flag mirroring `--runs`; new §5 section renders the iteration vs holdout grid, computes the gap per variant, applies the three-band verdict (CLEAN / CAUTION / OVERFIT), and lists the divergence axes plus the per-scenario holdout breakdown. Subsequent section numbers renumbered (6 reasoning, 7 reproduce, 8 provenance, 9 Bedrock). - `comparison.html` regenerated with the live holdout numbers. How to re-run ------------- bash x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/run-eval.sh \ handwritten pmeClaudeV46SonnetUsEast1 sonnet46-handwritten-holdout HOLDOUT bash x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/run-eval.sh \ autonomous pmeClaudeV46SonnetUsEast1 sonnet46-autonomous-holdout HOLDOUT node x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/build_comparison_html.mjs \ --runs ... --holdout-runs sonnet46-handwritten=...,sonnet46-autonomous=... This commit closes the question "did we just overfit to the iteration fixtures?" with a measurement, not an assertion. The answer is "the gap is small enough that the iteration loop is healthy, but not zero — field-mapping on novel vocabularies is the one place the autonomous skill is genuinely weaker than the hand-written one (0.909 vs 1.000), and that is a tool-implementation issue, not a skill-content overfit".

The previous report overclaimed full autonomy by saying the autonomous variant "ships an independently-authored 4-tool decomposition ... no shared context with the human-authored variant" and called it a "fully autonomous stack". That is true at the agent-facing surface (tool IDs, descriptions, schemas, decomposition, skill content, registration) but NOT at the domain-engine layer: each autonomous tool's handler still imports PCI_REQUIREMENTS, evaluateRequirement, and the ScopeClaim builder directly from the hand-written variant's pci_compliance_* modules. Recalibrates the framing without changing any numbers: - §1 intro now distinguishes "agent-facing surface" (independent) from "underlying domain engine" (shared via direct module imports) and points to the new §1.5 ladder. - §1.5 (new) "Autonomy ladder — what's truly independent vs what's shared": 10-row table covering tool IDs, descriptions, schemas, decomposition, prose, registration as INDEPENDENT and requirement catalog, evaluator, validation schemas, ScopeClaim builder, time helpers as SHARED. Names each shared file. - §4 verdict banner: "fully autonomous stack" → "surface-level autonomy of tools too", with an explicit caveat that the handler bodies still import the domain engine from the hand-written variant. Calls out the missing follow-up (pci_autonomous_requirements.ts / pci_autonomous_evaluator.ts). - §6 reasoning bullet 4: "Independently-authored tools" → "Independently- authored tool surface (engine still shared — see §1.5)" with the specific module names that are still being imported. - §8 Provenance & honesty: new "Honest limitation: autonomy is layered, not total" subsection summarising what the eval numbers measure (agent-surface autonomy on top of a shared engine) and what the next experiment would have to look like (independent engine + zero-import CI test + re-run). No code, eval numbers, or branch behaviour changed — only the framing of what the eval result is claiming. Sets up the follow-up work of authoring pci_autonomous_requirements.ts, pci_autonomous_evaluator.ts, and pci_autonomous_schemas.ts from the public DSS v4.0.1 spec and re-running.

Make the autonomous skill truly autonomous all the way down. Previously the four `pci_autonomous_*_tool.ts` handlers re-used the same PCI domain helpers as the hand-written skill (`pci_compliance_schemas`, `pci_compliance_requirements`, `pci_compliance_evaluator`). The agent-facing surface (IDs, schemas, decomposition, registration, skill content) was independent, but the underlying PCI engine was shared. This commit adds three engine modules in `pci_autonomous_tools/` authored from the PCI DSS v4.0.1 spec without referencing the hand-written ones, and rewires all four tools to use only the autonomous engine: - `pci_autonomous_schemas.ts` — independent zod input schemas with a stricter time-range guard (no future dates) and a `provenance` block on `PciAutonomousScopeClaim` for auditable autonomy. - `pci_autonomous_requirements.ts` — independent v4.0.1 catalog with a verdict-typed encoding (`detect_violations` vs `verify_presence`), self-documenting ES|QL params (`?_window_start`/`?_window_end`), enriched `defaultLookback` with rationale, and post-aggregation filtering instead of nested HAVING clauses. - `pci_autonomous_evaluator.ts` — composable pipeline of pure functions (replacing the nested try/catch pyramid), explicit status→score lookup table (avoiding multiplicative scoring drift), discriminated union for `FieldCapsPreflight`, and a different concurrency runner. CI lockdown: - `pci_autonomous_modules_no_handwritten_imports.test.ts` walks every file under `pci_autonomous_tools/` and asserts zero imports from the hand-written engine modules, plus that each tool file imports at least one autonomous engine module. The skill-level surface isolation test was also updated to reference the engine lockdown. All 28 autonomous-skill tests + 3 engine-lockdown tests pass. The next step (v6 results in `comparison.html`) is a fresh iteration+holdout eval run against this engine, which can now be attributed entirely to the autonomous architect.

Plug the v6 run (autonomous tools + autonomous engine) into the side-by-side comparison report. The architect re-authored the PCI domain engine from the public PCI DSS v4.0.1 spec (`pci_autonomous_requirements.ts`, `pci_autonomous_evaluator.ts`, `pci_autonomous_schemas.ts`), with a CI lockdown test asserting zero imports from the hand-written engine. Eval results: Iteration set (Sonnet 4.6, 8 scenarios) hand-written: 0.989 auto v5 (own tools, shared engine): 0.989 auto v6 (own tools + own engine): 0.989 ← deep autonomy at parity Holdout set (Sonnet 4.6, 6 scenarios) hand-written: 0.942 auto v5: 0.927 (gap −0.062 vs iteration → CAUTION band) auto v6: 0.985 (gap −0.004 vs iteration → CLEAN band) The deep-autonomy engine generalises *better* than the surface-only v5 on the holdout, with substantive wins on the 4h scorecard scenario (+0.100) and the default-account variants scenario (+0.250). Both wins come from the autonomous engine's more deliberate CDE / account-status semantics carrying over to non-fixture data shapes. Report changes -------------- - §1.5 autonomy ladder: rewrite the four engine rows from a single "SHARED" red pill to a "v5: SHARED / v6: AUTONOMOUS" pair, and add closing paragraphs that distinguish the two cycles. - §4 multi-model grid: add the v6 column. The reader can see v5 → v6 was a no-op on iteration scores but a substantive lift on holdout. - §5 generalisation gap: add a v6 row paired to the v6 holdout run. The pairing logic in build_comparison_html.mjs now strips any trailing `-vN` suffix when looking up the holdout label, so future iterations don't need a code change. - §6 reasoning bullet: flip the autonomous-side description from "engine still shared" to "tool surface AND domain engine independent (v6)", with the CI lockdown test referenced. - §8 honest limitation: rewrite as "how the deep-autonomy experiment was constructed (v6)". The prior text said this experiment "is not run here". It is now run here, and the section documents the three re-authored modules, the CI lockdown, and the result. The verdict banner now references both v5 (surface autonomy) and v6 (deep autonomy) as separate parity events.

Addresses the v6 deep-autonomy audit findings raised after the architect's own engine modules landed: Code-quality (autonomous engine modules) - schemas: tighten REQUIREMENT_ID_PATTERN so `all.1` etc. no longer match; strip stale "cycle-17" docstring references. - requirements: type catalog as Partial<Record<...>> so undefined lookups must be handled; drop redundant `| LIMIT 1` after un-grouped STATS; remove the as-cast pseudo-anchor (replaced by a runtime invariant in the new test file); strip "cycle-17" docstrings. - evaluator: scoreFor is exhaustive over the typed SCORE_TABLE so drop the unreachable `?? 0` fallback; runAutonomousWithConcurrency now awaits all in-flight tasks before re-throwing the first error so a single rejection no longer orphans siblings (semantics documented). - docstrings across index.ts, compliance_check_tool, register_tools, autonomous skill, and experimental_features now consistently describe v6 deep autonomy (independent engine + tools + heuristics) rather than overclaiming or underclaiming shared logic. Engine unit tests (~85 specs, ~2s) - pci_autonomous_schemas.test.ts: provenance constants, index-pattern refinements (ESQL injection, length bounds), time-range clamping, requirement-id regex, buildAutonomousScopeClaim dedupe/sort. - pci_autonomous_requirements.test.ts: catalog completeness, self- referential ids, presence of AUTONOMOUS_TIME_WINDOW placeholders, detect_violations always carries a violation query, defaultLookback sanity, plus a real runtime sync invariant that parses every catalog key through pciAutonomousRequirementIdSchema (replaces the prior compile-time anchor that was suppressed by an `as` cast). Also covers requirementCategory, buildAutonomousTimeWindowParams, time-range resolution, normalize/resolve helpers, and index-pattern helpers. - pci_autonomous_evaluator.test.ts: concurrency runner correctness + failure semantics, ordered ?_window_start/?_window_end binding, detect_violations RED path, verify_presence GREEN path, AMBER+HIGH / AMBER+LOW / NOT_ASSESSABLE branches via mockResolvedValueOnce, ES|QL failure → query_failed data gap, evidence row clamping. Reproducibility (#2 from audit) - build_comparison_html.mjs gains --combined-run <label>=<dir>, which reads a single results.json that mixes pci-compliance:* (iter) and pci-holdout:* (holdout) scenarios and splits them internally. The v6 evaluation report can now be regenerated from one results.json without an ad-hoc helper script. All four PCI-autonomous Jest suites pass locally (engine + lockdown). No new lint errors introduced (remaining no-continue / no-nested-ternary hits are pre-existing in untouched code).

- comparison.html / build_comparison_html.mjs: extend §8 with a new "v6 hardening — audit fixes + engine unit tests" subsection that spells out the post-v6 audit batch (Partial Record typing, exhaustive scoreFor, dropped LIMIT 1, concurrency failure semantics, stricter REQUIREMENT_ID_PATTERN), the new 85-spec engine test suite (including the runtime catalog↔schema sync invariant that replaces the suppressed compile-time anchor), and the new --combined-run flag for one-shot v6 report regeneration from a single results.json. - build_comparison_html.mjs: flatten six pre-existing nested ternaries (the §4 multi-runs-vs-live-vs-fallback chain becomes an IIFE with if/else; banner-class / banner-cls / gap-advice / mean-row cls all become let-block assignments) — no behaviour changes, the script smoke-runs end-to-end with --combined-run and produces a valid 574-line HTML output with all 11 §-headings intact. - pci_autonomous_requirements.ts: drop the lone `continue` in resolveAutonomousRequirementIds by inverting the guard into a positive-branch `if (canonical && canonical !== 'all') { ... }`. All 46 requirements specs still pass. Net result: both files lint clean (0 errors, 0 warnings). The 7 pre-existing lints sitting inside the audit-batch diff zone — 1 no-continue and 6 no-nested-ternary — are gone.

…rift test Addresses two follow-up findings on PR elastic#268798: #2 — Lockdown test (pci_autonomous_modules_no_handwritten_imports.test.ts): broaden the import deny-list to cover the full hand-written PCI surface, not just the three engine modules. Now blocks: - pci_compliance_tool - pci_compliance_evaluator - pci_compliance_requirements - pci_compliance_schemas - pci_field_mapper_tool - pci_scope_discovery_tool - anything under skills/pci_compliance/** The previous deny-list only covered the engine trio, which left a silent re-coupling path: a future contributor could import the hand-written orchestrator tool or scope-discovery helper and pass CI. The deep-autonomy guarantee in comparison.html §1.5 is broader than the engine — it covers every hand-written surface — so the lockdown should match. #4 — New comparison_html.test.ts: structural snapshot for the committed report. Asserts that the 11 §-level sections appear (in expected order) and the v6 hardening / deep-autonomy h3 subsections are present. Catches the two drift directions between comparison.html and scripts/build_comparison_html.mjs: 1. someone edits the HTML directly and forgets to update the template; 2. someone edits the template and forgets to regenerate + commit. Deliberately not byte-for-byte equality — the rendered HTML legitimately changes with each eval refresh and we don't want CI noise on prose tweaks.

Address the 15 findings from the autonomous PCI deep-analysis audit covering the engine modules, the four agent-facing tools, and the skill prompt. Blockers - Scope-discovery tool now returns a `discoveryClaim` (point-in-time snapshot) instead of a mis-shaped `scopeClaim`, surfaces ES errors as structured `dataGaps`, and validates `cat.indices` responses with a zod schema before walking them. - Requirements catalog: dropped the unused `requiredCategories[]` field and the orphan `requirementCategory()` helper. Removed `NOT_APPLICABLE` from `AutonomousComplianceStatus` — it was carried in the score table but never produced by any evaluator path. - Scorecard report no longer tags its synthesised executive roll-up as `ToolResultType.esqlResults` (the payload is not an ESQL row set); it now lands under `ToolResultType.other` so downstream UX/telemetry that special-cases `esqlResults` does not mis-render it. Importants - Skill prompt rewritten: workflow is now `discover → roll up → drill down`. The check and scorecard tools are explicitly designed to be used as a sequence and share one evaluator via the new `runAutonomousPciEvaluationPack` orchestration helper. - Both tools now derive `overallStatus` from the same severity rollup (`rollupAutonomousOverallStatus`) and `overallConfidence` from the same confidence rollup (`rollupAutonomousConfidence`), eliminating the previous risk of disagreement. - Field-mapper sensitive-field regex tightened: the previous bare `/token/i` over-matched (e.g. `subscription` contains no token but `tokenizer` would have flagged). Replaced with anchored patterns for `card`, `pan`, `cvv`, `cvc`, `account.number`, `credit.card`, `ssn`, `secret`, `password`, `api.key`, and specific `*token` shapes. - Added a runtime `assertNever` exhaustiveness check on the `statusToHumanLabel` switch — adding a new status without updating the switch now fails at compile time. Nice-to-haves - Removed experiment-only metadata (gate scores, citation counts, architect attribution, brittle `comparison.html §1.5` cross-refs) from every runtime file. Authoring metadata stays beside the eval suite. - "Recommended Remediation SLA" table in the skill prompt re-labelled as operational guidance — only the 30-day req 6.3.3 window is spec-sourced; the rest are heuristics a QSA would typically agree with but an org may tune. - SAQ scope-reduction "70%" claim re-cast as the assessor-guidance heuristic range (50–80%), not a guarantee. - `requirementCategory` tests removed; weak `['HIGH','MEDIUM']` evaluator assertion pinned to the exact value (`MEDIUM` via the coverage-stage no-violation-query path). - New `buildAutonomousDiscoveryClaim` helper + 4-spec test block covering dedupe/sort, provenance pinning, point-in-time semantics, and stable shape across shuffled inputs. Verification - ESLint: 14 files, clean. - Jest: 101/101 pass in `pci_autonomous_tools/` + the autonomous skill suite, 16/16 pass in `comparison_html.test.ts`. - Scoped `tsc -b` against `security_solution/tsconfig.type_check.json`: green.

patrykkopycinski added 2 commits May 10, 2026 20:50

patrykkopycinski added 9 commits May 11, 2026 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security GenAI] Autonomous vs hand-written PCI compliance skill — side-by-side eval harness#11

[Security GenAI] Autonomous vs hand-written PCI compliance skill — side-by-side eval harness#11
patrykkopycinski wants to merge 13 commits into
mainfrom
pk/autonomous-vs-handwritten-pci

patrykkopycinski commented May 10, 2026

Uh oh!

patrykkopycinski commented May 10, 2026

Uh oh!

patrykkopycinski commented May 11, 2026

Uh oh!

patrykkopycinski commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

patrykkopycinski commented May 10, 2026

TL;DR

Comparison artifact

What ships

How to reproduce locally

Why two skills, same tools?

Verification done before push

Open as draft

Uh oh!

patrykkopycinski commented May 10, 2026

Live local eval run captured (commit fc5194e97df3)

Headline numbers

Honest read

Two side-fixes shipped in the same commit

Uh oh!

patrykkopycinski commented May 11, 2026

Live eval comparison on real Bedrock connectors

Bedrock connector — Claude Opus 4.7 enablement

Uh oh!

patrykkopycinski commented May 11, 2026

Deep-dive: why the autonomous skill scored lower (and how it closed the gap)

Root causes (two distinct bugs, compounding)

Fixes applied

Result

Feedback loop: 28 min → 5.6 min per iteration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Live local eval run captured (commit `fc5194e97df3`)