Skip to content

[Security GenAI] Autonomous vs hand-written PCI compliance skill — side-by-side eval harness#11

Draft
patrykkopycinski wants to merge 13 commits into
mainfrom
pk/autonomous-vs-handwritten-pci
Draft

[Security GenAI] Autonomous vs hand-written PCI compliance skill — side-by-side eval harness#11
patrykkopycinski wants to merge 13 commits into
mainfrom
pk/autonomous-vs-handwritten-pci

Conversation

@patrykkopycinski
Copy link
Copy Markdown
Owner

TL;DR

Adds a second PCI compliance skill (pci-compliance-autonomous) that ships alongside the hand-written pci-compliance skill (elastic#256060), and parameterizes the existing kbn-evals-suite-pci-compliance so the same 7-scenario eval suite can be run against either skill via the EVAL_PCI_VARIANT env var.

The autonomous skill was generated end-to-end by skill.architect against the current Kibana tool catalog, with PCI domain knowledge synthesized from autonomous web research + model knowledge. It deliberately reuses the same underlying tools as the hand-written skill, so "skill content" (instructions + domain knowledge + trigger phrases) is the only experimental variable — same tools, same dataset, same evaluators, same judge.

Comparison artifact

Side-by-side comparison report: comparison.html in this branch (rendered via htmlpreview)

Currently shows the structural comparison (skill metadata, content metrics, distinguishing autonomous contributions). The "Live evaluation results" section is wired and waits for output from compare_variants.sh once the AI-connector eval cluster runs the suite. The HTML re-renders deterministically from runs/{handwritten,autonomous}/results.json.

What ships

Server (security_solution plugin)

  • New skill definition pci_compliance_autonomous/ registering pci-compliance-autonomous against the existing PCI tool IDs.
  • Feature flag pciComplianceAutonomousAgentBuilder (default off).
  • Skill registration gated by the flag.
  • Allow-list entry for the new skill ID.

Eval harness (kbn-evals-suite-pci-compliance)

  • evaluate_dataset.ts reads EVAL_PCI_VARIANT (handwritten | autonomous) to select which skill createSkillInvocationEvaluator targets. Default remains handwritten so existing CI is unchanged.
  • scripts/compare_variants.sh runs both variants back-to-back and emits the side-by-side report.
  • scripts/build_comparison_html.mjs generates the report; all embedded paths are repo-relative so the artifact is portable.
  • README documents the variant matrix and the comparison workflow.

CI plumbing

  • New Scout config set evals_pci_compliance_autonomous flips ONLY the autonomous flag.
  • evals.suites.json registers the autonomous suite.
  • llm_evals.yml adds a Buildkite step for the autonomous variant; existing PCI step tagged EVAL_PCI_VARIANT=handwritten for symmetry.

How to reproduce locally

cd kibana
./x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/compare_variants.sh
open  ./x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/comparison.html

Or via Buildkite — both kbn-evals-weekly-pci-compliance and kbn-evals-weekly-pci-compliance-autonomous steps now exist in llm_evals.yml and can be triggered independently.

Why two skills, same tools?

The hand-written skill already has its real tool implementations in tree. Forking those tools for the autonomous variant would conflate two questions: "is the autonomous skill content better?" and "are different tool surfaces better?". Reusing the tools isolates skill content as the only variable — exactly what the comparison is meant to measure.

Verification done before push

  • ReadLints clean across all authored files.
  • ESLint clean on every staged file except evaluate_dataset.ts:17 — pre-existing @kbn/imports/no_boundary_crossing from Add PCI compliance skill and tools for Agent Builder elastic/kibana#256060 that reproduces identically on every sibling kbn-evals-suite-* package on main (verified against kbn-evals-suite-security-ai-rules). Endemic to the eval framework, out of scope for a skill comparison.
  • Secrets scan: clean.
  • Personal/absolute paths in diff or generated HTML: clean (HTML uses repoRelative() helper).
  • 15 files, +1373 / -1 — focused and reviewable.
  • Scoped tsc -b on security_solution/tsconfig.json OOMs at 8 GB locally (a known Kibana issue with this plugin's size). Per Kibana's defer-type-check rule, tier-1 ReadLints is the local signal; CI will run the authoritative type check.

Open as draft

Per local convention; not for review until the live evaluator output is attached to comparison.html.


Branch: https://github.com/patrykkopycinski/kibana/tree/pk/autonomous-vs-handwritten-pci
Comparison artifact: https://github.com/patrykkopycinski/kibana/blob/pk/autonomous-vs-handwritten-pci/x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/comparison.html

…y-side eval harness

Adds a second PCI compliance skill (`pci-compliance-autonomous`) that ships
ALONGSIDE the existing hand-written `pci-compliance` skill, so the same eval
suite can be run against both variants and compared head-to-head. The
autonomous variant deliberately reuses the SAME underlying tools as the
hand-written variant, isolating "skill content" (instructions + domain
knowledge + trigger phrases) as the only experimental variable.

## What ships

Server (security_solution plugin)
- New skill definition `pci_compliance_autonomous/` registering
  `pci-compliance-autonomous` against the existing PCI tool IDs.
- New feature flag `pciComplianceAutonomousAgentBuilder` (default off).
- Skill registration gated by the flag in `register_skills.ts`.
- Allow-list entry for the new skill ID.

Eval harness (kbn-evals-suite-pci-compliance)
- `evaluate_dataset.ts` reads `EVAL_PCI_VARIANT` (`handwritten` | `autonomous`)
  to select which skill `createSkillInvocationEvaluator` targets. Default
  remains `handwritten` so existing CI is unchanged.
- `scripts/compare_variants.sh` runs both variants back-to-back and emits a
  side-by-side `comparison.html` with structural metrics + slots for live
  evaluator output (per-scenario scores, judge rationales, latency).
- `scripts/build_comparison_html.mjs` generates the report; all embedded paths
  are repo-relative so the artifact is portable.
- README documents the variant matrix and the comparison workflow.

CI plumbing
- New Scout config set `evals_pci_compliance_autonomous` that flips ONLY the
  autonomous flag, so the autonomous run sees only the autonomous skill.
- `evals.suites.json` registers `pci-compliance-autonomous`.
- `llm_evals.yml` adds a Buildkite step for the autonomous variant and tags
  the existing PCI step with `EVAL_PCI_VARIANT=handwritten` for symmetry.

## Why

The hand-written PCI skill (`pci-compliance`, elastic#256060) is the production
baseline. The autonomous skill was generated end-to-end by `skill.architect`
against the current Kibana tool catalog, with PCI domain knowledge synthesized
from autonomous web research + model knowledge (SAQ taxonomy, v3->v4 deltas,
scope-reduction levers, technical-vs-process classification). Running the
existing 7-scenario PCI eval suite against both — same tools, same dataset,
same evaluators, same judge — gives a clean A/B that answers "is the
autonomously generated skill at least as good as the hand-written one?".

## Out of scope (not introduced by this commit)

`evaluate_dataset.ts:17` triggers `@kbn/imports/no_boundary_crossing` because
`@kbn/evals` is declared `type: "test-helper"` and the suite imports value
exports from it. This lint reproduces identically on every sibling
`kbn-evals-suite-*` package on `main` (verified against
`kbn-evals-suite-security-ai-rules`), so it is endemic to the eval framework
and would require a cross-cutting change to `@kbn/evals` ownership /
visibility — out of scope for this skill comparison.
…on fix

- Ran @kbn/evals-suite-pci-compliance back-to-back against both PCI skill
  variants on a local Scout cluster wired to llama3.1:8b via a LiteLLM
  proxy (translates OpenAI-format requests to Ollama, including structured
  tool_calls). Captured 14 docs per variant from the kibana-evaluations
  data stream.

- Updated build_comparison_html.mjs to consume the framework's actual
  export shape (Elasticsearch _search response), folding the per-evaluator
  rows back into per-scenario rows. Added a routing-aggregate diagnostic
  (scenarios with >=1 PCI-skill tool call, total tool calls vs PCI-skill
  tool calls) so the HTML can show *why* a score landed where it did, not
  just the score itself.

- Re-rendered comparison.html with the live data. Both variants scored
  0.00 across all completed scenarios because llama3.1:8b is too small
  to engage either PCI skill -- the agent router fell back to the
  generic platform.core.search tool on every scenario, never invoking
  security.pci_*. The HTML now carries an honest banner explaining this:
  the comparison is apples-to-apples (identical model + dataset + infra),
  it just lives on the floor at this model scale. The structural and
  domain-coverage deltas in sections 2-3 remain the meaningful signal
  until the same script is re-run with a stronger model.

- Fixed an isolation bug in the autonomous Scout config set: the
  pciComplianceAgentBuilder feature flag defaults to true in
  experimental_features.ts, so the autonomous run was loading BOTH
  skills. Added 'disable:pciComplianceAgentBuilder' to the scout config
  serverArgs to keep the comparison clean for future runs.

Refs: #11
@patrykkopycinski
Copy link
Copy Markdown
Owner Author

Live local eval run captured (commit fc5194e97df3)

Ran the suite back-to-back against both PCI skill variants on a local Scout
cluster wired to llama3.1:8b via a LiteLLM proxy (translates OpenAI-format
requests to Ollama, including structured tool_calls). 14 docs per variant
landed in the kibana-evaluations data stream and the comparison HTML now
renders with live data.

Headline numbers

Signal Hand-written Autonomous
Scenarios completed (of 8) 7 7
PCI Criteria score (mean) 0.000 0.000
Total tool calls observed 12 12
security.pci_* skill tool calls 0 0
Wall-clock 17.5 min 15.5 min

Honest read

Both variants scored 0 across all completed scenarios. Not because of the
skill content
— because llama3.1:8b is too small to engage either PCI
skill at all. The agent router fell back to the generic
platform.core.search tool on every scenario and never invoked
security.pci_*. The comparison is apples-to-apples (identical model +
dataset + connector + infra), it just lives on the floor at this model
scale.

The structural and domain-coverage deltas in §2-§3 of the comparison HTML
(do-not-use boundaries, SAQ taxonomy, scope-reduction levers, etc.) remain
the meaningful signal until the same harness is re-run with a stronger
model (GPT-4-class, Claude 3.5+, Bedrock Claude 3.7) — at which point the
same script re-renders §4 with discriminating numbers.

Two side-fixes shipped in the same commit

  1. Isolation bug. pciComplianceAgentBuilder defaults to true in
    experimental_features.ts, so the autonomous Scout config was loading
    both PCI skills. Added 'disable:pciComplianceAgentBuilder' to the
    autonomous config's enableExperimental array so the comparison stays
    clean for future runs.
  2. build_comparison_html.mjs now consumes the framework's actual export
    shape (Elasticsearch _search response from kibana-evaluations),
    folds per-evaluator rows back into per-scenario rows, and adds a
    routing-aggregate diagnostic so the HTML can show why a score landed
    where it did, not just the score. This is what surfaces the "no
    PCI-skill tool calls" finding above.

Comparison HTML:
x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/comparison.html

…arison on real connectors

The autonomous-vs-handwritten PCI comparison previously ran on llama3.1:8b
through a local Ollama proxy. At that model scale the agent router never
engaged either PCI skill, so every scenario scored 0.00 and the comparison
landed on the floor (see commit fc5194e). This commit promotes the
comparison to real Bedrock connectors and ships the connector-side fix that
the upgrade required.

Bedrock connector — Claude Opus 4.7 enablement
----------------------------------------------
Claude Opus 4.7 on Bedrock rejects the `temperature` inference parameter
with `temperature is deprecated for this model`. Without omitting it the
connector simply 400s on every request. Fix is in three layers:

  - `@kbn/inference-common`: new `supportsTemperature?: boolean` on
    `ModelDefinition`; `claude-opus-4-7` marked `supportsTemperature: false`.
    Future Claude variants (or other provider models) with the same
    restriction need only flip the flag — one source of truth.

  - `inference` plugin: `getTemperatureIfValid` omits temperature when the
    model definition declares `supportsTemperature: false`. Sits alongside
    the existing OpenAI o-series exclusions and works for any provider.

  - `stack_connectors` (Bedrock): new local
    `bedrockModelSupportsTemperature(model)` helper; `formatBedrockBody`
    threads `model` through and gates the parameter. `invokeAI`,
    `invokeStream`, `invokeAIRaw`, `_converse`, and `_converseStream` all
    consult it. Defense in depth — direct sub-action callers
    (Security AI Assistant, etc.) are protected without taking a
    cross-plugin dependency on `@kbn/inference-common`.

Smoke-tested with `invokeAI` + `converse` sub-actions:
  - Claude 4.7 Opus (`us.anthropic.claude-opus-4-7`): now passes — temperature
    omitted, response returned.
  - Claude 4.6 Sonnet (`us.anthropic.claude-sonnet-4-6`): still passes —
    temperature included as before.

Live eval comparison (PCI Criteria, LLM-judge 0..1)
---------------------------------------------------
Both PCI skill variants ran the same 8-scenario `@kbn/evals-suite-pci-compliance`
suite end-to-end against a real Scout cluster, on two production Bedrock
connectors:

  | Variant     | Claude 4.7 Opus | Claude 4.6 Sonnet |
  |-------------|----------------:|------------------:|
  | Handwritten |           0.977 |             0.989 |
  | Autonomous  |           0.834 |             0.860 |

The handwritten skill (Smriti, PR elastic#256060) outperforms the autonomous variant
on both models by 14-15 points. The autonomous architect's broader domain
framing (SAQ taxonomy, v3→v4 deltas, scope-reduction levers) did not
translate into a better PCI-Criteria score. The handwritten contract is
shorter (~4.1k vs ~8.1k chars) and lines up more tightly with the eval's
scoring rubric — that tight coupling is the deciding factor.

build_comparison_html.mjs gains a `--runs <label>=<dir>,...` mode so the
4-cell grid renders from the four results.json snapshots. Legacy
`--handwritten`/`--autonomous` mode still works for single-model runs.

kbn-scout
---------
`run_kibana_server.ts` now respects `SCOUT_READ_DEV_CONFIG=true` and drops
`--no-dev-config` when set, so a developer can load `config/kibana.dev.yml`
(and the preconfigured AI connectors it defines) into the Scout-managed
Kibana process. Default behaviour is unchanged. Without this, evals against
real cloud connectors require fragile API-driven connector creation per
boot.

Refs: #11
@patrykkopycinski
Copy link
Copy Markdown
Owner Author

Live eval comparison on real Bedrock connectors

Live PCI Criteria scores (LLM-judge, 0..1) across both PCI skill variants on two production Bedrock connectors:

Variant Claude 4.7 Opus Claude 4.6 Sonnet
Handwritten 0.977 0.989
Autonomous 0.834 0.860

The handwritten skill (Smriti, elastic#256060) outperforms the autonomous variant on both models by 14-15 points. The autonomous architect's broader domain framing (SAQ taxonomy, v3→v4 deltas, scope-reduction levers — §3 of the report) does not translate into a better PCI-Criteria score. The handwritten contract is shorter (~4.1k vs ~8.1k chars) and lines up more tightly with the eval's scoring rubric.

Full side-by-side: x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/comparison.html (re-rendered from the four results.json snapshots; the new --runs mode keeps the script reproducible from any subset).

Bedrock connector — Claude Opus 4.7 enablement

Running on Opus 4.7 required a connector-side fix: the model rejects the temperature inference parameter with temperature is deprecated for this model. The patch flows through three layers (see commit 8ee59cf):

  • @kbn/inference-common — new supportsTemperature?: boolean on ModelDefinition; claude-opus-4-7 marked supportsTemperature: false. One source of truth for any provider.
  • inference plugin — getTemperatureIfValid omits temperature when the model definition declares it unsupported.
  • stack_connectors (Bedrock) — local bedrockModelSupportsTemperature(model) helper; formatBedrockBody, invokeAI, invokeStream, invokeAIRaw, _converse, and _converseStream all gate the parameter. Defense in depth for direct sub-action callers (Security AI Assistant, etc.) that bypass the inference plugin layer.

Smoke-tested via invokeAI + converse sub-actions on both connectors — Opus 4.7 now succeeds; Sonnet 4.6 unaffected.

Root-cause analysis of why the autonomously-architected PCI compliance
skill scored 12-15 pts below the hand-written variant uncovered two
distinct bugs that compounded:

1. **Tool registration bug** in `register_tools.ts` — PCI tools were
   gated *only* on `experimentalFeatures.pciComplianceAgentBuilder`,
   which the autonomous scout config explicitly disables to isolate the
   variant comparison. Result: the autonomous variant ran with NO PCI
   tools registered. Trace analysis confirmed 0 calls to
   `security.pci_compliance` across 16 scenarios vs 17-23 for HW. The
   agent fell back to raw `platform.core.execute_esql` and improvised
   the entire workflow. Fixed: gate now triggers on either flag.

2. **Skill-content design** — the autonomous prompt's 6-step workflow
   inserted "Reduce scope (tokenisation/P2PE/segmentation)" and
   "Classify requirements as technical vs process-based" steps BEFORE
   the tool calls, plus an 8 KB "Domain Knowledge Notes" block between
   the workflow and the status vocab. The structure read as
   "do-your-homework first" rather than "call the tools". Restructured:
   tools-first 4-step workflow with explicit "Always call the dedicated
   PCI tools; do not improvise raw ES|QL" injunction, theory moved to a
   "Background reference (do not consult before calling tools)" tail
   section. Removed broken handoff references to non-existent sibling
   skills and stripped tool-description provenance commentary.

Validation on Claude 4.6 Sonnet:
- pre-fix Auto: 0.860 mean (gap to HW: 12.9 pts)
- post-fix Auto v3: 0.955 mean (gap to HW: 3.4 pts)
- 6/8 scenarios now perfect 1.000; 1 scenario (full report) regressed
  -9 pts on a substance-vs-style criterion (agent calls the tool
  correctly but the report formatting elides specific evidence).

Feedback-loop infrastructure:
- `scripts/run-eval.sh` extended with optional scenario-grep argument
  (`run-eval.sh autonomous <connector> <label> "requirement 2.2.4"`)
  collapsing a full-suite cycle (~28 min) to a single-scenario probe
  (~5.6 min including scout boot, ~3 min if scout is reused).
- Two iterations of this loop fixed both bugs end-to-end.

POSTMORTEM.md captures the full analysis, including six ranked content
fixes and a three-tier feedback-loop efficiency proposal.
@patrykkopycinski
Copy link
Copy Markdown
Owner Author

Deep-dive: why the autonomous skill scored lower (and how it closed the gap)

After the initial run shipped (Auto 0.860 vs HW 0.989 on Sonnet 4.6), I dropped into the per-scenario rationale + tool-call traces in runs/*/results.json and the per-cell judge explanation fields. Full writeup in POSTMORTEM.md. Highlights:

Root causes (two distinct bugs, compounding)

Bug 1 — tool registration. register_tools.ts gated the PCI tools only on experimentalFeatures.pciComplianceAgentBuilder. The autonomous Scout config (evals_pci_compliance_autonomous) explicitly disables that flag to isolate the variant comparison. Result: the autonomous variant ran with zero PCI tools registered. Trace tally across 16 scenarios on both Opus 4.7 and Sonnet 4.6:

Cell Total steps PCI tool calls raw ESQL calls
HW · Opus 4.7 62 17 0
Auto · Opus 4.7 161 0 36
HW · Sonnet 4.6 77 23 0
Auto · Sonnet 4.6 214 0 30

The judge confirmed this in its own words: "The pci_compliance tool was not called; it was unavailable and fell back to ES|QL". The autonomous skill experiment never had a chance — its tools weren't there.

Bug 2 — skill-content design. Even with the registration fix in place, the autonomous prompt's 6-step workflow inserted Reduce scope (tokenisation / P2PE / segmentation) and Classify requirements as technical vs process-based steps before the tool calls, plus an 8 KB "Domain Knowledge Notes" block between the workflow and the status vocab. The structure reads "do your homework first" rather than "call the tools".

Fixes applied

  1. register_tools.ts — gate on either flag (pciComplianceAgentBuilder || pciComplianceAutonomousAgentBuilder).
  2. Autonomous skill content restructured: tools-first 4-step workflow with an explicit **Always call the dedicated PCI tools; do not improvise raw ES|QL** injunction; theory moved to a Background reference (do not consult before calling tools) tail section.
  3. Removed broken handoff references to non-existent sibling skills (threat-hunting, alert-analysis, detection-rule-edit).
  4. Stripped tool-description provenance commentary that the LLM was reading as design-rationale meta-text rather than operational instructions.

Result

Same scout config, same connector, same evaluator, only the skill content + tool-registration changed:

Run Auto mean (Sonnet 4.6) Gap to HW
Auto v1 (initial) 0.860 12.9 pts
Auto v3 (after fixes) 0.955 3.4 pts

6/8 scenarios now perfect 1.000 — matching the hand-written variant exactly. One scenario (full report) regressed -9 pts on a substance-vs-style criterion: the agent correctly calls pci_compliance in report mode but the report formatting elides specific evidence (mentions "default accounts active" rather than "admin and root").

The single-scenario trace on requirement 2.2.4 default accounts, before vs after:

Before fix (17 steps, score 0.571):

reasoning → reasoning → filestore.read → reasoning → reasoning →
  platform.core.list_indices → reasoning → reasoning →
  platform.core.get_index_mapping → reasoning →
  platform.core.search → reasoning → reasoning →
  platform.core.execute_esql → reasoning → reasoning →
  platform.core.execute_esql

Missed admin and root violations.

After fix (7 steps, score 1.000):

reasoning → filestore.read → reasoning → reasoning →
  security.pci_scope_discovery → reasoning →
  security.pci_compliance

Both admin and root violations surfaced. Identical pattern to HW.

Feedback loop: 28 min → 5.6 min per iteration

scripts/run-eval.sh now accepts an optional scenario-grep argument:

run-eval.sh autonomous <connector> <label> "requirement 2.2.4"

That collapses a full-suite cycle (~28 min) to a single-scenario probe (~5.6 min including scout boot, ~3 min if scout is reused across iterations). Two iterations of this loop end-to-end fixed both bugs above. The postmortem also proposes Tier 2 (sub-15s tool-call probe via the Kibana chat API) and Tier 3 (skill-architect rewrite loop fed by judge rationales) if you want even faster cycles next.

The comparison.html on this PR's HEAD now renders Auto v3 alongside the previous columns so the delta is visible at a glance.

…y (0.989 vs 0.989)

The autonomous PCI compliance skill now ships its own independently-authored
4-tool decomposition under a separate allowlist entry. The autonomous skill
has no knowledge of -- and no path to -- the hand-written PCI tools. This
validates a fully end-to-end autonomous stack (skill + tools, both
autonomously created) and reaches parity with the human-authored variant.

What changed
------------
* New PCI tool bundle under `agent_builder/tools/pci_autonomous_tools/`:
  - `pci_autonomous_scope_discovery`
  - `pci_autonomous_compliance_check`   (split out from the consolidated tool)
  - `pci_autonomous_scorecard_report`   (split out from the consolidated tool)
  - `pci_autonomous_field_mapper`
  All four implement the cycle-17 architect blueprint's 4-tool decomposition
  (vs the hand-written variant's 3 tools, where check+report share one tool
  via a `mode` parameter). Each tool reuses the underlying domain logic so
  the comparison stays apples-to-apples on capability while validating the
  isolation property.

* `register_tools.ts`: hand-written PCI tools register ONLY under
  `experimentalFeatures.pciComplianceAgentBuilder`; autonomous PCI tools
  register ONLY under `experimentalFeatures.pciComplianceAutonomousAgentBuilder`.
  The previous lenient gate (`either flag`) is removed -- the two variants
  are now strictly isolated.

* `allow_lists.ts`: all four new autonomous tool IDs added to the
  `AGENT_BUILDER_BUILTIN_TOOLS` allowlist (without this, tool registration
  silently fails and the agent falls back to raw ES|QL).

* Autonomous skill content + `getRegistryTools` rewired to reference the
  new tool IDs only.

* Eval rubric (`pci_compliance.spec.ts`) is now variant-aware via
  `EVAL_PCI_VARIANT` -- judging criteria check for `pci_autonomous_*` tool
  names when the autonomous variant is on, and the original names otherwise.

* Skill contract tests harden the isolation property: explicit assertions
  that the autonomous skill never references any hand-written tool ID, and
  that `getRegistryTools` advertises ONLY the autonomous bundle.

* Comparison HTML updated with a new v5 column and a green success banner
  showing the autonomous skill+tools reaches parity with the hand-written
  baseline on Claude 4.6 Sonnet (0.989 vs 0.989, 8/8 scenarios).

Why
---
The user wanted to validate that the autonomous skill workflow generalises
to other domains -- which requires removing every shortcut where the
autonomous variant inherits the hand-written variant's tooling. The earlier
"shared tool" runs were measuring only skill-content quality; this run
measures the full stack the architect would generate from a blank slate.

Result
------
| Variant                                 | Mean (8 scenarios) |
|-----------------------------------------|-------------------|
| Hand-written, Claude 4.6 Sonnet         | 0.989             |
| Autonomous v5 (own 4 tools), Sonnet 4.6 | 0.989             |
| Autonomous v3 (shared tools), Sonnet    | 0.955             |
| Autonomous v1 (shared, content drift)   | 0.860             |

Parity on the headline metric. The autonomous stack (skill content +
4-tool decomposition + allowlist entry + register gate) ships as a
self-contained bundle the architect can replicate for any other domain.
Adds a second evaluation surface so the iteration loop on the
autonomous PCI skill can be trusted to produce a generalisable skill
rather than one that has memorised the iteration fixtures.

Why
---
The 0.989 we got from `sonnet46-autonomous-v5` (cycle that hit
parity with the hand-written variant) is scored against the SAME
fixtures we inspect while improving the skill. That tight loop is
how every dataset-driven optimisation produces overfit: the skill
content drifts from "teach the principle" to "match the fixture".

Two layers of defence
---------------------

1. **Anti-overfit lockdown** (in `pci_compliance_autonomous_skill.test.ts`).
   A new `describe('anti-overfit ...')` block asserts the skill content
   contains NONE of the iteration- or holdout-set fixture values
   (`jdoe`, `pcompton`, `192.168.1.100`, `10.20.30.40`, `12 failed`,
   the random `logs-<hex>-{auth|network|...}` index pattern, etc.).
   Values that ARE legitimate PCI domain knowledge — `admin`/`root` for
   req 2.2.4, the lockout threshold of 10 for 8.3.4, `TLS 1.0`/`1.1`
   for 4.1 — are explicitly kept allowable. 11 invariants, all green
   today. Any future iteration that introduces a fixture-coupled patch
   will fail CI.

2. **Holdout dataset + spec** (new `pci_data_holdout.ts` +
   `pci_compliance_holdout/pci_compliance_holdout.spec.ts`). Same five
   PCI categories (auth/network/vuln/endpoint/legacy) but every
   memorisable axis is systematically different:
     - Index naming drops the `logs-*-{category}` pattern in favour of
       `security-audit-identity-*`, `siem-flows-prod-*`,
       `pkginfo-cve-*`, `edr-processes-*`, `legacy-app-syslog-*`. Tests
       that scope discovery uses field caps, not name patterns.
     - Brute-force volume is 8 (BELOW the PCI 8.3.4 threshold of 10) —
       expected verdict is GREEN, NOT RED. Catches skills that learnt
       "any failed-login cluster = violation".
     - Default-account flavours are Windows `Administrator` +
       `service_acct_42`, not Unix `admin`/`root`.
     - Weak TLS signature is TLS 1.1 ALONE — no TLS 1.0, no plain HTTP.
       Tests sub-version recognition rather than the kitchen-sink
       "multiple weak versions" pattern of the iteration set.
     - Non-ECS field schema uses `actor_name` / `client_addr` /
       `action_status` / `event_verb` / `device_id` / `cve_id` /
       `risk_rating` / `command` — completely different from the
       iteration set's `username` / `src_ip` / etc. Tests that
       field-mapping is semantic, not memorised.
     - 4-hour time window instead of 1-hour.
     - 2025-vintage CVEs instead of 2024.

The six holdout scenarios mirror the structure of the iteration
scenarios so the gap measurement is apples-to-apples: report,
single-requirement check (× brute force + TLS + default accounts),
scope discovery, field mapping.

Result on Sonnet 4.6
--------------------

|                | iteration | holdout | gap    | verdict |
|----------------|-----------|---------|--------|---------|
| Hand-written   | 0.989     | 0.942   | +0.047 | CLEAN   |
| Autonomous v5  | 0.989     | 0.927   | +0.062 | CAUTION |

Both variants drop the same ~5-6 pts moving from iteration to holdout
— and they drop on the SAME two scenarios (default-account variants
0.750/0.750, 4h scorecard 0.900/0.900). That tells us the holdout is
genuinely harder, not that the autonomous skill is uniquely overfit.
The autonomous gap of 0.062 is only 0.015 wider than the hand-written
gap — well within noise of the framework.

Crucially, the three HARDEST tests all scored 1.000 for both skills:
  - below-threshold brute force (counter-case — agent did NOT
    fabricate a false-positive violation)
  - TLS 1.1 alone (sub-version recognition without the kitchen-sink
    signature)
  - scope discovery on non-`logs-*` indices (worked via field caps,
    not via index-name pattern matching)

Tooling changes
---------------

  - `run-eval.sh`: scout boot timeout bumped 6 min → 15 min; the
    default was unreliable when the host was also running an IDE.
  - `build_comparison_html.mjs`: new `--holdout-runs` flag mirroring
    `--runs`; new §5 section renders the iteration vs holdout grid,
    computes the gap per variant, applies the three-band verdict
    (CLEAN / CAUTION / OVERFIT), and lists the divergence axes plus
    the per-scenario holdout breakdown. Subsequent section numbers
    renumbered (6 reasoning, 7 reproduce, 8 provenance, 9 Bedrock).
  - `comparison.html` regenerated with the live holdout numbers.

How to re-run
-------------

    bash x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/run-eval.sh \
        handwritten pmeClaudeV46SonnetUsEast1 sonnet46-handwritten-holdout HOLDOUT
    bash x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/run-eval.sh \
        autonomous  pmeClaudeV46SonnetUsEast1 sonnet46-autonomous-holdout  HOLDOUT
    node x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/build_comparison_html.mjs \
        --runs ... --holdout-runs sonnet46-handwritten=...,sonnet46-autonomous=...

This commit closes the question "did we just overfit to the iteration
fixtures?" with a measurement, not an assertion. The answer is "the
gap is small enough that the iteration loop is healthy, but not zero
— field-mapping on novel vocabularies is the one place the autonomous
skill is genuinely weaker than the hand-written one (0.909 vs 1.000),
and that is a tool-implementation issue, not a skill-content overfit".
The previous report overclaimed full autonomy by saying the autonomous variant
"ships an independently-authored 4-tool decomposition ... no shared context with
the human-authored variant" and called it a "fully autonomous stack". That is
true at the agent-facing surface (tool IDs, descriptions, schemas, decomposition,
skill content, registration) but NOT at the domain-engine layer: each autonomous
tool's handler still imports PCI_REQUIREMENTS, evaluateRequirement, and the
ScopeClaim builder directly from the hand-written variant's pci_compliance_*
modules.

Recalibrates the framing without changing any numbers:

- §1 intro now distinguishes "agent-facing surface" (independent) from
  "underlying domain engine" (shared via direct module imports) and points to
  the new §1.5 ladder.
- §1.5 (new) "Autonomy ladder — what's truly independent vs what's shared":
  10-row table covering tool IDs, descriptions, schemas, decomposition, prose,
  registration as INDEPENDENT and requirement catalog, evaluator, validation
  schemas, ScopeClaim builder, time helpers as SHARED. Names each shared file.
- §4 verdict banner: "fully autonomous stack" → "surface-level autonomy of
  tools too", with an explicit caveat that the handler bodies still import the
  domain engine from the hand-written variant. Calls out the missing follow-up
  (pci_autonomous_requirements.ts / pci_autonomous_evaluator.ts).
- §6 reasoning bullet 4: "Independently-authored tools" → "Independently-
  authored tool surface (engine still shared — see §1.5)" with the specific
  module names that are still being imported.
- §8 Provenance & honesty: new "Honest limitation: autonomy is layered, not
  total" subsection summarising what the eval numbers measure (agent-surface
  autonomy on top of a shared engine) and what the next experiment would have
  to look like (independent engine + zero-import CI test + re-run).

No code, eval numbers, or branch behaviour changed — only the framing of
what the eval result is claiming. Sets up the follow-up work of authoring
pci_autonomous_requirements.ts, pci_autonomous_evaluator.ts, and
pci_autonomous_schemas.ts from the public DSS v4.0.1 spec and re-running.
Make the autonomous skill truly autonomous all the way down. Previously
the four `pci_autonomous_*_tool.ts` handlers re-used the same PCI domain
helpers as the hand-written skill (`pci_compliance_schemas`,
`pci_compliance_requirements`, `pci_compliance_evaluator`). The
agent-facing surface (IDs, schemas, decomposition, registration, skill
content) was independent, but the underlying PCI engine was shared.

This commit adds three engine modules in `pci_autonomous_tools/`
authored from the PCI DSS v4.0.1 spec without referencing the
hand-written ones, and rewires all four tools to use only the
autonomous engine:

- `pci_autonomous_schemas.ts` — independent zod input schemas with a
  stricter time-range guard (no future dates) and a `provenance` block
  on `PciAutonomousScopeClaim` for auditable autonomy.
- `pci_autonomous_requirements.ts` — independent v4.0.1 catalog with a
  verdict-typed encoding (`detect_violations` vs `verify_presence`),
  self-documenting ES|QL params (`?_window_start`/`?_window_end`),
  enriched `defaultLookback` with rationale, and post-aggregation
  filtering instead of nested HAVING clauses.
- `pci_autonomous_evaluator.ts` — composable pipeline of pure functions
  (replacing the nested try/catch pyramid), explicit status→score
  lookup table (avoiding multiplicative scoring drift), discriminated
  union for `FieldCapsPreflight`, and a different concurrency runner.

CI lockdown:

- `pci_autonomous_modules_no_handwritten_imports.test.ts` walks every
  file under `pci_autonomous_tools/` and asserts zero imports from the
  hand-written engine modules, plus that each tool file imports at
  least one autonomous engine module. The skill-level surface
  isolation test was also updated to reference the engine lockdown.

All 28 autonomous-skill tests + 3 engine-lockdown tests pass.

The next step (v6 results in `comparison.html`) is a fresh
iteration+holdout eval run against this engine, which can now be
attributed entirely to the autonomous architect.
Plug the v6 run (autonomous tools + autonomous engine) into the
side-by-side comparison report. The architect re-authored the PCI
domain engine from the public PCI DSS v4.0.1 spec
(`pci_autonomous_requirements.ts`, `pci_autonomous_evaluator.ts`,
`pci_autonomous_schemas.ts`), with a CI lockdown test asserting zero
imports from the hand-written engine. Eval results:

Iteration set (Sonnet 4.6, 8 scenarios)
  hand-written: 0.989
  auto v5 (own tools, shared engine): 0.989
  auto v6 (own tools + own engine): 0.989  ← deep autonomy at parity

Holdout set (Sonnet 4.6, 6 scenarios)
  hand-written: 0.942
  auto v5: 0.927 (gap −0.062 vs iteration → CAUTION band)
  auto v6: 0.985 (gap −0.004 vs iteration → CLEAN band)

The deep-autonomy engine generalises *better* than the surface-only v5
on the holdout, with substantive wins on the 4h scorecard scenario
(+0.100) and the default-account variants scenario (+0.250). Both wins
come from the autonomous engine's more deliberate CDE / account-status
semantics carrying over to non-fixture data shapes.

Report changes
--------------

- §1.5 autonomy ladder: rewrite the four engine rows from a single
  "SHARED" red pill to a "v5: SHARED / v6: AUTONOMOUS" pair, and add
  closing paragraphs that distinguish the two cycles.
- §4 multi-model grid: add the v6 column. The reader can see v5 → v6
  was a no-op on iteration scores but a substantive lift on holdout.
- §5 generalisation gap: add a v6 row paired to the v6 holdout run.
  The pairing logic in build_comparison_html.mjs now strips any
  trailing `-vN` suffix when looking up the holdout label, so future
  iterations don't need a code change.
- §6 reasoning bullet: flip the autonomous-side description from
  "engine still shared" to "tool surface AND domain engine
  independent (v6)", with the CI lockdown test referenced.
- §8 honest limitation: rewrite as "how the deep-autonomy experiment
  was constructed (v6)". The prior text said this experiment "is not
  run here". It is now run here, and the section documents the three
  re-authored modules, the CI lockdown, and the result.

The verdict banner now references both v5 (surface autonomy) and v6
(deep autonomy) as separate parity events.
Addresses the v6 deep-autonomy audit findings raised after the architect's
own engine modules landed:

Code-quality (autonomous engine modules)
  - schemas: tighten REQUIREMENT_ID_PATTERN so `all.1` etc. no longer match;
    strip stale "cycle-17" docstring references.
  - requirements: type catalog as Partial<Record<...>> so undefined lookups
    must be handled; drop redundant `| LIMIT 1` after un-grouped STATS;
    remove the as-cast pseudo-anchor (replaced by a runtime invariant in
    the new test file); strip "cycle-17" docstrings.
  - evaluator: scoreFor is exhaustive over the typed SCORE_TABLE so drop
    the unreachable `?? 0` fallback; runAutonomousWithConcurrency now
    awaits all in-flight tasks before re-throwing the first error so a
    single rejection no longer orphans siblings (semantics documented).
  - docstrings across index.ts, compliance_check_tool, register_tools,
    autonomous skill, and experimental_features now consistently describe
    v6 deep autonomy (independent engine + tools + heuristics) rather than
    overclaiming or underclaiming shared logic.

Engine unit tests (~85 specs, ~2s)
  - pci_autonomous_schemas.test.ts: provenance constants, index-pattern
    refinements (ESQL injection, length bounds), time-range clamping,
    requirement-id regex, buildAutonomousScopeClaim dedupe/sort.
  - pci_autonomous_requirements.test.ts: catalog completeness, self-
    referential ids, presence of AUTONOMOUS_TIME_WINDOW placeholders,
    detect_violations always carries a violation query, defaultLookback
    sanity, plus a real runtime sync invariant that parses every catalog
    key through pciAutonomousRequirementIdSchema (replaces the prior
    compile-time anchor that was suppressed by an `as` cast). Also covers
    requirementCategory, buildAutonomousTimeWindowParams, time-range
    resolution, normalize/resolve helpers, and index-pattern helpers.
  - pci_autonomous_evaluator.test.ts: concurrency runner correctness +
    failure semantics, ordered ?_window_start/?_window_end binding,
    detect_violations RED path, verify_presence GREEN path, AMBER+HIGH /
    AMBER+LOW / NOT_ASSESSABLE branches via mockResolvedValueOnce, ES|QL
    failure → query_failed data gap, evidence row clamping.

Reproducibility (#2 from audit)
  - build_comparison_html.mjs gains --combined-run <label>=<dir>, which
    reads a single results.json that mixes pci-compliance:* (iter) and
    pci-holdout:* (holdout) scenarios and splits them internally. The
    v6 evaluation report can now be regenerated from one results.json
    without an ad-hoc helper script.

All four PCI-autonomous Jest suites pass locally (engine + lockdown).
No new lint errors introduced (remaining no-continue / no-nested-ternary
hits are pre-existing in untouched code).
- comparison.html / build_comparison_html.mjs: extend §8 with a new
  "v6 hardening — audit fixes + engine unit tests" subsection that
  spells out the post-v6 audit batch (Partial Record typing, exhaustive
  scoreFor, dropped LIMIT 1, concurrency failure semantics, stricter
  REQUIREMENT_ID_PATTERN), the new 85-spec engine test suite (including
  the runtime catalog↔schema sync invariant that replaces the suppressed
  compile-time anchor), and the new --combined-run flag for one-shot
  v6 report regeneration from a single results.json.

- build_comparison_html.mjs: flatten six pre-existing nested ternaries
  (the §4 multi-runs-vs-live-vs-fallback chain becomes an IIFE with
  if/else; banner-class / banner-cls / gap-advice / mean-row cls all
  become let-block assignments) — no behaviour changes, the script
  smoke-runs end-to-end with --combined-run and produces a valid 574-line
  HTML output with all 11 §-headings intact.

- pci_autonomous_requirements.ts: drop the lone `continue` in
  resolveAutonomousRequirementIds by inverting the guard into a
  positive-branch `if (canonical && canonical !== 'all') { ... }`.
  All 46 requirements specs still pass.

Net result: both files lint clean (0 errors, 0 warnings). The 7
pre-existing lints sitting inside the audit-batch diff zone — 1
no-continue and 6 no-nested-ternary — are gone.
…rift test

Addresses two follow-up findings on PR elastic#268798:

#2 — Lockdown test (pci_autonomous_modules_no_handwritten_imports.test.ts):
broaden the import deny-list to cover the full hand-written PCI surface,
not just the three engine modules. Now blocks:

  - pci_compliance_tool
  - pci_compliance_evaluator
  - pci_compliance_requirements
  - pci_compliance_schemas
  - pci_field_mapper_tool
  - pci_scope_discovery_tool
  - anything under skills/pci_compliance/**

The previous deny-list only covered the engine trio, which left a silent
re-coupling path: a future contributor could import the hand-written
orchestrator tool or scope-discovery helper and pass CI. The deep-autonomy
guarantee in comparison.html §1.5 is broader than the engine — it covers
every hand-written surface — so the lockdown should match.

#4 — New comparison_html.test.ts: structural snapshot for the committed
report. Asserts that the 11 §-level sections appear (in expected order)
and the v6 hardening / deep-autonomy h3 subsections are present. Catches
the two drift directions between comparison.html and
scripts/build_comparison_html.mjs:

  1. someone edits the HTML directly and forgets to update the template;
  2. someone edits the template and forgets to regenerate + commit.

Deliberately not byte-for-byte equality — the rendered HTML legitimately
changes with each eval refresh and we don't want CI noise on prose tweaks.
Address the 15 findings from the autonomous PCI deep-analysis audit
covering the engine modules, the four agent-facing tools, and the
skill prompt.

Blockers
- Scope-discovery tool now returns a `discoveryClaim` (point-in-time
  snapshot) instead of a mis-shaped `scopeClaim`, surfaces ES errors
  as structured `dataGaps`, and validates `cat.indices` responses
  with a zod schema before walking them.
- Requirements catalog: dropped the unused `requiredCategories[]` field
  and the orphan `requirementCategory()` helper. Removed `NOT_APPLICABLE`
  from `AutonomousComplianceStatus` — it was carried in the score table
  but never produced by any evaluator path.
- Scorecard report no longer tags its synthesised executive roll-up as
  `ToolResultType.esqlResults` (the payload is not an ESQL row set);
  it now lands under `ToolResultType.other` so downstream UX/telemetry
  that special-cases `esqlResults` does not mis-render it.

Importants
- Skill prompt rewritten: workflow is now `discover → roll up → drill
  down`. The check and scorecard tools are explicitly designed to be
  used as a sequence and share one evaluator via the new
  `runAutonomousPciEvaluationPack` orchestration helper.
- Both tools now derive `overallStatus` from the same severity rollup
  (`rollupAutonomousOverallStatus`) and `overallConfidence` from the
  same confidence rollup (`rollupAutonomousConfidence`), eliminating
  the previous risk of disagreement.
- Field-mapper sensitive-field regex tightened: the previous bare
  `/token/i` over-matched (e.g. `subscription` contains no token but
  `tokenizer` would have flagged). Replaced with anchored patterns
  for `card`, `pan`, `cvv`, `cvc`, `account.number`, `credit.card`,
  `ssn`, `secret`, `password`, `api.key`, and specific `*token`
  shapes.
- Added a runtime `assertNever` exhaustiveness check on the
  `statusToHumanLabel` switch — adding a new status without
  updating the switch now fails at compile time.

Nice-to-haves
- Removed experiment-only metadata (gate scores, citation counts,
  architect attribution, brittle `comparison.html §1.5` cross-refs)
  from every runtime file. Authoring metadata stays beside the eval
  suite.
- "Recommended Remediation SLA" table in the skill prompt re-labelled
  as operational guidance — only the 30-day req 6.3.3 window is
  spec-sourced; the rest are heuristics a QSA would typically agree
  with but an org may tune.
- SAQ scope-reduction "70%" claim re-cast as the assessor-guidance
  heuristic range (50–80%), not a guarantee.
- `requirementCategory` tests removed; weak `['HIGH','MEDIUM']`
  evaluator assertion pinned to the exact value (`MEDIUM` via the
  coverage-stage no-violation-query path).
- New `buildAutonomousDiscoveryClaim` helper + 4-spec test block
  covering dedupe/sort, provenance pinning, point-in-time semantics,
  and stable shape across shuffled inputs.

Verification
- ESLint: 14 files, clean.
- Jest: 101/101 pass in `pci_autonomous_tools/` + the autonomous
  skill suite, 16/16 pass in `comparison_html.test.ts`.
- Scoped `tsc -b` against `security_solution/tsconfig.type_check.json`:
  green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant