Skip to content

[Security GenAI] Autonomous-vs-hand-written PCI compliance skill side-by-side (v6 deep autonomy)#268798

Draft
patrykkopycinski wants to merge 13 commits into
elastic:mainfrom
patrykkopycinski:pk/autonomous-vs-handwritten-pci
Draft

[Security GenAI] Autonomous-vs-hand-written PCI compliance skill side-by-side (v6 deep autonomy)#268798
patrykkopycinski wants to merge 13 commits into
elastic:mainfrom
patrykkopycinski:pk/autonomous-vs-handwritten-pci

Conversation

@patrykkopycinski
Copy link
Copy Markdown
Contributor

@patrykkopycinski patrykkopycinski commented May 12, 2026

Summary

Validation experiment: can the skill.architect autonomous workflow carry an
entire Elastic Security feature end-to-end — agent contract and underlying
domain engine — as well as a hand-written skill authored by a domain expert?
This PR adds a second PCI compliance skill (pci-compliance-autonomous) next
to Smriti's hand-written pci-compliance skill (PR #256060), wires a
side-by-side eval harness through @kbn/evals, and ships the result.

Headline result (Claude 4.6 Sonnet, 8-scenario iteration set):
hand-written 0.989 vs autonomous v6 0.989 — parity, with both
generalising to a held-out dataset (hand-written 0.942, autonomous v6 0.985).

See x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/comparison.html for the full
side-by-side report (autonomy ladder in §1.5, per-scenario scores in §4,
overfit/holdout analysis in §5, audit-fix hardening in §8).

What lands

  • Autonomous skill + 4 tools behind their own feature flag
    pciComplianceAutonomousAgentBuilder and allowlist entry, so the agent
    router has zero path to the hand-written tool IDs under the autonomous flag.
  • Independent domain engine (v6 deep autonomy)
    pci_autonomous_requirements.ts (independent PCI DSS v4.0.1 catalog),
    pci_autonomous_evaluator.ts (composable pipeline, exhaustive
    status→score table, awaits-all concurrency runner),
    pci_autonomous_schemas.ts (independent zod + ScopeClaim builder with
    48 h future-date guard and provenance block).
  • CI lockdown test
    pci_autonomous_modules_no_handwritten_imports.test.ts walks every file
    under pci_autonomous_tools/ and asserts (a) zero imports from
    pci_compliance_(requirements|evaluator|schemas), (b) every tool file
    imports at least one autonomous engine module.
  • Eval harness + 8-scenario iteration spec + 6-scenario holdout spec
    same suite runs both variants on the same cluster, same dataset, same
    connector; the only variable is which skill the router has available.
    Anti-overfit content lockdown (11 invariants in the skill test) forbids
    the skill from referencing any iteration or holdout fixture value.
  • Bedrock Claude 4.7 Opus enablement — adds supportsTemperature: false
    to @kbn/inference-common's known-models table and gates the
    temperature parameter in both the inference plugin and the Bedrock
    connector (invokeAI, invokeStream, invokeAIRaw, _converse,
    _converseStream). Single line in known_models.ts controls this for
    future temperature-incompatible models.
  • 85-spec engine unit-test suite for the autonomous engine modules,
    including a runtime catalog↔schema sync invariant that parses every
    catalog key through pciAutonomousRequirementIdSchema.
  • Reproducible reportbuild_comparison_html.mjs --combined-run flag
    splits a single results.json containing both iteration and holdout
    scenarios so the report can be regenerated from one committed file.

Audit-fix batch (latest two commits)

After v6 landed, an internal audit raised seven items; all closed:

  • type catalog widened to Partial<Record<...>> to force handling of
    undefined lookups;
  • exhaustive SCORE_TABLE so the ?? 0 fallback in scoreFor was
    unreachable and removed;
  • redundant | LIMIT 1 after un-grouped STATS dropped;
  • runAutonomousWithConcurrency rewritten to await all in-flight tasks
    before re-throwing the first error (no orphan promises);
  • REQUIREMENT_ID_PATTERN regex tightened so all.1 and similar
    malformed IDs no longer match;
  • stale internal docstring references purged;
  • the suppressed compile-time anchor for catalog↔schema sync replaced
    by a real runtime invariant test;
  • --combined-run regen flag for one-shot report rebuilds.

In addition, the final commit sweeps 7 pre-existing lints sitting inside
the audit-batch diff zone (1 no-continue, 6 no-nested-ternary — flattened
by extracting branches into let/if/else if blocks and wrapping the §4
multi-runs chain in an IIFE).

Checklist

  • Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support — N/A (no user-facing UI strings; skill content is agent-builder-facing)
  • Unit or functional tests were updated or added — 18-spec autonomous skill test + 85-spec autonomous engine suite (pci_autonomous_{schemas,requirements,evaluator}.test.ts) + lockdown test asserting zero imports from hand-written engine; existing pci_compliance_skill.test.ts untouched.
  • If a plugin configuration key changed — N/A (new experimental feature flag pciComplianceAutonomousAgentBuilder defaults to off)
  • This was checked for breaking HTTP API changes — N/A
  • Flaky Test Runner was used on any tests changed — pending
  • The PR description includes the appropriate Release Notes section — see below
  • Review the backport guidelines — main only; do not backport (experimental validation)

Release Notes

release_note:skip — experimental autonomous-validation work behind an
opt-in feature flag, plus a Bedrock fix for the Claude 4.7 Opus
temperature deprecation. No user-facing surface lands by default.

Identify risks

  • Risk: new autonomous skill variant introduces a second 4-tool bundle.
    • Severity: low.
    • Mitigation: registered behind pciComplianceAutonomousAgentBuilder
      feature flag, default off; independent allowlist entry; CI lockdown
      test prevents cross-imports.
  • Risk: Bedrock temperature-stripping change has a wider blast radius
    than the PCI experiment in this PR.
    • Severity: low–medium.
    • Affected callers (outside PCI): every code path that hits the
      Bedrock connector's invokeAI / invokeStream / invokeAIRaw /
      _converse / _converseStream — i.e. Security AI Assistant,
      Observability AI Assistant, custom Workflows / Cases connectors using
      .bedrock, and Inference-plugin chat-completion calls that resolve to
      a .bedrock connector. The PCI suite is one of many consumers, not
      the only one — please review accordingly.
    • Mitigation: gated on a new supportsTemperature: false flag in
      @kbn/inference-common's known_models.ts, which defaults absent.
      Only models that opt in (currently claude-opus-4-7-* only) have
      temperature stripped; all other Bedrock-routed models retain pre-PR
      behavior. Smoke-tested invokeAI + converse on both Opus 4.7 (now
      passes) and Sonnet 4.6 (still includes temperature, also passes).
    • Deprecation context: the legacy .bedrock connector
      (stack_connectors/.../bedrock/{bedrock.ts,utils.ts}) is on the
      deprecation path per [Inference] Mark all LLM connectors as deprecated #261591 — superseded by the Inference plugin.
      The equivalent gate on the Inference side lives in
      inference/server/chat_complete/utils/get_temperature.ts and shares
      known_models.ts as source-of-truth, so users on either path inherit
      the same temperature-compatibility decisions. Keeping the connector
      edit here (rather than splitting it out) is deliberate so this PR
      closes the loop for users still on the deprecated path. CC reviewers:
      Team:Security GenAI for PCI surface, Team:Search + Team:Security
      for the Bedrock connector + Inference plugin half.
  • Risk: large diff (39 files, 8 k+ insertions).
    • Severity: low.
    • Mitigation: the diff is dominated by the new autonomous-tools bundle
      (own engine + own tests) and the comparison report; existing
      pci_compliance code is untouched.

Made with Cursor

…y-side eval harness

Adds a second PCI compliance skill (`pci-compliance-autonomous`) that ships
ALONGSIDE the existing hand-written `pci-compliance` skill, so the same eval
suite can be run against both variants and compared head-to-head. The
autonomous variant deliberately reuses the SAME underlying tools as the
hand-written variant, isolating "skill content" (instructions + domain
knowledge + trigger phrases) as the only experimental variable.

## What ships

Server (security_solution plugin)
- New skill definition `pci_compliance_autonomous/` registering
  `pci-compliance-autonomous` against the existing PCI tool IDs.
- New feature flag `pciComplianceAutonomousAgentBuilder` (default off).
- Skill registration gated by the flag in `register_skills.ts`.
- Allow-list entry for the new skill ID.

Eval harness (kbn-evals-suite-pci-compliance)
- `evaluate_dataset.ts` reads `EVAL_PCI_VARIANT` (`handwritten` | `autonomous`)
  to select which skill `createSkillInvocationEvaluator` targets. Default
  remains `handwritten` so existing CI is unchanged.
- `scripts/compare_variants.sh` runs both variants back-to-back and emits a
  side-by-side `comparison.html` with structural metrics + slots for live
  evaluator output (per-scenario scores, judge rationales, latency).
- `scripts/build_comparison_html.mjs` generates the report; all embedded paths
  are repo-relative so the artifact is portable.
- README documents the variant matrix and the comparison workflow.

CI plumbing
- New Scout config set `evals_pci_compliance_autonomous` that flips ONLY the
  autonomous flag, so the autonomous run sees only the autonomous skill.
- `evals.suites.json` registers `pci-compliance-autonomous`.
- `llm_evals.yml` adds a Buildkite step for the autonomous variant and tags
  the existing PCI step with `EVAL_PCI_VARIANT=handwritten` for symmetry.

## Why

The hand-written PCI skill (`pci-compliance`, elastic#256060) is the production
baseline. The autonomous skill was generated end-to-end by `skill.architect`
against the current Kibana tool catalog, with PCI domain knowledge synthesized
from autonomous web research + model knowledge (SAQ taxonomy, v3->v4 deltas,
scope-reduction levers, technical-vs-process classification). Running the
existing 7-scenario PCI eval suite against both — same tools, same dataset,
same evaluators, same judge — gives a clean A/B that answers "is the
autonomously generated skill at least as good as the hand-written one?".

## Out of scope (not introduced by this commit)

`evaluate_dataset.ts:17` triggers `@kbn/imports/no_boundary_crossing` because
`@kbn/evals` is declared `type: "test-helper"` and the suite imports value
exports from it. This lint reproduces identically on every sibling
`kbn-evals-suite-*` package on `main` (verified against
`kbn-evals-suite-security-ai-rules`), so it is endemic to the eval framework
and would require a cross-cutting change to `@kbn/evals` ownership /
visibility — out of scope for this skill comparison.
…on fix

- Ran @kbn/evals-suite-pci-compliance back-to-back against both PCI skill
  variants on a local Scout cluster wired to llama3.1:8b via a LiteLLM
  proxy (translates OpenAI-format requests to Ollama, including structured
  tool_calls). Captured 14 docs per variant from the kibana-evaluations
  data stream.

- Updated build_comparison_html.mjs to consume the framework's actual
  export shape (Elasticsearch _search response), folding the per-evaluator
  rows back into per-scenario rows. Added a routing-aggregate diagnostic
  (scenarios with >=1 PCI-skill tool call, total tool calls vs PCI-skill
  tool calls) so the HTML can show *why* a score landed where it did, not
  just the score itself.

- Re-rendered comparison.html with the live data. Both variants scored
  0.00 across all completed scenarios because llama3.1:8b is too small
  to engage either PCI skill -- the agent router fell back to the
  generic platform.core.search tool on every scenario, never invoking
  security.pci_*. The HTML now carries an honest banner explaining this:
  the comparison is apples-to-apples (identical model + dataset + infra),
  it just lives on the floor at this model scale. The structural and
  domain-coverage deltas in sections 2-3 remain the meaningful signal
  until the same script is re-run with a stronger model.

- Fixed an isolation bug in the autonomous Scout config set: the
  pciComplianceAgentBuilder feature flag defaults to true in
  experimental_features.ts, so the autonomous run was loading BOTH
  skills. Added 'disable:pciComplianceAgentBuilder' to the scout config
  serverArgs to keep the comparison clean for future runs.

Refs: #11
…arison on real connectors

The autonomous-vs-handwritten PCI comparison previously ran on llama3.1:8b
through a local Ollama proxy. At that model scale the agent router never
engaged either PCI skill, so every scenario scored 0.00 and the comparison
landed on the floor (see commit fc5194e). This commit promotes the
comparison to real Bedrock connectors and ships the connector-side fix that
the upgrade required.

Bedrock connector — Claude Opus 4.7 enablement
----------------------------------------------
Claude Opus 4.7 on Bedrock rejects the `temperature` inference parameter
with `temperature is deprecated for this model`. Without omitting it the
connector simply 400s on every request. Fix is in three layers:

  - `@kbn/inference-common`: new `supportsTemperature?: boolean` on
    `ModelDefinition`; `claude-opus-4-7` marked `supportsTemperature: false`.
    Future Claude variants (or other provider models) with the same
    restriction need only flip the flag — one source of truth.

  - `inference` plugin: `getTemperatureIfValid` omits temperature when the
    model definition declares `supportsTemperature: false`. Sits alongside
    the existing OpenAI o-series exclusions and works for any provider.

  - `stack_connectors` (Bedrock): new local
    `bedrockModelSupportsTemperature(model)` helper; `formatBedrockBody`
    threads `model` through and gates the parameter. `invokeAI`,
    `invokeStream`, `invokeAIRaw`, `_converse`, and `_converseStream` all
    consult it. Defense in depth — direct sub-action callers
    (Security AI Assistant, etc.) are protected without taking a
    cross-plugin dependency on `@kbn/inference-common`.

Smoke-tested with `invokeAI` + `converse` sub-actions:
  - Claude 4.7 Opus (`us.anthropic.claude-opus-4-7`): now passes — temperature
    omitted, response returned.
  - Claude 4.6 Sonnet (`us.anthropic.claude-sonnet-4-6`): still passes —
    temperature included as before.

Live eval comparison (PCI Criteria, LLM-judge 0..1)
---------------------------------------------------
Both PCI skill variants ran the same 8-scenario `@kbn/evals-suite-pci-compliance`
suite end-to-end against a real Scout cluster, on two production Bedrock
connectors:

  | Variant     | Claude 4.7 Opus | Claude 4.6 Sonnet |
  |-------------|----------------:|------------------:|
  | Handwritten |           0.977 |             0.989 |
  | Autonomous  |           0.834 |             0.860 |

The handwritten skill (Smriti, PR elastic#256060) outperforms the autonomous variant
on both models by 14-15 points. The autonomous architect's broader domain
framing (SAQ taxonomy, v3→v4 deltas, scope-reduction levers) did not
translate into a better PCI-Criteria score. The handwritten contract is
shorter (~4.1k vs ~8.1k chars) and lines up more tightly with the eval's
scoring rubric — that tight coupling is the deciding factor.

build_comparison_html.mjs gains a `--runs <label>=<dir>,...` mode so the
4-cell grid renders from the four results.json snapshots. Legacy
`--handwritten`/`--autonomous` mode still works for single-model runs.

kbn-scout
---------
`run_kibana_server.ts` now respects `SCOUT_READ_DEV_CONFIG=true` and drops
`--no-dev-config` when set, so a developer can load `config/kibana.dev.yml`
(and the preconfigured AI connectors it defines) into the Scout-managed
Kibana process. Default behaviour is unchanged. Without this, evals against
real cloud connectors require fragile API-driven connector creation per
boot.

Refs: #11
Root-cause analysis of why the autonomously-architected PCI compliance
skill scored 12-15 pts below the hand-written variant uncovered two
distinct bugs that compounded:

1. **Tool registration bug** in `register_tools.ts` — PCI tools were
   gated *only* on `experimentalFeatures.pciComplianceAgentBuilder`,
   which the autonomous scout config explicitly disables to isolate the
   variant comparison. Result: the autonomous variant ran with NO PCI
   tools registered. Trace analysis confirmed 0 calls to
   `security.pci_compliance` across 16 scenarios vs 17-23 for HW. The
   agent fell back to raw `platform.core.execute_esql` and improvised
   the entire workflow. Fixed: gate now triggers on either flag.

2. **Skill-content design** — the autonomous prompt's 6-step workflow
   inserted "Reduce scope (tokenisation/P2PE/segmentation)" and
   "Classify requirements as technical vs process-based" steps BEFORE
   the tool calls, plus an 8 KB "Domain Knowledge Notes" block between
   the workflow and the status vocab. The structure read as
   "do-your-homework first" rather than "call the tools". Restructured:
   tools-first 4-step workflow with explicit "Always call the dedicated
   PCI tools; do not improvise raw ES|QL" injunction, theory moved to a
   "Background reference (do not consult before calling tools)" tail
   section. Removed broken handoff references to non-existent sibling
   skills and stripped tool-description provenance commentary.

Validation on Claude 4.6 Sonnet:
- pre-fix Auto: 0.860 mean (gap to HW: 12.9 pts)
- post-fix Auto v3: 0.955 mean (gap to HW: 3.4 pts)
- 6/8 scenarios now perfect 1.000; 1 scenario (full report) regressed
  -9 pts on a substance-vs-style criterion (agent calls the tool
  correctly but the report formatting elides specific evidence).

Feedback-loop infrastructure:
- `scripts/run-eval.sh` extended with optional scenario-grep argument
  (`run-eval.sh autonomous <connector> <label> "requirement 2.2.4"`)
  collapsing a full-suite cycle (~28 min) to a single-scenario probe
  (~5.6 min including scout boot, ~3 min if scout is reused).
- Two iterations of this loop fixed both bugs end-to-end.

POSTMORTEM.md captures the full analysis, including six ranked content
fixes and a three-tier feedback-loop efficiency proposal.
…y (0.989 vs 0.989)

The autonomous PCI compliance skill now ships its own independently-authored
4-tool decomposition under a separate allowlist entry. The autonomous skill
has no knowledge of -- and no path to -- the hand-written PCI tools. This
validates a fully end-to-end autonomous stack (skill + tools, both
autonomously created) and reaches parity with the human-authored variant.

What changed
------------
* New PCI tool bundle under `agent_builder/tools/pci_autonomous_tools/`:
  - `pci_autonomous_scope_discovery`
  - `pci_autonomous_compliance_check`   (split out from the consolidated tool)
  - `pci_autonomous_scorecard_report`   (split out from the consolidated tool)
  - `pci_autonomous_field_mapper`
  All four implement the cycle-17 architect blueprint's 4-tool decomposition
  (vs the hand-written variant's 3 tools, where check+report share one tool
  via a `mode` parameter). Each tool reuses the underlying domain logic so
  the comparison stays apples-to-apples on capability while validating the
  isolation property.

* `register_tools.ts`: hand-written PCI tools register ONLY under
  `experimentalFeatures.pciComplianceAgentBuilder`; autonomous PCI tools
  register ONLY under `experimentalFeatures.pciComplianceAutonomousAgentBuilder`.
  The previous lenient gate (`either flag`) is removed -- the two variants
  are now strictly isolated.

* `allow_lists.ts`: all four new autonomous tool IDs added to the
  `AGENT_BUILDER_BUILTIN_TOOLS` allowlist (without this, tool registration
  silently fails and the agent falls back to raw ES|QL).

* Autonomous skill content + `getRegistryTools` rewired to reference the
  new tool IDs only.

* Eval rubric (`pci_compliance.spec.ts`) is now variant-aware via
  `EVAL_PCI_VARIANT` -- judging criteria check for `pci_autonomous_*` tool
  names when the autonomous variant is on, and the original names otherwise.

* Skill contract tests harden the isolation property: explicit assertions
  that the autonomous skill never references any hand-written tool ID, and
  that `getRegistryTools` advertises ONLY the autonomous bundle.

* Comparison HTML updated with a new v5 column and a green success banner
  showing the autonomous skill+tools reaches parity with the hand-written
  baseline on Claude 4.6 Sonnet (0.989 vs 0.989, 8/8 scenarios).

Why
---
The user wanted to validate that the autonomous skill workflow generalises
to other domains -- which requires removing every shortcut where the
autonomous variant inherits the hand-written variant's tooling. The earlier
"shared tool" runs were measuring only skill-content quality; this run
measures the full stack the architect would generate from a blank slate.

Result
------
| Variant                                 | Mean (8 scenarios) |
|-----------------------------------------|-------------------|
| Hand-written, Claude 4.6 Sonnet         | 0.989             |
| Autonomous v5 (own 4 tools), Sonnet 4.6 | 0.989             |
| Autonomous v3 (shared tools), Sonnet    | 0.955             |
| Autonomous v1 (shared, content drift)   | 0.860             |

Parity on the headline metric. The autonomous stack (skill content +
4-tool decomposition + allowlist entry + register gate) ships as a
self-contained bundle the architect can replicate for any other domain.
Adds a second evaluation surface so the iteration loop on the
autonomous PCI skill can be trusted to produce a generalisable skill
rather than one that has memorised the iteration fixtures.

Why
---
The 0.989 we got from `sonnet46-autonomous-v5` (cycle that hit
parity with the hand-written variant) is scored against the SAME
fixtures we inspect while improving the skill. That tight loop is
how every dataset-driven optimisation produces overfit: the skill
content drifts from "teach the principle" to "match the fixture".

Two layers of defence
---------------------

1. **Anti-overfit lockdown** (in `pci_compliance_autonomous_skill.test.ts`).
   A new `describe('anti-overfit ...')` block asserts the skill content
   contains NONE of the iteration- or holdout-set fixture values
   (`jdoe`, `pcompton`, `192.168.1.100`, `10.20.30.40`, `12 failed`,
   the random `logs-<hex>-{auth|network|...}` index pattern, etc.).
   Values that ARE legitimate PCI domain knowledge — `admin`/`root` for
   req 2.2.4, the lockout threshold of 10 for 8.3.4, `TLS 1.0`/`1.1`
   for 4.1 — are explicitly kept allowable. 11 invariants, all green
   today. Any future iteration that introduces a fixture-coupled patch
   will fail CI.

2. **Holdout dataset + spec** (new `pci_data_holdout.ts` +
   `pci_compliance_holdout/pci_compliance_holdout.spec.ts`). Same five
   PCI categories (auth/network/vuln/endpoint/legacy) but every
   memorisable axis is systematically different:
     - Index naming drops the `logs-*-{category}` pattern in favour of
       `security-audit-identity-*`, `siem-flows-prod-*`,
       `pkginfo-cve-*`, `edr-processes-*`, `legacy-app-syslog-*`. Tests
       that scope discovery uses field caps, not name patterns.
     - Brute-force volume is 8 (BELOW the PCI 8.3.4 threshold of 10) —
       expected verdict is GREEN, NOT RED. Catches skills that learnt
       "any failed-login cluster = violation".
     - Default-account flavours are Windows `Administrator` +
       `service_acct_42`, not Unix `admin`/`root`.
     - Weak TLS signature is TLS 1.1 ALONE — no TLS 1.0, no plain HTTP.
       Tests sub-version recognition rather than the kitchen-sink
       "multiple weak versions" pattern of the iteration set.
     - Non-ECS field schema uses `actor_name` / `client_addr` /
       `action_status` / `event_verb` / `device_id` / `cve_id` /
       `risk_rating` / `command` — completely different from the
       iteration set's `username` / `src_ip` / etc. Tests that
       field-mapping is semantic, not memorised.
     - 4-hour time window instead of 1-hour.
     - 2025-vintage CVEs instead of 2024.

The six holdout scenarios mirror the structure of the iteration
scenarios so the gap measurement is apples-to-apples: report,
single-requirement check (× brute force + TLS + default accounts),
scope discovery, field mapping.

Result on Sonnet 4.6
--------------------

|                | iteration | holdout | gap    | verdict |
|----------------|-----------|---------|--------|---------|
| Hand-written   | 0.989     | 0.942   | +0.047 | CLEAN   |
| Autonomous v5  | 0.989     | 0.927   | +0.062 | CAUTION |

Both variants drop the same ~5-6 pts moving from iteration to holdout
— and they drop on the SAME two scenarios (default-account variants
0.750/0.750, 4h scorecard 0.900/0.900). That tells us the holdout is
genuinely harder, not that the autonomous skill is uniquely overfit.
The autonomous gap of 0.062 is only 0.015 wider than the hand-written
gap — well within noise of the framework.

Crucially, the three HARDEST tests all scored 1.000 for both skills:
  - below-threshold brute force (counter-case — agent did NOT
    fabricate a false-positive violation)
  - TLS 1.1 alone (sub-version recognition without the kitchen-sink
    signature)
  - scope discovery on non-`logs-*` indices (worked via field caps,
    not via index-name pattern matching)

Tooling changes
---------------

  - `run-eval.sh`: scout boot timeout bumped 6 min → 15 min; the
    default was unreliable when the host was also running an IDE.
  - `build_comparison_html.mjs`: new `--holdout-runs` flag mirroring
    `--runs`; new §5 section renders the iteration vs holdout grid,
    computes the gap per variant, applies the three-band verdict
    (CLEAN / CAUTION / OVERFIT), and lists the divergence axes plus
    the per-scenario holdout breakdown. Subsequent section numbers
    renumbered (6 reasoning, 7 reproduce, 8 provenance, 9 Bedrock).
  - `comparison.html` regenerated with the live holdout numbers.

How to re-run
-------------

    bash x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/run-eval.sh \
        handwritten pmeClaudeV46SonnetUsEast1 sonnet46-handwritten-holdout HOLDOUT
    bash x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/run-eval.sh \
        autonomous  pmeClaudeV46SonnetUsEast1 sonnet46-autonomous-holdout  HOLDOUT
    node x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/build_comparison_html.mjs \
        --runs ... --holdout-runs sonnet46-handwritten=...,sonnet46-autonomous=...

This commit closes the question "did we just overfit to the iteration
fixtures?" with a measurement, not an assertion. The answer is "the
gap is small enough that the iteration loop is healthy, but not zero
— field-mapping on novel vocabularies is the one place the autonomous
skill is genuinely weaker than the hand-written one (0.909 vs 1.000),
and that is a tool-implementation issue, not a skill-content overfit".
The previous report overclaimed full autonomy by saying the autonomous variant
"ships an independently-authored 4-tool decomposition ... no shared context with
the human-authored variant" and called it a "fully autonomous stack". That is
true at the agent-facing surface (tool IDs, descriptions, schemas, decomposition,
skill content, registration) but NOT at the domain-engine layer: each autonomous
tool's handler still imports PCI_REQUIREMENTS, evaluateRequirement, and the
ScopeClaim builder directly from the hand-written variant's pci_compliance_*
modules.

Recalibrates the framing without changing any numbers:

- §1 intro now distinguishes "agent-facing surface" (independent) from
  "underlying domain engine" (shared via direct module imports) and points to
  the new §1.5 ladder.
- §1.5 (new) "Autonomy ladder — what's truly independent vs what's shared":
  10-row table covering tool IDs, descriptions, schemas, decomposition, prose,
  registration as INDEPENDENT and requirement catalog, evaluator, validation
  schemas, ScopeClaim builder, time helpers as SHARED. Names each shared file.
- §4 verdict banner: "fully autonomous stack" → "surface-level autonomy of
  tools too", with an explicit caveat that the handler bodies still import the
  domain engine from the hand-written variant. Calls out the missing follow-up
  (pci_autonomous_requirements.ts / pci_autonomous_evaluator.ts).
- §6 reasoning bullet 4: "Independently-authored tools" → "Independently-
  authored tool surface (engine still shared — see §1.5)" with the specific
  module names that are still being imported.
- §8 Provenance & honesty: new "Honest limitation: autonomy is layered, not
  total" subsection summarising what the eval numbers measure (agent-surface
  autonomy on top of a shared engine) and what the next experiment would have
  to look like (independent engine + zero-import CI test + re-run).

No code, eval numbers, or branch behaviour changed — only the framing of
what the eval result is claiming. Sets up the follow-up work of authoring
pci_autonomous_requirements.ts, pci_autonomous_evaluator.ts, and
pci_autonomous_schemas.ts from the public DSS v4.0.1 spec and re-running.
Make the autonomous skill truly autonomous all the way down. Previously
the four `pci_autonomous_*_tool.ts` handlers re-used the same PCI domain
helpers as the hand-written skill (`pci_compliance_schemas`,
`pci_compliance_requirements`, `pci_compliance_evaluator`). The
agent-facing surface (IDs, schemas, decomposition, registration, skill
content) was independent, but the underlying PCI engine was shared.

This commit adds three engine modules in `pci_autonomous_tools/`
authored from the PCI DSS v4.0.1 spec without referencing the
hand-written ones, and rewires all four tools to use only the
autonomous engine:

- `pci_autonomous_schemas.ts` — independent zod input schemas with a
  stricter time-range guard (no future dates) and a `provenance` block
  on `PciAutonomousScopeClaim` for auditable autonomy.
- `pci_autonomous_requirements.ts` — independent v4.0.1 catalog with a
  verdict-typed encoding (`detect_violations` vs `verify_presence`),
  self-documenting ES|QL params (`?_window_start`/`?_window_end`),
  enriched `defaultLookback` with rationale, and post-aggregation
  filtering instead of nested HAVING clauses.
- `pci_autonomous_evaluator.ts` — composable pipeline of pure functions
  (replacing the nested try/catch pyramid), explicit status→score
  lookup table (avoiding multiplicative scoring drift), discriminated
  union for `FieldCapsPreflight`, and a different concurrency runner.

CI lockdown:

- `pci_autonomous_modules_no_handwritten_imports.test.ts` walks every
  file under `pci_autonomous_tools/` and asserts zero imports from the
  hand-written engine modules, plus that each tool file imports at
  least one autonomous engine module. The skill-level surface
  isolation test was also updated to reference the engine lockdown.

All 28 autonomous-skill tests + 3 engine-lockdown tests pass.

The next step (v6 results in `comparison.html`) is a fresh
iteration+holdout eval run against this engine, which can now be
attributed entirely to the autonomous architect.
Plug the v6 run (autonomous tools + autonomous engine) into the
side-by-side comparison report. The architect re-authored the PCI
domain engine from the public PCI DSS v4.0.1 spec
(`pci_autonomous_requirements.ts`, `pci_autonomous_evaluator.ts`,
`pci_autonomous_schemas.ts`), with a CI lockdown test asserting zero
imports from the hand-written engine. Eval results:

Iteration set (Sonnet 4.6, 8 scenarios)
  hand-written: 0.989
  auto v5 (own tools, shared engine): 0.989
  auto v6 (own tools + own engine): 0.989  ← deep autonomy at parity

Holdout set (Sonnet 4.6, 6 scenarios)
  hand-written: 0.942
  auto v5: 0.927 (gap −0.062 vs iteration → CAUTION band)
  auto v6: 0.985 (gap −0.004 vs iteration → CLEAN band)

The deep-autonomy engine generalises *better* than the surface-only v5
on the holdout, with substantive wins on the 4h scorecard scenario
(+0.100) and the default-account variants scenario (+0.250). Both wins
come from the autonomous engine's more deliberate CDE / account-status
semantics carrying over to non-fixture data shapes.

Report changes
--------------

- §1.5 autonomy ladder: rewrite the four engine rows from a single
  "SHARED" red pill to a "v5: SHARED / v6: AUTONOMOUS" pair, and add
  closing paragraphs that distinguish the two cycles.
- §4 multi-model grid: add the v6 column. The reader can see v5 → v6
  was a no-op on iteration scores but a substantive lift on holdout.
- §5 generalisation gap: add a v6 row paired to the v6 holdout run.
  The pairing logic in build_comparison_html.mjs now strips any
  trailing `-vN` suffix when looking up the holdout label, so future
  iterations don't need a code change.
- §6 reasoning bullet: flip the autonomous-side description from
  "engine still shared" to "tool surface AND domain engine
  independent (v6)", with the CI lockdown test referenced.
- §8 honest limitation: rewrite as "how the deep-autonomy experiment
  was constructed (v6)". The prior text said this experiment "is not
  run here". It is now run here, and the section documents the three
  re-authored modules, the CI lockdown, and the result.

The verdict banner now references both v5 (surface autonomy) and v6
(deep autonomy) as separate parity events.
Addresses the v6 deep-autonomy audit findings raised after the architect's
own engine modules landed:

Code-quality (autonomous engine modules)
  - schemas: tighten REQUIREMENT_ID_PATTERN so `all.1` etc. no longer match;
    strip stale "cycle-17" docstring references.
  - requirements: type catalog as Partial<Record<...>> so undefined lookups
    must be handled; drop redundant `| LIMIT 1` after un-grouped STATS;
    remove the as-cast pseudo-anchor (replaced by a runtime invariant in
    the new test file); strip "cycle-17" docstrings.
  - evaluator: scoreFor is exhaustive over the typed SCORE_TABLE so drop
    the unreachable `?? 0` fallback; runAutonomousWithConcurrency now
    awaits all in-flight tasks before re-throwing the first error so a
    single rejection no longer orphans siblings (semantics documented).
  - docstrings across index.ts, compliance_check_tool, register_tools,
    autonomous skill, and experimental_features now consistently describe
    v6 deep autonomy (independent engine + tools + heuristics) rather than
    overclaiming or underclaiming shared logic.

Engine unit tests (~85 specs, ~2s)
  - pci_autonomous_schemas.test.ts: provenance constants, index-pattern
    refinements (ESQL injection, length bounds), time-range clamping,
    requirement-id regex, buildAutonomousScopeClaim dedupe/sort.
  - pci_autonomous_requirements.test.ts: catalog completeness, self-
    referential ids, presence of AUTONOMOUS_TIME_WINDOW placeholders,
    detect_violations always carries a violation query, defaultLookback
    sanity, plus a real runtime sync invariant that parses every catalog
    key through pciAutonomousRequirementIdSchema (replaces the prior
    compile-time anchor that was suppressed by an `as` cast). Also covers
    requirementCategory, buildAutonomousTimeWindowParams, time-range
    resolution, normalize/resolve helpers, and index-pattern helpers.
  - pci_autonomous_evaluator.test.ts: concurrency runner correctness +
    failure semantics, ordered ?_window_start/?_window_end binding,
    detect_violations RED path, verify_presence GREEN path, AMBER+HIGH /
    AMBER+LOW / NOT_ASSESSABLE branches via mockResolvedValueOnce, ES|QL
    failure → query_failed data gap, evidence row clamping.

Reproducibility (#2 from audit)
  - build_comparison_html.mjs gains --combined-run <label>=<dir>, which
    reads a single results.json that mixes pci-compliance:* (iter) and
    pci-holdout:* (holdout) scenarios and splits them internally. The
    v6 evaluation report can now be regenerated from one results.json
    without an ad-hoc helper script.

All four PCI-autonomous Jest suites pass locally (engine + lockdown).
No new lint errors introduced (remaining no-continue / no-nested-ternary
hits are pre-existing in untouched code).
- comparison.html / build_comparison_html.mjs: extend §8 with a new
  "v6 hardening — audit fixes + engine unit tests" subsection that
  spells out the post-v6 audit batch (Partial Record typing, exhaustive
  scoreFor, dropped LIMIT 1, concurrency failure semantics, stricter
  REQUIREMENT_ID_PATTERN), the new 85-spec engine test suite (including
  the runtime catalog↔schema sync invariant that replaces the suppressed
  compile-time anchor), and the new --combined-run flag for one-shot
  v6 report regeneration from a single results.json.

- build_comparison_html.mjs: flatten six pre-existing nested ternaries
  (the §4 multi-runs-vs-live-vs-fallback chain becomes an IIFE with
  if/else; banner-class / banner-cls / gap-advice / mean-row cls all
  become let-block assignments) — no behaviour changes, the script
  smoke-runs end-to-end with --combined-run and produces a valid 574-line
  HTML output with all 11 §-headings intact.

- pci_autonomous_requirements.ts: drop the lone `continue` in
  resolveAutonomousRequirementIds by inverting the guard into a
  positive-branch `if (canonical && canonical !== 'all') { ... }`.
  All 46 requirements specs still pass.

Net result: both files lint clean (0 errors, 0 warnings). The 7
pre-existing lints sitting inside the audit-batch diff zone — 1
no-continue and 6 no-nested-ternary — are gone.
@infra-vault-gh-plugin-prod
Copy link
Copy Markdown

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

  • Click to trigger kibana-pull-request for this PR!
  • Click to trigger kibana-deploy-project-from-pr for this PR!
  • Click to trigger kibana-deploy-cloud-from-pr for this PR!
  • Click to trigger kibana-entity-store-performance-from-pr for this PR!
  • Click to trigger kibana-storybooks-from-pr for this PR!

…rift test

Addresses two follow-up findings on PR elastic#268798:

#2 — Lockdown test (pci_autonomous_modules_no_handwritten_imports.test.ts):
broaden the import deny-list to cover the full hand-written PCI surface,
not just the three engine modules. Now blocks:

  - pci_compliance_tool
  - pci_compliance_evaluator
  - pci_compliance_requirements
  - pci_compliance_schemas
  - pci_field_mapper_tool
  - pci_scope_discovery_tool
  - anything under skills/pci_compliance/**

The previous deny-list only covered the engine trio, which left a silent
re-coupling path: a future contributor could import the hand-written
orchestrator tool or scope-discovery helper and pass CI. The deep-autonomy
guarantee in comparison.html §1.5 is broader than the engine — it covers
every hand-written surface — so the lockdown should match.

#4 — New comparison_html.test.ts: structural snapshot for the committed
report. Asserts that the 11 §-level sections appear (in expected order)
and the v6 hardening / deep-autonomy h3 subsections are present. Catches
the two drift directions between comparison.html and
scripts/build_comparison_html.mjs:

  1. someone edits the HTML directly and forgets to update the template;
  2. someone edits the template and forgets to regenerate + commit.

Deliberately not byte-for-byte equality — the rendered HTML legitimately
changes with each eval refresh and we don't want CI noise on prose tweaks.
Address the 15 findings from the autonomous PCI deep-analysis audit
covering the engine modules, the four agent-facing tools, and the
skill prompt.

Blockers
- Scope-discovery tool now returns a `discoveryClaim` (point-in-time
  snapshot) instead of a mis-shaped `scopeClaim`, surfaces ES errors
  as structured `dataGaps`, and validates `cat.indices` responses
  with a zod schema before walking them.
- Requirements catalog: dropped the unused `requiredCategories[]` field
  and the orphan `requirementCategory()` helper. Removed `NOT_APPLICABLE`
  from `AutonomousComplianceStatus` — it was carried in the score table
  but never produced by any evaluator path.
- Scorecard report no longer tags its synthesised executive roll-up as
  `ToolResultType.esqlResults` (the payload is not an ESQL row set);
  it now lands under `ToolResultType.other` so downstream UX/telemetry
  that special-cases `esqlResults` does not mis-render it.

Importants
- Skill prompt rewritten: workflow is now `discover → roll up → drill
  down`. The check and scorecard tools are explicitly designed to be
  used as a sequence and share one evaluator via the new
  `runAutonomousPciEvaluationPack` orchestration helper.
- Both tools now derive `overallStatus` from the same severity rollup
  (`rollupAutonomousOverallStatus`) and `overallConfidence` from the
  same confidence rollup (`rollupAutonomousConfidence`), eliminating
  the previous risk of disagreement.
- Field-mapper sensitive-field regex tightened: the previous bare
  `/token/i` over-matched (e.g. `subscription` contains no token but
  `tokenizer` would have flagged). Replaced with anchored patterns
  for `card`, `pan`, `cvv`, `cvc`, `account.number`, `credit.card`,
  `ssn`, `secret`, `password`, `api.key`, and specific `*token`
  shapes.
- Added a runtime `assertNever` exhaustiveness check on the
  `statusToHumanLabel` switch — adding a new status without
  updating the switch now fails at compile time.

Nice-to-haves
- Removed experiment-only metadata (gate scores, citation counts,
  architect attribution, brittle `comparison.html §1.5` cross-refs)
  from every runtime file. Authoring metadata stays beside the eval
  suite.
- "Recommended Remediation SLA" table in the skill prompt re-labelled
  as operational guidance — only the 30-day req 6.3.3 window is
  spec-sourced; the rest are heuristics a QSA would typically agree
  with but an org may tune.
- SAQ scope-reduction "70%" claim re-cast as the assessor-guidance
  heuristic range (50–80%), not a guarantee.
- `requirementCategory` tests removed; weak `['HIGH','MEDIUM']`
  evaluator assertion pinned to the exact value (`MEDIUM` via the
  coverage-stage no-violation-query path).
- New `buildAutonomousDiscoveryClaim` helper + 4-spec test block
  covering dedupe/sort, provenance pinning, point-in-time semantics,
  and stable shape across shuffled inputs.

Verification
- ESLint: 14 files, clean.
- Jest: 101/101 pass in `pci_autonomous_tools/` + the autonomous
  skill suite, 16/16 pass in `comparison_html.test.ts`.
- Scoped `tsc -b` against `security_solution/tsconfig.type_check.json`:
  green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant