[Security GenAI] Autonomous-vs-hand-written PCI compliance skill side-by-side (v6 deep autonomy) by patrykkopycinski · Pull Request #268798 · elastic/kibana

patrykkopycinski · 2026-05-12T06:57:38Z

Summary

Validation experiment: can the skill.architect autonomous workflow carry an
entire Elastic Security feature end-to-end — agent contract and underlying
domain engine — as well as a hand-written skill authored by a domain expert?
This PR adds a second PCI compliance skill (pci-compliance-autonomous) next
to Smriti's hand-written pci-compliance skill (PR #256060), wires a
side-by-side eval harness through @kbn/evals, and ships the result.

Headline result (Claude 4.6 Sonnet, 8-scenario iteration set):
hand-written 0.989 vs autonomous v6 0.989 — parity, with both
generalising to a held-out dataset (hand-written 0.942, autonomous v6 0.985).

See x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/comparison.html for the full
side-by-side report (autonomy ladder in §1.5, per-scenario scores in §4,
overfit/holdout analysis in §5, audit-fix hardening in §8).

What lands

Autonomous skill + 4 tools behind their own feature flag
pciComplianceAutonomousAgentBuilder and allowlist entry, so the agent
router has zero path to the hand-written tool IDs under the autonomous flag.
Independent domain engine (v6 deep autonomy) —
pci_autonomous_requirements.ts (independent PCI DSS v4.0.1 catalog),
pci_autonomous_evaluator.ts (composable pipeline, exhaustive
status→score table, awaits-all concurrency runner),
pci_autonomous_schemas.ts (independent zod + ScopeClaim builder with
48 h future-date guard and provenance block).
CI lockdown test —
pci_autonomous_modules_no_handwritten_imports.test.ts walks every file
under pci_autonomous_tools/ and asserts (a) zero imports from
pci_compliance_(requirements|evaluator|schemas), (b) every tool file
imports at least one autonomous engine module.
Eval harness + 8-scenario iteration spec + 6-scenario holdout spec —
same suite runs both variants on the same cluster, same dataset, same
connector; the only variable is which skill the router has available.
Anti-overfit content lockdown (11 invariants in the skill test) forbids
the skill from referencing any iteration or holdout fixture value.
Bedrock Claude 4.7 Opus enablement — adds supportsTemperature: false
to @kbn/inference-common's known-models table and gates the
temperature parameter in both the inference plugin and the Bedrock
connector (invokeAI, invokeStream, invokeAIRaw, _converse,
_converseStream). Single line in known_models.ts controls this for
future temperature-incompatible models.
85-spec engine unit-test suite for the autonomous engine modules,
including a runtime catalog↔schema sync invariant that parses every
catalog key through pciAutonomousRequirementIdSchema.
Reproducible report — build_comparison_html.mjs --combined-run flag
splits a single results.json containing both iteration and holdout
scenarios so the report can be regenerated from one committed file.

Audit-fix batch (latest two commits)

After v6 landed, an internal audit raised seven items; all closed:

type catalog widened to Partial<Record<...>> to force handling of
undefined lookups;
exhaustive SCORE_TABLE so the ?? 0 fallback in scoreFor was
unreachable and removed;
redundant | LIMIT 1 after un-grouped STATS dropped;
runAutonomousWithConcurrency rewritten to await all in-flight tasks
before re-throwing the first error (no orphan promises);
REQUIREMENT_ID_PATTERN regex tightened so all.1 and similar
malformed IDs no longer match;
stale internal docstring references purged;
the suppressed compile-time anchor for catalog↔schema sync replaced
by a real runtime invariant test;
--combined-run regen flag for one-shot report rebuilds.

In addition, the final commit sweeps 7 pre-existing lints sitting inside
the audit-batch diff zone (1 no-continue, 6 no-nested-ternary — flattened
by extracting branches into let/if/else if blocks and wrapping the §4
multi-runs chain in an IIFE).

Checklist

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support — N/A (no user-facing UI strings; skill content is agent-builder-facing)
Unit or functional tests were updated or added — 18-spec autonomous skill test + 85-spec autonomous engine suite (pci_autonomous_{schemas,requirements,evaluator}.test.ts) + lockdown test asserting zero imports from hand-written engine; existing pci_compliance_skill.test.ts untouched.
If a plugin configuration key changed — N/A (new experimental feature flag pciComplianceAutonomousAgentBuilder defaults to off)
This was checked for breaking HTTP API changes — N/A
Flaky Test Runner was used on any tests changed — pending
The PR description includes the appropriate Release Notes section — see below
Review the backport guidelines — main only; do not backport (experimental validation)

Release Notes

release_note:skip — experimental autonomous-validation work behind an
opt-in feature flag, plus a Bedrock fix for the Claude 4.7 Opus
temperature deprecation. No user-facing surface lands by default.

Identify risks

Risk: new autonomous skill variant introduces a second 4-tool bundle.
- Severity: low.
- Mitigation: registered behind pciComplianceAutonomousAgentBuilder
  feature flag, default off; independent allowlist entry; CI lockdown
  test prevents cross-imports.
Risk: Bedrock temperature-stripping change has a wider blast radius
than the PCI experiment in this PR.
- Severity: low–medium.
- Affected callers (outside PCI): every code path that hits the
  Bedrock connector's invokeAI / invokeStream / invokeAIRaw /
  _converse / _converseStream — i.e. Security AI Assistant,
  Observability AI Assistant, custom Workflows / Cases connectors using
  .bedrock, and Inference-plugin chat-completion calls that resolve to
  a .bedrock connector. The PCI suite is one of many consumers, not
  the only one — please review accordingly.
- Mitigation: gated on a new supportsTemperature: false flag in
  @kbn/inference-common's known_models.ts, which defaults absent.
  Only models that opt in (currently claude-opus-4-7-* only) have
  temperature stripped; all other Bedrock-routed models retain pre-PR
  behavior. Smoke-tested invokeAI + converse on both Opus 4.7 (now
  passes) and Sonnet 4.6 (still includes temperature, also passes).
- Deprecation context: the legacy .bedrock connector
  (stack_connectors/.../bedrock/{bedrock.ts,utils.ts}) is on the
  deprecation path per [Inference] Mark all LLM connectors as deprecated #261591 — superseded by the Inference plugin.
  The equivalent gate on the Inference side lives in
  inference/server/chat_complete/utils/get_temperature.ts and shares
  known_models.ts as source-of-truth, so users on either path inherit
  the same temperature-compatibility decisions. Keeping the connector
  edit here (rather than splitting it out) is deliberate so this PR
  closes the loop for users still on the deprecated path. CC reviewers:
  Team:Security GenAI for PCI surface, Team:Search + Team:Security
  for the Bedrock connector + Inference plugin half.
Risk: large diff (39 files, 8 k+ insertions).
- Severity: low.
- Mitigation: the diff is dominated by the new autonomous-tools bundle
  (own engine + own tests) and the comparison report; existing
  pci_compliance code is untouched.

Made with Cursor

…y-side eval harness Adds a second PCI compliance skill (`pci-compliance-autonomous`) that ships ALONGSIDE the existing hand-written `pci-compliance` skill, so the same eval suite can be run against both variants and compared head-to-head. The autonomous variant deliberately reuses the SAME underlying tools as the hand-written variant, isolating "skill content" (instructions + domain knowledge + trigger phrases) as the only experimental variable. ## What ships Server (security_solution plugin) - New skill definition `pci_compliance_autonomous/` registering `pci-compliance-autonomous` against the existing PCI tool IDs. - New feature flag `pciComplianceAutonomousAgentBuilder` (default off). - Skill registration gated by the flag in `register_skills.ts`. - Allow-list entry for the new skill ID. Eval harness (kbn-evals-suite-pci-compliance) - `evaluate_dataset.ts` reads `EVAL_PCI_VARIANT` (`handwritten` | `autonomous`) to select which skill `createSkillInvocationEvaluator` targets. Default remains `handwritten` so existing CI is unchanged. - `scripts/compare_variants.sh` runs both variants back-to-back and emits a side-by-side `comparison.html` with structural metrics + slots for live evaluator output (per-scenario scores, judge rationales, latency). - `scripts/build_comparison_html.mjs` generates the report; all embedded paths are repo-relative so the artifact is portable. - README documents the variant matrix and the comparison workflow. CI plumbing - New Scout config set `evals_pci_compliance_autonomous` that flips ONLY the autonomous flag, so the autonomous run sees only the autonomous skill. - `evals.suites.json` registers `pci-compliance-autonomous`. - `llm_evals.yml` adds a Buildkite step for the autonomous variant and tags the existing PCI step with `EVAL_PCI_VARIANT=handwritten` for symmetry. ## Why The hand-written PCI skill (`pci-compliance`, elastic#256060) is the production baseline. The autonomous skill was generated end-to-end by `skill.architect` against the current Kibana tool catalog, with PCI domain knowledge synthesized from autonomous web research + model knowledge (SAQ taxonomy, v3->v4 deltas, scope-reduction levers, technical-vs-process classification). Running the existing 7-scenario PCI eval suite against both — same tools, same dataset, same evaluators, same judge — gives a clean A/B that answers "is the autonomously generated skill at least as good as the hand-written one?". ## Out of scope (not introduced by this commit) `evaluate_dataset.ts:17` triggers `@kbn/imports/no_boundary_crossing` because `@kbn/evals` is declared `type: "test-helper"` and the suite imports value exports from it. This lint reproduces identically on every sibling `kbn-evals-suite-*` package on `main` (verified against `kbn-evals-suite-security-ai-rules`), so it is endemic to the eval framework and would require a cross-cutting change to `@kbn/evals` ownership / visibility — out of scope for this skill comparison.

…on fix - Ran @kbn/evals-suite-pci-compliance back-to-back against both PCI skill variants on a local Scout cluster wired to llama3.1:8b via a LiteLLM proxy (translates OpenAI-format requests to Ollama, including structured tool_calls). Captured 14 docs per variant from the kibana-evaluations data stream. - Updated build_comparison_html.mjs to consume the framework's actual export shape (Elasticsearch _search response), folding the per-evaluator rows back into per-scenario rows. Added a routing-aggregate diagnostic (scenarios with >=1 PCI-skill tool call, total tool calls vs PCI-skill tool calls) so the HTML can show *why* a score landed where it did, not just the score itself. - Re-rendered comparison.html with the live data. Both variants scored 0.00 across all completed scenarios because llama3.1:8b is too small to engage either PCI skill -- the agent router fell back to the generic platform.core.search tool on every scenario, never invoking security.pci_*. The HTML now carries an honest banner explaining this: the comparison is apples-to-apples (identical model + dataset + infra), it just lives on the floor at this model scale. The structural and domain-coverage deltas in sections 2-3 remain the meaningful signal until the same script is re-run with a stronger model. - Fixed an isolation bug in the autonomous Scout config set: the pciComplianceAgentBuilder feature flag defaults to true in experimental_features.ts, so the autonomous run was loading BOTH skills. Added 'disable:pciComplianceAgentBuilder' to the scout config serverArgs to keep the comparison clean for future runs. Refs: #11

…arison on real connectors The autonomous-vs-handwritten PCI comparison previously ran on llama3.1:8b through a local Ollama proxy. At that model scale the agent router never engaged either PCI skill, so every scenario scored 0.00 and the comparison landed on the floor (see commit fc5194e). This commit promotes the comparison to real Bedrock connectors and ships the connector-side fix that the upgrade required. Bedrock connector — Claude Opus 4.7 enablement ---------------------------------------------- Claude Opus 4.7 on Bedrock rejects the `temperature` inference parameter with `temperature is deprecated for this model`. Without omitting it the connector simply 400s on every request. Fix is in three layers: - `@kbn/inference-common`: new `supportsTemperature?: boolean` on `ModelDefinition`; `claude-opus-4-7` marked `supportsTemperature: false`. Future Claude variants (or other provider models) with the same restriction need only flip the flag — one source of truth. - `inference` plugin: `getTemperatureIfValid` omits temperature when the model definition declares `supportsTemperature: false`. Sits alongside the existing OpenAI o-series exclusions and works for any provider. - `stack_connectors` (Bedrock): new local `bedrockModelSupportsTemperature(model)` helper; `formatBedrockBody` threads `model` through and gates the parameter. `invokeAI`, `invokeStream`, `invokeAIRaw`, `_converse`, and `_converseStream` all consult it. Defense in depth — direct sub-action callers (Security AI Assistant, etc.) are protected without taking a cross-plugin dependency on `@kbn/inference-common`. Smoke-tested with `invokeAI` + `converse` sub-actions: - Claude 4.7 Opus (`us.anthropic.claude-opus-4-7`): now passes — temperature omitted, response returned. - Claude 4.6 Sonnet (`us.anthropic.claude-sonnet-4-6`): still passes — temperature included as before. Live eval comparison (PCI Criteria, LLM-judge 0..1) --------------------------------------------------- Both PCI skill variants ran the same 8-scenario `@kbn/evals-suite-pci-compliance` suite end-to-end against a real Scout cluster, on two production Bedrock connectors: | Variant | Claude 4.7 Opus | Claude 4.6 Sonnet | |-------------|----------------:|------------------:| | Handwritten | 0.977 | 0.989 | | Autonomous | 0.834 | 0.860 | The handwritten skill (Smriti, PR elastic#256060) outperforms the autonomous variant on both models by 14-15 points. The autonomous architect's broader domain framing (SAQ taxonomy, v3→v4 deltas, scope-reduction levers) did not translate into a better PCI-Criteria score. The handwritten contract is shorter (~4.1k vs ~8.1k chars) and lines up more tightly with the eval's scoring rubric — that tight coupling is the deciding factor. build_comparison_html.mjs gains a `--runs <label>=<dir>,...` mode so the 4-cell grid renders from the four results.json snapshots. Legacy `--handwritten`/`--autonomous` mode still works for single-model runs. kbn-scout --------- `run_kibana_server.ts` now respects `SCOUT_READ_DEV_CONFIG=true` and drops `--no-dev-config` when set, so a developer can load `config/kibana.dev.yml` (and the preconfigured AI connectors it defines) into the Scout-managed Kibana process. Default behaviour is unchanged. Without this, evals against real cloud connectors require fragile API-driven connector creation per boot. Refs: #11

Root-cause analysis of why the autonomously-architected PCI compliance skill scored 12-15 pts below the hand-written variant uncovered two distinct bugs that compounded: 1. **Tool registration bug** in `register_tools.ts` — PCI tools were gated *only* on `experimentalFeatures.pciComplianceAgentBuilder`, which the autonomous scout config explicitly disables to isolate the variant comparison. Result: the autonomous variant ran with NO PCI tools registered. Trace analysis confirmed 0 calls to `security.pci_compliance` across 16 scenarios vs 17-23 for HW. The agent fell back to raw `platform.core.execute_esql` and improvised the entire workflow. Fixed: gate now triggers on either flag. 2. **Skill-content design** — the autonomous prompt's 6-step workflow inserted "Reduce scope (tokenisation/P2PE/segmentation)" and "Classify requirements as technical vs process-based" steps BEFORE the tool calls, plus an 8 KB "Domain Knowledge Notes" block between the workflow and the status vocab. The structure read as "do-your-homework first" rather than "call the tools". Restructured: tools-first 4-step workflow with explicit "Always call the dedicated PCI tools; do not improvise raw ES|QL" injunction, theory moved to a "Background reference (do not consult before calling tools)" tail section. Removed broken handoff references to non-existent sibling skills and stripped tool-description provenance commentary. Validation on Claude 4.6 Sonnet: - pre-fix Auto: 0.860 mean (gap to HW: 12.9 pts) - post-fix Auto v3: 0.955 mean (gap to HW: 3.4 pts) - 6/8 scenarios now perfect 1.000; 1 scenario (full report) regressed -9 pts on a substance-vs-style criterion (agent calls the tool correctly but the report formatting elides specific evidence). Feedback-loop infrastructure: - `scripts/run-eval.sh` extended with optional scenario-grep argument (`run-eval.sh autonomous <connector> <label> "requirement 2.2.4"`) collapsing a full-suite cycle (~28 min) to a single-scenario probe (~5.6 min including scout boot, ~3 min if scout is reused). - Two iterations of this loop fixed both bugs end-to-end. POSTMORTEM.md captures the full analysis, including six ranked content fixes and a three-tier feedback-loop efficiency proposal.

…y (0.989 vs 0.989) The autonomous PCI compliance skill now ships its own independently-authored 4-tool decomposition under a separate allowlist entry. The autonomous skill has no knowledge of -- and no path to -- the hand-written PCI tools. This validates a fully end-to-end autonomous stack (skill + tools, both autonomously created) and reaches parity with the human-authored variant. What changed ------------ * New PCI tool bundle under `agent_builder/tools/pci_autonomous_tools/`: - `pci_autonomous_scope_discovery` - `pci_autonomous_compliance_check` (split out from the consolidated tool) - `pci_autonomous_scorecard_report` (split out from the consolidated tool) - `pci_autonomous_field_mapper` All four implement the cycle-17 architect blueprint's 4-tool decomposition (vs the hand-written variant's 3 tools, where check+report share one tool via a `mode` parameter). Each tool reuses the underlying domain logic so the comparison stays apples-to-apples on capability while validating the isolation property. * `register_tools.ts`: hand-written PCI tools register ONLY under `experimentalFeatures.pciComplianceAgentBuilder`; autonomous PCI tools register ONLY under `experimentalFeatures.pciComplianceAutonomousAgentBuilder`. The previous lenient gate (`either flag`) is removed -- the two variants are now strictly isolated. * `allow_lists.ts`: all four new autonomous tool IDs added to the `AGENT_BUILDER_BUILTIN_TOOLS` allowlist (without this, tool registration silently fails and the agent falls back to raw ES|QL). * Autonomous skill content + `getRegistryTools` rewired to reference the new tool IDs only. * Eval rubric (`pci_compliance.spec.ts`) is now variant-aware via `EVAL_PCI_VARIANT` -- judging criteria check for `pci_autonomous_*` tool names when the autonomous variant is on, and the original names otherwise. * Skill contract tests harden the isolation property: explicit assertions that the autonomous skill never references any hand-written tool ID, and that `getRegistryTools` advertises ONLY the autonomous bundle. * Comparison HTML updated with a new v5 column and a green success banner showing the autonomous skill+tools reaches parity with the hand-written baseline on Claude 4.6 Sonnet (0.989 vs 0.989, 8/8 scenarios). Why --- The user wanted to validate that the autonomous skill workflow generalises to other domains -- which requires removing every shortcut where the autonomous variant inherits the hand-written variant's tooling. The earlier "shared tool" runs were measuring only skill-content quality; this run measures the full stack the architect would generate from a blank slate. Result ------ | Variant | Mean (8 scenarios) | |-----------------------------------------|-------------------| | Hand-written, Claude 4.6 Sonnet | 0.989 | | Autonomous v5 (own 4 tools), Sonnet 4.6 | 0.989 | | Autonomous v3 (shared tools), Sonnet | 0.955 | | Autonomous v1 (shared, content drift) | 0.860 | Parity on the headline metric. The autonomous stack (skill content + 4-tool decomposition + allowlist entry + register gate) ships as a self-contained bundle the architect can replicate for any other domain.

Adds a second evaluation surface so the iteration loop on the autonomous PCI skill can be trusted to produce a generalisable skill rather than one that has memorised the iteration fixtures. Why --- The 0.989 we got from `sonnet46-autonomous-v5` (cycle that hit parity with the hand-written variant) is scored against the SAME fixtures we inspect while improving the skill. That tight loop is how every dataset-driven optimisation produces overfit: the skill content drifts from "teach the principle" to "match the fixture". Two layers of defence --------------------- 1. **Anti-overfit lockdown** (in `pci_compliance_autonomous_skill.test.ts`). A new `describe('anti-overfit ...')` block asserts the skill content contains NONE of the iteration- or holdout-set fixture values (`jdoe`, `pcompton`, `192.168.1.100`, `10.20.30.40`, `12 failed`, the random `logs-<hex>-{auth|network|...}` index pattern, etc.). Values that ARE legitimate PCI domain knowledge — `admin`/`root` for req 2.2.4, the lockout threshold of 10 for 8.3.4, `TLS 1.0`/`1.1` for 4.1 — are explicitly kept allowable. 11 invariants, all green today. Any future iteration that introduces a fixture-coupled patch will fail CI. 2. **Holdout dataset + spec** (new `pci_data_holdout.ts` + `pci_compliance_holdout/pci_compliance_holdout.spec.ts`). Same five PCI categories (auth/network/vuln/endpoint/legacy) but every memorisable axis is systematically different: - Index naming drops the `logs-*-{category}` pattern in favour of `security-audit-identity-*`, `siem-flows-prod-*`, `pkginfo-cve-*`, `edr-processes-*`, `legacy-app-syslog-*`. Tests that scope discovery uses field caps, not name patterns. - Brute-force volume is 8 (BELOW the PCI 8.3.4 threshold of 10) — expected verdict is GREEN, NOT RED. Catches skills that learnt "any failed-login cluster = violation". - Default-account flavours are Windows `Administrator` + `service_acct_42`, not Unix `admin`/`root`. - Weak TLS signature is TLS 1.1 ALONE — no TLS 1.0, no plain HTTP. Tests sub-version recognition rather than the kitchen-sink "multiple weak versions" pattern of the iteration set. - Non-ECS field schema uses `actor_name` / `client_addr` / `action_status` / `event_verb` / `device_id` / `cve_id` / `risk_rating` / `command` — completely different from the iteration set's `username` / `src_ip` / etc. Tests that field-mapping is semantic, not memorised. - 4-hour time window instead of 1-hour. - 2025-vintage CVEs instead of 2024. The six holdout scenarios mirror the structure of the iteration scenarios so the gap measurement is apples-to-apples: report, single-requirement check (× brute force + TLS + default accounts), scope discovery, field mapping. Result on Sonnet 4.6 -------------------- | | iteration | holdout | gap | verdict | |----------------|-----------|---------|--------|---------| | Hand-written | 0.989 | 0.942 | +0.047 | CLEAN | | Autonomous v5 | 0.989 | 0.927 | +0.062 | CAUTION | Both variants drop the same ~5-6 pts moving from iteration to holdout — and they drop on the SAME two scenarios (default-account variants 0.750/0.750, 4h scorecard 0.900/0.900). That tells us the holdout is genuinely harder, not that the autonomous skill is uniquely overfit. The autonomous gap of 0.062 is only 0.015 wider than the hand-written gap — well within noise of the framework. Crucially, the three HARDEST tests all scored 1.000 for both skills: - below-threshold brute force (counter-case — agent did NOT fabricate a false-positive violation) - TLS 1.1 alone (sub-version recognition without the kitchen-sink signature) - scope discovery on non-`logs-*` indices (worked via field caps, not via index-name pattern matching) Tooling changes --------------- - `run-eval.sh`: scout boot timeout bumped 6 min → 15 min; the default was unreliable when the host was also running an IDE. - `build_comparison_html.mjs`: new `--holdout-runs` flag mirroring `--runs`; new §5 section renders the iteration vs holdout grid, computes the gap per variant, applies the three-band verdict (CLEAN / CAUTION / OVERFIT), and lists the divergence axes plus the per-scenario holdout breakdown. Subsequent section numbers renumbered (6 reasoning, 7 reproduce, 8 provenance, 9 Bedrock). - `comparison.html` regenerated with the live holdout numbers. How to re-run ------------- bash x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/run-eval.sh \ handwritten pmeClaudeV46SonnetUsEast1 sonnet46-handwritten-holdout HOLDOUT bash x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/run-eval.sh \ autonomous pmeClaudeV46SonnetUsEast1 sonnet46-autonomous-holdout HOLDOUT node x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/build_comparison_html.mjs \ --runs ... --holdout-runs sonnet46-handwritten=...,sonnet46-autonomous=... This commit closes the question "did we just overfit to the iteration fixtures?" with a measurement, not an assertion. The answer is "the gap is small enough that the iteration loop is healthy, but not zero — field-mapping on novel vocabularies is the one place the autonomous skill is genuinely weaker than the hand-written one (0.909 vs 1.000), and that is a tool-implementation issue, not a skill-content overfit".

The previous report overclaimed full autonomy by saying the autonomous variant "ships an independently-authored 4-tool decomposition ... no shared context with the human-authored variant" and called it a "fully autonomous stack". That is true at the agent-facing surface (tool IDs, descriptions, schemas, decomposition, skill content, registration) but NOT at the domain-engine layer: each autonomous tool's handler still imports PCI_REQUIREMENTS, evaluateRequirement, and the ScopeClaim builder directly from the hand-written variant's pci_compliance_* modules. Recalibrates the framing without changing any numbers: - §1 intro now distinguishes "agent-facing surface" (independent) from "underlying domain engine" (shared via direct module imports) and points to the new §1.5 ladder. - §1.5 (new) "Autonomy ladder — what's truly independent vs what's shared": 10-row table covering tool IDs, descriptions, schemas, decomposition, prose, registration as INDEPENDENT and requirement catalog, evaluator, validation schemas, ScopeClaim builder, time helpers as SHARED. Names each shared file. - §4 verdict banner: "fully autonomous stack" → "surface-level autonomy of tools too", with an explicit caveat that the handler bodies still import the domain engine from the hand-written variant. Calls out the missing follow-up (pci_autonomous_requirements.ts / pci_autonomous_evaluator.ts). - §6 reasoning bullet 4: "Independently-authored tools" → "Independently- authored tool surface (engine still shared — see §1.5)" with the specific module names that are still being imported. - §8 Provenance & honesty: new "Honest limitation: autonomy is layered, not total" subsection summarising what the eval numbers measure (agent-surface autonomy on top of a shared engine) and what the next experiment would have to look like (independent engine + zero-import CI test + re-run). No code, eval numbers, or branch behaviour changed — only the framing of what the eval result is claiming. Sets up the follow-up work of authoring pci_autonomous_requirements.ts, pci_autonomous_evaluator.ts, and pci_autonomous_schemas.ts from the public DSS v4.0.1 spec and re-running.

Make the autonomous skill truly autonomous all the way down. Previously the four `pci_autonomous_*_tool.ts` handlers re-used the same PCI domain helpers as the hand-written skill (`pci_compliance_schemas`, `pci_compliance_requirements`, `pci_compliance_evaluator`). The agent-facing surface (IDs, schemas, decomposition, registration, skill content) was independent, but the underlying PCI engine was shared. This commit adds three engine modules in `pci_autonomous_tools/` authored from the PCI DSS v4.0.1 spec without referencing the hand-written ones, and rewires all four tools to use only the autonomous engine: - `pci_autonomous_schemas.ts` — independent zod input schemas with a stricter time-range guard (no future dates) and a `provenance` block on `PciAutonomousScopeClaim` for auditable autonomy. - `pci_autonomous_requirements.ts` — independent v4.0.1 catalog with a verdict-typed encoding (`detect_violations` vs `verify_presence`), self-documenting ES|QL params (`?_window_start`/`?_window_end`), enriched `defaultLookback` with rationale, and post-aggregation filtering instead of nested HAVING clauses. - `pci_autonomous_evaluator.ts` — composable pipeline of pure functions (replacing the nested try/catch pyramid), explicit status→score lookup table (avoiding multiplicative scoring drift), discriminated union for `FieldCapsPreflight`, and a different concurrency runner. CI lockdown: - `pci_autonomous_modules_no_handwritten_imports.test.ts` walks every file under `pci_autonomous_tools/` and asserts zero imports from the hand-written engine modules, plus that each tool file imports at least one autonomous engine module. The skill-level surface isolation test was also updated to reference the engine lockdown. All 28 autonomous-skill tests + 3 engine-lockdown tests pass. The next step (v6 results in `comparison.html`) is a fresh iteration+holdout eval run against this engine, which can now be attributed entirely to the autonomous architect.

Plug the v6 run (autonomous tools + autonomous engine) into the side-by-side comparison report. The architect re-authored the PCI domain engine from the public PCI DSS v4.0.1 spec (`pci_autonomous_requirements.ts`, `pci_autonomous_evaluator.ts`, `pci_autonomous_schemas.ts`), with a CI lockdown test asserting zero imports from the hand-written engine. Eval results: Iteration set (Sonnet 4.6, 8 scenarios) hand-written: 0.989 auto v5 (own tools, shared engine): 0.989 auto v6 (own tools + own engine): 0.989 ← deep autonomy at parity Holdout set (Sonnet 4.6, 6 scenarios) hand-written: 0.942 auto v5: 0.927 (gap −0.062 vs iteration → CAUTION band) auto v6: 0.985 (gap −0.004 vs iteration → CLEAN band) The deep-autonomy engine generalises *better* than the surface-only v5 on the holdout, with substantive wins on the 4h scorecard scenario (+0.100) and the default-account variants scenario (+0.250). Both wins come from the autonomous engine's more deliberate CDE / account-status semantics carrying over to non-fixture data shapes. Report changes -------------- - §1.5 autonomy ladder: rewrite the four engine rows from a single "SHARED" red pill to a "v5: SHARED / v6: AUTONOMOUS" pair, and add closing paragraphs that distinguish the two cycles. - §4 multi-model grid: add the v6 column. The reader can see v5 → v6 was a no-op on iteration scores but a substantive lift on holdout. - §5 generalisation gap: add a v6 row paired to the v6 holdout run. The pairing logic in build_comparison_html.mjs now strips any trailing `-vN` suffix when looking up the holdout label, so future iterations don't need a code change. - §6 reasoning bullet: flip the autonomous-side description from "engine still shared" to "tool surface AND domain engine independent (v6)", with the CI lockdown test referenced. - §8 honest limitation: rewrite as "how the deep-autonomy experiment was constructed (v6)". The prior text said this experiment "is not run here". It is now run here, and the section documents the three re-authored modules, the CI lockdown, and the result. The verdict banner now references both v5 (surface autonomy) and v6 (deep autonomy) as separate parity events.

Addresses the v6 deep-autonomy audit findings raised after the architect's own engine modules landed: Code-quality (autonomous engine modules) - schemas: tighten REQUIREMENT_ID_PATTERN so `all.1` etc. no longer match; strip stale "cycle-17" docstring references. - requirements: type catalog as Partial<Record<...>> so undefined lookups must be handled; drop redundant `| LIMIT 1` after un-grouped STATS; remove the as-cast pseudo-anchor (replaced by a runtime invariant in the new test file); strip "cycle-17" docstrings. - evaluator: scoreFor is exhaustive over the typed SCORE_TABLE so drop the unreachable `?? 0` fallback; runAutonomousWithConcurrency now awaits all in-flight tasks before re-throwing the first error so a single rejection no longer orphans siblings (semantics documented). - docstrings across index.ts, compliance_check_tool, register_tools, autonomous skill, and experimental_features now consistently describe v6 deep autonomy (independent engine + tools + heuristics) rather than overclaiming or underclaiming shared logic. Engine unit tests (~85 specs, ~2s) - pci_autonomous_schemas.test.ts: provenance constants, index-pattern refinements (ESQL injection, length bounds), time-range clamping, requirement-id regex, buildAutonomousScopeClaim dedupe/sort. - pci_autonomous_requirements.test.ts: catalog completeness, self- referential ids, presence of AUTONOMOUS_TIME_WINDOW placeholders, detect_violations always carries a violation query, defaultLookback sanity, plus a real runtime sync invariant that parses every catalog key through pciAutonomousRequirementIdSchema (replaces the prior compile-time anchor that was suppressed by an `as` cast). Also covers requirementCategory, buildAutonomousTimeWindowParams, time-range resolution, normalize/resolve helpers, and index-pattern helpers. - pci_autonomous_evaluator.test.ts: concurrency runner correctness + failure semantics, ordered ?_window_start/?_window_end binding, detect_violations RED path, verify_presence GREEN path, AMBER+HIGH / AMBER+LOW / NOT_ASSESSABLE branches via mockResolvedValueOnce, ES|QL failure → query_failed data gap, evidence row clamping. Reproducibility (#2 from audit) - build_comparison_html.mjs gains --combined-run <label>=<dir>, which reads a single results.json that mixes pci-compliance:* (iter) and pci-holdout:* (holdout) scenarios and splits them internally. The v6 evaluation report can now be regenerated from one results.json without an ad-hoc helper script. All four PCI-autonomous Jest suites pass locally (engine + lockdown). No new lint errors introduced (remaining no-continue / no-nested-ternary hits are pre-existing in untouched code).

- comparison.html / build_comparison_html.mjs: extend §8 with a new "v6 hardening — audit fixes + engine unit tests" subsection that spells out the post-v6 audit batch (Partial Record typing, exhaustive scoreFor, dropped LIMIT 1, concurrency failure semantics, stricter REQUIREMENT_ID_PATTERN), the new 85-spec engine test suite (including the runtime catalog↔schema sync invariant that replaces the suppressed compile-time anchor), and the new --combined-run flag for one-shot v6 report regeneration from a single results.json. - build_comparison_html.mjs: flatten six pre-existing nested ternaries (the §4 multi-runs-vs-live-vs-fallback chain becomes an IIFE with if/else; banner-class / banner-cls / gap-advice / mean-row cls all become let-block assignments) — no behaviour changes, the script smoke-runs end-to-end with --combined-run and produces a valid 574-line HTML output with all 11 §-headings intact. - pci_autonomous_requirements.ts: drop the lone `continue` in resolveAutonomousRequirementIds by inverting the guard into a positive-branch `if (canonical && canonical !== 'all') { ... }`. All 46 requirements specs still pass. Net result: both files lint clean (0 errors, 0 warnings). The 7 pre-existing lints sitting inside the audit-batch diff zone — 1 no-continue and 6 no-nested-ternary — are gone.

infra-vault-gh-plugin-prod · 2026-05-12T06:57:53Z

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

Click to trigger kibana-pull-request for this PR!
Click to trigger kibana-deploy-project-from-pr for this PR!
Click to trigger kibana-deploy-cloud-from-pr for this PR!
Click to trigger kibana-entity-store-performance-from-pr for this PR!
Click to trigger kibana-storybooks-from-pr for this PR!

…rift test Addresses two follow-up findings on PR elastic#268798: #2 — Lockdown test (pci_autonomous_modules_no_handwritten_imports.test.ts): broaden the import deny-list to cover the full hand-written PCI surface, not just the three engine modules. Now blocks: - pci_compliance_tool - pci_compliance_evaluator - pci_compliance_requirements - pci_compliance_schemas - pci_field_mapper_tool - pci_scope_discovery_tool - anything under skills/pci_compliance/** The previous deny-list only covered the engine trio, which left a silent re-coupling path: a future contributor could import the hand-written orchestrator tool or scope-discovery helper and pass CI. The deep-autonomy guarantee in comparison.html §1.5 is broader than the engine — it covers every hand-written surface — so the lockdown should match. #4 — New comparison_html.test.ts: structural snapshot for the committed report. Asserts that the 11 §-level sections appear (in expected order) and the v6 hardening / deep-autonomy h3 subsections are present. Catches the two drift directions between comparison.html and scripts/build_comparison_html.mjs: 1. someone edits the HTML directly and forgets to update the template; 2. someone edits the template and forgets to regenerate + commit. Deliberately not byte-for-byte equality — the rendered HTML legitimately changes with each eval refresh and we don't want CI noise on prose tweaks.

Address the 15 findings from the autonomous PCI deep-analysis audit covering the engine modules, the four agent-facing tools, and the skill prompt. Blockers - Scope-discovery tool now returns a `discoveryClaim` (point-in-time snapshot) instead of a mis-shaped `scopeClaim`, surfaces ES errors as structured `dataGaps`, and validates `cat.indices` responses with a zod schema before walking them. - Requirements catalog: dropped the unused `requiredCategories[]` field and the orphan `requirementCategory()` helper. Removed `NOT_APPLICABLE` from `AutonomousComplianceStatus` — it was carried in the score table but never produced by any evaluator path. - Scorecard report no longer tags its synthesised executive roll-up as `ToolResultType.esqlResults` (the payload is not an ESQL row set); it now lands under `ToolResultType.other` so downstream UX/telemetry that special-cases `esqlResults` does not mis-render it. Importants - Skill prompt rewritten: workflow is now `discover → roll up → drill down`. The check and scorecard tools are explicitly designed to be used as a sequence and share one evaluator via the new `runAutonomousPciEvaluationPack` orchestration helper. - Both tools now derive `overallStatus` from the same severity rollup (`rollupAutonomousOverallStatus`) and `overallConfidence` from the same confidence rollup (`rollupAutonomousConfidence`), eliminating the previous risk of disagreement. - Field-mapper sensitive-field regex tightened: the previous bare `/token/i` over-matched (e.g. `subscription` contains no token but `tokenizer` would have flagged). Replaced with anchored patterns for `card`, `pan`, `cvv`, `cvc`, `account.number`, `credit.card`, `ssn`, `secret`, `password`, `api.key`, and specific `*token` shapes. - Added a runtime `assertNever` exhaustiveness check on the `statusToHumanLabel` switch — adding a new status without updating the switch now fails at compile time. Nice-to-haves - Removed experiment-only metadata (gate scores, citation counts, architect attribution, brittle `comparison.html §1.5` cross-refs) from every runtime file. Authoring metadata stays beside the eval suite. - "Recommended Remediation SLA" table in the skill prompt re-labelled as operational guidance — only the 30-day req 6.3.3 window is spec-sourced; the rest are heuristics a QSA would typically agree with but an org may tune. - SAQ scope-reduction "70%" claim re-cast as the assessor-guidance heuristic range (50–80%), not a guarantee. - `requirementCategory` tests removed; weak `['HIGH','MEDIUM']` evaluator assertion pinned to the exact value (`MEDIUM` via the coverage-stage no-violation-query path). - New `buildAutonomousDiscoveryClaim` helper + 4-spec test block covering dedupe/sort, provenance pinning, point-in-time semantics, and stable shape across shuffled inputs. Verification - ESLint: 14 files, clean. - Jest: 101/101 pass in `pci_autonomous_tools/` + the autonomous skill suite, 16/16 pass in `comparison_html.test.ts`. - Scoped `tsc -b` against `security_solution/tsconfig.type_check.json`: green.

patrykkopycinski added 11 commits May 10, 2026 20:50

patrykkopycinski added 2 commits May 12, 2026 09:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security GenAI] Autonomous-vs-hand-written PCI compliance skill side-by-side (v6 deep autonomy)#268798

[Security GenAI] Autonomous-vs-hand-written PCI compliance skill side-by-side (v6 deep autonomy)#268798
patrykkopycinski wants to merge 13 commits into
elastic:mainfrom
patrykkopycinski:pk/autonomous-vs-handwritten-pci

patrykkopycinski commented May 12, 2026 •

edited

Loading

Uh oh!

infra-vault-gh-plugin-prod Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

patrykkopycinski commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What lands

Audit-fix batch (latest two commits)

Checklist

Release Notes

Identify risks

Uh oh!

infra-vault-gh-plugin-prod Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

patrykkopycinski commented May 12, 2026 •

edited

Loading