[Security GenAI] Autonomous-vs-hand-written PCI compliance skill side-by-side (v6 deep autonomy)#268798
Draft
patrykkopycinski wants to merge 13 commits into
Draft
Conversation
…y-side eval harness Adds a second PCI compliance skill (`pci-compliance-autonomous`) that ships ALONGSIDE the existing hand-written `pci-compliance` skill, so the same eval suite can be run against both variants and compared head-to-head. The autonomous variant deliberately reuses the SAME underlying tools as the hand-written variant, isolating "skill content" (instructions + domain knowledge + trigger phrases) as the only experimental variable. ## What ships Server (security_solution plugin) - New skill definition `pci_compliance_autonomous/` registering `pci-compliance-autonomous` against the existing PCI tool IDs. - New feature flag `pciComplianceAutonomousAgentBuilder` (default off). - Skill registration gated by the flag in `register_skills.ts`. - Allow-list entry for the new skill ID. Eval harness (kbn-evals-suite-pci-compliance) - `evaluate_dataset.ts` reads `EVAL_PCI_VARIANT` (`handwritten` | `autonomous`) to select which skill `createSkillInvocationEvaluator` targets. Default remains `handwritten` so existing CI is unchanged. - `scripts/compare_variants.sh` runs both variants back-to-back and emits a side-by-side `comparison.html` with structural metrics + slots for live evaluator output (per-scenario scores, judge rationales, latency). - `scripts/build_comparison_html.mjs` generates the report; all embedded paths are repo-relative so the artifact is portable. - README documents the variant matrix and the comparison workflow. CI plumbing - New Scout config set `evals_pci_compliance_autonomous` that flips ONLY the autonomous flag, so the autonomous run sees only the autonomous skill. - `evals.suites.json` registers `pci-compliance-autonomous`. - `llm_evals.yml` adds a Buildkite step for the autonomous variant and tags the existing PCI step with `EVAL_PCI_VARIANT=handwritten` for symmetry. ## Why The hand-written PCI skill (`pci-compliance`, elastic#256060) is the production baseline. The autonomous skill was generated end-to-end by `skill.architect` against the current Kibana tool catalog, with PCI domain knowledge synthesized from autonomous web research + model knowledge (SAQ taxonomy, v3->v4 deltas, scope-reduction levers, technical-vs-process classification). Running the existing 7-scenario PCI eval suite against both — same tools, same dataset, same evaluators, same judge — gives a clean A/B that answers "is the autonomously generated skill at least as good as the hand-written one?". ## Out of scope (not introduced by this commit) `evaluate_dataset.ts:17` triggers `@kbn/imports/no_boundary_crossing` because `@kbn/evals` is declared `type: "test-helper"` and the suite imports value exports from it. This lint reproduces identically on every sibling `kbn-evals-suite-*` package on `main` (verified against `kbn-evals-suite-security-ai-rules`), so it is endemic to the eval framework and would require a cross-cutting change to `@kbn/evals` ownership / visibility — out of scope for this skill comparison.
…on fix - Ran @kbn/evals-suite-pci-compliance back-to-back against both PCI skill variants on a local Scout cluster wired to llama3.1:8b via a LiteLLM proxy (translates OpenAI-format requests to Ollama, including structured tool_calls). Captured 14 docs per variant from the kibana-evaluations data stream. - Updated build_comparison_html.mjs to consume the framework's actual export shape (Elasticsearch _search response), folding the per-evaluator rows back into per-scenario rows. Added a routing-aggregate diagnostic (scenarios with >=1 PCI-skill tool call, total tool calls vs PCI-skill tool calls) so the HTML can show *why* a score landed where it did, not just the score itself. - Re-rendered comparison.html with the live data. Both variants scored 0.00 across all completed scenarios because llama3.1:8b is too small to engage either PCI skill -- the agent router fell back to the generic platform.core.search tool on every scenario, never invoking security.pci_*. The HTML now carries an honest banner explaining this: the comparison is apples-to-apples (identical model + dataset + infra), it just lives on the floor at this model scale. The structural and domain-coverage deltas in sections 2-3 remain the meaningful signal until the same script is re-run with a stronger model. - Fixed an isolation bug in the autonomous Scout config set: the pciComplianceAgentBuilder feature flag defaults to true in experimental_features.ts, so the autonomous run was loading BOTH skills. Added 'disable:pciComplianceAgentBuilder' to the scout config serverArgs to keep the comparison clean for future runs. Refs: #11
…arison on real connectors The autonomous-vs-handwritten PCI comparison previously ran on llama3.1:8b through a local Ollama proxy. At that model scale the agent router never engaged either PCI skill, so every scenario scored 0.00 and the comparison landed on the floor (see commit fc5194e). This commit promotes the comparison to real Bedrock connectors and ships the connector-side fix that the upgrade required. Bedrock connector — Claude Opus 4.7 enablement ---------------------------------------------- Claude Opus 4.7 on Bedrock rejects the `temperature` inference parameter with `temperature is deprecated for this model`. Without omitting it the connector simply 400s on every request. Fix is in three layers: - `@kbn/inference-common`: new `supportsTemperature?: boolean` on `ModelDefinition`; `claude-opus-4-7` marked `supportsTemperature: false`. Future Claude variants (or other provider models) with the same restriction need only flip the flag — one source of truth. - `inference` plugin: `getTemperatureIfValid` omits temperature when the model definition declares `supportsTemperature: false`. Sits alongside the existing OpenAI o-series exclusions and works for any provider. - `stack_connectors` (Bedrock): new local `bedrockModelSupportsTemperature(model)` helper; `formatBedrockBody` threads `model` through and gates the parameter. `invokeAI`, `invokeStream`, `invokeAIRaw`, `_converse`, and `_converseStream` all consult it. Defense in depth — direct sub-action callers (Security AI Assistant, etc.) are protected without taking a cross-plugin dependency on `@kbn/inference-common`. Smoke-tested with `invokeAI` + `converse` sub-actions: - Claude 4.7 Opus (`us.anthropic.claude-opus-4-7`): now passes — temperature omitted, response returned. - Claude 4.6 Sonnet (`us.anthropic.claude-sonnet-4-6`): still passes — temperature included as before. Live eval comparison (PCI Criteria, LLM-judge 0..1) --------------------------------------------------- Both PCI skill variants ran the same 8-scenario `@kbn/evals-suite-pci-compliance` suite end-to-end against a real Scout cluster, on two production Bedrock connectors: | Variant | Claude 4.7 Opus | Claude 4.6 Sonnet | |-------------|----------------:|------------------:| | Handwritten | 0.977 | 0.989 | | Autonomous | 0.834 | 0.860 | The handwritten skill (Smriti, PR elastic#256060) outperforms the autonomous variant on both models by 14-15 points. The autonomous architect's broader domain framing (SAQ taxonomy, v3→v4 deltas, scope-reduction levers) did not translate into a better PCI-Criteria score. The handwritten contract is shorter (~4.1k vs ~8.1k chars) and lines up more tightly with the eval's scoring rubric — that tight coupling is the deciding factor. build_comparison_html.mjs gains a `--runs <label>=<dir>,...` mode so the 4-cell grid renders from the four results.json snapshots. Legacy `--handwritten`/`--autonomous` mode still works for single-model runs. kbn-scout --------- `run_kibana_server.ts` now respects `SCOUT_READ_DEV_CONFIG=true` and drops `--no-dev-config` when set, so a developer can load `config/kibana.dev.yml` (and the preconfigured AI connectors it defines) into the Scout-managed Kibana process. Default behaviour is unchanged. Without this, evals against real cloud connectors require fragile API-driven connector creation per boot. Refs: #11
Root-cause analysis of why the autonomously-architected PCI compliance skill scored 12-15 pts below the hand-written variant uncovered two distinct bugs that compounded: 1. **Tool registration bug** in `register_tools.ts` — PCI tools were gated *only* on `experimentalFeatures.pciComplianceAgentBuilder`, which the autonomous scout config explicitly disables to isolate the variant comparison. Result: the autonomous variant ran with NO PCI tools registered. Trace analysis confirmed 0 calls to `security.pci_compliance` across 16 scenarios vs 17-23 for HW. The agent fell back to raw `platform.core.execute_esql` and improvised the entire workflow. Fixed: gate now triggers on either flag. 2. **Skill-content design** — the autonomous prompt's 6-step workflow inserted "Reduce scope (tokenisation/P2PE/segmentation)" and "Classify requirements as technical vs process-based" steps BEFORE the tool calls, plus an 8 KB "Domain Knowledge Notes" block between the workflow and the status vocab. The structure read as "do-your-homework first" rather than "call the tools". Restructured: tools-first 4-step workflow with explicit "Always call the dedicated PCI tools; do not improvise raw ES|QL" injunction, theory moved to a "Background reference (do not consult before calling tools)" tail section. Removed broken handoff references to non-existent sibling skills and stripped tool-description provenance commentary. Validation on Claude 4.6 Sonnet: - pre-fix Auto: 0.860 mean (gap to HW: 12.9 pts) - post-fix Auto v3: 0.955 mean (gap to HW: 3.4 pts) - 6/8 scenarios now perfect 1.000; 1 scenario (full report) regressed -9 pts on a substance-vs-style criterion (agent calls the tool correctly but the report formatting elides specific evidence). Feedback-loop infrastructure: - `scripts/run-eval.sh` extended with optional scenario-grep argument (`run-eval.sh autonomous <connector> <label> "requirement 2.2.4"`) collapsing a full-suite cycle (~28 min) to a single-scenario probe (~5.6 min including scout boot, ~3 min if scout is reused). - Two iterations of this loop fixed both bugs end-to-end. POSTMORTEM.md captures the full analysis, including six ranked content fixes and a three-tier feedback-loop efficiency proposal.
…y (0.989 vs 0.989) The autonomous PCI compliance skill now ships its own independently-authored 4-tool decomposition under a separate allowlist entry. The autonomous skill has no knowledge of -- and no path to -- the hand-written PCI tools. This validates a fully end-to-end autonomous stack (skill + tools, both autonomously created) and reaches parity with the human-authored variant. What changed ------------ * New PCI tool bundle under `agent_builder/tools/pci_autonomous_tools/`: - `pci_autonomous_scope_discovery` - `pci_autonomous_compliance_check` (split out from the consolidated tool) - `pci_autonomous_scorecard_report` (split out from the consolidated tool) - `pci_autonomous_field_mapper` All four implement the cycle-17 architect blueprint's 4-tool decomposition (vs the hand-written variant's 3 tools, where check+report share one tool via a `mode` parameter). Each tool reuses the underlying domain logic so the comparison stays apples-to-apples on capability while validating the isolation property. * `register_tools.ts`: hand-written PCI tools register ONLY under `experimentalFeatures.pciComplianceAgentBuilder`; autonomous PCI tools register ONLY under `experimentalFeatures.pciComplianceAutonomousAgentBuilder`. The previous lenient gate (`either flag`) is removed -- the two variants are now strictly isolated. * `allow_lists.ts`: all four new autonomous tool IDs added to the `AGENT_BUILDER_BUILTIN_TOOLS` allowlist (without this, tool registration silently fails and the agent falls back to raw ES|QL). * Autonomous skill content + `getRegistryTools` rewired to reference the new tool IDs only. * Eval rubric (`pci_compliance.spec.ts`) is now variant-aware via `EVAL_PCI_VARIANT` -- judging criteria check for `pci_autonomous_*` tool names when the autonomous variant is on, and the original names otherwise. * Skill contract tests harden the isolation property: explicit assertions that the autonomous skill never references any hand-written tool ID, and that `getRegistryTools` advertises ONLY the autonomous bundle. * Comparison HTML updated with a new v5 column and a green success banner showing the autonomous skill+tools reaches parity with the hand-written baseline on Claude 4.6 Sonnet (0.989 vs 0.989, 8/8 scenarios). Why --- The user wanted to validate that the autonomous skill workflow generalises to other domains -- which requires removing every shortcut where the autonomous variant inherits the hand-written variant's tooling. The earlier "shared tool" runs were measuring only skill-content quality; this run measures the full stack the architect would generate from a blank slate. Result ------ | Variant | Mean (8 scenarios) | |-----------------------------------------|-------------------| | Hand-written, Claude 4.6 Sonnet | 0.989 | | Autonomous v5 (own 4 tools), Sonnet 4.6 | 0.989 | | Autonomous v3 (shared tools), Sonnet | 0.955 | | Autonomous v1 (shared, content drift) | 0.860 | Parity on the headline metric. The autonomous stack (skill content + 4-tool decomposition + allowlist entry + register gate) ships as a self-contained bundle the architect can replicate for any other domain.
Adds a second evaluation surface so the iteration loop on the
autonomous PCI skill can be trusted to produce a generalisable skill
rather than one that has memorised the iteration fixtures.
Why
---
The 0.989 we got from `sonnet46-autonomous-v5` (cycle that hit
parity with the hand-written variant) is scored against the SAME
fixtures we inspect while improving the skill. That tight loop is
how every dataset-driven optimisation produces overfit: the skill
content drifts from "teach the principle" to "match the fixture".
Two layers of defence
---------------------
1. **Anti-overfit lockdown** (in `pci_compliance_autonomous_skill.test.ts`).
A new `describe('anti-overfit ...')` block asserts the skill content
contains NONE of the iteration- or holdout-set fixture values
(`jdoe`, `pcompton`, `192.168.1.100`, `10.20.30.40`, `12 failed`,
the random `logs-<hex>-{auth|network|...}` index pattern, etc.).
Values that ARE legitimate PCI domain knowledge — `admin`/`root` for
req 2.2.4, the lockout threshold of 10 for 8.3.4, `TLS 1.0`/`1.1`
for 4.1 — are explicitly kept allowable. 11 invariants, all green
today. Any future iteration that introduces a fixture-coupled patch
will fail CI.
2. **Holdout dataset + spec** (new `pci_data_holdout.ts` +
`pci_compliance_holdout/pci_compliance_holdout.spec.ts`). Same five
PCI categories (auth/network/vuln/endpoint/legacy) but every
memorisable axis is systematically different:
- Index naming drops the `logs-*-{category}` pattern in favour of
`security-audit-identity-*`, `siem-flows-prod-*`,
`pkginfo-cve-*`, `edr-processes-*`, `legacy-app-syslog-*`. Tests
that scope discovery uses field caps, not name patterns.
- Brute-force volume is 8 (BELOW the PCI 8.3.4 threshold of 10) —
expected verdict is GREEN, NOT RED. Catches skills that learnt
"any failed-login cluster = violation".
- Default-account flavours are Windows `Administrator` +
`service_acct_42`, not Unix `admin`/`root`.
- Weak TLS signature is TLS 1.1 ALONE — no TLS 1.0, no plain HTTP.
Tests sub-version recognition rather than the kitchen-sink
"multiple weak versions" pattern of the iteration set.
- Non-ECS field schema uses `actor_name` / `client_addr` /
`action_status` / `event_verb` / `device_id` / `cve_id` /
`risk_rating` / `command` — completely different from the
iteration set's `username` / `src_ip` / etc. Tests that
field-mapping is semantic, not memorised.
- 4-hour time window instead of 1-hour.
- 2025-vintage CVEs instead of 2024.
The six holdout scenarios mirror the structure of the iteration
scenarios so the gap measurement is apples-to-apples: report,
single-requirement check (× brute force + TLS + default accounts),
scope discovery, field mapping.
Result on Sonnet 4.6
--------------------
| | iteration | holdout | gap | verdict |
|----------------|-----------|---------|--------|---------|
| Hand-written | 0.989 | 0.942 | +0.047 | CLEAN |
| Autonomous v5 | 0.989 | 0.927 | +0.062 | CAUTION |
Both variants drop the same ~5-6 pts moving from iteration to holdout
— and they drop on the SAME two scenarios (default-account variants
0.750/0.750, 4h scorecard 0.900/0.900). That tells us the holdout is
genuinely harder, not that the autonomous skill is uniquely overfit.
The autonomous gap of 0.062 is only 0.015 wider than the hand-written
gap — well within noise of the framework.
Crucially, the three HARDEST tests all scored 1.000 for both skills:
- below-threshold brute force (counter-case — agent did NOT
fabricate a false-positive violation)
- TLS 1.1 alone (sub-version recognition without the kitchen-sink
signature)
- scope discovery on non-`logs-*` indices (worked via field caps,
not via index-name pattern matching)
Tooling changes
---------------
- `run-eval.sh`: scout boot timeout bumped 6 min → 15 min; the
default was unreliable when the host was also running an IDE.
- `build_comparison_html.mjs`: new `--holdout-runs` flag mirroring
`--runs`; new §5 section renders the iteration vs holdout grid,
computes the gap per variant, applies the three-band verdict
(CLEAN / CAUTION / OVERFIT), and lists the divergence axes plus
the per-scenario holdout breakdown. Subsequent section numbers
renumbered (6 reasoning, 7 reproduce, 8 provenance, 9 Bedrock).
- `comparison.html` regenerated with the live holdout numbers.
How to re-run
-------------
bash x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/run-eval.sh \
handwritten pmeClaudeV46SonnetUsEast1 sonnet46-handwritten-holdout HOLDOUT
bash x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/run-eval.sh \
autonomous pmeClaudeV46SonnetUsEast1 sonnet46-autonomous-holdout HOLDOUT
node x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/scripts/build_comparison_html.mjs \
--runs ... --holdout-runs sonnet46-handwritten=...,sonnet46-autonomous=...
This commit closes the question "did we just overfit to the iteration
fixtures?" with a measurement, not an assertion. The answer is "the
gap is small enough that the iteration loop is healthy, but not zero
— field-mapping on novel vocabularies is the one place the autonomous
skill is genuinely weaker than the hand-written one (0.909 vs 1.000),
and that is a tool-implementation issue, not a skill-content overfit".
The previous report overclaimed full autonomy by saying the autonomous variant "ships an independently-authored 4-tool decomposition ... no shared context with the human-authored variant" and called it a "fully autonomous stack". That is true at the agent-facing surface (tool IDs, descriptions, schemas, decomposition, skill content, registration) but NOT at the domain-engine layer: each autonomous tool's handler still imports PCI_REQUIREMENTS, evaluateRequirement, and the ScopeClaim builder directly from the hand-written variant's pci_compliance_* modules. Recalibrates the framing without changing any numbers: - §1 intro now distinguishes "agent-facing surface" (independent) from "underlying domain engine" (shared via direct module imports) and points to the new §1.5 ladder. - §1.5 (new) "Autonomy ladder — what's truly independent vs what's shared": 10-row table covering tool IDs, descriptions, schemas, decomposition, prose, registration as INDEPENDENT and requirement catalog, evaluator, validation schemas, ScopeClaim builder, time helpers as SHARED. Names each shared file. - §4 verdict banner: "fully autonomous stack" → "surface-level autonomy of tools too", with an explicit caveat that the handler bodies still import the domain engine from the hand-written variant. Calls out the missing follow-up (pci_autonomous_requirements.ts / pci_autonomous_evaluator.ts). - §6 reasoning bullet 4: "Independently-authored tools" → "Independently- authored tool surface (engine still shared — see §1.5)" with the specific module names that are still being imported. - §8 Provenance & honesty: new "Honest limitation: autonomy is layered, not total" subsection summarising what the eval numbers measure (agent-surface autonomy on top of a shared engine) and what the next experiment would have to look like (independent engine + zero-import CI test + re-run). No code, eval numbers, or branch behaviour changed — only the framing of what the eval result is claiming. Sets up the follow-up work of authoring pci_autonomous_requirements.ts, pci_autonomous_evaluator.ts, and pci_autonomous_schemas.ts from the public DSS v4.0.1 spec and re-running.
Make the autonomous skill truly autonomous all the way down. Previously the four `pci_autonomous_*_tool.ts` handlers re-used the same PCI domain helpers as the hand-written skill (`pci_compliance_schemas`, `pci_compliance_requirements`, `pci_compliance_evaluator`). The agent-facing surface (IDs, schemas, decomposition, registration, skill content) was independent, but the underlying PCI engine was shared. This commit adds three engine modules in `pci_autonomous_tools/` authored from the PCI DSS v4.0.1 spec without referencing the hand-written ones, and rewires all four tools to use only the autonomous engine: - `pci_autonomous_schemas.ts` — independent zod input schemas with a stricter time-range guard (no future dates) and a `provenance` block on `PciAutonomousScopeClaim` for auditable autonomy. - `pci_autonomous_requirements.ts` — independent v4.0.1 catalog with a verdict-typed encoding (`detect_violations` vs `verify_presence`), self-documenting ES|QL params (`?_window_start`/`?_window_end`), enriched `defaultLookback` with rationale, and post-aggregation filtering instead of nested HAVING clauses. - `pci_autonomous_evaluator.ts` — composable pipeline of pure functions (replacing the nested try/catch pyramid), explicit status→score lookup table (avoiding multiplicative scoring drift), discriminated union for `FieldCapsPreflight`, and a different concurrency runner. CI lockdown: - `pci_autonomous_modules_no_handwritten_imports.test.ts` walks every file under `pci_autonomous_tools/` and asserts zero imports from the hand-written engine modules, plus that each tool file imports at least one autonomous engine module. The skill-level surface isolation test was also updated to reference the engine lockdown. All 28 autonomous-skill tests + 3 engine-lockdown tests pass. The next step (v6 results in `comparison.html`) is a fresh iteration+holdout eval run against this engine, which can now be attributed entirely to the autonomous architect.
Plug the v6 run (autonomous tools + autonomous engine) into the side-by-side comparison report. The architect re-authored the PCI domain engine from the public PCI DSS v4.0.1 spec (`pci_autonomous_requirements.ts`, `pci_autonomous_evaluator.ts`, `pci_autonomous_schemas.ts`), with a CI lockdown test asserting zero imports from the hand-written engine. Eval results: Iteration set (Sonnet 4.6, 8 scenarios) hand-written: 0.989 auto v5 (own tools, shared engine): 0.989 auto v6 (own tools + own engine): 0.989 ← deep autonomy at parity Holdout set (Sonnet 4.6, 6 scenarios) hand-written: 0.942 auto v5: 0.927 (gap −0.062 vs iteration → CAUTION band) auto v6: 0.985 (gap −0.004 vs iteration → CLEAN band) The deep-autonomy engine generalises *better* than the surface-only v5 on the holdout, with substantive wins on the 4h scorecard scenario (+0.100) and the default-account variants scenario (+0.250). Both wins come from the autonomous engine's more deliberate CDE / account-status semantics carrying over to non-fixture data shapes. Report changes -------------- - §1.5 autonomy ladder: rewrite the four engine rows from a single "SHARED" red pill to a "v5: SHARED / v6: AUTONOMOUS" pair, and add closing paragraphs that distinguish the two cycles. - §4 multi-model grid: add the v6 column. The reader can see v5 → v6 was a no-op on iteration scores but a substantive lift on holdout. - §5 generalisation gap: add a v6 row paired to the v6 holdout run. The pairing logic in build_comparison_html.mjs now strips any trailing `-vN` suffix when looking up the holdout label, so future iterations don't need a code change. - §6 reasoning bullet: flip the autonomous-side description from "engine still shared" to "tool surface AND domain engine independent (v6)", with the CI lockdown test referenced. - §8 honest limitation: rewrite as "how the deep-autonomy experiment was constructed (v6)". The prior text said this experiment "is not run here". It is now run here, and the section documents the three re-authored modules, the CI lockdown, and the result. The verdict banner now references both v5 (surface autonomy) and v6 (deep autonomy) as separate parity events.
Addresses the v6 deep-autonomy audit findings raised after the architect's
own engine modules landed:
Code-quality (autonomous engine modules)
- schemas: tighten REQUIREMENT_ID_PATTERN so `all.1` etc. no longer match;
strip stale "cycle-17" docstring references.
- requirements: type catalog as Partial<Record<...>> so undefined lookups
must be handled; drop redundant `| LIMIT 1` after un-grouped STATS;
remove the as-cast pseudo-anchor (replaced by a runtime invariant in
the new test file); strip "cycle-17" docstrings.
- evaluator: scoreFor is exhaustive over the typed SCORE_TABLE so drop
the unreachable `?? 0` fallback; runAutonomousWithConcurrency now
awaits all in-flight tasks before re-throwing the first error so a
single rejection no longer orphans siblings (semantics documented).
- docstrings across index.ts, compliance_check_tool, register_tools,
autonomous skill, and experimental_features now consistently describe
v6 deep autonomy (independent engine + tools + heuristics) rather than
overclaiming or underclaiming shared logic.
Engine unit tests (~85 specs, ~2s)
- pci_autonomous_schemas.test.ts: provenance constants, index-pattern
refinements (ESQL injection, length bounds), time-range clamping,
requirement-id regex, buildAutonomousScopeClaim dedupe/sort.
- pci_autonomous_requirements.test.ts: catalog completeness, self-
referential ids, presence of AUTONOMOUS_TIME_WINDOW placeholders,
detect_violations always carries a violation query, defaultLookback
sanity, plus a real runtime sync invariant that parses every catalog
key through pciAutonomousRequirementIdSchema (replaces the prior
compile-time anchor that was suppressed by an `as` cast). Also covers
requirementCategory, buildAutonomousTimeWindowParams, time-range
resolution, normalize/resolve helpers, and index-pattern helpers.
- pci_autonomous_evaluator.test.ts: concurrency runner correctness +
failure semantics, ordered ?_window_start/?_window_end binding,
detect_violations RED path, verify_presence GREEN path, AMBER+HIGH /
AMBER+LOW / NOT_ASSESSABLE branches via mockResolvedValueOnce, ES|QL
failure → query_failed data gap, evidence row clamping.
Reproducibility (#2 from audit)
- build_comparison_html.mjs gains --combined-run <label>=<dir>, which
reads a single results.json that mixes pci-compliance:* (iter) and
pci-holdout:* (holdout) scenarios and splits them internally. The
v6 evaluation report can now be regenerated from one results.json
without an ad-hoc helper script.
All four PCI-autonomous Jest suites pass locally (engine + lockdown).
No new lint errors introduced (remaining no-continue / no-nested-ternary
hits are pre-existing in untouched code).
- comparison.html / build_comparison_html.mjs: extend §8 with a new
"v6 hardening — audit fixes + engine unit tests" subsection that
spells out the post-v6 audit batch (Partial Record typing, exhaustive
scoreFor, dropped LIMIT 1, concurrency failure semantics, stricter
REQUIREMENT_ID_PATTERN), the new 85-spec engine test suite (including
the runtime catalog↔schema sync invariant that replaces the suppressed
compile-time anchor), and the new --combined-run flag for one-shot
v6 report regeneration from a single results.json.
- build_comparison_html.mjs: flatten six pre-existing nested ternaries
(the §4 multi-runs-vs-live-vs-fallback chain becomes an IIFE with
if/else; banner-class / banner-cls / gap-advice / mean-row cls all
become let-block assignments) — no behaviour changes, the script
smoke-runs end-to-end with --combined-run and produces a valid 574-line
HTML output with all 11 §-headings intact.
- pci_autonomous_requirements.ts: drop the lone `continue` in
resolveAutonomousRequirementIds by inverting the guard into a
positive-branch `if (canonical && canonical !== 'all') { ... }`.
All 46 requirements specs still pass.
Net result: both files lint clean (0 errors, 0 warnings). The 7
pre-existing lints sitting inside the audit-batch diff zone — 1
no-continue and 6 no-nested-ternary — are gone.
|
🤖 Jobs for this PR can be triggered through checkboxes. 🚧
ℹ️ To trigger the CI, please tick the checkbox below 👇
|
…rift test Addresses two follow-up findings on PR elastic#268798: #2 — Lockdown test (pci_autonomous_modules_no_handwritten_imports.test.ts): broaden the import deny-list to cover the full hand-written PCI surface, not just the three engine modules. Now blocks: - pci_compliance_tool - pci_compliance_evaluator - pci_compliance_requirements - pci_compliance_schemas - pci_field_mapper_tool - pci_scope_discovery_tool - anything under skills/pci_compliance/** The previous deny-list only covered the engine trio, which left a silent re-coupling path: a future contributor could import the hand-written orchestrator tool or scope-discovery helper and pass CI. The deep-autonomy guarantee in comparison.html §1.5 is broader than the engine — it covers every hand-written surface — so the lockdown should match. #4 — New comparison_html.test.ts: structural snapshot for the committed report. Asserts that the 11 §-level sections appear (in expected order) and the v6 hardening / deep-autonomy h3 subsections are present. Catches the two drift directions between comparison.html and scripts/build_comparison_html.mjs: 1. someone edits the HTML directly and forgets to update the template; 2. someone edits the template and forgets to regenerate + commit. Deliberately not byte-for-byte equality — the rendered HTML legitimately changes with each eval refresh and we don't want CI noise on prose tweaks.
Address the 15 findings from the autonomous PCI deep-analysis audit covering the engine modules, the four agent-facing tools, and the skill prompt. Blockers - Scope-discovery tool now returns a `discoveryClaim` (point-in-time snapshot) instead of a mis-shaped `scopeClaim`, surfaces ES errors as structured `dataGaps`, and validates `cat.indices` responses with a zod schema before walking them. - Requirements catalog: dropped the unused `requiredCategories[]` field and the orphan `requirementCategory()` helper. Removed `NOT_APPLICABLE` from `AutonomousComplianceStatus` — it was carried in the score table but never produced by any evaluator path. - Scorecard report no longer tags its synthesised executive roll-up as `ToolResultType.esqlResults` (the payload is not an ESQL row set); it now lands under `ToolResultType.other` so downstream UX/telemetry that special-cases `esqlResults` does not mis-render it. Importants - Skill prompt rewritten: workflow is now `discover → roll up → drill down`. The check and scorecard tools are explicitly designed to be used as a sequence and share one evaluator via the new `runAutonomousPciEvaluationPack` orchestration helper. - Both tools now derive `overallStatus` from the same severity rollup (`rollupAutonomousOverallStatus`) and `overallConfidence` from the same confidence rollup (`rollupAutonomousConfidence`), eliminating the previous risk of disagreement. - Field-mapper sensitive-field regex tightened: the previous bare `/token/i` over-matched (e.g. `subscription` contains no token but `tokenizer` would have flagged). Replaced with anchored patterns for `card`, `pan`, `cvv`, `cvc`, `account.number`, `credit.card`, `ssn`, `secret`, `password`, `api.key`, and specific `*token` shapes. - Added a runtime `assertNever` exhaustiveness check on the `statusToHumanLabel` switch — adding a new status without updating the switch now fails at compile time. Nice-to-haves - Removed experiment-only metadata (gate scores, citation counts, architect attribution, brittle `comparison.html §1.5` cross-refs) from every runtime file. Authoring metadata stays beside the eval suite. - "Recommended Remediation SLA" table in the skill prompt re-labelled as operational guidance — only the 30-day req 6.3.3 window is spec-sourced; the rest are heuristics a QSA would typically agree with but an org may tune. - SAQ scope-reduction "70%" claim re-cast as the assessor-guidance heuristic range (50–80%), not a guarantee. - `requirementCategory` tests removed; weak `['HIGH','MEDIUM']` evaluator assertion pinned to the exact value (`MEDIUM` via the coverage-stage no-violation-query path). - New `buildAutonomousDiscoveryClaim` helper + 4-spec test block covering dedupe/sort, provenance pinning, point-in-time semantics, and stable shape across shuffled inputs. Verification - ESLint: 14 files, clean. - Jest: 101/101 pass in `pci_autonomous_tools/` + the autonomous skill suite, 16/16 pass in `comparison_html.test.ts`. - Scoped `tsc -b` against `security_solution/tsconfig.type_check.json`: green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validation experiment: can the
skill.architectautonomous workflow carry anentire Elastic Security feature end-to-end — agent contract and underlying
domain engine — as well as a hand-written skill authored by a domain expert?
This PR adds a second PCI compliance skill (
pci-compliance-autonomous) nextto Smriti's hand-written
pci-complianceskill (PR #256060), wires aside-by-side eval harness through
@kbn/evals, and ships the result.Headline result (Claude 4.6 Sonnet, 8-scenario iteration set):
hand-written 0.989 vs autonomous v6 0.989 — parity, with both
generalising to a held-out dataset (hand-written 0.942, autonomous v6 0.985).
See
x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/comparison.htmlfor the fullside-by-side report (autonomy ladder in §1.5, per-scenario scores in §4,
overfit/holdout analysis in §5, audit-fix hardening in §8).
What lands
pciComplianceAutonomousAgentBuilderand allowlist entry, so the agentrouter has zero path to the hand-written tool IDs under the autonomous flag.
pci_autonomous_requirements.ts(independent PCI DSS v4.0.1 catalog),pci_autonomous_evaluator.ts(composable pipeline, exhaustivestatus→score table, awaits-all concurrency runner),
pci_autonomous_schemas.ts(independent zod + ScopeClaim builder with48 h future-date guard and provenance block).
pci_autonomous_modules_no_handwritten_imports.test.tswalks every fileunder
pci_autonomous_tools/and asserts (a) zero imports frompci_compliance_(requirements|evaluator|schemas), (b) every tool fileimports at least one autonomous engine module.
same suite runs both variants on the same cluster, same dataset, same
connector; the only variable is which skill the router has available.
Anti-overfit content lockdown (11 invariants in the skill test) forbids
the skill from referencing any iteration or holdout fixture value.
supportsTemperature: falseto
@kbn/inference-common's known-models table and gates thetemperatureparameter in both the inference plugin and the Bedrockconnector (
invokeAI,invokeStream,invokeAIRaw,_converse,_converseStream). Single line inknown_models.tscontrols this forfuture temperature-incompatible models.
including a runtime catalog↔schema sync invariant that parses every
catalog key through
pciAutonomousRequirementIdSchema.build_comparison_html.mjs --combined-runflagsplits a single
results.jsoncontaining both iteration and holdoutscenarios so the report can be regenerated from one committed file.
Audit-fix batch (latest two commits)
After v6 landed, an internal audit raised seven items; all closed:
Partial<Record<...>>to force handling ofundefined lookups;
SCORE_TABLEso the?? 0fallback inscoreForwasunreachable and removed;
| LIMIT 1after un-groupedSTATSdropped;runAutonomousWithConcurrencyrewritten to await all in-flight tasksbefore re-throwing the first error (no orphan promises);
REQUIREMENT_ID_PATTERNregex tightened soall.1and similarmalformed IDs no longer match;
by a real runtime invariant test;
--combined-runregen flag for one-shot report rebuilds.In addition, the final commit sweeps 7 pre-existing lints sitting inside
the audit-batch diff zone (1
no-continue, 6no-nested-ternary— flattenedby extracting branches into
let/if/else ifblocks and wrapping the §4multi-runs chain in an IIFE).
Checklist
pci_autonomous_{schemas,requirements,evaluator}.test.ts) + lockdown test asserting zero imports from hand-written engine; existingpci_compliance_skill.test.tsuntouched.pciComplianceAutonomousAgentBuilderdefaults to off)Release Notes
release_note:skip— experimental autonomous-validation work behind anopt-in feature flag, plus a Bedrock fix for the Claude 4.7 Opus
temperaturedeprecation. No user-facing surface lands by default.Identify risks
pciComplianceAutonomousAgentBuilderfeature flag, default off; independent allowlist entry; CI lockdown
test prevents cross-imports.
temperature-stripping change has a wider blast radiusthan the PCI experiment in this PR.
Bedrock connector's
invokeAI/invokeStream/invokeAIRaw/_converse/_converseStream— i.e. Security AI Assistant,Observability AI Assistant, custom Workflows / Cases connectors using
.bedrock, and Inference-plugin chat-completion calls that resolve toa
.bedrockconnector. The PCI suite is one of many consumers, notthe only one — please review accordingly.
supportsTemperature: falseflag in@kbn/inference-common'sknown_models.ts, which defaults absent.Only models that opt in (currently
claude-opus-4-7-*only) havetemperaturestripped; all other Bedrock-routed models retain pre-PRbehavior. Smoke-tested
invokeAI+converseon both Opus 4.7 (nowpasses) and Sonnet 4.6 (still includes temperature, also passes).
.bedrockconnector(
stack_connectors/.../bedrock/{bedrock.ts,utils.ts}) is on thedeprecation path per [Inference] Mark all LLM connectors as deprecated #261591 — superseded by the Inference plugin.
The equivalent gate on the Inference side lives in
inference/server/chat_complete/utils/get_temperature.tsand sharesknown_models.tsas source-of-truth, so users on either path inheritthe same temperature-compatibility decisions. Keeping the connector
edit here (rather than splitting it out) is deliberate so this PR
closes the loop for users still on the deprecated path. CC reviewers:
Team:Security GenAIfor PCI surface,Team:Search+Team:Securityfor the Bedrock connector + Inference plugin half.
(own engine + own tests) and the comparison report; existing
pci_compliancecode is untouched.Made with Cursor