…eval suite (#264378)
> Built on top of #256060 (`smriti/pci-compliance-agent`) to land the
review changes there before it's marked ready. Targets Smriti's branch
directly so all work lands in a single upstream PR.
## Summary
Makes the PCI DSS v4.0.1 Agent Builder skill production-ready by
hardening the tools against ES|QL injection, consolidating them, gating
the feature behind an experimental flag, and adding an end-to-end eval
suite. The skill now:
* Ships **dark** by default (`pciComplianceAgentBuilder` experimental
flag, off).
* Uses **ES|QL parameter binding** everywhere — time ranges go through
`?_tstart` / `?_tend` bound params, and index patterns are strictly
validated by Zod before being interpolated into `FROM` (`FROM` clauses
cannot be parameterised).
* Exposes a single `pci_compliance` tool (`mode: "check" | "report"`)
plus `pci_scope_discovery` and `pci_field_mapper`, keeping the skill at
**3 registry tool references** (well under the 5-tool guideline).
* Attaches a structured **scope claim** to every tool response so the
LLM can cite DSS version, indices, time range, evaluated requirements,
and checked fields — and always includes a non-attestation disclaimer so
the skill never implies QSA certification.
## Eval Results — Before & After
The original PR (#256060) shipped **no eval suite and no automated
quality measurement**. This PR adds a new
`@kbn/evals-suite-pci-compliance` package with 8 scenarios. To quantify
the impact of the hardening changes, we retroactively ran the same eval
suite against Smriti's original tool code and then against this PR's
hardened tools — same model (Claude Sonnet 4.5), same judge (Claude Opus
4.6), same seed data, same criteria.
| Dataset | Original tools (#256060) | Hardened tools (this PR) | Delta
|
|---|---|---|---|
| requirement 2.2.4 default accounts | 0.83 | **1.00** | **+0.17** |
| scope discovery | 0.63 | **0.88** | **+0.25** |
| field mapping | 0.60 | **0.90** | **+0.30** |
| requirement 4.1 weak TLS | 0.88 | **1.00** | **+0.12** |
| full report | 0.60 | **0.90** | **+0.30** |
| requirement 8.3.4 brute force | 0.50 | **0.63** | **+0.13** |
| scoped to auth index | 0.86 | **1.00** | **+0.14** |
| no matching data | 1.00 | **1.00** | 0.00 |
| **Overall PCI Criteria mean** | **0.74** | **0.90** | **+0.16** |
| **Skill Invoked (trace-based)** | n/a (no eval existed) | **1.00** | ✅
new |
### What was missing before this PR
- **No eval suite** — no automated way to measure whether the PCI skill
produced correct, complete, or safe outputs
- **No scope claims** — tool responses had no structured metadata (DSS
version, time range, indices, disclaimer), making it impossible for the
LLM to cite audit scope or for a judge to verify correctness
- **No ES|QL injection protection** — time ranges were
string-interpolated into queries
- **No input validation** — index patterns were passed directly to
`FROM` without Zod sanitisation
- **No feature flag** — the skill was always registered regardless of
environment
- **No vulnerability scope** — `pci_scope_discovery` couldn't classify
vuln/CVE indices
- **No HTTP detection** — requirement 4.2.1 only checked for weak TLS,
not plain HTTP
- **Sequential `fieldCaps`** — O(n) round-trips per index in scope
discovery
- **No trace-based skill activation check** — no way to verify the PCI
SKILL.md was actually read
### Key improvements driving the score delta
1. **Structured scope claims + disclaimer** — every tool response now
includes `scopeClaim` with DSS version, indices, time range, evaluated
requirements, and non-attestation disclaimer. The judge validates these
are present (+0.17 on default accounts, contributes to all scenarios).
2. **Vulnerability scope type** — `pci_scope_discovery` now classifies
`vuln`/`cve`/`ids` indices under a dedicated `vulnerability` category
(+0.25 on scope discovery).
3. **Field mapping hints** — `pci_field_mapper` now carries explicit
`severity → vulnerability.severity` hints and returns ECS-aligned
suggestions (+0.30 on field mapping).
4. **ES|QL parameter binding** — the judge can verify queries reference
the correct time window (+0.30 on full report, +0.14 on scoped auth).
5. **Weak TLS / HTTP detection** — requirement 4.2.1 now also flags
plain HTTP traffic (`tls.version IS NULL AND network.protocol ==
"http"`) (+0.12 on weak TLS).
6. **Brute-force threshold alignment** — seed data increased from 7 → 12
failed logins to exceed the `> 10` threshold defined in
`pci_compliance_requirements.ts` (+0.13).
7. **Trace-based skill activation** — new deterministic evaluator
confirms the PCI SKILL.md is always read (1.00 across all 8 scenarios).
### Methodology
- **Original tools:** Smriti's tool code (`d28bbfb`, tip of
`smriti/pci-compliance-agent`) temporarily checked out on this branch,
with the eval suite running against it
- **Hardened tools:** this PR's tool code (`049b7be`) with the same eval
suite
- Both runs used the same Scout server config, same EDOT collector, same
LLM connectors
## Review guide
This PR is built to be reviewed in layers. Recommended order:
1. **Security / schemas** — `pci_compliance_schemas.ts` +
`pci_compliance_requirements.ts` (+ tests).
2. **Consolidation** — `pci_compliance_evaluator.ts`,
`pci_compliance_tool.ts`, and the skill wiring (old `check`/`report`
tools are deleted).
3. **Tool hardening** — `pci_field_mapper_tool.ts`,
`pci_scope_discovery_tool.ts` (batched fieldCaps, time-window scoping,
scope claims).
4. **Feature flag + registration** — `experimental_features.ts`,
`register_skills.ts`, `register_tools.ts`, `allow_lists.ts`,
`pci_compliance_skill.ts`.
5. **Eval suite** —
`x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/` +
the new `evals_pci_compliance` Scout config set.
## What changed
### 🔒 Security hardening
* **ES|QL parameter binding (A2)**: every coverage and violation query
now uses `executeEsql({ query, params: [{ _tstart, _tend }] })` with
`?_tstart` / `?_tend` placeholders instead of `$from`/`$to` string
interpolation. Verified by `pci_compliance_requirements.test.ts` — every
query in the registry must contain both placeholders and must not
contain the literal time-range values.
* **Strict Zod schemas (A1)**: a new `pci_compliance_schemas.ts`
centralises `pciIndexPatternSchema`, `pciTimeRangeSchema`, and
`pciRequirementIdSchema`. Rejects wildcards that would scan everything,
control chars, commas, spaces, backslashes, and any character outside
the Elasticsearch index-name allow-list — before the value ever reaches
ES. Used by every tool.
* **Scope claims (C)**: every PCI tool response includes a `scopeClaim`
object (`pciDssVersion`, `indices`, `timeRange`,
`requirementsEvaluated`, `requiredFieldsChecked`, `disclaimer`) produced
by a shared `buildScopeClaim()` helper. The evaluator and eval suite
both assert on it; the disclaimer is pinned in tests so it can't drift.
### 🧩 Tool consolidation (B)
* **Deleted**: `pci_compliance_check_tool.{ts,test.ts}`,
`pci_compliance_report_tool.{ts,test.ts}`.
* **Added**: `pci_compliance_tool.{ts,test.ts}` — a single tool with a
`mode: "check" | "report"` discriminator.
* **Added**: `pci_compliance_evaluator.{ts,test.ts}` — extracts shared
core evaluation logic (`evaluateRequirement`, `runWithConcurrency`) used
by both modes. Centralises preflight checks and bound ES|QL execution.
* **Skill tool budget**: `pci_compliance_skill.ts` now references 3
tools (`pci_compliance`, `pci_scope_discovery`, `pci_field_mapper`) —
below the 5-tool soft cap that improves LLM tool selection.
* **allow_lists.ts** updated to reflect the consolidated tool id.
### 🚩 Feature flag (D)
* New `pciComplianceAgentBuilder: false` in `experimental_features.ts`.
* `register_skills.ts` and `register_tools.ts` both gate registration on
the flag — no surface area is exposed when off.
* The `evals_pci_compliance` Scout config set enables the flag +
`agentBuilder:experimentalFeatures` UI setting for evals.
### ⚡ Performance fixes
* `pci_scope_discovery_tool.ts`: replaced per-index sequential
`fieldCaps` calls (O(n) round-trips across potentially thousands of
indices) with a **single batched `fieldCaps`** request. Custom wildcard
patterns are now resolved to concrete index names before querying.
Tested.
* `pci_field_mapper_tool.ts`: every `search()` now includes the
requested time-range filter so we don't scan cold data when mapping
fields.
### 🧪 Tests (G)
New/updated unit tests:
* `pci_compliance_schemas.test.ts` — Zod coverage (malicious patterns,
control chars, etc.) + scope-claim shape pinning (DSS version,
disclaimer, dedup/sort).
* `pci_compliance_requirements.test.ts` — ES|QL query safety: asserts
every registry query uses `?_tstart` / `?_tend` and never interpolates
the raw time range.
* `pci_compliance_evaluator.test.ts` — parameter binding is forwarded,
concurrency is bounded, `NOT_ASSESSABLE` vs `GREEN` classification.
* `pci_compliance_tool.test.ts` — input schema rejection, injection
protection, scope claim, `check` vs `report` response shape.
* `pci_field_mapper_tool.test.ts` + `pci_scope_discovery_tool.test.ts` —
hardened schemas, batched `fieldCaps`, wildcard pattern resolution,
scope claims.
* `pci_compliance_skill.test.ts` — tool count ≤ 5, tool IDs match
registry, flag-gated registration.
### 🧠 Eval suite (E)
New package: **`@kbn/evals-suite-pci-compliance`** (registered in
`evals.suites.json`, `tsconfig.base.json`, root `package.json`, and
`CODEOWNERS`).
Eight scenarios, each with PCI-specific criteria evaluator + trace-based
skill invocation evaluator:
| Scenario | Skill / Tool | What it asserts |
|---|---|---|
| Requirement 8 — brute force | `pci_compliance` (`check`) | Flags ≥12
failed-login evidence + surfaces source IP |
| Requirement 4 — weak TLS | `pci_compliance` (`check`) | Flags legacy
TLS/SSL + plain HTTP destinations |
| Requirement 2 — default accounts | `pci_compliance` (`check`) | Flags
default/service accounts |
| Scoped auth check | `pci_compliance` (`check`) | Correct scoping to
auth index only |
| Scope discovery | `pci_scope_discovery` | Identifies + classifies
PCI-relevant indices incl. vulnerability |
| Field mapping | `pci_field_mapper` | Suggests correct ECS targets for
legacy fields |
| Posture report | `pci_compliance` (`report`) | Scorecard across
requirements, confidence rollup |
| No matching data | `pci_compliance` (`check`) | Graceful handling when
no data exists |
A baseline criterion (`BASELINE_PCI_CRITERIA`) pins the DSS version, the
non-attestation disclaimer, and the `scopeClaim` shape for every
scenario. Seed data (`src/data_generators/pci_data.ts`) provisions five
small, self-contained index patterns with **randomized names per run**
(`logs-${random}-auth`, etc.) so specs own their lifecycle and the skill
can't rely on predictable index names.
Runs via:
```sh
node scripts/scout start-server --arch stateful --domain classic --serverConfigSet evals_pci_compliance
node scripts/evals start --suite pci-compliance
```
## On the testing strategy
* **Unit tests** cover schemas, ES|QL query safety, evaluator logic,
tool shape, and skill wiring — all deterministic, all in-process.
* **Eval suite** covers the LLM-driven integration path end-to-end with
seeded data and scope-claim assertions.
* **Scout API tests were intentionally descoped**: the tool HTTP
contract is exercised by the eval suite through the public
`/api/agent_builder/tools/_execute` endpoint, and every
security-critical path (schema rejection, injection protection, scope
claim) is pinned at the unit-test level. Happy to add an API-only Scout
spec in a follow-up if you'd prefer explicit coverage — the
`evals_pci_compliance` config set is already there to reuse.
## Checklist
- [x] Written or updated unit tests
- [x] Added eval coverage (new `kbn-evals-suite-pci-compliance` with 8
scenarios)
- [x] Eval comparison: original tools 0.74 → hardened tools 0.90 (+0.16
overall)
- [x] Feature gated behind experimental flag
(`pciComplianceAgentBuilder`)
- [x] Scope claim + non-attestation disclaimer pinned in tests
- [x] Trace-based skill invocation validation (1.00 across all
scenarios)
---------
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Summary
Makes the PCI DSS v4.0.1 Agent Builder skill production-ready by hardening the tools against ES|QL injection, consolidating them, gating the feature behind an experimental flag, and adding an end-to-end eval suite. The skill now:
pciComplianceAgentBuilderexperimental flag, off).?_tstart/?_tendbound params, and index patterns are strictly validated by Zod before being interpolated intoFROM(FROMclauses cannot be parameterised).pci_compliancetool (mode: "check" | "report") pluspci_scope_discoveryandpci_field_mapper, keeping the skill at 3 registry tool references (well under the 5-tool guideline).Eval Results — Before & After
The original PR (#256060) shipped no eval suite and no automated quality measurement. This PR adds a new
@kbn/evals-suite-pci-compliancepackage with 8 scenarios. To quantify the impact of the hardening changes, we retroactively ran the same eval suite against Smriti's original tool code and then against this PR's hardened tools — same model (Claude Sonnet 4.5), same judge (Claude Opus 4.6), same seed data, same criteria.What was missing before this PR
FROMwithout Zod sanitisationpci_scope_discoverycouldn't classify vuln/CVE indicesfieldCaps— O(n) round-trips per index in scope discoveryKey improvements driving the score delta
scopeClaimwith DSS version, indices, time range, evaluated requirements, and non-attestation disclaimer. The judge validates these are present (+0.17 on default accounts, contributes to all scenarios).pci_scope_discoverynow classifiesvuln/cve/idsindices under a dedicatedvulnerabilitycategory (+0.25 on scope discovery).pci_field_mappernow carries explicitseverity → vulnerability.severityhints and returns ECS-aligned suggestions (+0.30 on field mapping).tls.version IS NULL AND network.protocol == "http") (+0.12 on weak TLS).> 10threshold defined inpci_compliance_requirements.ts(+0.13).Methodology
d28bbfb, tip ofsmriti/pci-compliance-agent) temporarily checked out on this branch, with the eval suite running against it049b7be) with the same eval suiteReview guide
This PR is built to be reviewed in layers. Recommended order:
pci_compliance_schemas.ts+pci_compliance_requirements.ts(+ tests).pci_compliance_evaluator.ts,pci_compliance_tool.ts, and the skill wiring (oldcheck/reporttools are deleted).pci_field_mapper_tool.ts,pci_scope_discovery_tool.ts(batched fieldCaps, time-window scoping, scope claims).experimental_features.ts,register_skills.ts,register_tools.ts,allow_lists.ts,pci_compliance_skill.ts.x-pack/solutions/security/packages/kbn-evals-suite-pci-compliance/+ the newevals_pci_complianceScout config set.What changed
🔒 Security hardening
executeEsql({ query, params: [{ _tstart, _tend }] })with?_tstart/?_tendplaceholders instead of$from/$tostring interpolation. Verified bypci_compliance_requirements.test.ts— every query in the registry must contain both placeholders and must not contain the literal time-range values.pci_compliance_schemas.tscentralisespciIndexPatternSchema,pciTimeRangeSchema, andpciRequirementIdSchema. Rejects wildcards that would scan everything, control chars, commas, spaces, backslashes, and any character outside the Elasticsearch index-name allow-list — before the value ever reaches ES. Used by every tool.scopeClaimobject (pciDssVersion,indices,timeRange,requirementsEvaluated,requiredFieldsChecked,disclaimer) produced by a sharedbuildScopeClaim()helper. The evaluator and eval suite both assert on it; the disclaimer is pinned in tests so it can't drift.🧩 Tool consolidation (B)
pci_compliance_check_tool.{ts,test.ts},pci_compliance_report_tool.{ts,test.ts}.pci_compliance_tool.{ts,test.ts}— a single tool with amode: "check" | "report"discriminator.pci_compliance_evaluator.{ts,test.ts}— extracts shared core evaluation logic (evaluateRequirement,runWithConcurrency) used by both modes. Centralises preflight checks and bound ES|QL execution.pci_compliance_skill.tsnow references 3 tools (pci_compliance,pci_scope_discovery,pci_field_mapper) — below the 5-tool soft cap that improves LLM tool selection.🚩 Feature flag (D)
pciComplianceAgentBuilder: falseinexperimental_features.ts.register_skills.tsandregister_tools.tsboth gate registration on the flag — no surface area is exposed when off.evals_pci_complianceScout config set enables the flag +agentBuilder:experimentalFeaturesUI setting for evals.⚡ Performance fixes
pci_scope_discovery_tool.ts: replaced per-index sequentialfieldCapscalls (O(n) round-trips across potentially thousands of indices) with a single batchedfieldCapsrequest. Custom wildcard patterns are now resolved to concrete index names before querying. Tested.pci_field_mapper_tool.ts: everysearch()now includes the requested time-range filter so we don't scan cold data when mapping fields.🧪 Tests (G)
New/updated unit tests:
pci_compliance_schemas.test.ts— Zod coverage (malicious patterns, control chars, etc.) + scope-claim shape pinning (DSS version, disclaimer, dedup/sort).pci_compliance_requirements.test.ts— ES|QL query safety: asserts every registry query uses?_tstart/?_tendand never interpolates the raw time range.pci_compliance_evaluator.test.ts— parameter binding is forwarded, concurrency is bounded,NOT_ASSESSABLEvsGREENclassification.pci_compliance_tool.test.ts— input schema rejection, injection protection, scope claim,checkvsreportresponse shape.pci_field_mapper_tool.test.ts+pci_scope_discovery_tool.test.ts— hardened schemas, batchedfieldCaps, wildcard pattern resolution, scope claims.pci_compliance_skill.test.ts— tool count ≤ 5, tool IDs match registry, flag-gated registration.🧠 Eval suite (E)
New package:
@kbn/evals-suite-pci-compliance(registered inevals.suites.json,tsconfig.base.json, rootpackage.json, andCODEOWNERS).Eight scenarios, each with PCI-specific criteria evaluator + trace-based skill invocation evaluator:
pci_compliance(check)pci_compliance(check)pci_compliance(check)pci_compliance(check)pci_scope_discoverypci_field_mapperpci_compliance(report)pci_compliance(check)A baseline criterion (
BASELINE_PCI_CRITERIA) pins the DSS version, the non-attestation disclaimer, and thescopeClaimshape for every scenario. Seed data (src/data_generators/pci_data.ts) provisions five small, self-contained index patterns with randomized names per run (logs-${random}-auth, etc.) so specs own their lifecycle and the skill can't rely on predictable index names.Runs via:
On the testing strategy
/api/agent_builder/tools/_executeendpoint, and every security-critical path (schema rejection, injection protection, scope claim) is pinned at the unit-test level. Happy to add an API-only Scout spec in a follow-up if you'd prefer explicit coverage — theevals_pci_complianceconfig set is already there to reuse.Checklist
kbn-evals-suite-pci-compliancewith 8 scenarios)pciComplianceAgentBuilder)