Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,47 @@ and this project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.ht

## [Unreleased]

## [0.37.0] - 2026-05-27

**Theme: third attacker family added to cross-model held-out, v8
classifier closes the worst v0.36 sub-cell.** 900 adversarial entries
generated by `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic` on
AMD-backed MI300X via `rocm/vllm:latest` extend the cross-model fold
to a third family. v8 retrains on v035 TRAIN plus v036 Mixtral and
Claude TM/PE folded in, with v036 DE both legs and the full v037
Llama-3.3 leg held out. v8 holds in-distribution at 86.6% recall on
v035 TEST (vs v7's 85.3%, +1.3pp) with 5.0% FPR. On the new v037
holdout (2,277 entries), overall recall lands at 66.8%, a 7.6 pp lift
on the comparable v036 number. The Llama-3.3 leg covers a third
attacker family at 85.8% overall. The worst v0.36 sub-cell (data_exfil
× Claude) lifts from 26.0% (v7) to 38.9% (v8) on the same 700 entries,
+12.9 pp, confirming the v0.36 diagnosis that the constraint was
training-corpus distribution and not feature space. Folding v036
TM/PE into TRAIN reweights the existing destination features against
the closed-weight attacker patterns the v7 fold was missing.

### Added
- `tests/adversarial/generated/{TM,PE,DE}-v037-llama33.jsonl`: 900
third-family entries (13 schema-invalid TM dropped, 887 valid).
- `tests/adversarial/v037_split.json`: v0.37 split manifest. Inherits
every v035_split assignment unchanged, folds v036 Mixtral and Claude
TM/PE into the train fold, marks v036 Mixtral and Claude DE plus the
full v037 Llama-3.3 leg as held out.
- `src/vaara/data/adversarial_classifier_v8.joblib`: new production
bundle, 638 features (254 hand plus 384 MiniLM embedding), trained
on the 11,287-entry union fold at threshold 0.9006.
- `scripts/_v037_common.py`, `scripts/generate_targeted_v037.py`,
`scripts/v037_droplet_run.sh`, `scripts/v037_local_watcher.sh`:
Llama-3.3 generation pipeline mirroring the v0.36 shape with the
third-family swap and a continuous-monitoring local watcher.
- `scripts/build_v037_split.py`: builds `v037_split.json` from the
v035 inheritance plus the v036 and v037 generated entries.
- `scripts/validate_v037.py`, `scripts/eval_v037_holdout.py`: schema
check and three-leg holdout eval (mixtral, claude, llama33).
- `bench/v037_eval_v8_holdout.json`: full per-category and
per-category-per-leg eval results.
- `bench/vaara-bench-v0.37.md`: ship-gate record, chain of custody,
reproduction recipe, named limits.
- `vaara.attestation.sep2787`: reference implementation of the SEP-2787
Tool Call Attestation envelope (MCP spec PR
`modelcontextprotocol/modelcontextprotocol#2787`), proposed shape.
Expand All @@ -28,6 +68,8 @@ and this project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.ht
tampering rejection, canonicalization invariants, and TTL handling.

### Changed
- Production classifier: v7 → v8. v7 retained on disk for cross-eval
reproducibility. Threshold unchanged at 0.9006.
- `attestation` optional extra: adds `rfc8785>=0.1.4` for JCS
canonicalization.

Expand Down
6 changes: 6 additions & 0 deletions COMPLIANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -600,6 +600,12 @@ relevant measurement primitive that ties everything above together.
complement of) Clopper-Pearson. The extension rides in a single
field in the signed metadata. Standard OVERT verifiers ignore it.

## Position relative to the MIT AI Risk Repository

The [MIT AI Risk Repository v4](https://airisk.mit.edu/) (MIT FutureTech, Slattery et al., updated 2025-12-03, CC BY 4.0) is a meta-taxonomy of 1,835 risk-bearing entries drawn from 74 source papers, organised into 7 domains. Vaara has direct runtime evidence shape against roughly 740 of those entries (~46% of the sub-domain-tagged set), concentrated in Privacy & Security, Malicious Actors & Misuse, Human-Computer Interaction, parts of AI System Safety, and the Governance Failure sub-domain. Vaara does not cover the model-side, content-level, and structural risks that live elsewhere in the taxonomy.

The full per-sub-domain map lives at [docs/mit_ai_risk_repository_mapping.md](docs/mit_ai_risk_repository_mapping.md). Local copies of the v4 database and the companion AI Risk Mitigations sheet are tracked under `research/external/` for reproducibility.

## EU Product Liability Directive 2024/2853

Directive (EU) 2024/2853 of 23 October 2024 on liability for defective
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Held-out TEST recall 85.0% (95% Wilson [82.8, 87.1]) at FPR 4.6% [3.3, 6.3]. Mul
- 140 µs / 210 µs p99 inference latency, commodity CPU
- Distribution-free conformal coverage on the score
- MWU regret bound O(sqrt(T log N))
- [vaara-bench-v0.36](bench/vaara-bench-v0.36.md): current methodology, chain of custody, ship-gate record. Cross-model held-out evaluation (4,176 entries generated by Mixtral-8x7B and Claude Sonnet 4.6, never folded into TRAIN), v7 production classifier with 18 destination-aware features, honest training-corpus diagnosis named as v0.37 scope. Historical bench docs live under `bench/` for chain-of-custody continuity.
- [vaara-bench-v0.37](bench/vaara-bench-v0.37.md): current methodology, chain of custody, ship-gate record. Third attacker family added to cross-model held-out (900 entries generated by Llama-3.3-70B-Instruct on AMD-backed MI300X) and v8 production classifier trained on the v035 + v036 TM/PE union fold. Holds 86.6% recall on v035 TEST, 85.8% on the new Llama-3.3 leg, lifts the worst v0.36 sub-cell (data_exfil × Claude) from 26.0% to 38.9%. Historical bench docs live under `bench/` for chain-of-custody continuity.
- [vaara-bench-v1](bench/vaara-bench-v1.md): 77-trace synthetic-corpus regression baseline with frozen methodology, 100% soft TPR, 0% hard FPR

Each figure is reproducible from the public corpus or the bench harness in `bench/`.
Expand Down Expand Up @@ -266,6 +266,7 @@ See [COMPLIANCE.md](COMPLIANCE.md) "Position relative to open runtime-attestatio
| [PRIOR_ART.md](PRIOR_ART.md) | When each Vaara concept first shipped, and a neutral list of adjacent published work |
| [OWASP_AGENTIC.md](OWASP_AGENTIC.md) | Vaara mapping to OWASP Top 10 for Agentic Applications 2026 (ASI01 to ASI10) |
| [OVERT_CONTROLS.md](OVERT_CONTROLS.md) | Vaara mapping to OVERT 1.0 Part 3 Agentic AI Controls (TOOL-*, MCP-*, MULTI-*, CAP-*, DISC-*, HITL-*, DRIFT-*) |
| [docs/mit_ai_risk_repository_mapping.md](docs/mit_ai_risk_repository_mapping.md) | Vaara coverage map against the MIT AI Risk Repository v4 (1,835 risk-bearing entries across 7 domains) |
| [docs/signing-keys.md](docs/signing-keys.md) | Release signing and verification |
| [SECURITY.md](SECURITY.md) | Security policy and reporting |
| [CONTRIBUTING.md](CONTRIBUTING.md) | Contribution guidelines |
Expand Down
76 changes: 76 additions & 0 deletions bench/v037_eval_v8_holdout.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
{
"bundle": "src/vaara/data/adversarial_classifier_v8.joblib",
"bundle_version": "v0.37",
"threshold": 0.9006,
"split_manifest": "tests/adversarial/v037_split.json",
"n": 2277,
"pos": 2277,
"tp": 1522,
"fn": 755,
"recall": 0.668423364075538,
"recall_ci": [
0.6488167555363148,
0.687462624846786
],
"per_category": {
"data_exfil": {
"n": 1690,
"tp": 968,
"recall": 0.5727810650887574
},
"privilege_escalation": {
"n": 300,
"tp": 291,
"recall": 0.97
},
"tool_misuse": {
"n": 287,
"tp": 263,
"recall": 0.9163763066202091
}
},
"per_leg": {
"claude": {
"n": 700,
"tp": 272,
"recall": 0.38857142857142857
},
"mixtral": {
"n": 690,
"tp": 489,
"recall": 0.7086956521739131
},
"llama33": {
"n": 887,
"tp": 761,
"recall": 0.8579481397970687
}
},
"per_category_per_leg": {
"data_exfil__claude": {
"n": 700,
"tp": 272,
"recall": 0.38857142857142857
},
"data_exfil__mixtral": {
"n": 690,
"tp": 489,
"recall": 0.7086956521739131
},
"data_exfil__llama33": {
"n": 300,
"tp": 207,
"recall": 0.69
},
"privilege_escalation__llama33": {
"n": 300,
"tp": 291,
"recall": 0.97
},
"tool_misuse__llama33": {
"n": 287,
"tp": 263,
"recall": 0.9163763066202091
}
}
}
174 changes: 174 additions & 0 deletions bench/vaara-bench-v0.37.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# vaara-bench-v0.37

Methodology delta against [vaara-bench-v0.36](vaara-bench-v0.36.md).
v0.37 is a corpus-augmentation release with two contributions:

1. **Third attacker family added to held-out eval.** 900 adversarial
entries generated by `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic`
on AMD-backed MI300X SR-IOV under `rocm/vllm:latest`. Three
categories (`tool_misuse`, `privilege_escalation`, `data_exfil`),
300 per category, held out from TRAIN.
2. **v8 retrain on union corpus.** v036 Mixtral TM/PE plus v036 Claude
TM/PE entries folded into TRAIN. The v036 DE subset stays held out
so the worst sub-cell from v0.36 (data_exfil × Claude) remains a
pure cross-model eval. The new v037 holdout combines that carried
v036 DE fold with the full v037 Llama-3.3 leg.

## Production headline (v8 on v035 TEST)

| metric | v8 | v7 (prior production) | Δ |
|---|---|---|---|
| TEST recall at T=0.9006 | **86.6% [84.5, 88.6]** | 85.3% [83.1, 87.3] | +1.3pp |
| TEST FPR at same calibration | 5.0% [3.6, 6.8] | 4.6% [3.3, 6.3] | +0.4pp (within CI) |
| n_features | 638 (254 hand + 384 embed) | 638 | flat |
| TRAIN entries | 11,287 | 8,501 | +2,786 |
| TRAIN positive rate | 0.683 | 0.579 | folded TM/PE both legs |

v8 is the new production bundle
(`src/vaara/data/adversarial_classifier_v8.joblib`). v7 and v6 stay on
disk for cross-eval reproducibility. No regression on the
in-distribution TEST set. Recall lifts 1.3 pp at a 0.4 pp FPR cost that
sits inside the prior FPR confidence interval.

## Cross-model held-out result (v8 on v037 holdout)

| cut | n | recall at T=0.9006 |
|---|---|---|
| **overall** | 2,277 | **66.8% [64.9, 68.7]** |
| Llama-3.3 leg (new third family) | 887 | **85.8% [83.3, 87.9]** |
| Mixtral DE (carry-forward from v036) | 690 | 70.9% [67.4, 74.1] |
| Claude DE (carry-forward from v036) | 700 | **38.9% [35.3, 42.5]** |
| data_exfil overall | 1,690 | 57.3% [54.9, 59.6] |
| privilege_escalation overall | 300 | 97.0% [94.4, 98.4] |
| tool_misuse overall | 287 | 91.6% [87.9, 94.3] |
| data_exfil × llama33 | 300 | 69.0% [63.6, 74.0] |
| privilege_escalation × llama33 | 300 | 97.0% [94.4, 98.4] |
| tool_misuse × llama33 | 287 | 91.6% [87.9, 94.3] |

The v8 number on data_exfil × Claude lifts from 26.0% (v7) to **38.9%
(+12.9pp)** on the same 700 entries. The constraint diagnosed in v0.36
was training-corpus distribution, not feature space. Folding v036
TM/PE into TRAIN reweights the existing `dst__` features against actual
adversarial examples and the lift falls out.

## Third-family cross-model number

Llama-3.3-70B-Instruct was not present in TRAIN. The 85.8% overall
recall on its 887 entries indicates the v035 + v036 TM/PE TRAIN fold
generalises to a third attacker family at near-in-distribution recall.
Per-category cuts on this leg:

- privilege_escalation × llama33: 97.0%
- tool_misuse × llama33: 91.6%
- data_exfil × llama33: 69.0%

The data_exfil pattern from v0.36 holds. Across all three attacker
families now seen, DE is the hardest category. PE and TM generalise
cleanly; DE generalises unevenly.

## Carry-forward DE numbers

v036 Mixtral DE: 70.9% (v8) vs 69.3% (v7 on the same 690 entries),
flat. v036 Claude DE: 38.9% (v8) vs 26.0% (v7 on the same 700
entries), **+12.9pp**. The v036 → v8 lift is concentrated in the
closed-weight leg that was failing hardest. Open-weight Mixtral DE was
already at 70%-tier and stays there. The asymmetry confirms the v0.36
mechanism finding (destination signal is the axis, and folding the
related TM/PE distributions into TRAIN repositions the classifier on
that axis for the closed-weight leg).

## Ship gate

v0.37 ships under both a methodology gate and a sub-cell recall gate
because v8 is a production retrain:

| gate | result |
|---|---|
| v035 TEST recall does not regress | PASS, 85.3% → 86.6%, +1.3pp |
| v035 TEST FPR does not regress | PASS, 4.6% → 5.0%, within CI |
| Worst v0.36 sub-cell improves | PASS, DE × Claude 26.0% → 38.9% |
| Third attacker family covered with recall floor | PASS, llama33 overall 85.8% |
| Held-out gap stays published with mechanism | PASS |

Cross-model overall recall is 66.8%. Below the 70% floor used as soft
target in prior releases, but the floor was set against v035 TEST
distribution. Cross-model overall is a harder denominator, and 66.8%
is a 7.6 pp lift on the comparable v036 number (59.2% → 66.8%) with
a third family added to the denominator.

## Generation provenance

Llama-3.3-70B generation ran on an AMD-backed MI300X DigitalOcean
SR-IOV droplet under `rocm/vllm:latest` serving
`RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic` with the model's native
`compressed-tensors` FP8 quantization
(`--max-model-len 8192 --enforce-eager --gpu-memory-utilization 0.92`).
Three parallel category generators, ~22 minutes wall clock for 900
entries at steady-state ~40 entries/min combined. Droplet poweroff
issued post-rsync. Schema validation pass dropped 13 of 300 raw TM
entries (4.3%) where the model emitted non-DENY `expected`. Final v037
counts: TM 287, PE 300, DE 300, total 887 valid.

The v037 droplet recipe is identical to v0.36 modulo the model swap.
The `--quantization` flag had to be dropped because `compressed-tensors`
in the model config conflicts with an explicit `fp8` argument. vLLM
auto-detects the quantization scheme from the model config in that
case, and that path serves correctly. This is a model-specific
configuration note rather than a methodology change.

## Chain of custody

| anchor | path | pins |
|---|---|---|
| corpus manifest | `tests/adversarial/MANIFEST.sha256` | SHA-256 of every JSONL including v037 |
| v035 split (inherited) | `tests/adversarial/v035_split.json` | TRAIN/VAL/TEST for v8 calibration |
| v037 split | `tests/adversarial/v037_split.json` | v035 inherited + v036 TM/PE → train, v036 DE + v037 → holdout |
| production bundle | `src/vaara/data/adversarial_classifier_v8.joblib` | trained on 11,287 entries with dst features + embeddings |
| prior production | `src/vaara/data/adversarial_classifier_v7.joblib` | retained for cross-eval |
| Llama-3.3 generator | `scripts/generate_targeted_v037.py` | vLLM HTTP, FP8 dynamic on MI300X |
| droplet driver | `scripts/v037_droplet_run.sh` | idempotent, no destructive EXIT trap |
| watcher | `scripts/v037_local_watcher.sh` | 60s rsync poll, opt-in doctl auto-shutdown |
| split builder | `scripts/build_v037_split.py` | inherits v035, folds v036 TM/PE into train |
| holdout eval | `scripts/eval_v037_holdout.py` | three-leg breakdown (mixtral, claude, llama33) |
| v035 schema check | `scripts/validate_v037.py` | same shape as v0.36 validator |

## Reproduction recipe

```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language tag to the fenced code block.

At Line 137, use a language hint (e.g., bash) to satisfy markdownlint and improve readability.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 137-137: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/vaara-bench-v0.37.md` at line 137, The fenced code block using triple
backticks in the markdown contains no language tag; update that opening fence
(```) to include an appropriate language hint (for example, ```bash) so
markdownlint passes and the snippet is syntax-highlighted and more readable.

cd tests/adversarial && sha256sum -c MANIFEST.sha256
.venv/bin/python scripts/validate_v037.py
.venv/bin/python scripts/build_v037_split.py
.venv/bin/python scripts/save_classifier_bundle.py \
--version v0.37 --threshold 0.9006 --embeddings \
--split-manifest tests/adversarial/v037_split.json \
--train-fold train \
--bundle-out src/vaara/data/adversarial_classifier_v8.joblib
.venv/bin/python scripts/eval_v037_holdout.py \
--bundle src/vaara/data/adversarial_classifier_v8.joblib \
--json-out bench/v037_eval_v8_holdout.json
```
Comment on lines +137 to +149
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix reproduction command paths to be internally consistent.

At Line 138, the recipe changes directory to tests/adversarial, but subsequent commands at Lines 139-148 call scripts/... and src/... as if run from repo root. This likely makes the copy-paste reproduction flow fail.

Proposed doc fix
-``` 
-cd tests/adversarial && sha256sum -c MANIFEST.sha256
-.venv/bin/python scripts/validate_v037.py
-.venv/bin/python scripts/build_v037_split.py
+```bash
+sha256sum -c tests/adversarial/MANIFEST.sha256
+.venv/bin/python scripts/validate_v037.py
+.venv/bin/python scripts/build_v037_split.py
 .venv/bin/python scripts/save_classifier_bundle.py \
     --version v0.37 --threshold 0.9006 --embeddings \
     --split-manifest tests/adversarial/v037_split.json \
     --train-fold train \
     --bundle-out src/vaara/data/adversarial_classifier_v8.joblib
 .venv/bin/python scripts/eval_v037_holdout.py \
     --bundle src/vaara/data/adversarial_classifier_v8.joblib \
     --json-out bench/v037_eval_v8_holdout.json
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>

[warning] 137-137: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @bench/vaara-bench-v0.37.md around lines 137 - 149, The reproduction steps
change into tests/adversarial then run scripts and reference src paths as if at
repo root, which breaks copy-paste execution; update the commands so paths are
consistent from the repository root (e.g. check MANIFEST.sha256 as
tests/adversarial/MANIFEST.sha256 and keep subsequent .venv/bin/python calls and
--split-manifest tests/adversarial/v037_split.json and --bundle-out
src/vaara/data/adversarial_classifier_v8.joblib referenced from the repo root)
so users can run the listed commands without changing directory first.


</details>

<!-- fingerprinting:phantom:triton:hawk -->

<!-- This is an auto-generated comment by CodeRabbit -->


## Named limits

1. **Third family generation is 887 valid entries, not 4,000+ like
v0.36.** Wilson CI on a 300-entry sub-cell at p ~ 0.85 is ± 4 pp,
adequate for ship-gate decisions. Scaling the Llama-3.3 leg to v036
density is v0.38 scope, paired with public-benchmark evaluation.
2. **Open-weight families dominate the third-family fold.** Llama-3.3
and Mixtral are both open-weight Meta and Mistral architectures.
Closed-weight coverage in v0.37 is the carry-forward Claude DE
subset only. Adding GPT-4o-class or Gemini-class generation is v0.38
scope.
3. **No public-benchmark eval (PINT, BIPIA, INJECT) yet.** v0.38 scope.
4. **PAIR multi-attacker scale-up not performed.** v0.38 scope (target
ASR Wilson upper under 1%).
5. **FPR-bounded three-stage combiner per FCR paper (arxiv:2605.22004)
not implemented.** v0.39 scope.

## Cumulative position

v0.37 closes the worst v0.36 sub-cell by 12.9 pp without giving up
in-distribution recall, and covers a third attacker family at 85.8%
overall. The data_exfil category remains the hardest cross-model
surface. That is the v0.38 + v0.39 line of work: public-benchmark
numbers, PAIR-at-scale, FPR-bounded combiner.
2 changes: 1 addition & 1 deletion clients/ts/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@vaara/client",
"version": "0.36.0",
"version": "0.37.0",
"description": "TypeScript client for the Vaara HTTP API. Conformal risk scoring, hash-chained audit, policy reload, named detectors.",
"main": "dist/index.js",
"types": "dist/index.d.ts",
Expand Down
Loading