Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,50 @@ and this project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.ht

## [Unreleased]

## [0.38.0] - 2026-05-27

**Theme: Phase 1 PAIR scale-up to n=300 per attacker family on the
Llama-3.3-70B leg.** 900 fresh adversarial entries generated by
`RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic` on AMD-backed MI300X
SR-IOV under `rocm/vllm:latest` at seed 43. The v8 production
classifier is carried forward unchanged from v0.37 and evaluated at
calibrated T=0.9006 against the new corpus. Overall recall lands at
88.4% [86.2, 90.4], a 2.6 pp lift over the v0.37 Llama-3.3 leg
(85.8%). The biggest move is on `data_exfil` (69.0% to 75.3%, +6.3
pp), with `tool_misuse` at 93.7% and `privilege_escalation` at 96.3%.
The Phase 1 entries are content-distinct from the v0.37 Llama-3.3 leg
because the new seed produces fresh samples.

External-corpus eval (BIPIA, LLMail-Inject) and the IPI fourth attacker
family both move to v0.39. Neither external corpus pre-extracts the
tool calls that v8 classifies, so an honest eval requires an LLM-agent
harness rather than direct classifier inference. IPI fits the same
release window as a different attack class.

### Added
- `tests/adversarial/generated/{TM,PE,DE}-v038-llama33-s43.jsonl`:
900 Phase 1 entries (300 per category) generated at seed 43,
schema-valid, fingerprint-deduplicated against v037.
- `scripts/eval_v038_phase1.py`: reads the three Phase 1 jsonls
directly and runs the production v8 bundle at T=0.9006. Reports
overall recall, per-category recall, per-severity recall, and
Wilson confidence intervals. Writes the eval artifact to
`bench/v038_phase1_eval_v8.json`.
- `scripts/v038_droplet_run.sh`: droplet driver mirroring the v0.37
shape with the `--quantization fp8` argument removed. The current
`rocm/vllm:latest` image refuses the explicit quantization flag
when the model config already declares `compressed-tensors`. vLLM
auto-detects on that path.
- `scripts/v038_local_watcher.sh`: 60-second rsync-back loop for
continuous recovery of entries and logs during long droplet runs.
- `bench/v038_phase1_eval_v8.json`: Phase 1 eval artifact.
- `bench/vaara-bench-v0.38.md`: v0.38 methodology, chain of custody,
ship gate, and the explicit scope note on the v0.39 external-corpus
and IPI threads.

### Changed
- README bench pointer swapped from v0.37 to v0.38.

## [0.37.1] - 2026-05-27

**Theme: SEP-2787 verifier step 5, argument commitment verification.**
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Held-out TEST recall 85.0% (95% Wilson [82.8, 87.1]) at FPR 4.6% [3.3, 6.3]. Mul
- 140 µs / 210 µs p99 inference latency, commodity CPU
- Distribution-free conformal coverage on the score
- MWU regret bound O(sqrt(T log N))
- [vaara-bench-v0.37](bench/vaara-bench-v0.37.md): current methodology, chain of custody, ship-gate record. Third attacker family added to cross-model held-out (900 entries generated by Llama-3.3-70B-Instruct on AMD-backed MI300X) and v8 production classifier trained on the v035 + v036 TM/PE union fold. Holds 86.6% recall on v035 TEST, 85.8% on the new Llama-3.3 leg, lifts the worst v0.36 sub-cell (data_exfil × Claude) from 26.0% to 38.9%. Historical bench docs live under `bench/` for chain-of-custody continuity.
- [vaara-bench-v0.38](bench/vaara-bench-v0.38.md): current methodology, chain of custody, ship-gate record. Phase 1 PAIR scale-up to n=300 per attacker family with 900 fresh Llama-3.3-70B entries on AMD-backed MI300X at seed 43. v8 production classifier unchanged from v0.37, evaluated at calibrated T=0.9006. Overall recall 88.4% [86.2, 90.4] on the Phase 1 corpus, +2.6pp over the v0.37 Llama-3.3 leg, biggest lift on data_exfil (+6.3pp). Historical bench docs live under `bench/` for chain-of-custody continuity.
- [vaara-bench-v1](bench/vaara-bench-v1.md): 77-trace synthetic-corpus regression baseline with frozen methodology, 100% soft TPR, 0% hard FPR

Each figure is reproducible from the public corpus or the bench harness in `bench/`.
Expand Down
56 changes: 56 additions & 0 deletions bench/v038_phase1_eval_v8.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
{
"bundle": "src/vaara/data/adversarial_classifier_v8.joblib",
"bundle_version": "v0.37",
"threshold": 0.9006,
"source": "v0.38 Phase 1: tests/adversarial/generated/{TM,PE,DE}-v038-llama33-s43.jsonl",
"model_attacker": "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic",
"seed": 43,
"n": 900,
"pos": 900,
"tp": 796,
"fn": 104,
"recall": 0.8844444444444445,
"recall_ci": [
0.8619044268492455,
0.9037164518533944
],
"per_category": {
"tool_misuse": {
"n": 300,
"tp": 281,
"recall": 0.9366666666666666
},
"privilege_escalation": {
"n": 300,
"tp": 289,
"recall": 0.9633333333333334
},
"data_exfil": {
"n": 300,
"tp": 226,
"recall": 0.7533333333333333
}
},
"per_severity": {
"critical": {
"n": 397,
"tp": 366,
"recall": 0.9219143576826196
},
"medium": {
"n": 161,
"tp": 135,
"recall": 0.8385093167701864
},
"high": {
"n": 336,
"tp": 289,
"recall": 0.8601190476190477
},
"low": {
"n": 6,
"tp": 6,
"recall": 1.0
}
}
}
130 changes: 130 additions & 0 deletions bench/vaara-bench-v0.38.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# vaara-bench-v0.38

Methodology delta against [vaara-bench-v0.37](vaara-bench-v0.37.md).
v0.38 is a corpus scale-up release on the Phase 1 PAIR leg:

1. **Third attacker family scaled to n=300 per category.** 900 fresh
adversarial entries generated by
`RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic` on AMD-backed MI300X
SR-IOV under `rocm/vllm:latest` at seed 43. Three categories
(`tool_misuse`, `privilege_escalation`, `data_exfil`), 300 per
category, held out from TRAIN.
2. **v8 classifier carried forward unchanged.** No retrain in v0.38.
The Phase 1 corpus is evaluated against the same production bundle
(`adversarial_classifier_v8.joblib`) that shipped in v0.37 at the
same calibrated threshold T=0.9006.

## Phase 1 result (v8 on 900 Llama-3.3 entries, seed 43)

| cut | n | recall at T=0.9006 |
|---|---|---|
| **overall** | 900 | **88.4% [86.2, 90.4]** |
| tool_misuse | 300 | 93.7% [90.3, 95.9] |
| privilege_escalation | 300 | 96.3% [93.6, 97.9] |
| data_exfil | 300 | 75.3% [70.2, 79.9] |

vs the v0.37 Llama-3.3 leg at n=887 (85.8% overall, TM 91.6%, PE 97.0%,
DE 69.0%): +2.6pp overall, with the biggest lift on data_exfil
(+6.3pp). PE moves inside its prior confidence interval. The DE lift
holds the v0.37 mechanism finding that data_exfil is the hardest
category but not a structural failure.

The Phase 1 corpus uses a different random seed than the v0.37 leg
(43 vs the v0.37 generator default) so the entries are content-distinct.
Fingerprint deduplication against v037 entries showed zero true
duplicates.

## Recall by severity (Phase 1)

| severity | n | recall |
|---|---|---|
| critical | 397 | 92.2% |
| high | 336 | 86.0% |
| medium | 161 | 83.9% |
| low | 6 | 100.0% |

The severity distribution carries across the Phase 1 entries: critical
above 92%, high and medium in the mid-eighties. The low n=6 bucket is
not load-bearing at this sample size.

## Generation provenance

Phase 1 generation ran on an AMD-backed MI300X DigitalOcean SR-IOV
droplet under `rocm/vllm:latest` serving
`RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic` with the model's native
`compressed-tensors` FP8 quantization
(`--max-model-len 8192 --enforce-eager --gpu-memory-utilization 0.92`).
Three parallel category generators, 22 minutes wall clock for 900
entries (vLLM health-up at 06:48Z, last generator done at 07:06Z on
2026-05-27). All 900 entries schema-valid.

The v037 generator hardcoded `v037` in entry `id` and `agent_id` fields
regardless of `--random-seed`, producing ID collisions against the
v037 corpus. A one-pass rename `v037 -> v038` zeroed those collisions
in place before the eval. Content uniqueness was preserved because
the model produced distinct samples at the new seed.

The v0.38 droplet driver (`scripts/v038_droplet_run.sh`) drops the
`--quantization fp8` argument that v0.37 used. The current
`rocm/vllm:latest` image (vllm 0.11.2.dev673) refuses an explicit
quantization flag when the model config already declares
`compressed-tensors`. vLLM auto-detects on that path.

## Chain of custody

| anchor | path | pins |
|---|---|---|
| Phase 1 corpus | `tests/adversarial/generated/{TM,PE,DE}-v038-llama33-s43.jsonl` | 300 entries per category, seed 43, schema-valid |
| production bundle | `src/vaara/data/adversarial_classifier_v8.joblib` | unchanged from v0.37 |
| Phase 1 eval | `scripts/eval_v038_phase1.py` | reads jsonls directly, bypasses split manifest |
| eval artifact | `bench/v038_phase1_eval_v8.json` | overall + per-category + per-severity |
| droplet driver | `scripts/v038_droplet_run.sh` | drops --quantization fp8 flag |
| watcher | `scripts/v038_local_watcher.sh` | 60s rsync-back loop for defensive recovery |

## Reproduction recipe

```
PYTHONPATH=src .venv/bin/python scripts/eval_v038_phase1.py \
--bundle src/vaara/data/adversarial_classifier_v8.joblib \
--threshold 0.9006 \
--json-out bench/v038_phase1_eval_v8.json
```

Comment on lines +86 to +92
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifier to the fenced code block.

The code block starting at line 86 should specify a language for proper syntax highlighting and linting consistency. As per the static analysis hint, fenced code blocks should have a language specified.

📝 Proposed fix
-```
+```bash
 PYTHONPATH=src .venv/bin/python scripts/eval_v038_phase1.py \
     --bundle src/vaara/data/adversarial_classifier_v8.joblib \
     --threshold 0.9006 \
     --json-out bench/v038_phase1_eval_v8.json
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>

[warning] 86-86: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @bench/vaara-bench-v0.38.md around lines 86 - 92, The fenced code block
containing the shell command that runs scripts/eval_v038_phase1.py should
include a language identifier for syntax highlighting; edit the block in
bench/vaara-bench-v0.38.md around the PYTHONPATH invocation to change the
opening fence from tobash so the command (including
scripts/eval_v038_phase1.py, --bundle, --threshold, --json-out) is marked as
bash.


</details>

<!-- fingerprinting:phantom:triton:puma -->

<!-- This is an auto-generated comment by CodeRabbit -->

## What is not in v0.38

Two threads carry to v0.39:

1. **External-corpus eval (BIPIA, LLMail-Inject).** BIPIA provides 75
text-injection templates plus 50 code-injection templates (the
instructions to inject into a benign context). LLMail-Inject
provides 208K labelled participant submissions on whether an LLM
email assistant followed each injection. Neither corpus
pre-extracts the resulting tool calls. v8 classifies tool calls.
An honest eval against either requires running an LLM agent
end-to-end on the injection prompt, capturing the resulting tool
call, and then running v8 on that tool call. That is an LLM-agent
harness, not a packaging task on top of an existing eval path.
BIPIA attack texts are downloaded to
`tests/adversarial/external/bipia/` for the v0.39 harness work.
2. **IPI fourth attacker family.** Indirect prompt injection lands
cleaner as a different attack class in v0.39 rather than a fourth
attacker LLM in v0.38. The Phase 1 result on the existing three
attacker families is the v0.38 headline.

## Ship gate

| gate | result |
|---|---|
| Phase 1 PAIR scale-up clears the v0.37 Llama-3.3 leg | PASS, 85.8% -> 88.4% overall, +2.6pp |
| Worst Phase 1 sub-cell stays above 70% recall floor | PASS, DE 75.3% |
| In-distribution TEST recall not regressed | PASS, v8 unchanged from v0.37 |
| Methodology + chain of custody published | PASS |

## Cumulative position

v0.38 closes the Phase 1 PAIR scale-up on three attacker families
(Mixtral, Claude, Llama-3.3-70B) at n=300 each, with the third family
landing at 88.4% overall recall against an unchanged v8 classifier.
The next-release line of work is external-corpus eval against BIPIA
and LLMail-Inject and the IPI fourth attacker family, both of which
share the LLM-agent harness scope that v0.39 is sized for.
2 changes: 1 addition & 1 deletion clients/ts/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@vaara/client",
"version": "0.37.1",
"version": "0.38.0",
"description": "TypeScript client for the Vaara HTTP API. Conformal risk scoring, hash-chained audit, policy reload, named detectors.",
"main": "dist/index.js",
"types": "dist/index.d.ts",
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "vaara"
version = "0.37.1"
version = "0.38.0"
description = "Adaptive AI Agent Execution Layer for risk scoring, audit trails, and regulatory compliance"
requires-python = ">=3.10"
license = "Apache-2.0"
Expand Down
Loading