-
Notifications
You must be signed in to change notification settings - Fork 1
release(v0.38.0): PAIR scale-up to n=300 per attacker family, 88.4% recall #147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| { | ||
| "bundle": "src/vaara/data/adversarial_classifier_v8.joblib", | ||
| "bundle_version": "v0.37", | ||
| "threshold": 0.9006, | ||
| "source": "v0.38 Phase 1: tests/adversarial/generated/{TM,PE,DE}-v038-llama33-s43.jsonl", | ||
| "model_attacker": "RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic", | ||
| "seed": 43, | ||
| "n": 900, | ||
| "pos": 900, | ||
| "tp": 796, | ||
| "fn": 104, | ||
| "recall": 0.8844444444444445, | ||
| "recall_ci": [ | ||
| 0.8619044268492455, | ||
| 0.9037164518533944 | ||
| ], | ||
| "per_category": { | ||
| "tool_misuse": { | ||
| "n": 300, | ||
| "tp": 281, | ||
| "recall": 0.9366666666666666 | ||
| }, | ||
| "privilege_escalation": { | ||
| "n": 300, | ||
| "tp": 289, | ||
| "recall": 0.9633333333333334 | ||
| }, | ||
| "data_exfil": { | ||
| "n": 300, | ||
| "tp": 226, | ||
| "recall": 0.7533333333333333 | ||
| } | ||
| }, | ||
| "per_severity": { | ||
| "critical": { | ||
| "n": 397, | ||
| "tp": 366, | ||
| "recall": 0.9219143576826196 | ||
| }, | ||
| "medium": { | ||
| "n": 161, | ||
| "tp": 135, | ||
| "recall": 0.8385093167701864 | ||
| }, | ||
| "high": { | ||
| "n": 336, | ||
| "tp": 289, | ||
| "recall": 0.8601190476190477 | ||
| }, | ||
| "low": { | ||
| "n": 6, | ||
| "tp": 6, | ||
| "recall": 1.0 | ||
| } | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,130 @@ | ||
| # vaara-bench-v0.38 | ||
|
|
||
| Methodology delta against [vaara-bench-v0.37](vaara-bench-v0.37.md). | ||
| v0.38 is a corpus scale-up release on the Phase 1 PAIR leg: | ||
|
|
||
| 1. **Third attacker family scaled to n=300 per category.** 900 fresh | ||
| adversarial entries generated by | ||
| `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic` on AMD-backed MI300X | ||
| SR-IOV under `rocm/vllm:latest` at seed 43. Three categories | ||
| (`tool_misuse`, `privilege_escalation`, `data_exfil`), 300 per | ||
| category, held out from TRAIN. | ||
| 2. **v8 classifier carried forward unchanged.** No retrain in v0.38. | ||
| The Phase 1 corpus is evaluated against the same production bundle | ||
| (`adversarial_classifier_v8.joblib`) that shipped in v0.37 at the | ||
| same calibrated threshold T=0.9006. | ||
|
|
||
| ## Phase 1 result (v8 on 900 Llama-3.3 entries, seed 43) | ||
|
|
||
| | cut | n | recall at T=0.9006 | | ||
| |---|---|---| | ||
| | **overall** | 900 | **88.4% [86.2, 90.4]** | | ||
| | tool_misuse | 300 | 93.7% [90.3, 95.9] | | ||
| | privilege_escalation | 300 | 96.3% [93.6, 97.9] | | ||
| | data_exfil | 300 | 75.3% [70.2, 79.9] | | ||
|
|
||
| vs the v0.37 Llama-3.3 leg at n=887 (85.8% overall, TM 91.6%, PE 97.0%, | ||
| DE 69.0%): +2.6pp overall, with the biggest lift on data_exfil | ||
| (+6.3pp). PE moves inside its prior confidence interval. The DE lift | ||
| holds the v0.37 mechanism finding that data_exfil is the hardest | ||
| category but not a structural failure. | ||
|
|
||
| The Phase 1 corpus uses a different random seed than the v0.37 leg | ||
| (43 vs the v0.37 generator default) so the entries are content-distinct. | ||
| Fingerprint deduplication against v037 entries showed zero true | ||
| duplicates. | ||
|
|
||
| ## Recall by severity (Phase 1) | ||
|
|
||
| | severity | n | recall | | ||
| |---|---|---| | ||
| | critical | 397 | 92.2% | | ||
| | high | 336 | 86.0% | | ||
| | medium | 161 | 83.9% | | ||
| | low | 6 | 100.0% | | ||
|
|
||
| The severity distribution carries across the Phase 1 entries: critical | ||
| above 92%, high and medium in the mid-eighties. The low n=6 bucket is | ||
| not load-bearing at this sample size. | ||
|
|
||
| ## Generation provenance | ||
|
|
||
| Phase 1 generation ran on an AMD-backed MI300X DigitalOcean SR-IOV | ||
| droplet under `rocm/vllm:latest` serving | ||
| `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic` with the model's native | ||
| `compressed-tensors` FP8 quantization | ||
| (`--max-model-len 8192 --enforce-eager --gpu-memory-utilization 0.92`). | ||
| Three parallel category generators, 22 minutes wall clock for 900 | ||
| entries (vLLM health-up at 06:48Z, last generator done at 07:06Z on | ||
| 2026-05-27). All 900 entries schema-valid. | ||
|
|
||
| The v037 generator hardcoded `v037` in entry `id` and `agent_id` fields | ||
| regardless of `--random-seed`, producing ID collisions against the | ||
| v037 corpus. A one-pass rename `v037 -> v038` zeroed those collisions | ||
| in place before the eval. Content uniqueness was preserved because | ||
| the model produced distinct samples at the new seed. | ||
|
|
||
| The v0.38 droplet driver (`scripts/v038_droplet_run.sh`) drops the | ||
| `--quantization fp8` argument that v0.37 used. The current | ||
| `rocm/vllm:latest` image (vllm 0.11.2.dev673) refuses an explicit | ||
| quantization flag when the model config already declares | ||
| `compressed-tensors`. vLLM auto-detects on that path. | ||
|
|
||
| ## Chain of custody | ||
|
|
||
| | anchor | path | pins | | ||
| |---|---|---| | ||
| | Phase 1 corpus | `tests/adversarial/generated/{TM,PE,DE}-v038-llama33-s43.jsonl` | 300 entries per category, seed 43, schema-valid | | ||
| | production bundle | `src/vaara/data/adversarial_classifier_v8.joblib` | unchanged from v0.37 | | ||
| | Phase 1 eval | `scripts/eval_v038_phase1.py` | reads jsonls directly, bypasses split manifest | | ||
| | eval artifact | `bench/v038_phase1_eval_v8.json` | overall + per-category + per-severity | | ||
| | droplet driver | `scripts/v038_droplet_run.sh` | drops --quantization fp8 flag | | ||
| | watcher | `scripts/v038_local_watcher.sh` | 60s rsync-back loop for defensive recovery | | ||
|
|
||
| ## Reproduction recipe | ||
|
|
||
| ``` | ||
| PYTHONPATH=src .venv/bin/python scripts/eval_v038_phase1.py \ | ||
| --bundle src/vaara/data/adversarial_classifier_v8.joblib \ | ||
| --threshold 0.9006 \ | ||
| --json-out bench/v038_phase1_eval_v8.json | ||
| ``` | ||
|
|
||
| ## What is not in v0.38 | ||
|
|
||
| Two threads carry to v0.39: | ||
|
|
||
| 1. **External-corpus eval (BIPIA, LLMail-Inject).** BIPIA provides 75 | ||
| text-injection templates plus 50 code-injection templates (the | ||
| instructions to inject into a benign context). LLMail-Inject | ||
| provides 208K labelled participant submissions on whether an LLM | ||
| email assistant followed each injection. Neither corpus | ||
| pre-extracts the resulting tool calls. v8 classifies tool calls. | ||
| An honest eval against either requires running an LLM agent | ||
| end-to-end on the injection prompt, capturing the resulting tool | ||
| call, and then running v8 on that tool call. That is an LLM-agent | ||
| harness, not a packaging task on top of an existing eval path. | ||
| BIPIA attack texts are downloaded to | ||
| `tests/adversarial/external/bipia/` for the v0.39 harness work. | ||
| 2. **IPI fourth attacker family.** Indirect prompt injection lands | ||
| cleaner as a different attack class in v0.39 rather than a fourth | ||
| attacker LLM in v0.38. The Phase 1 result on the existing three | ||
| attacker families is the v0.38 headline. | ||
|
|
||
| ## Ship gate | ||
|
|
||
| | gate | result | | ||
| |---|---| | ||
| | Phase 1 PAIR scale-up clears the v0.37 Llama-3.3 leg | PASS, 85.8% -> 88.4% overall, +2.6pp | | ||
| | Worst Phase 1 sub-cell stays above 70% recall floor | PASS, DE 75.3% | | ||
| | In-distribution TEST recall not regressed | PASS, v8 unchanged from v0.37 | | ||
| | Methodology + chain of custody published | PASS | | ||
|
|
||
| ## Cumulative position | ||
|
|
||
| v0.38 closes the Phase 1 PAIR scale-up on three attacker families | ||
| (Mixtral, Claude, Llama-3.3-70B) at n=300 each, with the third family | ||
| landing at 88.4% overall recall against an unchanged v8 classifier. | ||
| The next-release line of work is external-corpus eval against BIPIA | ||
| and LLMail-Inject and the IPI fourth attacker family, both of which | ||
| share the LLM-agent harness scope that v0.39 is sized for. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add language identifier to the fenced code block.
The code block starting at line 86 should specify a language for proper syntax highlighting and linting consistency. As per the static analysis hint, fenced code blocks should have a language specified.
📝 Proposed fix
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
@bench/vaara-bench-v0.38.mdaround lines 86 - 92, The fenced code blockcontaining the shell command that runs scripts/eval_v038_phase1.py should
include a language identifier for syntax highlighting; edit the block in
bench/vaara-bench-v0.38.md around the PYTHONPATH invocation to change the
opening fence from
tobash so the command (includingscripts/eval_v038_phase1.py, --bundle, --threshold, --json-out) is marked as
bash.