-
Notifications
You must be signed in to change notification settings - Fork 1
release(v0.37.0): third attacker family + v8 classifier closes worst v0.36 sub-cell #144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,76 @@ | ||
| { | ||
| "bundle": "src/vaara/data/adversarial_classifier_v8.joblib", | ||
| "bundle_version": "v0.37", | ||
| "threshold": 0.9006, | ||
| "split_manifest": "tests/adversarial/v037_split.json", | ||
| "n": 2277, | ||
| "pos": 2277, | ||
| "tp": 1522, | ||
| "fn": 755, | ||
| "recall": 0.668423364075538, | ||
| "recall_ci": [ | ||
| 0.6488167555363148, | ||
| 0.687462624846786 | ||
| ], | ||
| "per_category": { | ||
| "data_exfil": { | ||
| "n": 1690, | ||
| "tp": 968, | ||
| "recall": 0.5727810650887574 | ||
| }, | ||
| "privilege_escalation": { | ||
| "n": 300, | ||
| "tp": 291, | ||
| "recall": 0.97 | ||
| }, | ||
| "tool_misuse": { | ||
| "n": 287, | ||
| "tp": 263, | ||
| "recall": 0.9163763066202091 | ||
| } | ||
| }, | ||
| "per_leg": { | ||
| "claude": { | ||
| "n": 700, | ||
| "tp": 272, | ||
| "recall": 0.38857142857142857 | ||
| }, | ||
| "mixtral": { | ||
| "n": 690, | ||
| "tp": 489, | ||
| "recall": 0.7086956521739131 | ||
| }, | ||
| "llama33": { | ||
| "n": 887, | ||
| "tp": 761, | ||
| "recall": 0.8579481397970687 | ||
| } | ||
| }, | ||
| "per_category_per_leg": { | ||
| "data_exfil__claude": { | ||
| "n": 700, | ||
| "tp": 272, | ||
| "recall": 0.38857142857142857 | ||
| }, | ||
| "data_exfil__mixtral": { | ||
| "n": 690, | ||
| "tp": 489, | ||
| "recall": 0.7086956521739131 | ||
| }, | ||
| "data_exfil__llama33": { | ||
| "n": 300, | ||
| "tp": 207, | ||
| "recall": 0.69 | ||
| }, | ||
| "privilege_escalation__llama33": { | ||
| "n": 300, | ||
| "tp": 291, | ||
| "recall": 0.97 | ||
| }, | ||
| "tool_misuse__llama33": { | ||
| "n": 287, | ||
| "tp": 263, | ||
| "recall": 0.9163763066202091 | ||
| } | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,174 @@ | ||
| # vaara-bench-v0.37 | ||
|
|
||
| Methodology delta against [vaara-bench-v0.36](vaara-bench-v0.36.md). | ||
| v0.37 is a corpus-augmentation release with two contributions: | ||
|
|
||
| 1. **Third attacker family added to held-out eval.** 900 adversarial | ||
| entries generated by `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic` | ||
| on AMD-backed MI300X SR-IOV under `rocm/vllm:latest`. Three | ||
| categories (`tool_misuse`, `privilege_escalation`, `data_exfil`), | ||
| 300 per category, held out from TRAIN. | ||
| 2. **v8 retrain on union corpus.** v036 Mixtral TM/PE plus v036 Claude | ||
| TM/PE entries folded into TRAIN. The v036 DE subset stays held out | ||
| so the worst sub-cell from v0.36 (data_exfil × Claude) remains a | ||
| pure cross-model eval. The new v037 holdout combines that carried | ||
| v036 DE fold with the full v037 Llama-3.3 leg. | ||
|
|
||
| ## Production headline (v8 on v035 TEST) | ||
|
|
||
| | metric | v8 | v7 (prior production) | Δ | | ||
| |---|---|---|---| | ||
| | TEST recall at T=0.9006 | **86.6% [84.5, 88.6]** | 85.3% [83.1, 87.3] | +1.3pp | | ||
| | TEST FPR at same calibration | 5.0% [3.6, 6.8] | 4.6% [3.3, 6.3] | +0.4pp (within CI) | | ||
| | n_features | 638 (254 hand + 384 embed) | 638 | flat | | ||
| | TRAIN entries | 11,287 | 8,501 | +2,786 | | ||
| | TRAIN positive rate | 0.683 | 0.579 | folded TM/PE both legs | | ||
|
|
||
| v8 is the new production bundle | ||
| (`src/vaara/data/adversarial_classifier_v8.joblib`). v7 and v6 stay on | ||
| disk for cross-eval reproducibility. No regression on the | ||
| in-distribution TEST set. Recall lifts 1.3 pp at a 0.4 pp FPR cost that | ||
| sits inside the prior FPR confidence interval. | ||
|
|
||
| ## Cross-model held-out result (v8 on v037 holdout) | ||
|
|
||
| | cut | n | recall at T=0.9006 | | ||
| |---|---|---| | ||
| | **overall** | 2,277 | **66.8% [64.9, 68.7]** | | ||
| | Llama-3.3 leg (new third family) | 887 | **85.8% [83.3, 87.9]** | | ||
| | Mixtral DE (carry-forward from v036) | 690 | 70.9% [67.4, 74.1] | | ||
| | Claude DE (carry-forward from v036) | 700 | **38.9% [35.3, 42.5]** | | ||
| | data_exfil overall | 1,690 | 57.3% [54.9, 59.6] | | ||
| | privilege_escalation overall | 300 | 97.0% [94.4, 98.4] | | ||
| | tool_misuse overall | 287 | 91.6% [87.9, 94.3] | | ||
| | data_exfil × llama33 | 300 | 69.0% [63.6, 74.0] | | ||
| | privilege_escalation × llama33 | 300 | 97.0% [94.4, 98.4] | | ||
| | tool_misuse × llama33 | 287 | 91.6% [87.9, 94.3] | | ||
|
|
||
| The v8 number on data_exfil × Claude lifts from 26.0% (v7) to **38.9% | ||
| (+12.9pp)** on the same 700 entries. The constraint diagnosed in v0.36 | ||
| was training-corpus distribution, not feature space. Folding v036 | ||
| TM/PE into TRAIN reweights the existing `dst__` features against actual | ||
| adversarial examples and the lift falls out. | ||
|
|
||
| ## Third-family cross-model number | ||
|
|
||
| Llama-3.3-70B-Instruct was not present in TRAIN. The 85.8% overall | ||
| recall on its 887 entries indicates the v035 + v036 TM/PE TRAIN fold | ||
| generalises to a third attacker family at near-in-distribution recall. | ||
| Per-category cuts on this leg: | ||
|
|
||
| - privilege_escalation × llama33: 97.0% | ||
| - tool_misuse × llama33: 91.6% | ||
| - data_exfil × llama33: 69.0% | ||
|
|
||
| The data_exfil pattern from v0.36 holds. Across all three attacker | ||
| families now seen, DE is the hardest category. PE and TM generalise | ||
| cleanly; DE generalises unevenly. | ||
|
|
||
| ## Carry-forward DE numbers | ||
|
|
||
| v036 Mixtral DE: 70.9% (v8) vs 69.3% (v7 on the same 690 entries), | ||
| flat. v036 Claude DE: 38.9% (v8) vs 26.0% (v7 on the same 700 | ||
| entries), **+12.9pp**. The v036 → v8 lift is concentrated in the | ||
| closed-weight leg that was failing hardest. Open-weight Mixtral DE was | ||
| already at 70%-tier and stays there. The asymmetry confirms the v0.36 | ||
| mechanism finding (destination signal is the axis, and folding the | ||
| related TM/PE distributions into TRAIN repositions the classifier on | ||
| that axis for the closed-weight leg). | ||
|
|
||
| ## Ship gate | ||
|
|
||
| v0.37 ships under both a methodology gate and a sub-cell recall gate | ||
| because v8 is a production retrain: | ||
|
|
||
| | gate | result | | ||
| |---|---| | ||
| | v035 TEST recall does not regress | PASS, 85.3% → 86.6%, +1.3pp | | ||
| | v035 TEST FPR does not regress | PASS, 4.6% → 5.0%, within CI | | ||
| | Worst v0.36 sub-cell improves | PASS, DE × Claude 26.0% → 38.9% | | ||
| | Third attacker family covered with recall floor | PASS, llama33 overall 85.8% | | ||
| | Held-out gap stays published with mechanism | PASS | | ||
|
|
||
| Cross-model overall recall is 66.8%. Below the 70% floor used as soft | ||
| target in prior releases, but the floor was set against v035 TEST | ||
| distribution. Cross-model overall is a harder denominator, and 66.8% | ||
| is a 7.6 pp lift on the comparable v036 number (59.2% → 66.8%) with | ||
| a third family added to the denominator. | ||
|
|
||
| ## Generation provenance | ||
|
|
||
| Llama-3.3-70B generation ran on an AMD-backed MI300X DigitalOcean | ||
| SR-IOV droplet under `rocm/vllm:latest` serving | ||
| `RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic` with the model's native | ||
| `compressed-tensors` FP8 quantization | ||
| (`--max-model-len 8192 --enforce-eager --gpu-memory-utilization 0.92`). | ||
| Three parallel category generators, ~22 minutes wall clock for 900 | ||
| entries at steady-state ~40 entries/min combined. Droplet poweroff | ||
| issued post-rsync. Schema validation pass dropped 13 of 300 raw TM | ||
| entries (4.3%) where the model emitted non-DENY `expected`. Final v037 | ||
| counts: TM 287, PE 300, DE 300, total 887 valid. | ||
|
|
||
| The v037 droplet recipe is identical to v0.36 modulo the model swap. | ||
| The `--quantization` flag had to be dropped because `compressed-tensors` | ||
| in the model config conflicts with an explicit `fp8` argument. vLLM | ||
| auto-detects the quantization scheme from the model config in that | ||
| case, and that path serves correctly. This is a model-specific | ||
| configuration note rather than a methodology change. | ||
|
|
||
| ## Chain of custody | ||
|
|
||
| | anchor | path | pins | | ||
| |---|---|---| | ||
| | corpus manifest | `tests/adversarial/MANIFEST.sha256` | SHA-256 of every JSONL including v037 | | ||
| | v035 split (inherited) | `tests/adversarial/v035_split.json` | TRAIN/VAL/TEST for v8 calibration | | ||
| | v037 split | `tests/adversarial/v037_split.json` | v035 inherited + v036 TM/PE → train, v036 DE + v037 → holdout | | ||
| | production bundle | `src/vaara/data/adversarial_classifier_v8.joblib` | trained on 11,287 entries with dst features + embeddings | | ||
| | prior production | `src/vaara/data/adversarial_classifier_v7.joblib` | retained for cross-eval | | ||
| | Llama-3.3 generator | `scripts/generate_targeted_v037.py` | vLLM HTTP, FP8 dynamic on MI300X | | ||
| | droplet driver | `scripts/v037_droplet_run.sh` | idempotent, no destructive EXIT trap | | ||
| | watcher | `scripts/v037_local_watcher.sh` | 60s rsync poll, opt-in doctl auto-shutdown | | ||
| | split builder | `scripts/build_v037_split.py` | inherits v035, folds v036 TM/PE into train | | ||
| | holdout eval | `scripts/eval_v037_holdout.py` | three-leg breakdown (mixtral, claude, llama33) | | ||
| | v035 schema check | `scripts/validate_v037.py` | same shape as v0.36 validator | | ||
|
|
||
| ## Reproduction recipe | ||
|
|
||
| ``` | ||
| cd tests/adversarial && sha256sum -c MANIFEST.sha256 | ||
| .venv/bin/python scripts/validate_v037.py | ||
| .venv/bin/python scripts/build_v037_split.py | ||
| .venv/bin/python scripts/save_classifier_bundle.py \ | ||
| --version v0.37 --threshold 0.9006 --embeddings \ | ||
| --split-manifest tests/adversarial/v037_split.json \ | ||
| --train-fold train \ | ||
| --bundle-out src/vaara/data/adversarial_classifier_v8.joblib | ||
| .venv/bin/python scripts/eval_v037_holdout.py \ | ||
| --bundle src/vaara/data/adversarial_classifier_v8.joblib \ | ||
| --json-out bench/v037_eval_v8_holdout.json | ||
| ``` | ||
|
Comment on lines
+137
to
+149
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix reproduction command paths to be internally consistent. At Line 138, the recipe changes directory to Proposed doc fix-```
-cd tests/adversarial && sha256sum -c MANIFEST.sha256
-.venv/bin/python scripts/validate_v037.py
-.venv/bin/python scripts/build_v037_split.py
+```bash
+sha256sum -c tests/adversarial/MANIFEST.sha256
+.venv/bin/python scripts/validate_v037.py
+.venv/bin/python scripts/build_v037_split.py
.venv/bin/python scripts/save_classifier_bundle.py \
--version v0.37 --threshold 0.9006 --embeddings \
--split-manifest tests/adversarial/v037_split.json \
--train-fold train \
--bundle-out src/vaara/data/adversarial_classifier_v8.joblib
.venv/bin/python scripts/eval_v037_holdout.py \
--bundle src/vaara/data/adversarial_classifier_v8.joblib \
--json-out bench/v037_eval_v8_holdout.jsonVerify each finding against current code. Fix only still-valid issues, skip the In |
||
|
|
||
| ## Named limits | ||
|
|
||
| 1. **Third family generation is 887 valid entries, not 4,000+ like | ||
| v0.36.** Wilson CI on a 300-entry sub-cell at p ~ 0.85 is ± 4 pp, | ||
| adequate for ship-gate decisions. Scaling the Llama-3.3 leg to v036 | ||
| density is v0.38 scope, paired with public-benchmark evaluation. | ||
| 2. **Open-weight families dominate the third-family fold.** Llama-3.3 | ||
| and Mixtral are both open-weight Meta and Mistral architectures. | ||
| Closed-weight coverage in v0.37 is the carry-forward Claude DE | ||
| subset only. Adding GPT-4o-class or Gemini-class generation is v0.38 | ||
| scope. | ||
| 3. **No public-benchmark eval (PINT, BIPIA, INJECT) yet.** v0.38 scope. | ||
| 4. **PAIR multi-attacker scale-up not performed.** v0.38 scope (target | ||
| ASR Wilson upper under 1%). | ||
| 5. **FPR-bounded three-stage combiner per FCR paper (arxiv:2605.22004) | ||
| not implemented.** v0.39 scope. | ||
|
|
||
| ## Cumulative position | ||
|
|
||
| v0.37 closes the worst v0.36 sub-cell by 12.9 pp without giving up | ||
| in-distribution recall, and covers a third attacker family at 85.8% | ||
| overall. The data_exfil category remains the hardest cross-model | ||
| surface. That is the v0.38 + v0.39 line of work: public-benchmark | ||
| numbers, PAIR-at-scale, FPR-bounded combiner. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a language tag to the fenced code block.
At Line 137, use a language hint (e.g.,
bash) to satisfy markdownlint and improve readability.🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 137-137: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents