release(v0.38.0): PAIR scale-up to n=300 per attacker family, 88.4% recall#147
Conversation
…ecall v0.38 ships the third PAIR attacker leg at production scale. Three families (Mixtral-8x7B, Claude Sonnet 4.6, Llama-3.3-70B-Instruct) at n=300 each, with the Phase 1 Llama-3.3 corpus generated fresh on AMD-backed MI300X SR-IOV under rocm/vllm:latest at seed 43. The v8 production classifier is carried forward unchanged from v0.37 and evaluated at calibrated T=0.9006 against the 900 Phase 1 entries. Overall recall lands at 88.4% [86.2, 90.4], a 2.6 pp lift over the v0.37 Llama-3.3 leg (85.8%). The biggest category move is data_exfil (69.0% to 75.3%, +6.3 pp). tool_misuse holds at 93.7% and privilege_escalation at 96.3%. External-corpus eval against BIPIA and LLMail-Inject and the IPI fourth attacker family both move to v0.39. Neither external corpus pre-extracts the tool calls that v8 classifies, so an honest eval requires running an LLM agent end-to-end on each injection prompt and capturing the resulting tool call. That is an LLM-agent harness, not a packaging task on top of an existing eval path. IPI fits the same v0.39 release window as a different attack class. Bench doc: bench/vaara-bench-v0.38.md. Eval artifact: bench/v038_phase1_eval_v8.json. Reproduction harness: scripts/eval_v038_phase1.py.
📝 WalkthroughWalkthroughv0.38.0 release bumps package versions across Python and TypeScript manifests, updates CHANGELOG and README, introduces Phase 1 benchmark evaluation results with methodology documentation, adds an evaluation script that computes classifier recall with Wilson confidence intervals, and provides droplet-side and local-side orchestration scripts for running distributed evaluation infrastructure. Changesv0.38.0 Release and Evaluation Infrastructure
Sequence DiagramsequenceDiagram
participant Operator
participant Droplet
participant Watcher as Local Watcher
participant vLLM
participant LocalFS
Operator->>Droplet: Run v038_droplet_run.sh
Droplet->>vLLM: Probe /v1/models
Droplet->>vLLM: Launch rocm/vllm:latest
Droplet->>vLLM: Wait for health
Droplet->>Droplet: Start TM/PE/DE generators
Operator->>Watcher: Run v038_local_watcher.sh
Watcher->>Droplet: Establish SSH connectivity
loop Polling interval
Watcher->>Droplet: rsync generator outputs
Droplet-->>LocalFS: Transfer JSONL artifacts
Watcher->>Droplet: Query vLLM /v1/models
Watcher->>LocalFS: Check line counts
Droplet->>Droplet: Generate entries
end
Watcher->>LocalFS: Final rsync
Watcher->>LocalFS: Log completion
Watcher->>Droplet: Optional auto-delete via doctl
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
scripts/eval_v038_phase1.py (1)
94-101: ⚡ Quick winPer-category recall uses
tp/n(all examples) vs overall recalltp/(tp+fn)(positives only), so it would be wrong ifALLOWexamples ever appear.In
scripts/eval_v038_phase1.py, overall recall is computed fromywherey==1iffexpectedcontains"DENY"/"ESCALATE"(build_labels). Per-category recall instead incrementsnfor every entry (line 99) whiletponly increments for"DENY"/"ESCALATE"(lines 100-101). If there are anyexpected="ALLOW"entries, the per-category denominator would be too large and recall would be under-reported.Current v0.38 Phase 1 inputs used by this script (
tests/adversarial/generated/{TM,PE,DE}-v038-llama33-s43.jsonl) haveALLOW=0across all three files, so this mismatch doesn’t affect today’s run; the logic is still worth aligning toyfor robustness.🔧 Proposed fix to compute recall using positives-only (`y`)
per_cat: dict[str, dict[str, int]] = {} per_sev: dict[str, dict[str, int]] = {} -for e, pr in zip(entries, pred): +for e, pr, y_true in zip(entries, pred, y): cat = e.get("category", "?") sev = e.get("severity", "?") for bucket, d in [(cat, per_cat), (sev, per_sev)]: - d.setdefault(bucket, {"n": 0, "tp": 0}) - d[bucket]["n"] += 1 - if pr == 1 and e.get("expected") in ("DENY", "ESCALATE"): + d.setdefault(bucket, {"n_pos": 0, "tp": 0}) + if y_true == 1: + d[bucket]["n_pos"] += 1 + if pr == 1 and y_true == 1: d[bucket]["tp"] += 1Then update the render function and output to use
n_posinstead ofn:def render(label: str, m: dict[str, dict[str, int]]): print(f"\n--- recall by {label} ---") for k in sorted(m): - n, tp = m[k]["n"], m[k]["tp"] - r = tp / max(n, 1) - lo, hi = wilson_ci(tp, n) - print(f" {k:22s} n={n:4d} tp={tp:4d} recall={r:.1%} [{lo:.1%}, {hi:.1%}]") + n_pos, tp = m[k]["n_pos"], m[k]["tp"] + r = tp / max(n_pos, 1) + lo, hi = wilson_ci(tp, n_pos) + print(f" {k:22s} n_pos={n_pos:4d} tp={tp:4d} recall={r:.1%} [{lo:.1%}, {hi:.1%}]")And in the JSON output:
- "per_category": {k: {**v, "recall": v["tp"] / max(v["n"], 1)} for k, v in per_cat.items()}, - "per_severity": {k: {**v, "recall": v["tp"] / max(v["n"], 1)} for k, v in per_sev.items()}, + "per_category": {k: {**v, "recall": v["tp"] / max(v["n_pos"], 1)} for k, v in per_cat.items()}, + "per_severity": {k: {**v, "recall": v["tp"] / max(v["n_pos"], 1)} for k, v in per_sev.items()},🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/eval_v038_phase1.py` around lines 94 - 101, Per-category recall currently uses d[bucket]["n"] as total examples but only increments tp for positives, so change the per-bucket denominator to count positives only (align with build_labels / y): in the loop over entries/pred, compute a boolean is_pos = e.get("expected") in ("DENY","ESCALATE") (or reuse the existing label array), call d.setdefault(bucket, {"n_pos": 0, "tp": 0}), increment d[bucket]["n_pos"] only when is_pos is True, and increment d[bucket]["tp"] when pr == 1 and is_pos is True; then update any rendering/output that reads d[bucket]["n"] to use d[bucket]["n_pos"] (and rename uses accordingly) so per-category recall = tp / n_pos.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@bench/vaara-bench-v0.38.md`:
- Around line 86-92: The fenced code block containing the shell command that
runs scripts/eval_v038_phase1.py should include a language identifier for syntax
highlighting; edit the block in bench/vaara-bench-v0.38.md around the PYTHONPATH
invocation to change the opening fence from ``` to ```bash so the command
(including scripts/eval_v038_phase1.py, --bundle, --threshold, --json-out) is
marked as bash.
In `@scripts/v038_droplet_run.sh`:
- Around line 47-63: The script currently runs the container detached (docker
run -d ...) so the shell redirection captures only the client output (container
ID) rather than the container stdout/stderr; fix by either removing -d so the
process stays in foreground and the existing redirection on the docker run line
writes the vLLM server logs to "${LOG_DIR}/vllm_llama33.log", or keep -d but
after starting the container named vllm-llama33 spawn a background logger that
tails container logs (docker logs -f vllm-llama33 redirected to
"${LOG_DIR}/vllm_llama33.log") so the real runtime logs are written for the
watcher to consume. Ensure the change targets the docker run invocation for the
rocm/vllm:latest vllm serve ... command and the LOG_DIR/vllm_llama33.log path.
In `@scripts/v038_local_watcher.sh`:
- Around line 75-81: The SSH calls that set vllm_model and fetch the remote log
can hang because they lack connection/timeouts; update both ssh invocations (the
one assigning vllm_model and the one running tail in the if-block) to include
SSH timeout options such as -o ConnectTimeout=5 and keepalive options like -o
ServerAliveInterval=15 -o ServerAliveCountMax=1 (or wrap the ssh call with a
short external timeout like timeout 10s) so a stalled connection fails quickly
instead of freezing the watcher loop; modify the ssh used in the vllm_model
assignment and the ssh used before tail -20 ${LOG_REMOTE}/vllm_llama33.log
accordingly.
---
Nitpick comments:
In `@scripts/eval_v038_phase1.py`:
- Around line 94-101: Per-category recall currently uses d[bucket]["n"] as total
examples but only increments tp for positives, so change the per-bucket
denominator to count positives only (align with build_labels / y): in the loop
over entries/pred, compute a boolean is_pos = e.get("expected") in
("DENY","ESCALATE") (or reuse the existing label array), call
d.setdefault(bucket, {"n_pos": 0, "tp": 0}), increment d[bucket]["n_pos"] only
when is_pos is True, and increment d[bucket]["tp"] when pr == 1 and is_pos is
True; then update any rendering/output that reads d[bucket]["n"] to use
d[bucket]["n_pos"] (and rename uses accordingly) so per-category recall = tp /
n_pos.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: e4504d1c-f721-41ff-9ff6-e4b42d6e9795
⛔ Files ignored due to path filters (3)
tests/adversarial/generated/DE-v038-llama33-s43.jsonlis excluded by!**/generated/**tests/adversarial/generated/PE-v038-llama33-s43.jsonlis excluded by!**/generated/**tests/adversarial/generated/TM-v038-llama33-s43.jsonlis excluded by!**/generated/**
📒 Files selected for processing (10)
CHANGELOG.mdREADME.mdbench/v038_phase1_eval_v8.jsonbench/vaara-bench-v0.38.mdclients/ts/package.jsonpyproject.tomlscripts/eval_v038_phase1.pyscripts/v038_droplet_run.shscripts/v038_local_watcher.shsrc/vaara/__init__.py
| ``` | ||
| PYTHONPATH=src .venv/bin/python scripts/eval_v038_phase1.py \ | ||
| --bundle src/vaara/data/adversarial_classifier_v8.joblib \ | ||
| --threshold 0.9006 \ | ||
| --json-out bench/v038_phase1_eval_v8.json | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Add language identifier to the fenced code block.
The code block starting at line 86 should specify a language for proper syntax highlighting and linting consistency. As per the static analysis hint, fenced code blocks should have a language specified.
📝 Proposed fix
-```
+```bash
PYTHONPATH=src .venv/bin/python scripts/eval_v038_phase1.py \
--bundle src/vaara/data/adversarial_classifier_v8.joblib \
--threshold 0.9006 \
--json-out bench/v038_phase1_eval_v8.json</details>
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>
[warning] 86-86: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @bench/vaara-bench-v0.38.md around lines 86 - 92, The fenced code block
containing the shell command that runs scripts/eval_v038_phase1.py should
include a language identifier for syntax highlighting; edit the block in
bench/vaara-bench-v0.38.md around the PYTHONPATH invocation to change the
opening fence from tobash so the command (including
scripts/eval_v038_phase1.py, --bundle, --threshold, --json-out) is marked as
bash.
</details>
<!-- fingerprinting:phantom:triton:puma -->
<!-- This is an auto-generated comment by CodeRabbit -->
| docker run -d --rm \ | ||
| --device=/dev/kfd --device=/dev/dri \ | ||
| --group-add video --group-add render \ | ||
| --security-opt seccomp=unconfined \ | ||
| --cap-add=SYS_PTRACE \ | ||
| --shm-size 32G --ipc=host --network host \ | ||
| -e "HF_TOKEN=${HF_TOKEN:-}" \ | ||
| -v "${HF_CACHE}":/root/.cache/huggingface \ | ||
| --name vllm-llama33 \ | ||
| rocm/vllm:latest \ | ||
| vllm serve "${MODEL}" \ | ||
| --host 0.0.0.0 --port "${PORT}" \ | ||
| --max-model-len 8192 \ | ||
| --enforce-eager \ | ||
| --gpu-memory-utilization 0.92 \ | ||
| >"${LOG_DIR}/vllm_llama33.log" 2>&1 | ||
|
|
There was a problem hiding this comment.
Detached mode is not writing vLLM runtime logs to vllm_llama33.log.
Line 62 captures only docker run client output (typically container ID), so Line 74 and the local watcher won’t get real server logs for debugging.
Suggested fix
- docker run -d --rm \
+ docker run -d --rm \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--shm-size 32G --ipc=host --network host \
-e "HF_TOKEN=${HF_TOKEN:-}" \
-v "${HF_CACHE}":/root/.cache/huggingface \
--name vllm-llama33 \
rocm/vllm:latest \
vllm serve "${MODEL}" \
--host 0.0.0.0 --port "${PORT}" \
--max-model-len 8192 \
--enforce-eager \
- --gpu-memory-utilization 0.92 \
- >"${LOG_DIR}/vllm_llama33.log" 2>&1
+ --gpu-memory-utilization 0.92
+
+ # Stream container logs to file used by health/debug paths.
+ nohup docker logs -f vllm-llama33 >"${LOG_DIR}/vllm_llama33.log" 2>&1 &📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| docker run -d --rm \ | |
| --device=/dev/kfd --device=/dev/dri \ | |
| --group-add video --group-add render \ | |
| --security-opt seccomp=unconfined \ | |
| --cap-add=SYS_PTRACE \ | |
| --shm-size 32G --ipc=host --network host \ | |
| -e "HF_TOKEN=${HF_TOKEN:-}" \ | |
| -v "${HF_CACHE}":/root/.cache/huggingface \ | |
| --name vllm-llama33 \ | |
| rocm/vllm:latest \ | |
| vllm serve "${MODEL}" \ | |
| --host 0.0.0.0 --port "${PORT}" \ | |
| --max-model-len 8192 \ | |
| --enforce-eager \ | |
| --gpu-memory-utilization 0.92 \ | |
| >"${LOG_DIR}/vllm_llama33.log" 2>&1 | |
| docker run -d --rm \ | |
| --device=/dev/kfd --device=/dev/dri \ | |
| --group-add video --group-add render \ | |
| --security-opt seccomp=unconfined \ | |
| --cap-add=SYS_PTRACE \ | |
| --shm-size 32G --ipc=host --network host \ | |
| -e "HF_TOKEN=${HF_TOKEN:-}" \ | |
| -v "${HF_CACHE}":/root/.cache/huggingface \ | |
| --name vllm-llama33 \ | |
| rocm/vllm:latest \ | |
| vllm serve "${MODEL}" \ | |
| --host 0.0.0.0 --port "${PORT}" \ | |
| --max-model-len 8192 \ | |
| --enforce-eager \ | |
| --gpu-memory-utilization 0.92 | |
| # Stream container logs to file used by health/debug paths. | |
| nohup docker logs -f vllm-llama33 >"${LOG_DIR}/vllm_llama33.log" 2>&1 & |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/v038_droplet_run.sh` around lines 47 - 63, The script currently runs
the container detached (docker run -d ...) so the shell redirection captures
only the client output (container ID) rather than the container stdout/stderr;
fix by either removing -d so the process stays in foreground and the existing
redirection on the docker run line writes the vLLM server logs to
"${LOG_DIR}/vllm_llama33.log", or keep -d but after starting the container named
vllm-llama33 spawn a background logger that tails container logs (docker logs -f
vllm-llama33 redirected to "${LOG_DIR}/vllm_llama33.log") so the real runtime
logs are written for the watcher to consume. Ensure the change targets the
docker run invocation for the rocm/vllm:latest vllm serve ... command and the
LOG_DIR/vllm_llama33.log path.
| vllm_model=$(ssh -o BatchMode=yes "${DROPLET}" \ | ||
| "curl -sf http://localhost:8000/v1/models 2>/dev/null | grep -oE '\"id\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | head -1 | grep -oE '\"[^\"]*\"\$' | tr -d '\"'" \ | ||
| 2>/dev/null || echo "SSH_FAIL") | ||
| if [[ "$vllm_model" != "${EXPECTED_MODEL}" ]] && [[ "$vllm_model" != "SSH_FAIL" ]]; then | ||
| note "WARNING vllm_model='${vllm_model}' (expected ${EXPECTED_MODEL})" | ||
| ssh -o BatchMode=yes "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \ | ||
| | tee -a "${WATCH_DIR}/progress.log" || true |
There was a problem hiding this comment.
Recurring SSH probes can hang the watcher loop without connection timeouts.
Line 75 and Line 80 SSH calls don’t set timeouts, so one stalled connection can freeze progress detection.
Suggested fix
+SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=10 -o ServerAliveInterval=15 -o ServerAliveCountMax=2)
+
- vllm_model=$(ssh -o BatchMode=yes "${DROPLET}" \
+ vllm_model=$(ssh "${SSH_OPTS[@]}" "${DROPLET}" \
"curl -sf http://localhost:8000/v1/models 2>/dev/null | grep -oE '\"id\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | head -1 | grep -oE '\"[^\"]*\"\$' | tr -d '\"'" \
2>/dev/null || echo "SSH_FAIL")
@@
- ssh -o BatchMode=yes "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \
+ ssh "${SSH_OPTS[@]}" "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \
| tee -a "${WATCH_DIR}/progress.log" || true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| vllm_model=$(ssh -o BatchMode=yes "${DROPLET}" \ | |
| "curl -sf http://localhost:8000/v1/models 2>/dev/null | grep -oE '\"id\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | head -1 | grep -oE '\"[^\"]*\"\$' | tr -d '\"'" \ | |
| 2>/dev/null || echo "SSH_FAIL") | |
| if [[ "$vllm_model" != "${EXPECTED_MODEL}" ]] && [[ "$vllm_model" != "SSH_FAIL" ]]; then | |
| note "WARNING vllm_model='${vllm_model}' (expected ${EXPECTED_MODEL})" | |
| ssh -o BatchMode=yes "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \ | |
| | tee -a "${WATCH_DIR}/progress.log" || true | |
| SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=10 -o ServerAliveInterval=15 -o ServerAliveCountMax=2) | |
| vllm_model=$(ssh "${SSH_OPTS[@]}" "${DROPLET}" \ | |
| "curl -sf http://localhost:8000/v1/models 2>/dev/null | grep -oE '\"id\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | head -1 | grep -oE '\"[^\"]*\"\$' | tr -d '\"'" \ | |
| 2>/dev/null || echo "SSH_FAIL") | |
| if [[ "$vllm_model" != "${EXPECTED_MODEL}" ]] && [[ "$vllm_model" != "SSH_FAIL" ]]; then | |
| note "WARNING vllm_model='${vllm_model}' (expected ${EXPECTED_MODEL})" | |
| ssh "${SSH_OPTS[@]}" "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \ | |
| | tee -a "${WATCH_DIR}/progress.log" || true |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/v038_local_watcher.sh` around lines 75 - 81, The SSH calls that set
vllm_model and fetch the remote log can hang because they lack
connection/timeouts; update both ssh invocations (the one assigning vllm_model
and the one running tail in the if-block) to include SSH timeout options such as
-o ConnectTimeout=5 and keepalive options like -o ServerAliveInterval=15 -o
ServerAliveCountMax=1 (or wrap the ssh call with a short external timeout like
timeout 10s) so a stalled connection fails quickly instead of freezing the
watcher loop; modify the ssh used in the vllm_model assignment and the ssh used
before tail -20 ${LOG_REMOTE}/vllm_llama33.log accordingly.
…#148) Three CodeRabbit findings on PR #147 addressed against main: - scripts/v038_droplet_run.sh: docker run -d sent runtime logs to the detached client process so the redirect captured only the container ID. Split the docker run from a background docker logs -f writing to vllm_llama33.log, so the health-check tail path and the watcher rsync both see real vLLM server output. - scripts/v038_local_watcher.sh: SSH probes inside the watch loop ran without ConnectTimeout or ServerAliveInterval, which would freeze progress detection on a stalled connection. Lifted SSH_OPTS into a module-scope array (BatchMode, ConnectTimeout=10, ServerAliveInterval=15, ServerAliveCountMax=2) and rewired the three ssh calls through it. - bench/vaara-bench-v0.38.md: bare fence on the reproduction recipe block tagged as bash for markdownlint MD040 and reader syntax highlighting. The shipped v0.38 generation run already completed cleanly so these do not affect the release artifacts. The fixes propagate into the canonical scripts so the v0.39 generation harness picks up the corrected shape. Co-authored-by: vaaraio <267591518+vaaraio@users.noreply.github.com>
Summary
v0.38 ships the Phase 1 PAIR scale-up on the Llama-3.3-70B leg. 900
fresh adversarial entries (300 per category) generated on AMD-backed
MI300X SR-IOV under
rocm/vllm:latestat seed 43, content-distinctfrom the v0.37 Llama-3.3 leg. The v8 production classifier is carried
forward unchanged from v0.37 and evaluated at calibrated T=0.9006
against the new corpus.
Overall recall lands at 88.4% [86.2, 90.4], a 2.6 pp lift over the
v0.37 Llama-3.3 leg (85.8%). Per category:
tool_misuse93.7%,privilege_escalation96.3%,data_exfil75.3%. The biggest move ison
data_exfil(+6.3 pp), the category v0.37 identified as thehardest cross-model surface.
What is not in this release
External-corpus eval (BIPIA, LLMail-Inject) and the IPI fourth
attacker family both move to v0.39. Neither external corpus
pre-extracts the tool calls that v8 classifies. An honest eval
against either requires running an LLM agent end-to-end on the
injection prompt and capturing the resulting tool call, which is an
LLM-agent harness rather than a packaging task on top of an existing
eval path. IPI fits the same release window as a different attack
class.
Test plan
scripts/eval_v038_phase1.pyPYTHONPATH=src .venv/bin/python scripts/eval_v038_phase1.pypyproject.toml,clients/ts/package.json,src/vaara/__init__.py)Bench
bench/vaara-bench-v0.38.md
Summary by CodeRabbit
New Features
Documentation