release(v0.38.0): PAIR scale-up to n=300 per attacker family, 88.4% recall by vaaraio · Pull Request #147 · vaaraio/vaara

vaaraio · 2026-05-27T08:56:27Z

Summary

v0.38 ships the Phase 1 PAIR scale-up on the Llama-3.3-70B leg. 900
fresh adversarial entries (300 per category) generated on AMD-backed
MI300X SR-IOV under rocm/vllm:latest at seed 43, content-distinct
from the v0.37 Llama-3.3 leg. The v8 production classifier is carried
forward unchanged from v0.37 and evaluated at calibrated T=0.9006
against the new corpus.

Overall recall lands at 88.4% [86.2, 90.4], a 2.6 pp lift over the
v0.37 Llama-3.3 leg (85.8%). Per category: tool_misuse 93.7%,
privilege_escalation 96.3%, data_exfil 75.3%. The biggest move is
on data_exfil (+6.3 pp), the category v0.37 identified as the
hardest cross-model surface.

What is not in this release

External-corpus eval (BIPIA, LLMail-Inject) and the IPI fourth
attacker family both move to v0.39. Neither external corpus
pre-extracts the tool calls that v8 classifies. An honest eval
against either requires running an LLM agent end-to-end on the
injection prompt and capturing the resulting tool call, which is an
LLM-agent harness rather than a packaging task on top of an existing
eval path. IPI fits the same release window as a different attack
class.

Test plan

ruff clean on the new scripts/eval_v038_phase1.py
eval reproducible via PYTHONPATH=src .venv/bin/python scripts/eval_v038_phase1.py
three version files all bumped to 0.38.0 (pyproject.toml, clients/ts/package.json, src/vaara/__init__.py)
README bench pointer swapped to v0.38
Phase 1 entries fingerprint-deduplicated against v037

Bench

bench/vaara-bench-v0.38.md

Summary by CodeRabbit

New Features
- Released version 0.38.0
- Integrated Phase 1 evaluation results with comprehensive metrics breakdown by category and severity level
Documentation
- Updated benchmark methodology reference documentation
- Published evaluation results with improved recall metrics (+2.6pp improvement)

…ecall v0.38 ships the third PAIR attacker leg at production scale. Three families (Mixtral-8x7B, Claude Sonnet 4.6, Llama-3.3-70B-Instruct) at n=300 each, with the Phase 1 Llama-3.3 corpus generated fresh on AMD-backed MI300X SR-IOV under rocm/vllm:latest at seed 43. The v8 production classifier is carried forward unchanged from v0.37 and evaluated at calibrated T=0.9006 against the 900 Phase 1 entries. Overall recall lands at 88.4% [86.2, 90.4], a 2.6 pp lift over the v0.37 Llama-3.3 leg (85.8%). The biggest category move is data_exfil (69.0% to 75.3%, +6.3 pp). tool_misuse holds at 93.7% and privilege_escalation at 96.3%. External-corpus eval against BIPIA and LLMail-Inject and the IPI fourth attacker family both move to v0.39. Neither external corpus pre-extracts the tool calls that v8 classifies, so an honest eval requires running an LLM agent end-to-end on each injection prompt and capturing the resulting tool call. That is an LLM-agent harness, not a packaging task on top of an existing eval path. IPI fits the same v0.39 release window as a different attack class. Bench doc: bench/vaara-bench-v0.38.md. Eval artifact: bench/v038_phase1_eval_v8.json. Reproduction harness: scripts/eval_v038_phase1.py.

coderabbitai · 2026-05-27T08:56:38Z

📝 Walkthrough

Walkthrough

v0.38.0 release bumps package versions across Python and TypeScript manifests, updates CHANGELOG and README, introduces Phase 1 benchmark evaluation results with methodology documentation, adds an evaluation script that computes classifier recall with Wilson confidence intervals, and provides droplet-side and local-side orchestration scripts for running distributed evaluation infrastructure.

Changes

v0.38.0 Release and Evaluation Infrastructure

Layer / File(s)	Summary
Version and release metadata updates `src/vaara/__init__.py`, `pyproject.toml`, `clients/ts/package.json`, `CHANGELOG.md`, `README.md`	Package version incremented to 0.38.0 across Python module, project config, and TypeScript client. CHANGELOG documents the v0.38 release with new Phase 1 artifacts and scripts. README bench reference updated from v0.37 to v0.38 methodology document.
Benchmark methodology and evaluation results `bench/vaara-bench-v0.38.md`, `bench/v038_phase1_eval_v8.json`	v0.38 bench document describes Phase 1 PAIR scale-up to n=300 per attacker family on Llama-3.3-70B (seed 43), reports recall deltas vs v0.37 (n=887), includes provenance/chain-of-custody, reproduction recipe, exclusions for v0.39, and cumulative position statement. JSON results file contains bundle metadata, threshold settings, aggregate recall with confidence intervals, and breakdowns by category and severity.
Phase 1 evaluation script implementation `scripts/eval_v038_phase1.py`	Loads v8 classifier bundle, reads Phase 1 test entries from three category-specific JSONL files (tool misuse, privilege escalation, data exfil), builds feature vectors/labels using shared utilities, predicts and evaluates DENY\|ESCALATE recall against configurable threshold, computes Wilson score confidence bounds, aggregates metrics by category and severity, outputs detailed JSON results. Implements `wilson_ci()` for binomial confidence intervals and CLI argument handling for bundle path, threshold, and output destination.
Droplet-side orchestration and model serving `scripts/v038_droplet_run.sh`	Idempotent two-phase workflow: (1) ensures rocm/vllm:latest container is running and serving expected model by probing /v1/models, removing stale containers and relaunching if needed, waiting for health with bounded timeout; (2) launches three category-specific generator processes concurrently via nohup with per-category PID files to skip already-running generators, writes JSONL outputs and logs to predefined directories.
Local monitoring, coordination, and cleanup `scripts/v038_local_watcher.sh`	Establishes SSH connectivity, continuously rsyncs generator JSONL outputs and logs from droplet to local directories, tracks per-prefix line-count growth to detect completion at N_PER_CAT entries per generator, polls remote vLLM /v1/models endpoint to verify expected model (warns and tails logs if mismatched), triggers final rsyncs and next-step logging upon completion, optionally auto-deletes droplet via doctl. Runs indefinitely with configurable polling interval; logs status to .v038_watch/progress.log and rsync.log.

Sequence Diagram

sequenceDiagram
  participant Operator
  participant Droplet
  participant Watcher as Local Watcher
  participant vLLM
  participant LocalFS

  Operator->>Droplet: Run v038_droplet_run.sh
  Droplet->>vLLM: Probe /v1/models
  Droplet->>vLLM: Launch rocm/vllm:latest
  Droplet->>vLLM: Wait for health
  Droplet->>Droplet: Start TM/PE/DE generators

  Operator->>Watcher: Run v038_local_watcher.sh
  Watcher->>Droplet: Establish SSH connectivity
  loop Polling interval
    Watcher->>Droplet: rsync generator outputs
    Droplet-->>LocalFS: Transfer JSONL artifacts
    Watcher->>Droplet: Query vLLM /v1/models
    Watcher->>LocalFS: Check line counts
    Droplet->>Droplet: Generate entries
  end

  Watcher->>LocalFS: Final rsync
  Watcher->>LocalFS: Log completion
  Watcher->>Droplet: Optional auto-delete via doctl

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

vaaraio/vaara#127: Release versioning PR that also bumps src/vaara/__init__.py __version__ and related package manifests as part of version lifecycle management.
vaaraio/vaara#144: Earlier evaluation tooling PR that implements the v8 classifier threshold calibration (T=0.9006) and Wilson score confidence interval computation that v0.38 evaluation script builds upon.

Poem

🐰 Hop along to v0.38,
Where droplets dance and generators wait,
With Llama-3.3 and seeds so fine,
Nine hundred entries, DENY lines align!
Watch it sync from afar, with watcher so keen,
The smoothest benchmark release we've seen! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately and concisely summarizes the main change: a v0.38.0 release featuring a PAIR scale-up to n=300 per attacker family with 88.4% recall, which matches the core objectives.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch release/v0.38.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

scripts/eval_v038_phase1.py (1)

94-101: ⚡ Quick win

Per-category recall uses tp/n (all examples) vs overall recall tp/(tp+fn) (positives only), so it would be wrong if ALLOW examples ever appear.

In scripts/eval_v038_phase1.py, overall recall is computed from y where y==1 iff expected contains "DENY"/"ESCALATE" (build_labels). Per-category recall instead increments n for every entry (line 99) while tp only increments for "DENY"/"ESCALATE" (lines 100-101). If there are any expected="ALLOW" entries, the per-category denominator would be too large and recall would be under-reported.

Current v0.38 Phase 1 inputs used by this script (tests/adversarial/generated/{TM,PE,DE}-v038-llama33-s43.jsonl) have ALLOW=0 across all three files, so this mismatch doesn’t affect today’s run; the logic is still worth aligning to y for robustness.

🔧 Proposed fix to compute recall using positives-only (`y`)

 per_cat: dict[str, dict[str, int]] = {}
 per_sev: dict[str, dict[str, int]] = {}
 
-for e, pr in zip(entries, pred):
+for e, pr, y_true in zip(entries, pred, y):
     cat = e.get("category", "?")
     sev = e.get("severity", "?")
     for bucket, d in [(cat, per_cat), (sev, per_sev)]:
-        d.setdefault(bucket, {"n": 0, "tp": 0})
-        d[bucket]["n"] += 1
-        if pr == 1 and e.get("expected") in ("DENY", "ESCALATE"):
+        d.setdefault(bucket, {"n_pos": 0, "tp": 0})
+        if y_true == 1:
+            d[bucket]["n_pos"] += 1
+        if pr == 1 and y_true == 1:
             d[bucket]["tp"] += 1

Then update the render function and output to use n_pos instead of n:

     def render(label: str, m: dict[str, dict[str, int]]):
         print(f"\n--- recall by {label} ---")
         for k in sorted(m):
-            n, tp = m[k]["n"], m[k]["tp"]
-            r = tp / max(n, 1)
-            lo, hi = wilson_ci(tp, n)
-            print(f"  {k:22s} n={n:4d} tp={tp:4d} recall={r:.1%} [{lo:.1%}, {hi:.1%}]")
+            n_pos, tp = m[k]["n_pos"], m[k]["tp"]
+            r = tp / max(n_pos, 1)
+            lo, hi = wilson_ci(tp, n_pos)
+            print(f"  {k:22s} n_pos={n_pos:4d} tp={tp:4d} recall={r:.1%} [{lo:.1%}, {hi:.1%}]")

And in the JSON output:

-    "per_category": {k: {**v, "recall": v["tp"] / max(v["n"], 1)} for k, v in per_cat.items()},
-    "per_severity": {k: {**v, "recall": v["tp"] / max(v["n"], 1)} for k, v in per_sev.items()},
+    "per_category": {k: {**v, "recall": v["tp"] / max(v["n_pos"], 1)} for k, v in per_cat.items()},
+    "per_severity": {k: {**v, "recall": v["tp"] / max(v["n_pos"], 1)} for k, v in per_sev.items()},

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/eval_v038_phase1.py` around lines 94 - 101, Per-category recall
currently uses d[bucket]["n"] as total examples but only increments tp for
positives, so change the per-bucket denominator to count positives only (align
with build_labels / y): in the loop over entries/pred, compute a boolean is_pos
= e.get("expected") in ("DENY","ESCALATE") (or reuse the existing label array),
call d.setdefault(bucket, {"n_pos": 0, "tp": 0}), increment d[bucket]["n_pos"]
only when is_pos is True, and increment d[bucket]["tp"] when pr == 1 and is_pos
is True; then update any rendering/output that reads d[bucket]["n"] to use
d[bucket]["n_pos"] (and rename uses accordingly) so per-category recall = tp /
n_pos.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bench/vaara-bench-v0.38.md`:
- Around line 86-92: The fenced code block containing the shell command that
runs scripts/eval_v038_phase1.py should include a language identifier for syntax
highlighting; edit the block in bench/vaara-bench-v0.38.md around the PYTHONPATH
invocation to change the opening fence from ``` to ```bash so the command
(including scripts/eval_v038_phase1.py, --bundle, --threshold, --json-out) is
marked as bash.

In `@scripts/v038_droplet_run.sh`:
- Around line 47-63: The script currently runs the container detached (docker
run -d ...) so the shell redirection captures only the client output (container
ID) rather than the container stdout/stderr; fix by either removing -d so the
process stays in foreground and the existing redirection on the docker run line
writes the vLLM server logs to "${LOG_DIR}/vllm_llama33.log", or keep -d but
after starting the container named vllm-llama33 spawn a background logger that
tails container logs (docker logs -f vllm-llama33 redirected to
"${LOG_DIR}/vllm_llama33.log") so the real runtime logs are written for the
watcher to consume. Ensure the change targets the docker run invocation for the
rocm/vllm:latest vllm serve ... command and the LOG_DIR/vllm_llama33.log path.

In `@scripts/v038_local_watcher.sh`:
- Around line 75-81: The SSH calls that set vllm_model and fetch the remote log
can hang because they lack connection/timeouts; update both ssh invocations (the
one assigning vllm_model and the one running tail in the if-block) to include
SSH timeout options such as -o ConnectTimeout=5 and keepalive options like -o
ServerAliveInterval=15 -o ServerAliveCountMax=1 (or wrap the ssh call with a
short external timeout like timeout 10s) so a stalled connection fails quickly
instead of freezing the watcher loop; modify the ssh used in the vllm_model
assignment and the ssh used before tail -20 ${LOG_REMOTE}/vllm_llama33.log
accordingly.

---

Nitpick comments:
In `@scripts/eval_v038_phase1.py`:
- Around line 94-101: Per-category recall currently uses d[bucket]["n"] as total
examples but only increments tp for positives, so change the per-bucket
denominator to count positives only (align with build_labels / y): in the loop
over entries/pred, compute a boolean is_pos = e.get("expected") in
("DENY","ESCALATE") (or reuse the existing label array), call
d.setdefault(bucket, {"n_pos": 0, "tp": 0}), increment d[bucket]["n_pos"] only
when is_pos is True, and increment d[bucket]["tp"] when pr == 1 and is_pos is
True; then update any rendering/output that reads d[bucket]["n"] to use
d[bucket]["n_pos"] (and rename uses accordingly) so per-category recall = tp /
n_pos.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e4504d1c-f721-41ff-9ff6-e4b42d6e9795

📥 Commits

Reviewing files that changed from the base of the PR and between b9e945b and d7a2744.

⛔ Files ignored due to path filters (3)

tests/adversarial/generated/DE-v038-llama33-s43.jsonl is excluded by !**/generated/**
tests/adversarial/generated/PE-v038-llama33-s43.jsonl is excluded by !**/generated/**
tests/adversarial/generated/TM-v038-llama33-s43.jsonl is excluded by !**/generated/**

📒 Files selected for processing (10)

CHANGELOG.md
README.md
bench/v038_phase1_eval_v8.json
bench/vaara-bench-v0.38.md
clients/ts/package.json
pyproject.toml
scripts/eval_v038_phase1.py
scripts/v038_droplet_run.sh
scripts/v038_local_watcher.sh
src/vaara/__init__.py

coderabbitai · 2026-05-27T09:05:11Z

+```
+PYTHONPATH=src .venv/bin/python scripts/eval_v038_phase1.py \
+    --bundle src/vaara/data/adversarial_classifier_v8.joblib \
+    --threshold 0.9006 \
+    --json-out bench/v038_phase1_eval_v8.json
+```
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifier to the fenced code block.

The code block starting at line 86 should specify a language for proper syntax highlighting and linting consistency. As per the static analysis hint, fenced code blocks should have a language specified.

📝 Proposed fix

-``` +```bash PYTHONPATH=src .venv/bin/python scripts/eval_v038_phase1.py \ --bundle src/vaara/data/adversarial_classifier_v8.joblib \ --threshold 0.9006 \ --json-out bench/v038_phase1_eval_v8.json

</details> <details> <summary>🧰 Tools</summary> <details> <summary>🪛 markdownlint-cli2 (0.22.1)</summary> [warning] 86-86: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @bench/vaara-bench-v0.38.md around lines 86 - 92, The fenced code block
containing the shell command that runs scripts/eval_v038_phase1.py should
include a language identifier for syntax highlighting; edit the block in
bench/vaara-bench-v0.38.md around the PYTHONPATH invocation to change the
opening fence from tobash so the command (including
scripts/eval_v038_phase1.py, --bundle, --threshold, --json-out) is marked as
bash.

</details>  

coderabbitai · 2026-05-27T09:05:11Z

+  docker run -d --rm \
+    --device=/dev/kfd --device=/dev/dri \
+    --group-add video --group-add render \
+    --security-opt seccomp=unconfined \
+    --cap-add=SYS_PTRACE \
+    --shm-size 32G --ipc=host --network host \
+    -e "HF_TOKEN=${HF_TOKEN:-}" \
+    -v "${HF_CACHE}":/root/.cache/huggingface \
+    --name vllm-llama33 \
+    rocm/vllm:latest \
+    vllm serve "${MODEL}" \
+      --host 0.0.0.0 --port "${PORT}" \
+      --max-model-len 8192 \
+      --enforce-eager \
+      --gpu-memory-utilization 0.92 \
+      >"${LOG_DIR}/vllm_llama33.log" 2>&1
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Detached mode is not writing vLLM runtime logs to vllm_llama33.log.

Line 62 captures only docker run client output (typically container ID), so Line 74 and the local watcher won’t get real server logs for debugging.

Suggested fix

- docker run -d --rm \ + docker run -d --rm \ --device=/dev/kfd --device=/dev/dri \ --group-add video --group-add render \ --security-opt seccomp=unconfined \ --cap-add=SYS_PTRACE \ --shm-size 32G --ipc=host --network host \ -e "HF_TOKEN=${HF_TOKEN:-}" \ -v "${HF_CACHE}":/root/.cache/huggingface \ --name vllm-llama33 \ rocm/vllm:latest \ vllm serve "${MODEL}" \ --host 0.0.0.0 --port "${PORT}" \ --max-model-len 8192 \ --enforce-eager \ - --gpu-memory-utilization 0.92 \ - >"${LOG_DIR}/vllm_llama33.log" 2>&1 + --gpu-memory-utilization 0.92 + + # Stream container logs to file used by health/debug paths. + nohup docker logs -f vllm-llama33 >"${LOG_DIR}/vllm_llama33.log" 2>&1 &

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

docker run -d --rm \

--device=/dev/kfd --device=/dev/dri \

--group-add video --group-add render \

--security-opt seccomp=unconfined \

--cap-add=SYS_PTRACE \

--shm-size 32G --ipc=host --network host \

-e "HF_TOKEN=${HF_TOKEN:-}" \

-v "${HF_CACHE}":/root/.cache/huggingface \

--name vllm-llama33 \

rocm/vllm:latest \

vllm serve "${MODEL}" \

--host 0.0.0.0 --port "${PORT}" \

--max-model-len 8192 \

--enforce-eager \

--gpu-memory-utilization 0.92 \

>"${LOG_DIR}/vllm_llama33.log" 2>&1

docker run -d --rm \

--device=/dev/kfd --device=/dev/dri \

--group-add video --group-add render \

--security-opt seccomp=unconfined \

--cap-add=SYS_PTRACE \

--shm-size 32G --ipc=host --network host \

-e "HF_TOKEN=${HF_TOKEN:-}" \

-v "${HF_CACHE}":/root/.cache/huggingface \

--name vllm-llama33 \

rocm/vllm:latest \

vllm serve "${MODEL}" \

--host 0.0.0.0 --port "${PORT}" \

--max-model-len 8192 \

--enforce-eager \

--gpu-memory-utilization 0.92

# Stream container logs to file used by health/debug paths.

nohup docker logs -f vllm-llama33 >"${LOG_DIR}/vllm_llama33.log" 2>&1 &

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/v038_droplet_run.sh` around lines 47 - 63, The script currently runs the container detached (docker run -d ...) so the shell redirection captures only the client output (container ID) rather than the container stdout/stderr; fix by either removing -d so the process stays in foreground and the existing redirection on the docker run line writes the vLLM server logs to "${LOG_DIR}/vllm_llama33.log", or keep -d but after starting the container named vllm-llama33 spawn a background logger that tails container logs (docker logs -f vllm-llama33 redirected to "${LOG_DIR}/vllm_llama33.log") so the real runtime logs are written for the watcher to consume. Ensure the change targets the docker run invocation for the rocm/vllm:latest vllm serve ... command and the LOG_DIR/vllm_llama33.log path.

coderabbitai · 2026-05-27T09:05:11Z

+  vllm_model=$(ssh -o BatchMode=yes "${DROPLET}" \
+    "curl -sf http://localhost:8000/v1/models 2>/dev/null | grep -oE '\"id\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | head -1 | grep -oE '\"[^\"]*\"\$' | tr -d '\"'" \
+    2>/dev/null || echo "SSH_FAIL")
+  if [[ "$vllm_model" != "${EXPECTED_MODEL}" ]] && [[ "$vllm_model" != "SSH_FAIL" ]]; then
+    note "WARNING vllm_model='${vllm_model}' (expected ${EXPECTED_MODEL})"
+    ssh -o BatchMode=yes "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \
+      | tee -a "${WATCH_DIR}/progress.log" || true


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Recurring SSH probes can hang the watcher loop without connection timeouts.

Line 75 and Line 80 SSH calls don’t set timeouts, so one stalled connection can freeze progress detection.

Suggested fix

+SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=10 -o ServerAliveInterval=15 -o ServerAliveCountMax=2) + - vllm_model=$(ssh -o BatchMode=yes "${DROPLET}" \ + vllm_model=$(ssh "${SSH_OPTS[@]}" "${DROPLET}" \ "curl -sf http://localhost:8000/v1/models 2>/dev/null | grep -oE '\"id\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | head -1 | grep -oE '\"[^\"]*\"\$' | tr -d '\"'" \ 2>/dev/null || echo "SSH_FAIL") @@ - ssh -o BatchMode=yes "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \ + ssh "${SSH_OPTS[@]}" "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \ | tee -a "${WATCH_DIR}/progress.log" || true

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

vllm_model=$(ssh -o BatchMode=yes "${DROPLET}" \

"curl -sf http://localhost:8000/v1/models 2>/dev/null | grep -oE '\"id\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | head -1 | grep -oE '\"[^\"]*\"\$' | tr -d '\"'" \

2>/dev/null || echo "SSH_FAIL")

if [[ "$vllm_model" != "${EXPECTED_MODEL}" ]] && [[ "$vllm_model" != "SSH_FAIL" ]]; then

note "WARNING vllm_model='${vllm_model}' (expected ${EXPECTED_MODEL})"

ssh -o BatchMode=yes "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \

| tee -a "${WATCH_DIR}/progress.log" || true

SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=10 -o ServerAliveInterval=15 -o ServerAliveCountMax=2)

vllm_model=$(ssh "${SSH_OPTS[@]}" "${DROPLET}" \

"curl -sf http://localhost:8000/v1/models 2>/dev/null | grep -oE '\"id\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | head -1 | grep -oE '\"[^\"]*\"\$' | tr -d '\"'" \

2>/dev/null || echo "SSH_FAIL")

if [[ "$vllm_model" != "${EXPECTED_MODEL}" ]] && [[ "$vllm_model" != "SSH_FAIL" ]]; then

note "WARNING vllm_model='${vllm_model}' (expected ${EXPECTED_MODEL})"

ssh "${SSH_OPTS[@]}" "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \

| tee -a "${WATCH_DIR}/progress.log" || true

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/v038_local_watcher.sh` around lines 75 - 81, The SSH calls that set vllm_model and fetch the remote log can hang because they lack connection/timeouts; update both ssh invocations (the one assigning vllm_model and the one running tail in the if-block) to include SSH timeout options such as -o ConnectTimeout=5 and keepalive options like -o ServerAliveInterval=15 -o ServerAliveCountMax=1 (or wrap the ssh call with a short external timeout like timeout 10s) so a stalled connection fails quickly instead of freezing the watcher loop; modify the ssh used in the vllm_model assignment and the ssh used before tail -20 ${LOG_REMOTE}/vllm_llama33.log accordingly.

…#148) Three CodeRabbit findings on PR #147 addressed against main: - scripts/v038_droplet_run.sh: docker run -d sent runtime logs to the detached client process so the redirect captured only the container ID. Split the docker run from a background docker logs -f writing to vllm_llama33.log, so the health-check tail path and the watcher rsync both see real vLLM server output. - scripts/v038_local_watcher.sh: SSH probes inside the watch loop ran without ConnectTimeout or ServerAliveInterval, which would freeze progress detection on a stalled connection. Lifted SSH_OPTS into a module-scope array (BatchMode, ConnectTimeout=10, ServerAliveInterval=15, ServerAliveCountMax=2) and rewired the three ssh calls through it. - bench/vaara-bench-v0.38.md: bare fence on the reproduction recipe block tagged as bash for markdownlint MD040 and reader syntax highlighting. The shipped v0.38 generation run already completed cleanly so these do not affect the release artifacts. The fixes propagate into the canonical scripts so the v0.39 generation harness picks up the corrected shape. Co-authored-by: vaaraio <267591518+vaaraio@users.noreply.github.com>

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

vaaraio merged commit 0266609 into main May 27, 2026
12 checks passed

vaaraio deleted the release/v0.38.0 branch May 27, 2026 09:07

vaaraio mentioned this pull request May 27, 2026

chore(v038): coderabbit fixes on droplet driver, watcher, bench fence #148

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release(v0.38.0): PAIR scale-up to n=300 per attacker family, 88.4% recall#147

release(v0.38.0): PAIR scale-up to n=300 per attacker family, 88.4% recall#147
vaaraio merged 1 commit into
mainfrom
release/v0.38.0

vaaraio commented May 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vaaraio commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What is not in this release

Test plan

Bench

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vaaraio commented May 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading