Skip to content

release(v0.38.0): PAIR scale-up to n=300 per attacker family, 88.4% recall#147

Merged
vaaraio merged 1 commit into
mainfrom
release/v0.38.0
May 27, 2026
Merged

release(v0.38.0): PAIR scale-up to n=300 per attacker family, 88.4% recall#147
vaaraio merged 1 commit into
mainfrom
release/v0.38.0

Conversation

@vaaraio
Copy link
Copy Markdown
Owner

@vaaraio vaaraio commented May 27, 2026

Summary

v0.38 ships the Phase 1 PAIR scale-up on the Llama-3.3-70B leg. 900
fresh adversarial entries (300 per category) generated on AMD-backed
MI300X SR-IOV under rocm/vllm:latest at seed 43, content-distinct
from the v0.37 Llama-3.3 leg. The v8 production classifier is carried
forward unchanged from v0.37 and evaluated at calibrated T=0.9006
against the new corpus.

Overall recall lands at 88.4% [86.2, 90.4], a 2.6 pp lift over the
v0.37 Llama-3.3 leg (85.8%). Per category: tool_misuse 93.7%,
privilege_escalation 96.3%, data_exfil 75.3%. The biggest move is
on data_exfil (+6.3 pp), the category v0.37 identified as the
hardest cross-model surface.

What is not in this release

External-corpus eval (BIPIA, LLMail-Inject) and the IPI fourth
attacker family both move to v0.39. Neither external corpus
pre-extracts the tool calls that v8 classifies. An honest eval
against either requires running an LLM agent end-to-end on the
injection prompt and capturing the resulting tool call, which is an
LLM-agent harness rather than a packaging task on top of an existing
eval path. IPI fits the same release window as a different attack
class.

Test plan

  • ruff clean on the new scripts/eval_v038_phase1.py
  • eval reproducible via PYTHONPATH=src .venv/bin/python scripts/eval_v038_phase1.py
  • three version files all bumped to 0.38.0 (pyproject.toml, clients/ts/package.json, src/vaara/__init__.py)
  • README bench pointer swapped to v0.38
  • Phase 1 entries fingerprint-deduplicated against v037

Bench

bench/vaara-bench-v0.38.md

Summary by CodeRabbit

  • New Features

    • Released version 0.38.0
    • Integrated Phase 1 evaluation results with comprehensive metrics breakdown by category and severity level
  • Documentation

    • Updated benchmark methodology reference documentation
    • Published evaluation results with improved recall metrics (+2.6pp improvement)

Review Change Stack

…ecall

v0.38 ships the third PAIR attacker leg at production scale. Three
families (Mixtral-8x7B, Claude Sonnet 4.6, Llama-3.3-70B-Instruct) at
n=300 each, with the Phase 1 Llama-3.3 corpus generated fresh on
AMD-backed MI300X SR-IOV under rocm/vllm:latest at seed 43. The v8
production classifier is carried forward unchanged from v0.37 and
evaluated at calibrated T=0.9006 against the 900 Phase 1 entries.
Overall recall lands at 88.4% [86.2, 90.4], a 2.6 pp lift over the
v0.37 Llama-3.3 leg (85.8%). The biggest category move is data_exfil
(69.0% to 75.3%, +6.3 pp). tool_misuse holds at 93.7% and
privilege_escalation at 96.3%.

External-corpus eval against BIPIA and LLMail-Inject and the IPI
fourth attacker family both move to v0.39. Neither external corpus
pre-extracts the tool calls that v8 classifies, so an honest eval
requires running an LLM agent end-to-end on each injection prompt
and capturing the resulting tool call. That is an LLM-agent harness,
not a packaging task on top of an existing eval path. IPI fits the
same v0.39 release window as a different attack class.

Bench doc: bench/vaara-bench-v0.38.md. Eval artifact:
bench/v038_phase1_eval_v8.json. Reproduction harness:
scripts/eval_v038_phase1.py.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

📝 Walkthrough

Walkthrough

v0.38.0 release bumps package versions across Python and TypeScript manifests, updates CHANGELOG and README, introduces Phase 1 benchmark evaluation results with methodology documentation, adds an evaluation script that computes classifier recall with Wilson confidence intervals, and provides droplet-side and local-side orchestration scripts for running distributed evaluation infrastructure.

Changes

v0.38.0 Release and Evaluation Infrastructure

Layer / File(s) Summary
Version and release metadata updates
src/vaara/__init__.py, pyproject.toml, clients/ts/package.json, CHANGELOG.md, README.md
Package version incremented to 0.38.0 across Python module, project config, and TypeScript client. CHANGELOG documents the v0.38 release with new Phase 1 artifacts and scripts. README bench reference updated from v0.37 to v0.38 methodology document.
Benchmark methodology and evaluation results
bench/vaara-bench-v0.38.md, bench/v038_phase1_eval_v8.json
v0.38 bench document describes Phase 1 PAIR scale-up to n=300 per attacker family on Llama-3.3-70B (seed 43), reports recall deltas vs v0.37 (n=887), includes provenance/chain-of-custody, reproduction recipe, exclusions for v0.39, and cumulative position statement. JSON results file contains bundle metadata, threshold settings, aggregate recall with confidence intervals, and breakdowns by category and severity.
Phase 1 evaluation script implementation
scripts/eval_v038_phase1.py
Loads v8 classifier bundle, reads Phase 1 test entries from three category-specific JSONL files (tool misuse, privilege escalation, data exfil), builds feature vectors/labels using shared utilities, predicts and evaluates DENY|ESCALATE recall against configurable threshold, computes Wilson score confidence bounds, aggregates metrics by category and severity, outputs detailed JSON results. Implements wilson_ci() for binomial confidence intervals and CLI argument handling for bundle path, threshold, and output destination.
Droplet-side orchestration and model serving
scripts/v038_droplet_run.sh
Idempotent two-phase workflow: (1) ensures rocm/vllm:latest container is running and serving expected model by probing /v1/models, removing stale containers and relaunching if needed, waiting for health with bounded timeout; (2) launches three category-specific generator processes concurrently via nohup with per-category PID files to skip already-running generators, writes JSONL outputs and logs to predefined directories.
Local monitoring, coordination, and cleanup
scripts/v038_local_watcher.sh
Establishes SSH connectivity, continuously rsyncs generator JSONL outputs and logs from droplet to local directories, tracks per-prefix line-count growth to detect completion at N_PER_CAT entries per generator, polls remote vLLM /v1/models endpoint to verify expected model (warns and tails logs if mismatched), triggers final rsyncs and next-step logging upon completion, optionally auto-deletes droplet via doctl. Runs indefinitely with configurable polling interval; logs status to .v038_watch/progress.log and rsync.log.

Sequence Diagram

sequenceDiagram
  participant Operator
  participant Droplet
  participant Watcher as Local Watcher
  participant vLLM
  participant LocalFS

  Operator->>Droplet: Run v038_droplet_run.sh
  Droplet->>vLLM: Probe /v1/models
  Droplet->>vLLM: Launch rocm/vllm:latest
  Droplet->>vLLM: Wait for health
  Droplet->>Droplet: Start TM/PE/DE generators

  Operator->>Watcher: Run v038_local_watcher.sh
  Watcher->>Droplet: Establish SSH connectivity
  loop Polling interval
    Watcher->>Droplet: rsync generator outputs
    Droplet-->>LocalFS: Transfer JSONL artifacts
    Watcher->>Droplet: Query vLLM /v1/models
    Watcher->>LocalFS: Check line counts
    Droplet->>Droplet: Generate entries
  end

  Watcher->>LocalFS: Final rsync
  Watcher->>LocalFS: Log completion
  Watcher->>Droplet: Optional auto-delete via doctl
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • vaaraio/vaara#127: Release versioning PR that also bumps src/vaara/__init__.py __version__ and related package manifests as part of version lifecycle management.
  • vaaraio/vaara#144: Earlier evaluation tooling PR that implements the v8 classifier threshold calibration (T=0.9006) and Wilson score confidence interval computation that v0.38 evaluation script builds upon.

Poem

🐰 Hop along to v0.38,
Where droplets dance and generators wait,
With Llama-3.3 and seeds so fine,
Nine hundred entries, DENY lines align!
Watch it sync from afar, with watcher so keen,
The smoothest benchmark release we've seen! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately and concisely summarizes the main change: a v0.38.0 release featuring a PAIR scale-up to n=300 per attacker family with 88.4% recall, which matches the core objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch release/v0.38.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
scripts/eval_v038_phase1.py (1)

94-101: ⚡ Quick win

Per-category recall uses tp/n (all examples) vs overall recall tp/(tp+fn) (positives only), so it would be wrong if ALLOW examples ever appear.

In scripts/eval_v038_phase1.py, overall recall is computed from y where y==1 iff expected contains "DENY"/"ESCALATE" (build_labels). Per-category recall instead increments n for every entry (line 99) while tp only increments for "DENY"/"ESCALATE" (lines 100-101). If there are any expected="ALLOW" entries, the per-category denominator would be too large and recall would be under-reported.

Current v0.38 Phase 1 inputs used by this script (tests/adversarial/generated/{TM,PE,DE}-v038-llama33-s43.jsonl) have ALLOW=0 across all three files, so this mismatch doesn’t affect today’s run; the logic is still worth aligning to y for robustness.

🔧 Proposed fix to compute recall using positives-only (`y`)
 per_cat: dict[str, dict[str, int]] = {}
 per_sev: dict[str, dict[str, int]] = {}
 
-for e, pr in zip(entries, pred):
+for e, pr, y_true in zip(entries, pred, y):
     cat = e.get("category", "?")
     sev = e.get("severity", "?")
     for bucket, d in [(cat, per_cat), (sev, per_sev)]:
-        d.setdefault(bucket, {"n": 0, "tp": 0})
-        d[bucket]["n"] += 1
-        if pr == 1 and e.get("expected") in ("DENY", "ESCALATE"):
+        d.setdefault(bucket, {"n_pos": 0, "tp": 0})
+        if y_true == 1:
+            d[bucket]["n_pos"] += 1
+        if pr == 1 and y_true == 1:
             d[bucket]["tp"] += 1

Then update the render function and output to use n_pos instead of n:

     def render(label: str, m: dict[str, dict[str, int]]):
         print(f"\n--- recall by {label} ---")
         for k in sorted(m):
-            n, tp = m[k]["n"], m[k]["tp"]
-            r = tp / max(n, 1)
-            lo, hi = wilson_ci(tp, n)
-            print(f"  {k:22s} n={n:4d} tp={tp:4d} recall={r:.1%} [{lo:.1%}, {hi:.1%}]")
+            n_pos, tp = m[k]["n_pos"], m[k]["tp"]
+            r = tp / max(n_pos, 1)
+            lo, hi = wilson_ci(tp, n_pos)
+            print(f"  {k:22s} n_pos={n_pos:4d} tp={tp:4d} recall={r:.1%} [{lo:.1%}, {hi:.1%}]")

And in the JSON output:

-    "per_category": {k: {**v, "recall": v["tp"] / max(v["n"], 1)} for k, v in per_cat.items()},
-    "per_severity": {k: {**v, "recall": v["tp"] / max(v["n"], 1)} for k, v in per_sev.items()},
+    "per_category": {k: {**v, "recall": v["tp"] / max(v["n_pos"], 1)} for k, v in per_cat.items()},
+    "per_severity": {k: {**v, "recall": v["tp"] / max(v["n_pos"], 1)} for k, v in per_sev.items()},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/eval_v038_phase1.py` around lines 94 - 101, Per-category recall
currently uses d[bucket]["n"] as total examples but only increments tp for
positives, so change the per-bucket denominator to count positives only (align
with build_labels / y): in the loop over entries/pred, compute a boolean is_pos
= e.get("expected") in ("DENY","ESCALATE") (or reuse the existing label array),
call d.setdefault(bucket, {"n_pos": 0, "tp": 0}), increment d[bucket]["n_pos"]
only when is_pos is True, and increment d[bucket]["tp"] when pr == 1 and is_pos
is True; then update any rendering/output that reads d[bucket]["n"] to use
d[bucket]["n_pos"] (and rename uses accordingly) so per-category recall = tp /
n_pos.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bench/vaara-bench-v0.38.md`:
- Around line 86-92: The fenced code block containing the shell command that
runs scripts/eval_v038_phase1.py should include a language identifier for syntax
highlighting; edit the block in bench/vaara-bench-v0.38.md around the PYTHONPATH
invocation to change the opening fence from ``` to ```bash so the command
(including scripts/eval_v038_phase1.py, --bundle, --threshold, --json-out) is
marked as bash.

In `@scripts/v038_droplet_run.sh`:
- Around line 47-63: The script currently runs the container detached (docker
run -d ...) so the shell redirection captures only the client output (container
ID) rather than the container stdout/stderr; fix by either removing -d so the
process stays in foreground and the existing redirection on the docker run line
writes the vLLM server logs to "${LOG_DIR}/vllm_llama33.log", or keep -d but
after starting the container named vllm-llama33 spawn a background logger that
tails container logs (docker logs -f vllm-llama33 redirected to
"${LOG_DIR}/vllm_llama33.log") so the real runtime logs are written for the
watcher to consume. Ensure the change targets the docker run invocation for the
rocm/vllm:latest vllm serve ... command and the LOG_DIR/vllm_llama33.log path.

In `@scripts/v038_local_watcher.sh`:
- Around line 75-81: The SSH calls that set vllm_model and fetch the remote log
can hang because they lack connection/timeouts; update both ssh invocations (the
one assigning vllm_model and the one running tail in the if-block) to include
SSH timeout options such as -o ConnectTimeout=5 and keepalive options like -o
ServerAliveInterval=15 -o ServerAliveCountMax=1 (or wrap the ssh call with a
short external timeout like timeout 10s) so a stalled connection fails quickly
instead of freezing the watcher loop; modify the ssh used in the vllm_model
assignment and the ssh used before tail -20 ${LOG_REMOTE}/vllm_llama33.log
accordingly.

---

Nitpick comments:
In `@scripts/eval_v038_phase1.py`:
- Around line 94-101: Per-category recall currently uses d[bucket]["n"] as total
examples but only increments tp for positives, so change the per-bucket
denominator to count positives only (align with build_labels / y): in the loop
over entries/pred, compute a boolean is_pos = e.get("expected") in
("DENY","ESCALATE") (or reuse the existing label array), call
d.setdefault(bucket, {"n_pos": 0, "tp": 0}), increment d[bucket]["n_pos"] only
when is_pos is True, and increment d[bucket]["tp"] when pr == 1 and is_pos is
True; then update any rendering/output that reads d[bucket]["n"] to use
d[bucket]["n_pos"] (and rename uses accordingly) so per-category recall = tp /
n_pos.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e4504d1c-f721-41ff-9ff6-e4b42d6e9795

📥 Commits

Reviewing files that changed from the base of the PR and between b9e945b and d7a2744.

⛔ Files ignored due to path filters (3)
  • tests/adversarial/generated/DE-v038-llama33-s43.jsonl is excluded by !**/generated/**
  • tests/adversarial/generated/PE-v038-llama33-s43.jsonl is excluded by !**/generated/**
  • tests/adversarial/generated/TM-v038-llama33-s43.jsonl is excluded by !**/generated/**
📒 Files selected for processing (10)
  • CHANGELOG.md
  • README.md
  • bench/v038_phase1_eval_v8.json
  • bench/vaara-bench-v0.38.md
  • clients/ts/package.json
  • pyproject.toml
  • scripts/eval_v038_phase1.py
  • scripts/v038_droplet_run.sh
  • scripts/v038_local_watcher.sh
  • src/vaara/__init__.py

Comment on lines +86 to +92
```
PYTHONPATH=src .venv/bin/python scripts/eval_v038_phase1.py \
--bundle src/vaara/data/adversarial_classifier_v8.joblib \
--threshold 0.9006 \
--json-out bench/v038_phase1_eval_v8.json
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifier to the fenced code block.

The code block starting at line 86 should specify a language for proper syntax highlighting and linting consistency. As per the static analysis hint, fenced code blocks should have a language specified.

📝 Proposed fix
-```
+```bash
 PYTHONPATH=src .venv/bin/python scripts/eval_v038_phase1.py \
     --bundle src/vaara/data/adversarial_classifier_v8.joblib \
     --threshold 0.9006 \
     --json-out bench/v038_phase1_eval_v8.json
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>

[warning] 86-86: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @bench/vaara-bench-v0.38.md around lines 86 - 92, The fenced code block
containing the shell command that runs scripts/eval_v038_phase1.py should
include a language identifier for syntax highlighting; edit the block in
bench/vaara-bench-v0.38.md around the PYTHONPATH invocation to change the
opening fence from tobash so the command (including
scripts/eval_v038_phase1.py, --bundle, --threshold, --json-out) is marked as
bash.


</details>

<!-- fingerprinting:phantom:triton:puma -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines +47 to +63
docker run -d --rm \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--shm-size 32G --ipc=host --network host \
-e "HF_TOKEN=${HF_TOKEN:-}" \
-v "${HF_CACHE}":/root/.cache/huggingface \
--name vllm-llama33 \
rocm/vllm:latest \
vllm serve "${MODEL}" \
--host 0.0.0.0 --port "${PORT}" \
--max-model-len 8192 \
--enforce-eager \
--gpu-memory-utilization 0.92 \
>"${LOG_DIR}/vllm_llama33.log" 2>&1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Detached mode is not writing vLLM runtime logs to vllm_llama33.log.

Line 62 captures only docker run client output (typically container ID), so Line 74 and the local watcher won’t get real server logs for debugging.

Suggested fix
-  docker run -d --rm \
+  docker run -d --rm \
     --device=/dev/kfd --device=/dev/dri \
     --group-add video --group-add render \
     --security-opt seccomp=unconfined \
     --cap-add=SYS_PTRACE \
     --shm-size 32G --ipc=host --network host \
     -e "HF_TOKEN=${HF_TOKEN:-}" \
     -v "${HF_CACHE}":/root/.cache/huggingface \
     --name vllm-llama33 \
     rocm/vllm:latest \
     vllm serve "${MODEL}" \
       --host 0.0.0.0 --port "${PORT}" \
       --max-model-len 8192 \
       --enforce-eager \
-      --gpu-memory-utilization 0.92 \
-      >"${LOG_DIR}/vllm_llama33.log" 2>&1
+      --gpu-memory-utilization 0.92
+
+  # Stream container logs to file used by health/debug paths.
+  nohup docker logs -f vllm-llama33 >"${LOG_DIR}/vllm_llama33.log" 2>&1 &
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
docker run -d --rm \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--shm-size 32G --ipc=host --network host \
-e "HF_TOKEN=${HF_TOKEN:-}" \
-v "${HF_CACHE}":/root/.cache/huggingface \
--name vllm-llama33 \
rocm/vllm:latest \
vllm serve "${MODEL}" \
--host 0.0.0.0 --port "${PORT}" \
--max-model-len 8192 \
--enforce-eager \
--gpu-memory-utilization 0.92 \
>"${LOG_DIR}/vllm_llama33.log" 2>&1
docker run -d --rm \
--device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--shm-size 32G --ipc=host --network host \
-e "HF_TOKEN=${HF_TOKEN:-}" \
-v "${HF_CACHE}":/root/.cache/huggingface \
--name vllm-llama33 \
rocm/vllm:latest \
vllm serve "${MODEL}" \
--host 0.0.0.0 --port "${PORT}" \
--max-model-len 8192 \
--enforce-eager \
--gpu-memory-utilization 0.92
# Stream container logs to file used by health/debug paths.
nohup docker logs -f vllm-llama33 >"${LOG_DIR}/vllm_llama33.log" 2>&1 &
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/v038_droplet_run.sh` around lines 47 - 63, The script currently runs
the container detached (docker run -d ...) so the shell redirection captures
only the client output (container ID) rather than the container stdout/stderr;
fix by either removing -d so the process stays in foreground and the existing
redirection on the docker run line writes the vLLM server logs to
"${LOG_DIR}/vllm_llama33.log", or keep -d but after starting the container named
vllm-llama33 spawn a background logger that tails container logs (docker logs -f
vllm-llama33 redirected to "${LOG_DIR}/vllm_llama33.log") so the real runtime
logs are written for the watcher to consume. Ensure the change targets the
docker run invocation for the rocm/vllm:latest vllm serve ... command and the
LOG_DIR/vllm_llama33.log path.

Comment on lines +75 to +81
vllm_model=$(ssh -o BatchMode=yes "${DROPLET}" \
"curl -sf http://localhost:8000/v1/models 2>/dev/null | grep -oE '\"id\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | head -1 | grep -oE '\"[^\"]*\"\$' | tr -d '\"'" \
2>/dev/null || echo "SSH_FAIL")
if [[ "$vllm_model" != "${EXPECTED_MODEL}" ]] && [[ "$vllm_model" != "SSH_FAIL" ]]; then
note "WARNING vllm_model='${vllm_model}' (expected ${EXPECTED_MODEL})"
ssh -o BatchMode=yes "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \
| tee -a "${WATCH_DIR}/progress.log" || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Recurring SSH probes can hang the watcher loop without connection timeouts.

Line 75 and Line 80 SSH calls don’t set timeouts, so one stalled connection can freeze progress detection.

Suggested fix
+SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=10 -o ServerAliveInterval=15 -o ServerAliveCountMax=2)
+
-  vllm_model=$(ssh -o BatchMode=yes "${DROPLET}" \
+  vllm_model=$(ssh "${SSH_OPTS[@]}" "${DROPLET}" \
     "curl -sf http://localhost:8000/v1/models 2>/dev/null | grep -oE '\"id\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | head -1 | grep -oE '\"[^\"]*\"\$' | tr -d '\"'" \
     2>/dev/null || echo "SSH_FAIL")
@@
-    ssh -o BatchMode=yes "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \
+    ssh "${SSH_OPTS[@]}" "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \
       | tee -a "${WATCH_DIR}/progress.log" || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
vllm_model=$(ssh -o BatchMode=yes "${DROPLET}" \
"curl -sf http://localhost:8000/v1/models 2>/dev/null | grep -oE '\"id\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | head -1 | grep -oE '\"[^\"]*\"\$' | tr -d '\"'" \
2>/dev/null || echo "SSH_FAIL")
if [[ "$vllm_model" != "${EXPECTED_MODEL}" ]] && [[ "$vllm_model" != "SSH_FAIL" ]]; then
note "WARNING vllm_model='${vllm_model}' (expected ${EXPECTED_MODEL})"
ssh -o BatchMode=yes "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \
| tee -a "${WATCH_DIR}/progress.log" || true
SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=10 -o ServerAliveInterval=15 -o ServerAliveCountMax=2)
vllm_model=$(ssh "${SSH_OPTS[@]}" "${DROPLET}" \
"curl -sf http://localhost:8000/v1/models 2>/dev/null | grep -oE '\"id\"[[:space:]]*:[[:space:]]*\"[^\"]*\"' | head -1 | grep -oE '\"[^\"]*\"\$' | tr -d '\"'" \
2>/dev/null || echo "SSH_FAIL")
if [[ "$vllm_model" != "${EXPECTED_MODEL}" ]] && [[ "$vllm_model" != "SSH_FAIL" ]]; then
note "WARNING vllm_model='${vllm_model}' (expected ${EXPECTED_MODEL})"
ssh "${SSH_OPTS[@]}" "${DROPLET}" "tail -20 ${LOG_REMOTE}/vllm_llama33.log" 2>&1 \
| tee -a "${WATCH_DIR}/progress.log" || true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/v038_local_watcher.sh` around lines 75 - 81, The SSH calls that set
vllm_model and fetch the remote log can hang because they lack
connection/timeouts; update both ssh invocations (the one assigning vllm_model
and the one running tail in the if-block) to include SSH timeout options such as
-o ConnectTimeout=5 and keepalive options like -o ServerAliveInterval=15 -o
ServerAliveCountMax=1 (or wrap the ssh call with a short external timeout like
timeout 10s) so a stalled connection fails quickly instead of freezing the
watcher loop; modify the ssh used in the vllm_model assignment and the ssh used
before tail -20 ${LOG_REMOTE}/vllm_llama33.log accordingly.

@vaaraio vaaraio merged commit 0266609 into main May 27, 2026
12 checks passed
@vaaraio vaaraio deleted the release/v0.38.0 branch May 27, 2026 09:07
vaaraio added a commit that referenced this pull request May 27, 2026
…#148)

Three CodeRabbit findings on PR #147 addressed against main:

- scripts/v038_droplet_run.sh: docker run -d sent runtime logs to the
  detached client process so the redirect captured only the container
  ID. Split the docker run from a background docker logs -f writing
  to vllm_llama33.log, so the health-check tail path and the watcher
  rsync both see real vLLM server output.
- scripts/v038_local_watcher.sh: SSH probes inside the watch loop ran
  without ConnectTimeout or ServerAliveInterval, which would freeze
  progress detection on a stalled connection. Lifted SSH_OPTS into a
  module-scope array (BatchMode, ConnectTimeout=10, ServerAliveInterval=15,
  ServerAliveCountMax=2) and rewired the three ssh calls through it.
- bench/vaara-bench-v0.38.md: bare fence on the reproduction recipe
  block tagged as bash for markdownlint MD040 and reader syntax
  highlighting.

The shipped v0.38 generation run already completed cleanly so these
do not affect the release artifacts. The fixes propagate into the
canonical scripts so the v0.39 generation harness picks up the
corrected shape.

Co-authored-by: vaaraio <267591518+vaaraio@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant