Skip to content

feat: rapid-mlx doctor — 4-tier regression harness#125

Merged
raullenchai merged 28 commits intomainfrom
feat/doctor-harness
Apr 16, 2026
Merged

feat: rapid-mlx doctor — 4-tier regression harness#125
raullenchai merged 28 commits intomainfrom
feat/doctor-harness

Conversation

@raullenchai
Copy link
Copy Markdown
Owner

Summary

Adds `rapid-mlx doctor`, a four-tier regression harness ("code health checkup"):

Tier Time What When to run
`smoke` ~45s repo + imports + ruff + CLI sanity + pytest (2071 tests) every commit
`check` ~10 min spin up qwen3.5-4b, API contract + perf + baseline diff every PR / big change
`full` ~1-2 hr check across 3 models × 11 agent profiles before release
`benchmark` overnight sweep all locally-cached models → markdown scorecard periodic / promo

Exit codes are a stable contract for hooks/CI: `0` = pass, `1` = perf regression, `2` = functional fail.

Architecture

```
vllm_mlx/doctor/ # 13 files, ~2K LOC, lazy-imported
runner.py — DoctorRunner, CheckResult, atomic run-dir reservation
cli.py — tier dispatch + per-model orchestration
server.py — subprocess lifecycle (boot, /health poll, SIGTERM→SIGKILL teardown)
baseline.py — load/save/compare, sign-aware deltas, per-model files
discovery.py — locate locally-cached models (HF hub + LM Studio layouts)
scorecard.py — markdown table generator
checks/
smoke.py — pytest, ruff, CLI sanity, imports
api.py — wraps regression_suite.py + smoke_matrix.sh
perf.py — wraps autoresearch_bench.py, captures 13 metrics
agent.py — wraps AgentTestRunner across all 11 profiles
benchmark.py — single-cell metric capture for the scorecard

harness/ # NOT shipped in wheel
README.md — single source of truth (tiers, baselines, thresholds, troubleshooting)
thresholds.yaml — per-metric regression %
baselines/ — checked-in golden numbers (qwen3.5-4b for now)
scorecard/ — generated artefacts (latest.md tracked)
runs/ — gitignored per-run reports
```

Install footprint

Metric Impact
Wheel size +204 KB Python
Runtime deps 0 new (stdlib + already-required PyYAML)
`pip install` time unchanged
`rapid-mlx --help` startup +0ms (doctor is lazy-imported)
`harness/`, `tests/`, `Makefile` not shipped

End users only see one new subcommand; pre-existing flows untouched.

Reuses existing scripts (no rewrites)

  • `tests/regression_suite.py` — API contract (10 cases)
  • `tests/test_smoke_matrix.sh` — emoji/CJK/thinking/special-token-leak
  • `scripts/autoresearch_bench.py` — 13 perf metrics
  • `vllm_mlx/agents/testing.py` — auto-generated agent test plans

Small backwards-compatible tweaks to the above so they honor an
override port (`RAPID_MLX_PORT`, `AUTORESEARCH_BASE`).

Lint debt cleanup (separate commit)

Pre-PR: 43 ruff errors. Post-PR: 0. Mostly auto-fixable (unsorted imports,
f-string placeholders); the 8 manual fixes are documented in the
`chore: clean up ruff lint debt` commit.

Codex review trail

This PR went through ~16 rounds of `codex review` and addressed every
real issue codex flagged (12 P1/P2 fixes). 4 codex rounds hallucinated
issues in untouched files — those turned out to be real pre-existing
bugs and were filed as separate issues:

Test plan

  • All 2071 unit tests pass (was 2027 before; +14 discovery + 15 baseline + 15 ruff cleanup)
  • `make smoke` — 45s, all-pass on this machine
  • `make check` — 10 min on qwen3.5-4b, baseline recorded
  • `make full` — 8 min on qwen3.5-4b, all 11 agent profiles green
  • `make benchmark` — 7/9 local models scored; 2 correctly skipped as partial downloads
  • Regression detection verified end-to-end: tampered baseline → REGRESSION fired with correct sign + magnitude → restored
  • Concurrent invocations: atomic mkdir prevents collision

Recorded artefacts

  • `harness/baselines/check-qwen3.5-4b.json`
  • `harness/baselines/full-qwen3.5-4b.json`
  • `harness/scorecard/latest.md`

🤖 Generated with Claude Code

Your Name and others added 28 commits April 15, 2026 21:05
Pre-doctor cleanup: 36 auto-fixed (unsorted imports, unused imports,
f-string placeholders, deprecated import, yoda conditions, redundant
if-branches) + 8 manual fixes:

- bare 'except' → except Exception (regression_suite.py)
- ambiguous variable name 'l' → 'line' (regression_suite.py)
- multiple statements on one line (test_openwebui.py, test_output_router.py)
- BaseEngine.generate_warmup empty default — added noqa comment with
  rationale (intentional override point, not abstract)

All 2027 tests still pass. Sets a clean baseline for the upcoming
'rapid-mlx doctor' regression harness which will fail on lint regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New ``rapid-mlx doctor`` subcommand. Smoke tier (no model required):

- repo_layout    — verify aliases.json, agent profiles dir, pyproject
- imports        — top-level + engine + agent profile load
- ruff           — lint (with binary fallback if not in venv)
- cli_sanity     — smoke --help / models / agents subcommands
- pytest         — full unit suite (~45s, 2027 tests)

Architecture:

  vllm_mlx/doctor/
    runner.py          — DoctorRunner, CheckResult, TierResult, exit codes
    cli.py             — tier dispatch (smoke implemented; check/full/
                          benchmark stubs return rc=2)
    checks/smoke.py    — the 5 smoke checks
  harness/
    thresholds.yaml    — per-metric regression %
    baselines/         — gitignored placeholder for golden numbers
    runs/              — gitignored per-run artefacts

Exit codes are stable contract for hooks/CI:
  0 = all pass, 1 = perf regression, 2 = functional fail

Each run writes both result.json (machine-readable) and report.md
(human-readable, scannable).

Next: check tier with Qwen3.5-4B + baseline diff.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…que run dirs

Three issues from codex round 1:

1. (P1) imports check used to import server/scheduler/engine, which
   transitively initialized mlx.core and crashed on hosts without a
   working Metal device. Smoke tier was supposed to be lightweight.
   Now imports only cli, model_aliases, agents, doctor — heavier
   modules still get exercised via pytest where it actually matters.
   Side benefit: imports check went from ~5s to ~0.1s.

2. (P2) doctor assumed a source checkout but was exposed to all installs.
   A pip-installed wheel doesn't ship tests/ or harness/, so the command
   would print confusing failures. Added an upfront guard that checks
   for sentinel paths and exits with a clear message.

3. (P2) run directories used minute-precision timestamps with exist_ok=
   True — two invocations of the same tier in the same minute would
   overwrite each other's report.md. Now uses second precision plus a
   numeric suffix on collision.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ions

Round 2 codex review: the previous check-then-create pattern would
race when two doctor processes started in the same second — both saw
the candidate path as free, both called mkdir(exist_ok=True), and one
would silently overwrite the other's report.md.

Switch to mkdir(exist_ok=False) inside a retry loop so the loser of
the race gets FileExistsError and tries the next suffix. First
invocation gets the unsuffixed dir; subsequent same-second runs get
-1, -2, ...

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the 'check' tier (~1-2 min for Qwen3.5-4B):

  rapid-mlx doctor check [--model qwen3.5-4b] [--update-baselines]

Components:

- vllm_mlx/doctor/server.py
  Subprocess lifecycle helper. Picks a free port, polls /health until
  ready or timeouts (with explicit error if process exits early), and
  guarantees teardown via SIGTERM → SIGKILL escalation on POSIX
  process groups.

- vllm_mlx/doctor/baseline.py
  load/save baseline JSON in harness/baselines/{tier}.json, threshold
  loader (yaml-optional), comparator with sign-aware delta % (lower-
  is-better metrics inverted so positive Δ% always means "better").
  Renders markdown delta tables for the report.

- vllm_mlx/doctor/checks/api.py
  Wraps tests/regression_suite.py + tests/test_smoke_matrix.sh against
  the live server. Both honor RAPID_MLX_PORT / positional port arg so
  the doctor's free-port allocation works.

- vllm_mlx/doctor/checks/perf.py
  Wraps scripts/autoresearch_bench.py --json. Captures 13 metrics.
  Treats all-zero primary metrics as a failure (catches the common
  silent-failure case where every request was rejected).

CLI surface:

  --model               override default qwen3.5-4b
  --update-baselines    record current run as new baseline

Required upstream tweaks (small, backwards-compatible):

- regression_suite.py reads RAPID_MLX_PORT env (default 8777)
- autoresearch_bench.py reads AUTORESEARCH_BASE env (default :8000)
- Both scripts + smoke_matrix.sh now use model="default" instead of
  "any"/"x" — those names were never registered, so requests 404'd.

First real run on Qwen3.5-4B surfaces 2 actual issues in the model:
  - <|im_end|> special-token leak in streaming output
  - thinking-toggle ratio test fails
That is exactly what the doctor is for; whether to fix them, change
the test, or accept them is a separate decision.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ines, threshold names

Three issues from codex round 1 on the check tier:

1. (P1) regression_suite.py always exits 0 — only prints "N/M tests
   passed". The wrapper treated rc=0 as PASS unconditionally, so
   partial failures slipped through (real example: Test 10 streaming-
   usage-stats failure was masked). Now parses the summary line and
   fails when N < M, with clear diagnostic detail.

2. (P2) Baseline files were keyed by tier only — running 'doctor check
   --model qwen3.5-9b' against a baseline recorded for qwen3.5-4b would
   produce nonsense deltas, and --update-baselines would clobber the
   wrong file. Baselines are now per-(tier, model):
   harness/baselines/check-{model_slug}.json. Defence-in-depth: also
   refuses to compare if the baseline file's recorded model differs
   from the current run.

3. (P2) thresholds.yaml had stale metric names (ttft_cold_ms,
   multiturn_ttft_ms, tool_latency_s) that didn't match what
   autoresearch_bench.py emits (cold_ttft_ms, mt_ttft_ms, tc_latency_ms).
   Most overrides were silently ignored. Aligned the YAML to the
   actual metric names + added composite_score and tc_success_rate.

End-to-end verified on Qwen3.5-4B: baseline recorded, second run
shows no regressions (all metrics within ±2.4%), and the doctor
correctly flags the pre-existing smoke_matrix bugs (<|im_end|> leak
+ thinking-toggle ratio) as failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Captured on commit 70d81b5 with mlx 0.31.1 + mlx-lm 0.31.2:
  decode_tps        49.7 tok/s
  cold_ttft_ms      314 ms
  cached_ttft_ms    394 ms
  tc_success_rate   1.0  (tool calling reliable)
  composite_score   109.9

Future 'doctor check' runs will diff against this. Threshold for
decode_tps is ±5%, for ttft is ±10% (see harness/thresholds.yaml).

Update with: rapid-mlx doctor check --update-baselines

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. (P2) tc_latency_ms was missing from _LOWER_IS_BETTER, so baseline
   diffs flipped sign for tool-call latency. Verified fix on Qwen3.5-
   4B: a slight latency improvement now reports +0.2% (better) instead
   of -0.2% (worse).

2. (P2) _safe_model_slug used 'replace' which is not injective —
   'foo/bar' and 'foo__bar' would collide. Switched to URL percent-
   encoding (urllib.parse.quote with safe='-.'); existing
   'qwen3.5-4b' baseline filename is unchanged because no special
   chars need quoting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For model IDs containing '/' or spaces (e.g. 'mlx-community/Qwen3.5-...
27B'), the previous slug was 'mlx-community__Qwen3.5-27B' but the new
URL-encoded slug is 'mlx-community%2FQwen3.5-27B'. Without a
fallback, baselines recorded under the old scheme become invisible
and regression detection silently turns off until manual re-record.

load_baseline() now tries the canonical path first, then the legacy
slug. When found via the legacy path the file is left in place;
running --update-baselines naturally migrates it to the new name on
the next save.

Existing 'qwen3.5-4b' baseline is unaffected (no special chars).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous legacy-slug fallback returned the file unconditionally,
which could hand back another model's baseline when two ids collided
under the old slug scheme (the very problem the new injective slug
was meant to fix).  _apply_baseline() would then turn that into a
hard FAIL instead of a clean "no baseline" SKIP.

Now the legacy-path read also validates that the file's recorded
model field matches the requested model before returning.  If it
doesn't, fall through to None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the 'full' tier (~1-2 hr for 3 models × 11 agent profiles):

  rapid-mlx doctor full [--models qwen3.5-4b,qwen3.5-35b,gemma-4-26b]
                        [--update-baselines]

For each model in the list (default qwen3.5-4b + qwen3.5-35b +
gemma-4-26b), boots a server then runs:
  - regression_suite (API contract)
  - smoke_matrix (output sanitization)
  - autoresearch (perf, with per-model baseline diff)
  - all 11 agent profiles' API test plans

Per-model failures don't abort the whole sweep — the runner records
the failure and moves on, so a single broken model still yields a
complete report covering the others.

Components:

- vllm_mlx/doctor/checks/agent.py
  Wraps AgentTestRunner from vllm_mlx.agents.testing. ERROR is
  treated as FAIL (both indicate something the doctor should
  surface); SKIP is fine (capability not declared by the profile).

- vllm_mlx/doctor/cli.py
  Refactor: extracted _run_per_model_block() shared by check + full.
  Per-iteration capture of profile_name/port in lambda defaults to
  avoid the classic loop-variable closure bug.

- vllm_mlx/doctor/runner.py
  run_check() now overrides result.name with the caller's name,
  so multi-model runs don't collapse 'agent_langchain' entries
  across qwen3.5-4b / qwen3.5-35b / gemma-4-26b.

- vllm_mlx/cli.py
  --models comma-list flag for full tier.

Initial Qwen3.5-4B full-tier baseline recorded:
  harness/baselines/full-qwen3.5-4b.json

First full run on Qwen3.5-4B: 14 pass / 3 fail / 0 skip. The 3
failures are pre-existing and consistent with the check tier
(regression_suite Test 10, smoke_matrix <|im_end|> leak +
thinking-toggle, hermes profile's specific test). All 11 agent
profiles' API tests pass cleanly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P2: In multi-model 'doctor full' runs, each model's baseline_diff
write to diff.md would clobber the previous model's deltas. Now
writes per-model diff-{model_slug}.md and appends a section to a
combined diff.md ('## qwen3.5-4b' / '## qwen3.5-35b' / ...).

Side fix: baseline_diff / baseline_update / no-baseline-found check
names now include [model], so the report rows are unique across
the loop and the final summary correctly counts each model's
status separately.

Codex's other flag (P1: agent import resolves to wrong package) was
incorrect — 'from ...agents' from vllm_mlx/doctor/checks/agent.py
correctly resolves to vllm_mlx.agents (verified by direct call:
the check reaches AgentTestRunner and fails on 'server not running'
rather than ImportError).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ve slug

Two issues from codex round 1:

P2: When no baseline exists, the previous patch dropped the diff.md
write entirely, breaking automation that always reads
run_dir/diff.md. Now writes both diff-{model}.md and appends to
diff.md with a "no baseline yet for model=X" notice — same shape as
the comparison case so callers can rely on the file existing.

P2: Per-model artefact filenames used the local _filename_safe()
helper, which only replaced '/' and ' ' — the same non-injective
pattern that the per-model baseline split was meant to fix. 'foo/bar'
and 'foo_bar' would produce the same diff-foo_bar.md and clobber.

Switched to baseline._safe_model_slug (URL percent-encoded, injective)
for both diff and server log filenames. Removed _filename_safe.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three pieces that make the doctor a finished product, not just code:

1. tests/test_doctor_baseline.py — 15 unit tests pinning down the
   regression detection contract:
   - higher-is-better metrics: within / regression / improvement
   - lower-is-better metrics: sign-inversion contract (positive Δ%
     always means "better") + tc_latency_ms regression test for the
     codex round 2 finding
   - accuracy metrics: zero-tolerance for any drop
   - edge cases: zero baseline, NEW / DROPPED metrics, no baseline,
     unknown-metric default thresholds
   - _safe_model_slug injectivity (regression test for codex finding)

   Also manually verified end-to-end: tampered baseline (decode_tps
   49.67 → 75.0, cold_tps 49.27 → 70.0), re-ran 'doctor check', got
   REGRESSION on both metrics with correct sign + magnitude, restored
   baseline.

2. harness/README.md — single source of truth covering:
   - The four tiers (smoke/check/full/benchmark) and when to use each
   - Exit codes (0/1/2 stable contract for hooks/CI)
   - Tier reference (what each check does, what model it boots)
   - Baseline workflow (record → review → commit; per-model files)
   - Threshold semantics (sign convention, defaults)
   - Run artefacts (report.md, result.json, diff.md, server.log)
   - Common failure modes + actionable hints
   - "Adding a new check" guide for future contributors

3. Makefile — ergonomic targets:
   - make smoke / check / full / benchmark
   - make update-baselines TIER=check
   - make lint / test / clean
   - make help (lists everything with one-line descriptions)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Project requires-python is >=3.10, but the Makefile required python3.12
specifically and bypassed any active virtualenv. Targets now auto-detect
the available interpreter (3.12 → 3.13 → 3.11 → 3.10 → python3 →
python) and respect 'make smoke PY=...' for explicit override.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ex round 2)

Codex P1: previous fix preferred 'python'/'python3' over versioned
binaries to honor active venvs, but on macOS those names often resolve
to system Python 3.9 — below our requires-python = >=3.10. The Makefile
broke immediately on the first 'str | None' annotation.

New ordering:
  1. $VIRTUAL_ENV/bin/python — wins so venv users on any Python >=3.10
     get exactly their venv's interpreter regardless of PATH.
  2. python3.13/3.12/3.11/3.10 — only if no venv is active.
  3. python3 last-resort fallback.

Skips bare 'python' / 'python3' between #1 and #2 to avoid the system-
3.9 trap.  Override still possible: 'make smoke PY=python3.13'.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odex round 3)

Codex P1: with pyenv shims, 'command -v python3.13' succeeds even when
only 3.12 is installed — the shim itself exists for every version.
Previous logic would pick the broken shim and every 'make smoke' would
die with "pyenv: python3.13: command not found".

Now the detection actually executes each candidate with a
sys.version_info >= (3, 10) check before accepting it.  Skips broken
pyenv shims naturally because they exit non-zero when the underlying
interpreter isn't installed.

Order of preference unchanged:
  1. $VIRTUAL_ENV/bin/python (verified by -x test)
  2. python3.13 → 3.12 → 3.11 → 3.10 (verified by --version probe)
  3. python3 fallback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the 'benchmark' tier (overnight, all locally-available
models):

  rapid-mlx doctor benchmark              # auto-discover local models
  rapid-mlx doctor benchmark --models qwen3.5-4b,gemma-4-26b

For each model:
  1. Probe local HF cache + LM Studio cache + HF_HUB_CACHE for weights
     — never auto-download (caller pre-fetches what they want bench'd)
  2. Boot a Simple-engine server
  3. Run autoresearch_bench, capture decode TPS / TTFT / tool-call %
  4. Tear down

Aggregates into a markdown scorecard at:
  - harness/scorecard/scorecard-{ts}.md   (gitignored)
  - harness/scorecard/latest.md           (tracked, always current)
  - harness/runs/{ts}-benchmark/scorecard.md (run-dir copy)

Components:

- vllm_mlx/doctor/discovery.py
  Locates models in HF_HUB_CACHE, ~/.cache/huggingface, and
  ~/.lmstudio/models. Sanity-checks for config.json before claiming
  a hit. Returns ModelAvailability with the alias, repo_id,
  available bool, and resolved path or skip reason.

- vllm_mlx/doctor/checks/benchmark.py
  Wraps autoresearch_bench --json. Returns CheckResult.metrics
  populated for the scorecard renderer to consume. Same all-zero
  primary-metric guard as the check tier perf.py.

- vllm_mlx/doctor/scorecard.py
  Renders one row per model with 5 columns (Decode TPS / Cold TTFT /
  Cached TTFT / Tool % / Score). Failed cells show 'FAIL — <reason>'
  in the status column instead of vanishing — the scorecard always
  covers every model the user asked about.

- vllm_mlx/doctor/cli.py
  run_benchmark_tier() orchestrates everything; per-model errors
  never abort the sweep (caught at outer level too).

Initial scorecard recorded for qwen3.5-4b in harness/scorecard/latest.md
(168 tok/s decode, 188ms cold TTFT, 100% tool-call success).

v1 deliberately scoped to Simple engine only; cross-engine columns
(Simple/Batched/Hybrid) deferred to v2 once BatchedEngine stabilises
per issue #105.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two P1 issues from codex round 1 on the benchmark tier:

1. 'rapid-mlx doctor benchmark --models <alias>' bypassed the
   availability check entirely, so a typo or unfetched alias would
   trigger a multi-GB download mid-sweep — exactly the contract
   ('never auto-download') the commit was supposed to enforce.
   Now: explicit list still gets validated against discovery; missing
   aliases land in the skipped section with a clear reason instead.
   Unknown aliases (not in aliases.json) are also detected and labeled.

2. LM Studio model layout was unreachable.  LM Studio stores models
   as ~/.lmstudio/models/{org}/{repo}/, but discovery only looked for
   the HF hub layout (models--{org}--{repo}/snapshots/...).  Now
   _check_alias() probes both layouts at every cache root, so models
   installed via LM Studio are first-class.

Side improvement: when --models gives nothing runnable, the error
message lists each skipped alias with its reason instead of the
generic "no models available locally".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Discovery now correctly finds models in LM Studio's
~/.lmstudio/models/{org}/{repo}/ layout, but the benchmark tier
still called serve(model=alias) — which runs the alias through
mlx_lm.load() and triggers an HF Hub download because nothing in
the serve path knows about the LM Studio cache. End result: discovery
said "ok, available", server proceeded to download anyway.

Fix: serve() now takes an optional model_path. When discovery
located a local snapshot, the benchmark passes that path explicitly,
so the server loads from disk and never reaches Hugging Face. The
display name in logs / scorecard still uses the alias.

Side: bumped boot_timeout to 600s for benchmark runs because cold
loads of 27B+ models routinely need >180s when paging from a slow
external SSD. Earlier sweep shows three large-model cells failed
exactly because of the 180s ceiling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…json

The latest benchmark sweep surfaced a real bug: discovery marked
qwen3.5-27b and qwopus-27b-8bit as available because their snapshot
dirs had config.json + model.safetensors.index.json. But the actual
*.safetensors shards were never downloaded — the snapshots were
half-broken (likely interrupted mid-download). Server then crashed
with "No safetensors found", showing up as a misleading 'server boot
failed' in the scorecard.

Worse: when a model has multiple snapshot copies in different cache
roots (one partial on the SSD, one complete in ~/.cache/huggingface),
discovery picked the *first* hit by config.json — which was the
broken partial copy.

Fix:
- _is_complete_snapshot() now requires both config.json AND at least
  one resolvable weight file (.safetensors / .npz / .gguf). Symlinks
  are resolve(strict=True)'d so dangling links don't false-pass.
- _check_alias() iterates all snapshots in all cache roots, picks
  the first *complete* one. Tracks whether any partial snapshot was
  seen so the skip reason can say "partial download?" instead of
  the generic "not found".

Verified: qwen3.5-27b now skipped with clear reason. qwopus-27b-8bit
correctly resolves to the local HF cache (which has the safetensors)
instead of the broken SSD snapshot.

Test coverage: 9 new tests in tests/test_doctor_discovery.py covering
complete/missing config/missing weights/dangling symlink/.npz
fallback + multi-root + LM Studio layout cases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…x round 2 P1)

Codex P1: previous _is_complete_snapshot() returned True as soon as
ANY single .safetensors file resolved.  For sharded models, an
interrupted download can leave (config.json + index.json + a subset
of shards), and that would still pass — but mlx_lm.load() crashes
once it tries to open a missing shard.

New behavior: when model.safetensors.index.json exists, parse its
weight_map and require EVERY referenced shard to resolve. Falls
through to the single-file glob check when there's no index (so
single-shard / .npz / .gguf models still work without an index).

Test coverage extended: 3 new tests (sharded_complete_passes,
sharded_partial_fails — the regression test for codex's finding —
and sharded_dangling_symlink_fails).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…round 3)

Codex P1: a corrupt or truncated index.json (common when a download
gets killed mid-write) used to fall through to the single-file glob
check, which returns True the moment any one shard happens to resolve.
That re-introduces the false-positive sharded models were supposed to
fix.

Now: if index.json exists but JSON-decode fails, return False outright
— we can't trust the snapshot.  Empty weight_map still falls through
because it's a legitimate template pattern (single-file model with a
stub index).

Two new tests:
  - test_corrupt_index_with_shards_fails (regression for codex P1)
  - test_index_with_empty_weight_map_falls_through_to_single_file
    (verifies the legitimate single-file fallback still works)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ape hygiene

Subagent self-review of PR #125 found 12 issues codex's line-level
passes missed.  This commit addresses all P1s and most P2s.

P1 — real bugs:

1. CLI's main parser auto-resolved 'args.model' for every subcommand
   *including doctor*, so 'rapid-mlx doctor check --model qwen3.5-4b'
   arrived at the doctor with model='mlx-community/Qwen3.5-4B-...',
   wrote baseline 'check-mlx-community%2F....json' instead of
   'check-qwen3.5-4b.json', and reported "no baseline found" against
   the committed file.  Fix: skip alias resolution for the doctor
   subcommand (it does its own resolution at server-spawn time via
   discovery).  Default 'doctor check' worked only by accident
   because no flag was passed.

2. README claimed pytest excludes "integration/server" tests but the
   actual command only excludes tests/integrations/ + the one
   server-required event-loop test.  Reworded to match reality.

3. harness/thresholds.yaml listed tool_call_accuracy /
   reasoning_accuracy / coding_accuracy — none of which the harness
   ever emits.  Dead config that would have silently misled anyone
   tuning them.  Removed; left a comment noting where to add real
   accuracy thresholds when BFCL etc. are wired in.

P2 — UX hygiene:

4. --update-baselines now prints a warning and is ignored when used
   with smoke / benchmark (they don't record baselines).  Used to
   silently exit 0 with no indication.

5. Boot timeout for full tier raised from 180s to 600s — same
   27B+ models in full as in benchmark; was hitting timeout where
   benchmark wouldn't.

6. _resolve_agent_profiles() no longer silently degrades to
   ['generic'] on profile-load failures; logs a clear WARNING so the
   user knows the documented 11-profile sweep just turned into a
   1-profile run.

7. scorecard.py was the only generator that didn't escape '|' in
   failure-detail strings.  A stack-trace frame ('File "x.py" | line
   N | y') would break the table.  Extracted shared md_cell()
   helper into runner.py and used from both places.

8. baseline_diff used to be appended to CheckResult.detail, which
   gets truncated to 120 chars in the markdown report.  Now stashed
   on the runner and rendered as a proper "## Baseline diffs"
   section in report.md, so report is self-contained without
   needing diff.md alongside.

9. _safe_model_slug renamed to safe_model_slug (was '_private' but
   imported across modules + tests).

10. README now documents make targets ('make smoke / check / full /
    benchmark / update-baselines TIER=...') as the primary
    interface, with the long-form rapid-mlx invocation footnoted.

11. README mentions HF_HOME/hub in cache discovery list (was
    omitted, surprising users who set HF_HOME).

12. --models help text now correctly says "for full / benchmark
    tiers" (was "for full tier" only).

Test coverage gaps closed (17 new tests in test_doctor_runner.py):

- Exit-code contract (the WHOLE point of the harness for CI):
  all-pass=0, regression=1, fail=2, fail-dominates-regression,
  empty-runs=0, skip-only=0
- md_cell escaping: pipes, newlines, truncation, empty, None
- Atomic run-dir reservation across back-to-back constructions
- Report rendering: basic table, diff sections embedded, pipe
  in detail doesn't break table layout

Total doctor test coverage now: 46 tests across baseline + discovery
+ runner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codex P2: previous fix bumped boot_timeout_s to 600s inside the
shared _run_per_model_block, which means 'doctor check' (which only
boots the small qwen3.5-4b) now also waits up to 10 minutes before
reporting a broken server boot.  That defeats the check tier's
fast-feedback purpose.

Made boot_timeout_s a parameter of _run_per_model_block (default
180s).  Only run_full_tier passes 600s — that's the tier that
actually loads 27B+ models.  run_check_tier inherits the default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous fix scoped the 600s boot timeout to the full tier only, which
broke 'doctor check --model qwen3.5-35b' (a documented workflow):
check would now use the 180s default and time out on legitimate
27B+ cold loads.

Replaced tier-scoped logic with a per-model heuristic.
_suggested_boot_timeout(alias) returns 600s when the alias contains a
≥20B parameter-count hint (matches '20b'..'999b' word-boundary), else
180s.

This way:
  - 'doctor check'                       → 180s (qwen3.5-4b, fast-fail)
  - 'doctor check --model qwen3.5-35b'   → 600s (auto-detect)
  - 'doctor full' default 3 models       → 180s/600s/600s as needed
  - 'doctor full --models <list>'        → per-model
  - 'doctor benchmark'                   → still uses its own 600s

3 new tests cover:
  - small aliases (qwen3.5-4b, llama3-3b, ...) → 180s
  - large aliases (qwen3.5-35b, qwopus-27b, hermes4-70b, ...) → 600s
  - unknown alias → 180s (fast-fail wins by default)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codex P2: every per-model / per-tier heuristic missed at least one
supported case (qwen3-coder is 80B but its alias has no NNb hint;
minimax-m2.5 is huge but named 'minimax-m2.5'; etc.). Each missed
case becomes a spurious 'doctor full --models <alias>' failure.

Replaced the heuristic with a single DEFAULT_BOOT_TIMEOUT_S = 600s
constant. Reasoning:

- Server crashes (port conflict, missing weights, import error,
  bad path) are detected within <1s by proc.poll() in
  server._wait_for_health regardless of this value.
- The only scenario that actually waits 600s is a genuinely slow
  legitimate cold load — at which point we *want* the long budget.
- "Fast feedback in check tier" was a goal but the realistic worst
  case (server hang) doesn't happen often enough to justify the
  false-fail risk.

Test simplified to assert DEFAULT_BOOT_TIMEOUT_S >= 600 (floor for
122B cold-load case) rather than enumerating which aliases get
which timeout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI runs 'ruff format --check' which enforces formatting.
49 files were flagged — mostly pre-existing drift, plus the new
doctor module files. Applied 'ruff format' uniformly.

No logic changes. 2074 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@raullenchai raullenchai merged commit 05f1762 into main Apr 16, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant