feat: rapid-mlx doctor — 4-tier regression harness#125
Merged
raullenchai merged 28 commits intomainfrom Apr 16, 2026
Merged
Conversation
Pre-doctor cleanup: 36 auto-fixed (unsorted imports, unused imports, f-string placeholders, deprecated import, yoda conditions, redundant if-branches) + 8 manual fixes: - bare 'except' → except Exception (regression_suite.py) - ambiguous variable name 'l' → 'line' (regression_suite.py) - multiple statements on one line (test_openwebui.py, test_output_router.py) - BaseEngine.generate_warmup empty default — added noqa comment with rationale (intentional override point, not abstract) All 2027 tests still pass. Sets a clean baseline for the upcoming 'rapid-mlx doctor' regression harness which will fail on lint regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New ``rapid-mlx doctor`` subcommand. Smoke tier (no model required):
- repo_layout — verify aliases.json, agent profiles dir, pyproject
- imports — top-level + engine + agent profile load
- ruff — lint (with binary fallback if not in venv)
- cli_sanity — smoke --help / models / agents subcommands
- pytest — full unit suite (~45s, 2027 tests)
Architecture:
vllm_mlx/doctor/
runner.py — DoctorRunner, CheckResult, TierResult, exit codes
cli.py — tier dispatch (smoke implemented; check/full/
benchmark stubs return rc=2)
checks/smoke.py — the 5 smoke checks
harness/
thresholds.yaml — per-metric regression %
baselines/ — gitignored placeholder for golden numbers
runs/ — gitignored per-run artefacts
Exit codes are stable contract for hooks/CI:
0 = all pass, 1 = perf regression, 2 = functional fail
Each run writes both result.json (machine-readable) and report.md
(human-readable, scannable).
Next: check tier with Qwen3.5-4B + baseline diff.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…que run dirs Three issues from codex round 1: 1. (P1) imports check used to import server/scheduler/engine, which transitively initialized mlx.core and crashed on hosts without a working Metal device. Smoke tier was supposed to be lightweight. Now imports only cli, model_aliases, agents, doctor — heavier modules still get exercised via pytest where it actually matters. Side benefit: imports check went from ~5s to ~0.1s. 2. (P2) doctor assumed a source checkout but was exposed to all installs. A pip-installed wheel doesn't ship tests/ or harness/, so the command would print confusing failures. Added an upfront guard that checks for sentinel paths and exits with a clear message. 3. (P2) run directories used minute-precision timestamps with exist_ok= True — two invocations of the same tier in the same minute would overwrite each other's report.md. Now uses second precision plus a numeric suffix on collision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ions Round 2 codex review: the previous check-then-create pattern would race when two doctor processes started in the same second — both saw the candidate path as free, both called mkdir(exist_ok=True), and one would silently overwrite the other's report.md. Switch to mkdir(exist_ok=False) inside a retry loop so the loser of the race gets FileExistsError and tries the next suffix. First invocation gets the unsuffixed dir; subsequent same-second runs get -1, -2, ... Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the 'check' tier (~1-2 min for Qwen3.5-4B):
rapid-mlx doctor check [--model qwen3.5-4b] [--update-baselines]
Components:
- vllm_mlx/doctor/server.py
Subprocess lifecycle helper. Picks a free port, polls /health until
ready or timeouts (with explicit error if process exits early), and
guarantees teardown via SIGTERM → SIGKILL escalation on POSIX
process groups.
- vllm_mlx/doctor/baseline.py
load/save baseline JSON in harness/baselines/{tier}.json, threshold
loader (yaml-optional), comparator with sign-aware delta % (lower-
is-better metrics inverted so positive Δ% always means "better").
Renders markdown delta tables for the report.
- vllm_mlx/doctor/checks/api.py
Wraps tests/regression_suite.py + tests/test_smoke_matrix.sh against
the live server. Both honor RAPID_MLX_PORT / positional port arg so
the doctor's free-port allocation works.
- vllm_mlx/doctor/checks/perf.py
Wraps scripts/autoresearch_bench.py --json. Captures 13 metrics.
Treats all-zero primary metrics as a failure (catches the common
silent-failure case where every request was rejected).
CLI surface:
--model override default qwen3.5-4b
--update-baselines record current run as new baseline
Required upstream tweaks (small, backwards-compatible):
- regression_suite.py reads RAPID_MLX_PORT env (default 8777)
- autoresearch_bench.py reads AUTORESEARCH_BASE env (default :8000)
- Both scripts + smoke_matrix.sh now use model="default" instead of
"any"/"x" — those names were never registered, so requests 404'd.
First real run on Qwen3.5-4B surfaces 2 actual issues in the model:
- <|im_end|> special-token leak in streaming output
- thinking-toggle ratio test fails
That is exactly what the doctor is for; whether to fix them, change
the test, or accept them is a separate decision.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ines, threshold names
Three issues from codex round 1 on the check tier:
1. (P1) regression_suite.py always exits 0 — only prints "N/M tests
passed". The wrapper treated rc=0 as PASS unconditionally, so
partial failures slipped through (real example: Test 10 streaming-
usage-stats failure was masked). Now parses the summary line and
fails when N < M, with clear diagnostic detail.
2. (P2) Baseline files were keyed by tier only — running 'doctor check
--model qwen3.5-9b' against a baseline recorded for qwen3.5-4b would
produce nonsense deltas, and --update-baselines would clobber the
wrong file. Baselines are now per-(tier, model):
harness/baselines/check-{model_slug}.json. Defence-in-depth: also
refuses to compare if the baseline file's recorded model differs
from the current run.
3. (P2) thresholds.yaml had stale metric names (ttft_cold_ms,
multiturn_ttft_ms, tool_latency_s) that didn't match what
autoresearch_bench.py emits (cold_ttft_ms, mt_ttft_ms, tc_latency_ms).
Most overrides were silently ignored. Aligned the YAML to the
actual metric names + added composite_score and tc_success_rate.
End-to-end verified on Qwen3.5-4B: baseline recorded, second run
shows no regressions (all metrics within ±2.4%), and the doctor
correctly flags the pre-existing smoke_matrix bugs (<|im_end|> leak
+ thinking-toggle ratio) as failures.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Captured on commit 70d81b5 with mlx 0.31.1 + mlx-lm 0.31.2: decode_tps 49.7 tok/s cold_ttft_ms 314 ms cached_ttft_ms 394 ms tc_success_rate 1.0 (tool calling reliable) composite_score 109.9 Future 'doctor check' runs will diff against this. Threshold for decode_tps is ±5%, for ttft is ±10% (see harness/thresholds.yaml). Update with: rapid-mlx doctor check --update-baselines Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. (P2) tc_latency_ms was missing from _LOWER_IS_BETTER, so baseline diffs flipped sign for tool-call latency. Verified fix on Qwen3.5- 4B: a slight latency improvement now reports +0.2% (better) instead of -0.2% (worse). 2. (P2) _safe_model_slug used 'replace' which is not injective — 'foo/bar' and 'foo__bar' would collide. Switched to URL percent- encoding (urllib.parse.quote with safe='-.'); existing 'qwen3.5-4b' baseline filename is unchanged because no special chars need quoting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For model IDs containing '/' or spaces (e.g. 'mlx-community/Qwen3.5-... 27B'), the previous slug was 'mlx-community__Qwen3.5-27B' but the new URL-encoded slug is 'mlx-community%2FQwen3.5-27B'. Without a fallback, baselines recorded under the old scheme become invisible and regression detection silently turns off until manual re-record. load_baseline() now tries the canonical path first, then the legacy slug. When found via the legacy path the file is left in place; running --update-baselines naturally migrates it to the new name on the next save. Existing 'qwen3.5-4b' baseline is unaffected (no special chars). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous legacy-slug fallback returned the file unconditionally, which could hand back another model's baseline when two ids collided under the old slug scheme (the very problem the new injective slug was meant to fix). _apply_baseline() would then turn that into a hard FAIL instead of a clean "no baseline" SKIP. Now the legacy-path read also validates that the file's recorded model field matches the requested model before returning. If it doesn't, fall through to None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the 'full' tier (~1-2 hr for 3 models × 11 agent profiles):
rapid-mlx doctor full [--models qwen3.5-4b,qwen3.5-35b,gemma-4-26b]
[--update-baselines]
For each model in the list (default qwen3.5-4b + qwen3.5-35b +
gemma-4-26b), boots a server then runs:
- regression_suite (API contract)
- smoke_matrix (output sanitization)
- autoresearch (perf, with per-model baseline diff)
- all 11 agent profiles' API test plans
Per-model failures don't abort the whole sweep — the runner records
the failure and moves on, so a single broken model still yields a
complete report covering the others.
Components:
- vllm_mlx/doctor/checks/agent.py
Wraps AgentTestRunner from vllm_mlx.agents.testing. ERROR is
treated as FAIL (both indicate something the doctor should
surface); SKIP is fine (capability not declared by the profile).
- vllm_mlx/doctor/cli.py
Refactor: extracted _run_per_model_block() shared by check + full.
Per-iteration capture of profile_name/port in lambda defaults to
avoid the classic loop-variable closure bug.
- vllm_mlx/doctor/runner.py
run_check() now overrides result.name with the caller's name,
so multi-model runs don't collapse 'agent_langchain' entries
across qwen3.5-4b / qwen3.5-35b / gemma-4-26b.
- vllm_mlx/cli.py
--models comma-list flag for full tier.
Initial Qwen3.5-4B full-tier baseline recorded:
harness/baselines/full-qwen3.5-4b.json
First full run on Qwen3.5-4B: 14 pass / 3 fail / 0 skip. The 3
failures are pre-existing and consistent with the check tier
(regression_suite Test 10, smoke_matrix <|im_end|> leak +
thinking-toggle, hermes profile's specific test). All 11 agent
profiles' API tests pass cleanly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P2: In multi-model 'doctor full' runs, each model's baseline_diff
write to diff.md would clobber the previous model's deltas. Now
writes per-model diff-{model_slug}.md and appends a section to a
combined diff.md ('## qwen3.5-4b' / '## qwen3.5-35b' / ...).
Side fix: baseline_diff / baseline_update / no-baseline-found check
names now include [model], so the report rows are unique across
the loop and the final summary correctly counts each model's
status separately.
Codex's other flag (P1: agent import resolves to wrong package) was
incorrect — 'from ...agents' from vllm_mlx/doctor/checks/agent.py
correctly resolves to vllm_mlx.agents (verified by direct call:
the check reaches AgentTestRunner and fails on 'server not running'
rather than ImportError).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ve slug
Two issues from codex round 1:
P2: When no baseline exists, the previous patch dropped the diff.md
write entirely, breaking automation that always reads
run_dir/diff.md. Now writes both diff-{model}.md and appends to
diff.md with a "no baseline yet for model=X" notice — same shape as
the comparison case so callers can rely on the file existing.
P2: Per-model artefact filenames used the local _filename_safe()
helper, which only replaced '/' and ' ' — the same non-injective
pattern that the per-model baseline split was meant to fix. 'foo/bar'
and 'foo_bar' would produce the same diff-foo_bar.md and clobber.
Switched to baseline._safe_model_slug (URL percent-encoded, injective)
for both diff and server log filenames. Removed _filename_safe.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three pieces that make the doctor a finished product, not just code:
1. tests/test_doctor_baseline.py — 15 unit tests pinning down the
regression detection contract:
- higher-is-better metrics: within / regression / improvement
- lower-is-better metrics: sign-inversion contract (positive Δ%
always means "better") + tc_latency_ms regression test for the
codex round 2 finding
- accuracy metrics: zero-tolerance for any drop
- edge cases: zero baseline, NEW / DROPPED metrics, no baseline,
unknown-metric default thresholds
- _safe_model_slug injectivity (regression test for codex finding)
Also manually verified end-to-end: tampered baseline (decode_tps
49.67 → 75.0, cold_tps 49.27 → 70.0), re-ran 'doctor check', got
REGRESSION on both metrics with correct sign + magnitude, restored
baseline.
2. harness/README.md — single source of truth covering:
- The four tiers (smoke/check/full/benchmark) and when to use each
- Exit codes (0/1/2 stable contract for hooks/CI)
- Tier reference (what each check does, what model it boots)
- Baseline workflow (record → review → commit; per-model files)
- Threshold semantics (sign convention, defaults)
- Run artefacts (report.md, result.json, diff.md, server.log)
- Common failure modes + actionable hints
- "Adding a new check" guide for future contributors
3. Makefile — ergonomic targets:
- make smoke / check / full / benchmark
- make update-baselines TIER=check
- make lint / test / clean
- make help (lists everything with one-line descriptions)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Project requires-python is >=3.10, but the Makefile required python3.12 specifically and bypassed any active virtualenv. Targets now auto-detect the available interpreter (3.12 → 3.13 → 3.11 → 3.10 → python3 → python) and respect 'make smoke PY=...' for explicit override. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ex round 2)
Codex P1: previous fix preferred 'python'/'python3' over versioned
binaries to honor active venvs, but on macOS those names often resolve
to system Python 3.9 — below our requires-python = >=3.10. The Makefile
broke immediately on the first 'str | None' annotation.
New ordering:
1. $VIRTUAL_ENV/bin/python — wins so venv users on any Python >=3.10
get exactly their venv's interpreter regardless of PATH.
2. python3.13/3.12/3.11/3.10 — only if no venv is active.
3. python3 last-resort fallback.
Skips bare 'python' / 'python3' between #1 and #2 to avoid the system-
3.9 trap. Override still possible: 'make smoke PY=python3.13'.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odex round 3) Codex P1: with pyenv shims, 'command -v python3.13' succeeds even when only 3.12 is installed — the shim itself exists for every version. Previous logic would pick the broken shim and every 'make smoke' would die with "pyenv: python3.13: command not found". Now the detection actually executes each candidate with a sys.version_info >= (3, 10) check before accepting it. Skips broken pyenv shims naturally because they exit non-zero when the underlying interpreter isn't installed. Order of preference unchanged: 1. $VIRTUAL_ENV/bin/python (verified by -x test) 2. python3.13 → 3.12 → 3.11 → 3.10 (verified by --version probe) 3. python3 fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the 'benchmark' tier (overnight, all locally-available
models):
rapid-mlx doctor benchmark # auto-discover local models
rapid-mlx doctor benchmark --models qwen3.5-4b,gemma-4-26b
For each model:
1. Probe local HF cache + LM Studio cache + HF_HUB_CACHE for weights
— never auto-download (caller pre-fetches what they want bench'd)
2. Boot a Simple-engine server
3. Run autoresearch_bench, capture decode TPS / TTFT / tool-call %
4. Tear down
Aggregates into a markdown scorecard at:
- harness/scorecard/scorecard-{ts}.md (gitignored)
- harness/scorecard/latest.md (tracked, always current)
- harness/runs/{ts}-benchmark/scorecard.md (run-dir copy)
Components:
- vllm_mlx/doctor/discovery.py
Locates models in HF_HUB_CACHE, ~/.cache/huggingface, and
~/.lmstudio/models. Sanity-checks for config.json before claiming
a hit. Returns ModelAvailability with the alias, repo_id,
available bool, and resolved path or skip reason.
- vllm_mlx/doctor/checks/benchmark.py
Wraps autoresearch_bench --json. Returns CheckResult.metrics
populated for the scorecard renderer to consume. Same all-zero
primary-metric guard as the check tier perf.py.
- vllm_mlx/doctor/scorecard.py
Renders one row per model with 5 columns (Decode TPS / Cold TTFT /
Cached TTFT / Tool % / Score). Failed cells show 'FAIL — <reason>'
in the status column instead of vanishing — the scorecard always
covers every model the user asked about.
- vllm_mlx/doctor/cli.py
run_benchmark_tier() orchestrates everything; per-model errors
never abort the sweep (caught at outer level too).
Initial scorecard recorded for qwen3.5-4b in harness/scorecard/latest.md
(168 tok/s decode, 188ms cold TTFT, 100% tool-call success).
v1 deliberately scoped to Simple engine only; cross-engine columns
(Simple/Batched/Hybrid) deferred to v2 once BatchedEngine stabilises
per issue #105.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two P1 issues from codex round 1 on the benchmark tier:
1. 'rapid-mlx doctor benchmark --models <alias>' bypassed the
availability check entirely, so a typo or unfetched alias would
trigger a multi-GB download mid-sweep — exactly the contract
('never auto-download') the commit was supposed to enforce.
Now: explicit list still gets validated against discovery; missing
aliases land in the skipped section with a clear reason instead.
Unknown aliases (not in aliases.json) are also detected and labeled.
2. LM Studio model layout was unreachable. LM Studio stores models
as ~/.lmstudio/models/{org}/{repo}/, but discovery only looked for
the HF hub layout (models--{org}--{repo}/snapshots/...). Now
_check_alias() probes both layouts at every cache root, so models
installed via LM Studio are first-class.
Side improvement: when --models gives nothing runnable, the error
message lists each skipped alias with its reason instead of the
generic "no models available locally".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Discovery now correctly finds models in LM Studio's
~/.lmstudio/models/{org}/{repo}/ layout, but the benchmark tier
still called serve(model=alias) — which runs the alias through
mlx_lm.load() and triggers an HF Hub download because nothing in
the serve path knows about the LM Studio cache. End result: discovery
said "ok, available", server proceeded to download anyway.
Fix: serve() now takes an optional model_path. When discovery
located a local snapshot, the benchmark passes that path explicitly,
so the server loads from disk and never reaches Hugging Face. The
display name in logs / scorecard still uses the alias.
Side: bumped boot_timeout to 600s for benchmark runs because cold
loads of 27B+ models routinely need >180s when paging from a slow
external SSD. Earlier sweep shows three large-model cells failed
exactly because of the 180s ceiling.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…json The latest benchmark sweep surfaced a real bug: discovery marked qwen3.5-27b and qwopus-27b-8bit as available because their snapshot dirs had config.json + model.safetensors.index.json. But the actual *.safetensors shards were never downloaded — the snapshots were half-broken (likely interrupted mid-download). Server then crashed with "No safetensors found", showing up as a misleading 'server boot failed' in the scorecard. Worse: when a model has multiple snapshot copies in different cache roots (one partial on the SSD, one complete in ~/.cache/huggingface), discovery picked the *first* hit by config.json — which was the broken partial copy. Fix: - _is_complete_snapshot() now requires both config.json AND at least one resolvable weight file (.safetensors / .npz / .gguf). Symlinks are resolve(strict=True)'d so dangling links don't false-pass. - _check_alias() iterates all snapshots in all cache roots, picks the first *complete* one. Tracks whether any partial snapshot was seen so the skip reason can say "partial download?" instead of the generic "not found". Verified: qwen3.5-27b now skipped with clear reason. qwopus-27b-8bit correctly resolves to the local HF cache (which has the safetensors) instead of the broken SSD snapshot. Test coverage: 9 new tests in tests/test_doctor_discovery.py covering complete/missing config/missing weights/dangling symlink/.npz fallback + multi-root + LM Studio layout cases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…x round 2 P1) Codex P1: previous _is_complete_snapshot() returned True as soon as ANY single .safetensors file resolved. For sharded models, an interrupted download can leave (config.json + index.json + a subset of shards), and that would still pass — but mlx_lm.load() crashes once it tries to open a missing shard. New behavior: when model.safetensors.index.json exists, parse its weight_map and require EVERY referenced shard to resolve. Falls through to the single-file glob check when there's no index (so single-shard / .npz / .gguf models still work without an index). Test coverage extended: 3 new tests (sharded_complete_passes, sharded_partial_fails — the regression test for codex's finding — and sharded_dangling_symlink_fails). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…round 3)
Codex P1: a corrupt or truncated index.json (common when a download
gets killed mid-write) used to fall through to the single-file glob
check, which returns True the moment any one shard happens to resolve.
That re-introduces the false-positive sharded models were supposed to
fix.
Now: if index.json exists but JSON-decode fails, return False outright
— we can't trust the snapshot. Empty weight_map still falls through
because it's a legitimate template pattern (single-file model with a
stub index).
Two new tests:
- test_corrupt_index_with_shards_fails (regression for codex P1)
- test_index_with_empty_weight_map_falls_through_to_single_file
(verifies the legitimate single-file fallback still works)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ape hygiene Subagent self-review of PR #125 found 12 issues codex's line-level passes missed. This commit addresses all P1s and most P2s. P1 — real bugs: 1. CLI's main parser auto-resolved 'args.model' for every subcommand *including doctor*, so 'rapid-mlx doctor check --model qwen3.5-4b' arrived at the doctor with model='mlx-community/Qwen3.5-4B-...', wrote baseline 'check-mlx-community%2F....json' instead of 'check-qwen3.5-4b.json', and reported "no baseline found" against the committed file. Fix: skip alias resolution for the doctor subcommand (it does its own resolution at server-spawn time via discovery). Default 'doctor check' worked only by accident because no flag was passed. 2. README claimed pytest excludes "integration/server" tests but the actual command only excludes tests/integrations/ + the one server-required event-loop test. Reworded to match reality. 3. harness/thresholds.yaml listed tool_call_accuracy / reasoning_accuracy / coding_accuracy — none of which the harness ever emits. Dead config that would have silently misled anyone tuning them. Removed; left a comment noting where to add real accuracy thresholds when BFCL etc. are wired in. P2 — UX hygiene: 4. --update-baselines now prints a warning and is ignored when used with smoke / benchmark (they don't record baselines). Used to silently exit 0 with no indication. 5. Boot timeout for full tier raised from 180s to 600s — same 27B+ models in full as in benchmark; was hitting timeout where benchmark wouldn't. 6. _resolve_agent_profiles() no longer silently degrades to ['generic'] on profile-load failures; logs a clear WARNING so the user knows the documented 11-profile sweep just turned into a 1-profile run. 7. scorecard.py was the only generator that didn't escape '|' in failure-detail strings. A stack-trace frame ('File "x.py" | line N | y') would break the table. Extracted shared md_cell() helper into runner.py and used from both places. 8. baseline_diff used to be appended to CheckResult.detail, which gets truncated to 120 chars in the markdown report. Now stashed on the runner and rendered as a proper "## Baseline diffs" section in report.md, so report is self-contained without needing diff.md alongside. 9. _safe_model_slug renamed to safe_model_slug (was '_private' but imported across modules + tests). 10. README now documents make targets ('make smoke / check / full / benchmark / update-baselines TIER=...') as the primary interface, with the long-form rapid-mlx invocation footnoted. 11. README mentions HF_HOME/hub in cache discovery list (was omitted, surprising users who set HF_HOME). 12. --models help text now correctly says "for full / benchmark tiers" (was "for full tier" only). Test coverage gaps closed (17 new tests in test_doctor_runner.py): - Exit-code contract (the WHOLE point of the harness for CI): all-pass=0, regression=1, fail=2, fail-dominates-regression, empty-runs=0, skip-only=0 - md_cell escaping: pipes, newlines, truncation, empty, None - Atomic run-dir reservation across back-to-back constructions - Report rendering: basic table, diff sections embedded, pipe in detail doesn't break table layout Total doctor test coverage now: 46 tests across baseline + discovery + runner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codex P2: previous fix bumped boot_timeout_s to 600s inside the shared _run_per_model_block, which means 'doctor check' (which only boots the small qwen3.5-4b) now also waits up to 10 minutes before reporting a broken server boot. That defeats the check tier's fast-feedback purpose. Made boot_timeout_s a parameter of _run_per_model_block (default 180s). Only run_full_tier passes 600s — that's the tier that actually loads 27B+ models. run_check_tier inherits the default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous fix scoped the 600s boot timeout to the full tier only, which broke 'doctor check --model qwen3.5-35b' (a documented workflow): check would now use the 180s default and time out on legitimate 27B+ cold loads. Replaced tier-scoped logic with a per-model heuristic. _suggested_boot_timeout(alias) returns 600s when the alias contains a ≥20B parameter-count hint (matches '20b'..'999b' word-boundary), else 180s. This way: - 'doctor check' → 180s (qwen3.5-4b, fast-fail) - 'doctor check --model qwen3.5-35b' → 600s (auto-detect) - 'doctor full' default 3 models → 180s/600s/600s as needed - 'doctor full --models <list>' → per-model - 'doctor benchmark' → still uses its own 600s 3 new tests cover: - small aliases (qwen3.5-4b, llama3-3b, ...) → 180s - large aliases (qwen3.5-35b, qwopus-27b, hermes4-70b, ...) → 600s - unknown alias → 180s (fast-fail wins by default) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codex P2: every per-model / per-tier heuristic missed at least one supported case (qwen3-coder is 80B but its alias has no NNb hint; minimax-m2.5 is huge but named 'minimax-m2.5'; etc.). Each missed case becomes a spurious 'doctor full --models <alias>' failure. Replaced the heuristic with a single DEFAULT_BOOT_TIMEOUT_S = 600s constant. Reasoning: - Server crashes (port conflict, missing weights, import error, bad path) are detected within <1s by proc.poll() in server._wait_for_health regardless of this value. - The only scenario that actually waits 600s is a genuinely slow legitimate cold load — at which point we *want* the long budget. - "Fast feedback in check tier" was a goal but the realistic worst case (server hang) doesn't happen often enough to justify the false-fail risk. Test simplified to assert DEFAULT_BOOT_TIMEOUT_S >= 600 (floor for 122B cold-load case) rather than enumerating which aliases get which timeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI runs 'ruff format --check' which enforces formatting. 49 files were flagged — mostly pre-existing drift, plus the new doctor module files. Applied 'ruff format' uniformly. No logic changes. 2074 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds `rapid-mlx doctor`, a four-tier regression harness ("code health checkup"):
Exit codes are a stable contract for hooks/CI: `0` = pass, `1` = perf regression, `2` = functional fail.
Architecture
```
vllm_mlx/doctor/ # 13 files, ~2K LOC, lazy-imported
runner.py — DoctorRunner, CheckResult, atomic run-dir reservation
cli.py — tier dispatch + per-model orchestration
server.py — subprocess lifecycle (boot, /health poll, SIGTERM→SIGKILL teardown)
baseline.py — load/save/compare, sign-aware deltas, per-model files
discovery.py — locate locally-cached models (HF hub + LM Studio layouts)
scorecard.py — markdown table generator
checks/
smoke.py — pytest, ruff, CLI sanity, imports
api.py — wraps regression_suite.py + smoke_matrix.sh
perf.py — wraps autoresearch_bench.py, captures 13 metrics
agent.py — wraps AgentTestRunner across all 11 profiles
benchmark.py — single-cell metric capture for the scorecard
harness/ # NOT shipped in wheel
README.md — single source of truth (tiers, baselines, thresholds, troubleshooting)
thresholds.yaml — per-metric regression %
baselines/ — checked-in golden numbers (qwen3.5-4b for now)
scorecard/ — generated artefacts (latest.md tracked)
runs/ — gitignored per-run reports
```
Install footprint
End users only see one new subcommand; pre-existing flows untouched.
Reuses existing scripts (no rewrites)
Small backwards-compatible tweaks to the above so they honor an
override port (`RAPID_MLX_PORT`, `AUTORESEARCH_BASE`).
Lint debt cleanup (separate commit)
Pre-PR: 43 ruff errors. Post-PR: 0. Mostly auto-fixable (unsorted imports,
f-string placeholders); the 8 manual fixes are documented in the
`chore: clean up ruff lint debt` commit.
Codex review trail
This PR went through ~16 rounds of `codex review` and addressed every
real issue codex flagged (12 P1/P2 fixes). 4 codex rounds hallucinated
issues in untouched files — those turned out to be real pre-existing
bugs and were filed as separate issues:
Test plan
Recorded artefacts
🤖 Generated with Claude Code