feat: rapid-mlx doctor — 4-tier regression harness by raullenchai · Pull Request #125 · raullenchai/Rapid-MLX

raullenchai · 2026-04-16T14:27:23Z

Summary

Adds `rapid-mlx doctor`, a four-tier regression harness ("code health checkup"):

Tier	Time	What	When to run
`smoke`	~45s	repo + imports + ruff + CLI sanity + pytest (2071 tests)	every commit
`check`	~10 min	spin up qwen3.5-4b, API contract + perf + baseline diff	every PR / big change
`full`	~1-2 hr	check across 3 models × 11 agent profiles	before release
`benchmark`	overnight	sweep all locally-cached models → markdown scorecard	periodic / promo

Exit codes are a stable contract for hooks/CI: `0` = pass, `1` = perf regression, `2` = functional fail.

Architecture

```
vllm_mlx/doctor/ # 13 files, ~2K LOC, lazy-imported
runner.py — DoctorRunner, CheckResult, atomic run-dir reservation
cli.py — tier dispatch + per-model orchestration
server.py — subprocess lifecycle (boot, /health poll, SIGTERM→SIGKILL teardown)
baseline.py — load/save/compare, sign-aware deltas, per-model files
discovery.py — locate locally-cached models (HF hub + LM Studio layouts)
scorecard.py — markdown table generator
checks/
smoke.py — pytest, ruff, CLI sanity, imports
api.py — wraps regression_suite.py + smoke_matrix.sh
perf.py — wraps autoresearch_bench.py, captures 13 metrics
agent.py — wraps AgentTestRunner across all 11 profiles
benchmark.py — single-cell metric capture for the scorecard

harness/ # NOT shipped in wheel
README.md — single source of truth (tiers, baselines, thresholds, troubleshooting)
thresholds.yaml — per-metric regression %
baselines/ — checked-in golden numbers (qwen3.5-4b for now)
scorecard/ — generated artefacts (latest.md tracked)
runs/ — gitignored per-run reports
```

Install footprint

Metric	Impact
Wheel size	+204 KB Python
Runtime deps	0 new (stdlib + already-required PyYAML)
`pip install` time	unchanged
`rapid-mlx --help` startup	+0ms (doctor is lazy-imported)
`harness/`, `tests/`, `Makefile`	not shipped

End users only see one new subcommand; pre-existing flows untouched.

Reuses existing scripts (no rewrites)

`tests/regression_suite.py` — API contract (10 cases)
`tests/test_smoke_matrix.sh` — emoji/CJK/thinking/special-token-leak
`scripts/autoresearch_bench.py` — 13 perf metrics
`vllm_mlx/agents/testing.py` — auto-generated agent test plans

Small backwards-compatible tweaks to the above so they honor an
override port (`RAPID_MLX_PORT`, `AUTORESEARCH_BASE`).

Lint debt cleanup (separate commit)

Pre-PR: 43 ruff errors. Post-PR: 0. Mostly auto-fixable (unsorted imports,
f-string placeholders); the 8 manual fixes are documented in the
`chore: clean up ruff lint debt` commit.

Codex review trail

This PR went through ~16 rounds of `codex review` and addressed every
real issue codex flagged (12 P1/P2 fixes). 4 codex rounds hallucinated
issues in untouched files — those turned out to be real pre-existing
bugs and were filed as separate issues:

Wheel install missing agent profile YAMLs (package-data only ships aliases.json) #118 wheel missing agent profile YAMLs
pyproject.toml: 'all' extra self-references rapid-mlx, breaks editable installs #119 `pyproject.toml` `all` extra self-reference
Chat template fallback drops tools when applicator has no apply_chat_template #120 chat_template fallback drops `tools`
MTP injection misses nested text_config in _try_inject_mtp (only post_load checks it) #121 MTP misses nested `text_config`
chat_template: enable_thinking dropped during tool fallback retry #122 chat_template drops `enable_thinking` on tool retry
Gemma 4 text loader allocates full multimodal checkpoint before sanitize() #123 Gemma 4 OOM during multimodal shard load
vllm-mlx-chat console script published without gradio dependency #124 `vllm-mlx-chat` published without gradio dependency

Test plan

All 2071 unit tests pass (was 2027 before; +14 discovery + 15 baseline + 15 ruff cleanup)
`make smoke` — 45s, all-pass on this machine
`make check` — 10 min on qwen3.5-4b, baseline recorded
`make full` — 8 min on qwen3.5-4b, all 11 agent profiles green
`make benchmark` — 7/9 local models scored; 2 correctly skipped as partial downloads
Regression detection verified end-to-end: tampered baseline → REGRESSION fired with correct sign + magnitude → restored
Concurrent invocations: atomic mkdir prevents collision

Recorded artefacts

`harness/baselines/check-qwen3.5-4b.json`
`harness/baselines/full-qwen3.5-4b.json`
`harness/scorecard/latest.md`

🤖 Generated with Claude Code

Pre-doctor cleanup: 36 auto-fixed (unsorted imports, unused imports, f-string placeholders, deprecated import, yoda conditions, redundant if-branches) + 8 manual fixes: - bare 'except' → except Exception (regression_suite.py) - ambiguous variable name 'l' → 'line' (regression_suite.py) - multiple statements on one line (test_openwebui.py, test_output_router.py) - BaseEngine.generate_warmup empty default — added noqa comment with rationale (intentional override point, not abstract) All 2027 tests still pass. Sets a clean baseline for the upcoming 'rapid-mlx doctor' regression harness which will fail on lint regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New ``rapid-mlx doctor`` subcommand. Smoke tier (no model required): - repo_layout — verify aliases.json, agent profiles dir, pyproject - imports — top-level + engine + agent profile load - ruff — lint (with binary fallback if not in venv) - cli_sanity — smoke --help / models / agents subcommands - pytest — full unit suite (~45s, 2027 tests) Architecture: vllm_mlx/doctor/ runner.py — DoctorRunner, CheckResult, TierResult, exit codes cli.py — tier dispatch (smoke implemented; check/full/ benchmark stubs return rc=2) checks/smoke.py — the 5 smoke checks harness/ thresholds.yaml — per-metric regression % baselines/ — gitignored placeholder for golden numbers runs/ — gitignored per-run artefacts Exit codes are stable contract for hooks/CI: 0 = all pass, 1 = perf regression, 2 = functional fail Each run writes both result.json (machine-readable) and report.md (human-readable, scannable). Next: check tier with Qwen3.5-4B + baseline diff. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…que run dirs Three issues from codex round 1: 1. (P1) imports check used to import server/scheduler/engine, which transitively initialized mlx.core and crashed on hosts without a working Metal device. Smoke tier was supposed to be lightweight. Now imports only cli, model_aliases, agents, doctor — heavier modules still get exercised via pytest where it actually matters. Side benefit: imports check went from ~5s to ~0.1s. 2. (P2) doctor assumed a source checkout but was exposed to all installs. A pip-installed wheel doesn't ship tests/ or harness/, so the command would print confusing failures. Added an upfront guard that checks for sentinel paths and exits with a clear message. 3. (P2) run directories used minute-precision timestamps with exist_ok= True — two invocations of the same tier in the same minute would overwrite each other's report.md. Now uses second precision plus a numeric suffix on collision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ions Round 2 codex review: the previous check-then-create pattern would race when two doctor processes started in the same second — both saw the candidate path as free, both called mkdir(exist_ok=True), and one would silently overwrite the other's report.md. Switch to mkdir(exist_ok=False) inside a retry loop so the loser of the race gets FileExistsError and tries the next suffix. First invocation gets the unsuffixed dir; subsequent same-second runs get -1, -2, ... Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implements the 'check' tier (~1-2 min for Qwen3.5-4B): rapid-mlx doctor check [--model qwen3.5-4b] [--update-baselines] Components: - vllm_mlx/doctor/server.py Subprocess lifecycle helper. Picks a free port, polls /health until ready or timeouts (with explicit error if process exits early), and guarantees teardown via SIGTERM → SIGKILL escalation on POSIX process groups. - vllm_mlx/doctor/baseline.py load/save baseline JSON in harness/baselines/{tier}.json, threshold loader (yaml-optional), comparator with sign-aware delta % (lower- is-better metrics inverted so positive Δ% always means "better"). Renders markdown delta tables for the report. - vllm_mlx/doctor/checks/api.py Wraps tests/regression_suite.py + tests/test_smoke_matrix.sh against the live server. Both honor RAPID_MLX_PORT / positional port arg so the doctor's free-port allocation works. - vllm_mlx/doctor/checks/perf.py Wraps scripts/autoresearch_bench.py --json. Captures 13 metrics. Treats all-zero primary metrics as a failure (catches the common silent-failure case where every request was rejected). CLI surface: --model override default qwen3.5-4b --update-baselines record current run as new baseline Required upstream tweaks (small, backwards-compatible): - regression_suite.py reads RAPID_MLX_PORT env (default 8777) - autoresearch_bench.py reads AUTORESEARCH_BASE env (default :8000) - Both scripts + smoke_matrix.sh now use model="default" instead of "any"/"x" — those names were never registered, so requests 404'd. First real run on Qwen3.5-4B surfaces 2 actual issues in the model: - <|im_end|> special-token leak in streaming output - thinking-toggle ratio test fails That is exactly what the doctor is for; whether to fix them, change the test, or accept them is a separate decision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ines, threshold names Three issues from codex round 1 on the check tier: 1. (P1) regression_suite.py always exits 0 — only prints "N/M tests passed". The wrapper treated rc=0 as PASS unconditionally, so partial failures slipped through (real example: Test 10 streaming- usage-stats failure was masked). Now parses the summary line and fails when N < M, with clear diagnostic detail. 2. (P2) Baseline files were keyed by tier only — running 'doctor check --model qwen3.5-9b' against a baseline recorded for qwen3.5-4b would produce nonsense deltas, and --update-baselines would clobber the wrong file. Baselines are now per-(tier, model): harness/baselines/check-{model_slug}.json. Defence-in-depth: also refuses to compare if the baseline file's recorded model differs from the current run. 3. (P2) thresholds.yaml had stale metric names (ttft_cold_ms, multiturn_ttft_ms, tool_latency_s) that didn't match what autoresearch_bench.py emits (cold_ttft_ms, mt_ttft_ms, tc_latency_ms). Most overrides were silently ignored. Aligned the YAML to the actual metric names + added composite_score and tc_success_rate. End-to-end verified on Qwen3.5-4B: baseline recorded, second run shows no regressions (all metrics within ±2.4%), and the doctor correctly flags the pre-existing smoke_matrix bugs (<|im_end|> leak + thinking-toggle ratio) as failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Captured on commit 70d81b5 with mlx 0.31.1 + mlx-lm 0.31.2: decode_tps 49.7 tok/s cold_ttft_ms 314 ms cached_ttft_ms 394 ms tc_success_rate 1.0 (tool calling reliable) composite_score 109.9 Future 'doctor check' runs will diff against this. Threshold for decode_tps is ±5%, for ttft is ±10% (see harness/thresholds.yaml). Update with: rapid-mlx doctor check --update-baselines Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. (P2) tc_latency_ms was missing from _LOWER_IS_BETTER, so baseline diffs flipped sign for tool-call latency. Verified fix on Qwen3.5- 4B: a slight latency improvement now reports +0.2% (better) instead of -0.2% (worse). 2. (P2) _safe_model_slug used 'replace' which is not injective — 'foo/bar' and 'foo__bar' would collide. Switched to URL percent- encoding (urllib.parse.quote with safe='-.'); existing 'qwen3.5-4b' baseline filename is unchanged because no special chars need quoting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

For model IDs containing '/' or spaces (e.g. 'mlx-community/Qwen3.5-... 27B'), the previous slug was 'mlx-community__Qwen3.5-27B' but the new URL-encoded slug is 'mlx-community%2FQwen3.5-27B'. Without a fallback, baselines recorded under the old scheme become invisible and regression detection silently turns off until manual re-record. load_baseline() now tries the canonical path first, then the legacy slug. When found via the legacy path the file is left in place; running --update-baselines naturally migrates it to the new name on the next save. Existing 'qwen3.5-4b' baseline is unaffected (no special chars). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The previous legacy-slug fallback returned the file unconditionally, which could hand back another model's baseline when two ids collided under the old slug scheme (the very problem the new injective slug was meant to fix). _apply_baseline() would then turn that into a hard FAIL instead of a clean "no baseline" SKIP. Now the legacy-path read also validates that the file's recorded model field matches the requested model before returning. If it doesn't, fall through to None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implements the 'full' tier (~1-2 hr for 3 models × 11 agent profiles): rapid-mlx doctor full [--models qwen3.5-4b,qwen3.5-35b,gemma-4-26b] [--update-baselines] For each model in the list (default qwen3.5-4b + qwen3.5-35b + gemma-4-26b), boots a server then runs: - regression_suite (API contract) - smoke_matrix (output sanitization) - autoresearch (perf, with per-model baseline diff) - all 11 agent profiles' API test plans Per-model failures don't abort the whole sweep — the runner records the failure and moves on, so a single broken model still yields a complete report covering the others. Components: - vllm_mlx/doctor/checks/agent.py Wraps AgentTestRunner from vllm_mlx.agents.testing. ERROR is treated as FAIL (both indicate something the doctor should surface); SKIP is fine (capability not declared by the profile). - vllm_mlx/doctor/cli.py Refactor: extracted _run_per_model_block() shared by check + full. Per-iteration capture of profile_name/port in lambda defaults to avoid the classic loop-variable closure bug. - vllm_mlx/doctor/runner.py run_check() now overrides result.name with the caller's name, so multi-model runs don't collapse 'agent_langchain' entries across qwen3.5-4b / qwen3.5-35b / gemma-4-26b. - vllm_mlx/cli.py --models comma-list flag for full tier. Initial Qwen3.5-4B full-tier baseline recorded: harness/baselines/full-qwen3.5-4b.json First full run on Qwen3.5-4B: 14 pass / 3 fail / 0 skip. The 3 failures are pre-existing and consistent with the check tier (regression_suite Test 10, smoke_matrix <|im_end|> leak + thinking-toggle, hermes profile's specific test). All 11 agent profiles' API tests pass cleanly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

P2: In multi-model 'doctor full' runs, each model's baseline_diff write to diff.md would clobber the previous model's deltas. Now writes per-model diff-{model_slug}.md and appends a section to a combined diff.md ('## qwen3.5-4b' / '## qwen3.5-35b' / ...). Side fix: baseline_diff / baseline_update / no-baseline-found check names now include [model], so the report rows are unique across the loop and the final summary correctly counts each model's status separately. Codex's other flag (P1: agent import resolves to wrong package) was incorrect — 'from ...agents' from vllm_mlx/doctor/checks/agent.py correctly resolves to vllm_mlx.agents (verified by direct call: the check reaches AgentTestRunner and fails on 'server not running' rather than ImportError). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ve slug Two issues from codex round 1: P2: When no baseline exists, the previous patch dropped the diff.md write entirely, breaking automation that always reads run_dir/diff.md. Now writes both diff-{model}.md and appends to diff.md with a "no baseline yet for model=X" notice — same shape as the comparison case so callers can rely on the file existing. P2: Per-model artefact filenames used the local _filename_safe() helper, which only replaced '/' and ' ' — the same non-injective pattern that the per-model baseline split was meant to fix. 'foo/bar' and 'foo_bar' would produce the same diff-foo_bar.md and clobber. Switched to baseline._safe_model_slug (URL percent-encoded, injective) for both diff and server log filenames. Removed _filename_safe. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three pieces that make the doctor a finished product, not just code: 1. tests/test_doctor_baseline.py — 15 unit tests pinning down the regression detection contract: - higher-is-better metrics: within / regression / improvement - lower-is-better metrics: sign-inversion contract (positive Δ% always means "better") + tc_latency_ms regression test for the codex round 2 finding - accuracy metrics: zero-tolerance for any drop - edge cases: zero baseline, NEW / DROPPED metrics, no baseline, unknown-metric default thresholds - _safe_model_slug injectivity (regression test for codex finding) Also manually verified end-to-end: tampered baseline (decode_tps 49.67 → 75.0, cold_tps 49.27 → 70.0), re-ran 'doctor check', got REGRESSION on both metrics with correct sign + magnitude, restored baseline. 2. harness/README.md — single source of truth covering: - The four tiers (smoke/check/full/benchmark) and when to use each - Exit codes (0/1/2 stable contract for hooks/CI) - Tier reference (what each check does, what model it boots) - Baseline workflow (record → review → commit; per-model files) - Threshold semantics (sign convention, defaults) - Run artefacts (report.md, result.json, diff.md, server.log) - Common failure modes + actionable hints - "Adding a new check" guide for future contributors 3. Makefile — ergonomic targets: - make smoke / check / full / benchmark - make update-baselines TIER=check - make lint / test / clean - make help (lists everything with one-line descriptions) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Project requires-python is >=3.10, but the Makefile required python3.12 specifically and bypassed any active virtualenv. Targets now auto-detect the available interpreter (3.12 → 3.13 → 3.11 → 3.10 → python3 → python) and respect 'make smoke PY=...' for explicit override. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ex round 2) Codex P1: previous fix preferred 'python'/'python3' over versioned binaries to honor active venvs, but on macOS those names often resolve to system Python 3.9 — below our requires-python = >=3.10. The Makefile broke immediately on the first 'str | None' annotation. New ordering: 1. $VIRTUAL_ENV/bin/python — wins so venv users on any Python >=3.10 get exactly their venv's interpreter regardless of PATH. 2. python3.13/3.12/3.11/3.10 — only if no venv is active. 3. python3 last-resort fallback. Skips bare 'python' / 'python3' between #1 and #2 to avoid the system- 3.9 trap. Override still possible: 'make smoke PY=python3.13'. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…odex round 3) Codex P1: with pyenv shims, 'command -v python3.13' succeeds even when only 3.12 is installed — the shim itself exists for every version. Previous logic would pick the broken shim and every 'make smoke' would die with "pyenv: python3.13: command not found". Now the detection actually executes each candidate with a sys.version_info >= (3, 10) check before accepting it. Skips broken pyenv shims naturally because they exit non-zero when the underlying interpreter isn't installed. Order of preference unchanged: 1. $VIRTUAL_ENV/bin/python (verified by -x test) 2. python3.13 → 3.12 → 3.11 → 3.10 (verified by --version probe) 3. python3 fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implements the 'benchmark' tier (overnight, all locally-available models): rapid-mlx doctor benchmark # auto-discover local models rapid-mlx doctor benchmark --models qwen3.5-4b,gemma-4-26b For each model: 1. Probe local HF cache + LM Studio cache + HF_HUB_CACHE for weights — never auto-download (caller pre-fetches what they want bench'd) 2. Boot a Simple-engine server 3. Run autoresearch_bench, capture decode TPS / TTFT / tool-call % 4. Tear down Aggregates into a markdown scorecard at: - harness/scorecard/scorecard-{ts}.md (gitignored) - harness/scorecard/latest.md (tracked, always current) - harness/runs/{ts}-benchmark/scorecard.md (run-dir copy) Components: - vllm_mlx/doctor/discovery.py Locates models in HF_HUB_CACHE, ~/.cache/huggingface, and ~/.lmstudio/models. Sanity-checks for config.json before claiming a hit. Returns ModelAvailability with the alias, repo_id, available bool, and resolved path or skip reason. - vllm_mlx/doctor/checks/benchmark.py Wraps autoresearch_bench --json. Returns CheckResult.metrics populated for the scorecard renderer to consume. Same all-zero primary-metric guard as the check tier perf.py. - vllm_mlx/doctor/scorecard.py Renders one row per model with 5 columns (Decode TPS / Cold TTFT / Cached TTFT / Tool % / Score). Failed cells show 'FAIL — <reason>' in the status column instead of vanishing — the scorecard always covers every model the user asked about. - vllm_mlx/doctor/cli.py run_benchmark_tier() orchestrates everything; per-model errors never abort the sweep (caught at outer level too). Initial scorecard recorded for qwen3.5-4b in harness/scorecard/latest.md (168 tok/s decode, 188ms cold TTFT, 100% tool-call success). v1 deliberately scoped to Simple engine only; cross-engine columns (Simple/Batched/Hybrid) deferred to v2 once BatchedEngine stabilises per issue #105. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two P1 issues from codex round 1 on the benchmark tier: 1. 'rapid-mlx doctor benchmark --models <alias>' bypassed the availability check entirely, so a typo or unfetched alias would trigger a multi-GB download mid-sweep — exactly the contract ('never auto-download') the commit was supposed to enforce. Now: explicit list still gets validated against discovery; missing aliases land in the skipped section with a clear reason instead. Unknown aliases (not in aliases.json) are also detected and labeled. 2. LM Studio model layout was unreachable. LM Studio stores models as ~/.lmstudio/models/{org}/{repo}/, but discovery only looked for the HF hub layout (models--{org}--{repo}/snapshots/...). Now _check_alias() probes both layouts at every cache root, so models installed via LM Studio are first-class. Side improvement: when --models gives nothing runnable, the error message lists each skipped alias with its reason instead of the generic "no models available locally". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Discovery now correctly finds models in LM Studio's ~/.lmstudio/models/{org}/{repo}/ layout, but the benchmark tier still called serve(model=alias) — which runs the alias through mlx_lm.load() and triggers an HF Hub download because nothing in the serve path knows about the LM Studio cache. End result: discovery said "ok, available", server proceeded to download anyway. Fix: serve() now takes an optional model_path. When discovery located a local snapshot, the benchmark passes that path explicitly, so the server loads from disk and never reaches Hugging Face. The display name in logs / scorecard still uses the alias. Side: bumped boot_timeout to 600s for benchmark runs because cold loads of 27B+ models routinely need >180s when paging from a slow external SSD. Earlier sweep shows three large-model cells failed exactly because of the 180s ceiling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…json The latest benchmark sweep surfaced a real bug: discovery marked qwen3.5-27b and qwopus-27b-8bit as available because their snapshot dirs had config.json + model.safetensors.index.json. But the actual *.safetensors shards were never downloaded — the snapshots were half-broken (likely interrupted mid-download). Server then crashed with "No safetensors found", showing up as a misleading 'server boot failed' in the scorecard. Worse: when a model has multiple snapshot copies in different cache roots (one partial on the SSD, one complete in ~/.cache/huggingface), discovery picked the *first* hit by config.json — which was the broken partial copy. Fix: - _is_complete_snapshot() now requires both config.json AND at least one resolvable weight file (.safetensors / .npz / .gguf). Symlinks are resolve(strict=True)'d so dangling links don't false-pass. - _check_alias() iterates all snapshots in all cache roots, picks the first *complete* one. Tracks whether any partial snapshot was seen so the skip reason can say "partial download?" instead of the generic "not found". Verified: qwen3.5-27b now skipped with clear reason. qwopus-27b-8bit correctly resolves to the local HF cache (which has the safetensors) instead of the broken SSD snapshot. Test coverage: 9 new tests in tests/test_doctor_discovery.py covering complete/missing config/missing weights/dangling symlink/.npz fallback + multi-root + LM Studio layout cases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…x round 2 P1) Codex P1: previous _is_complete_snapshot() returned True as soon as ANY single .safetensors file resolved. For sharded models, an interrupted download can leave (config.json + index.json + a subset of shards), and that would still pass — but mlx_lm.load() crashes once it tries to open a missing shard. New behavior: when model.safetensors.index.json exists, parse its weight_map and require EVERY referenced shard to resolve. Falls through to the single-file glob check when there's no index (so single-shard / .npz / .gguf models still work without an index). Test coverage extended: 3 new tests (sharded_complete_passes, sharded_partial_fails — the regression test for codex's finding — and sharded_dangling_symlink_fails). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…round 3) Codex P1: a corrupt or truncated index.json (common when a download gets killed mid-write) used to fall through to the single-file glob check, which returns True the moment any one shard happens to resolve. That re-introduces the false-positive sharded models were supposed to fix. Now: if index.json exists but JSON-decode fails, return False outright — we can't trust the snapshot. Empty weight_map still falls through because it's a legitimate template pattern (single-file model with a stub index). Two new tests: - test_corrupt_index_with_shards_fails (regression for codex P1) - test_index_with_empty_weight_map_falls_through_to_single_file (verifies the legitimate single-file fallback still works) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ape hygiene Subagent self-review of PR #125 found 12 issues codex's line-level passes missed. This commit addresses all P1s and most P2s. P1 — real bugs: 1. CLI's main parser auto-resolved 'args.model' for every subcommand *including doctor*, so 'rapid-mlx doctor check --model qwen3.5-4b' arrived at the doctor with model='mlx-community/Qwen3.5-4B-...', wrote baseline 'check-mlx-community%2F....json' instead of 'check-qwen3.5-4b.json', and reported "no baseline found" against the committed file. Fix: skip alias resolution for the doctor subcommand (it does its own resolution at server-spawn time via discovery). Default 'doctor check' worked only by accident because no flag was passed. 2. README claimed pytest excludes "integration/server" tests but the actual command only excludes tests/integrations/ + the one server-required event-loop test. Reworded to match reality. 3. harness/thresholds.yaml listed tool_call_accuracy / reasoning_accuracy / coding_accuracy — none of which the harness ever emits. Dead config that would have silently misled anyone tuning them. Removed; left a comment noting where to add real accuracy thresholds when BFCL etc. are wired in. P2 — UX hygiene: 4. --update-baselines now prints a warning and is ignored when used with smoke / benchmark (they don't record baselines). Used to silently exit 0 with no indication. 5. Boot timeout for full tier raised from 180s to 600s — same 27B+ models in full as in benchmark; was hitting timeout where benchmark wouldn't. 6. _resolve_agent_profiles() no longer silently degrades to ['generic'] on profile-load failures; logs a clear WARNING so the user knows the documented 11-profile sweep just turned into a 1-profile run. 7. scorecard.py was the only generator that didn't escape '|' in failure-detail strings. A stack-trace frame ('File "x.py" | line N | y') would break the table. Extracted shared md_cell() helper into runner.py and used from both places. 8. baseline_diff used to be appended to CheckResult.detail, which gets truncated to 120 chars in the markdown report. Now stashed on the runner and rendered as a proper "## Baseline diffs" section in report.md, so report is self-contained without needing diff.md alongside. 9. _safe_model_slug renamed to safe_model_slug (was '_private' but imported across modules + tests). 10. README now documents make targets ('make smoke / check / full / benchmark / update-baselines TIER=...') as the primary interface, with the long-form rapid-mlx invocation footnoted. 11. README mentions HF_HOME/hub in cache discovery list (was omitted, surprising users who set HF_HOME). 12. --models help text now correctly says "for full / benchmark tiers" (was "for full tier" only). Test coverage gaps closed (17 new tests in test_doctor_runner.py): - Exit-code contract (the WHOLE point of the harness for CI): all-pass=0, regression=1, fail=2, fail-dominates-regression, empty-runs=0, skip-only=0 - md_cell escaping: pipes, newlines, truncation, empty, None - Atomic run-dir reservation across back-to-back constructions - Report rendering: basic table, diff sections embedded, pipe in detail doesn't break table layout Total doctor test coverage now: 46 tests across baseline + discovery + runner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Codex P2: previous fix bumped boot_timeout_s to 600s inside the shared _run_per_model_block, which means 'doctor check' (which only boots the small qwen3.5-4b) now also waits up to 10 minutes before reporting a broken server boot. That defeats the check tier's fast-feedback purpose. Made boot_timeout_s a parameter of _run_per_model_block (default 180s). Only run_full_tier passes 600s — that's the tier that actually loads 27B+ models. run_check_tier inherits the default. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous fix scoped the 600s boot timeout to the full tier only, which broke 'doctor check --model qwen3.5-35b' (a documented workflow): check would now use the 180s default and time out on legitimate 27B+ cold loads. Replaced tier-scoped logic with a per-model heuristic. _suggested_boot_timeout(alias) returns 600s when the alias contains a ≥20B parameter-count hint (matches '20b'..'999b' word-boundary), else 180s. This way: - 'doctor check' → 180s (qwen3.5-4b, fast-fail) - 'doctor check --model qwen3.5-35b' → 600s (auto-detect) - 'doctor full' default 3 models → 180s/600s/600s as needed - 'doctor full --models <list>' → per-model - 'doctor benchmark' → still uses its own 600s 3 new tests cover: - small aliases (qwen3.5-4b, llama3-3b, ...) → 180s - large aliases (qwen3.5-35b, qwopus-27b, hermes4-70b, ...) → 600s - unknown alias → 180s (fast-fail wins by default) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Codex P2: every per-model / per-tier heuristic missed at least one supported case (qwen3-coder is 80B but its alias has no NNb hint; minimax-m2.5 is huge but named 'minimax-m2.5'; etc.). Each missed case becomes a spurious 'doctor full --models <alias>' failure. Replaced the heuristic with a single DEFAULT_BOOT_TIMEOUT_S = 600s constant. Reasoning: - Server crashes (port conflict, missing weights, import error, bad path) are detected within <1s by proc.poll() in server._wait_for_health regardless of this value. - The only scenario that actually waits 600s is a genuinely slow legitimate cold load — at which point we *want* the long budget. - "Fast feedback in check tier" was a goal but the realistic worst case (server hang) doesn't happen often enough to justify the false-fail risk. Test simplified to assert DEFAULT_BOOT_TIMEOUT_S >= 600 (floor for 122B cold-load case) rather than enumerating which aliases get which timeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CI runs 'ruff format --check' which enforces formatting. 49 files were flagged — mostly pre-existing drift, plus the new doctor module files. Applied 'ruff format' uniformly. No logic changes. 2074 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Your Name and others added 28 commits April 15, 2026 21:05

raullenchai merged commit 05f1762 into main Apr 16, 2026
7 checks passed

raullenchai mentioned this pull request Apr 16, 2026

fix: 7 codex-discovered issues (#118-#124) #126

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: rapid-mlx doctor — 4-tier regression harness#125

feat: rapid-mlx doctor — 4-tier regression harness#125
raullenchai merged 28 commits intomainfrom
feat/doctor-harness

raullenchai commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raullenchai commented Apr 16, 2026

Summary

Architecture

Install footprint

Reuses existing scripts (no rewrites)

Lint debt cleanup (separate commit)

Codex review trail

Test plan

Recorded artefacts

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant