[tests] Review tests for PR #5427 by danielhanchen · Pull Request #50 · shimmyshimmer/unsloth-staging-4

danielhanchen · 2026-05-16T13:41:49Z

Automated test files from review process

Closes unslothai#5401. Three related GGUF reload bugs reproduced against `unsloth studio run -m unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42`: 1. The `POST /api/inference/load` already-loaded short-circuit only compared `model_identifier` and `hf_variant`. A same-(model, variant) Apply that flipped `cache_type_kv` / `speculative_type` / `chat_template_override` / `max_seq_length` / `llama_extra_args` returned `status="already_loaded"` and the new setting silently never reached llama-server. 2. The frontend chat-settings Apply path POSTs `/unload` then `/load` without round-tripping `llama_extra_args`. Every reload after `unsloth run --some-flag X` quietly dropped `--some-flag X` from the spawned `llama-server` command line. 3. `LlamaCppBackend.load_model` released `_lock` between Phase 1 (kill) and Phase 3 (spawn) so two concurrent loads each passed Phase 1 with `self._process is None`. Both ran Phase 2 (download), both reached Phase 3, and the Phase 3 defensive `_kill_process()` from unslothai#5171 collapsed them to one survivor only after both `subprocess.Popen` calls had landed. For the 86 GB MoE in unslothai#5161 / the model in unslothai#5401 the overlap window was tens of seconds, long enough to OOM the host. With a 0.6B model the pgrep timeline showed two simultaneous PIDs for 3.3 s on `main`. Fix: `studio/backend/core/inference/llama_cpp.py` * Add `self._serial_load_lock = threading.Lock()`. The whole body of `load_model` runs under this lock so two concurrent `/api/inference/load` requests are strictly sequential. The fine-grained `_lock` and the Phase 3 defensive `_kill_process()` from unslothai#5171 are kept as a second layer. `/unload`, `/status`, and `/load-progress` are unaffected because they only touch the fine-grained lock or read properties. * Add `self._extra_args` plus an `extra_args` property, written inside `load_model` whenever the caller supplies a non-`None` value. `unload_model()` deliberately does not reset it so the route layer can inherit the args across the frontend's `/unload` + `/load` gap. `studio/backend/routes/inference.py` * Add `_request_matches_loaded_settings(request, llama_backend)` that compares `max_seq_length`, `cache_type_kv`, `speculative_type`, `chat_template_override`, and `llama_extra_args` between the incoming request and the live backend. Same-(model, variant) requests whose runtime settings differ now fall through to a real reload instead of returning `already_loaded`. A missing `llama_extra_args` field on the request is treated as "inherit current", so the short-circuit still fires when the only difference is the frontend not echoing the CLI flags back. * GGUF load branch inherits `llama_extra_args` from `llama_backend.extra_args` when the request omits the field, re-validates through `validate_extra_args`, and forwards the result to `load_model(...)`. An explicit `[]` from the caller is still honoured as "clear". Verified end to end against a live `unsloth studio run` instance: | Scenario | Before | After | | --------------------------------------------------------------- | --------- | ------------------------------------------------------------------------ | | `/load` same (model, variant, settings) | 1 PID, `already_loaded` | unchanged | | `/load` same model, variant, new `cache_type_kv=q8_0` ctx=8192 | `already_loaded`, settings dropped | `loaded`, `/status` reports the new settings, new server has `-c 8192 --cache-type-k q8_0 --top-k 20 --seed 42` | | Frontend Apply `/unload` + `/load`, new settings, no `llama_extra_args` field | Drops `--top-k 20 --seed 42` | Preserves `--top-k 20 --seed 42` | | `/unload` + two parallel `/load` | Two PIDs for 3.3 s | Max simultaneous count = 1 across the full pgrep timeline | | `/load` with `llama_extra_args=[]` (explicit clear) | n/a | `loaded`, new server has no `--top-k` / `--seed` | | `/load` with `llama_extra_args=["--top-k","30","--seed","7"]` (override) | n/a | `loaded`, new server has the supplied flags | `pytest studio/backend/tests` is green except for one pre-existing terminal-width-sensitive assertion (`test_studio_api.py::test_help_output`) and the pre-existing `test_studio_api.py` fixture errors that fail on unmodified main too. No new regressions.

Review feedback on PR unslothai#5427 from gemini-code-assist. The original short-circuit compared ``request.max_seq_length`` against ``llama_backend.context_length`` (the effective context). VRAM-fit logic can cap the running server below what the caller asked for, so this comparison incorrectly returns ``already_loaded`` when the user flips the slider from an explicit length (e.g. 8192) back to "Auto" (0): the explicit request was capped to, say, 4096, and the new "Auto" request reads ``backend.context_length == 4096`` and decides nothing changed. Track the originally requested ``n_ctx`` on the backend instead and compare against that. ``requested_n_ctx == 0`` means the last load asked for the model's native length; ``request.max_seq_length == 0`` matches it. Verified in the sandbox suite (now 90 tests): - ``test_explicit_to_auto_triggers_reload`` -- loaded with explicit 8192, then Apply with ``max_seq_length=0`` falls through to a real reload and the new server runs at the native 40960. - ``test_auto_to_explicit_triggers_reload`` -- inverse direction. - ``test_explicit_to_same_explicit_short_circuits`` -- re-Apply with the same explicit value still short-circuits (no needless reload). - Existing scenarios (kv change, spec change, template change, extra args inherit, parallel-load stress, frontend Apply flow) unchanged. ``pytest studio/backend/tests`` still green on the same set of tests; the pre-existing ``test_help_output`` failure and ``test_studio_api`` fixture errors are unaffected.

Trim the verbose explanatory comments and docstrings introduced in f9cbec3 and dd0b1d5 down to one-line summaries. The "why" still points at issue unslothai#5401; the multi-paragraph rationale belonged in the PR body, not the source. No behaviour change.

Resolves the conflict in `studio/backend/core/inference/llama_cpp.py` where bbd0ba0 / 63c6750 added a `mmproj_matches_model_family` guard to the mmproj attach block and this branch's f9cbec3 wrapped the body of `load_model` in `self._serial_load_lock`, re-indenting the same region by four spaces. Kept both changes: the mmproj attach now runs through the family-check guard inside the serial lock, and the `else: self._api_key = None` fallback for non-direct-stream mode is preserved. Verified post-merge: - `pytest studio/backend/tests --deselect test_studio_api.py`: 1056 passed, 46 skipped, 0 failed. - Sandbox unit + live integration suite re-run: 90 passed.

Tightens the inheritance and serial-load paths to close the six P2 findings raised by codex-connector on PR unslothai#5427 against `f9cbec3b` / `dd0b1d58`. 1. Re-check loaded state before killing queued loads. Two duplicate `/api/inference/load` requests both pass the route-level `is_loaded` gate before the first publishes `_healthy = True`. The second waits on `_serial_load_lock`, enters Phase 1, and tears down the just-spawned llama-server for a redundant full reload. Added `LlamaCppBackend._already_in_target_state(...)` and a short-circuit at the top of the serial-lock block: if the live server already satisfies the kwargs, return True without killing. 2. Don't inherit CLI overrides that shadow new first-class settings. `unsloth run -c 4096` is a permitted pass-through; the validator docs explicitly call out `-c`/`--ctx-size`. Stored in `_extra_args` and appended after Studio's own flags, the inherited `-c 4096` silently won the last-wins parse against a new `max_seq_length=8192`. Added `strip_shadowing_flags` in `llama_server_args.py` (covers `-c`, `--cache-type-k/v`, `--spec-*`, `--chat-template*`, `--jinja`/`--no-jinja`) and the route runs the inherited list through it before validate + forward. 3. Restrict inherited llama args to the same GGUF model. `_extra_args` is deliberately preserved across `unload_model()` for the chat- settings Apply flow (`/unload` + `/load` with no `llama_extra_args` field). Now also track `_extra_args_source = (model_identifier, hf_variant)` so the route can refuse cross-model inheritance. `LlamaCppBackend.extra_args_source` exposes the tuple. 4. Persist extras only after a successful load. `_extra_args` was written at the top of `load_model` before Popen + health check, so a failed startup left bad args in place to poison the next UI retry. The write (along with `_requested_n_ctx`) is now deferred until after `_healthy = True`. 5. Ignore speculative diffs for vision loads. `load_model` silently gates speculative decoding on `not is_vision`, so the backend's `_speculative_type` stays `None` for vision models. The route's comparator now normalises the request's value to `"off"` when `llama_backend.is_vision` to avoid a no-op reload of a vision server every time the dropdown defaults to `default`. The `_already_in_target_state` helper applies the same rule. 6. Wait for the replacement server before short-circuiting. `_kill_process` did not clear `_healthy`; the new first-class settings (`_cache_type_kv`, `_speculative_type`, `_chat_template_override`) are written under `_lock` BEFORE Popen + `_wait_for_health`. A duplicate `/load` arriving during the new server's warm-up window could short-circuit against the not-yet-healthy replacement and the caller would start inference against a server that was still loading. `_kill_process` now sets `_healthy = False` in its `finally` block so `is_loaded` returns False from the moment the old server is killed until the new one finishes warm-up. Tests: - Sandbox suite under `./temp/sim_5401/` extended to 136 tests (was 90): new unit coverage for `strip_shadowing_flags` (12 cases), `_kill_process` clears `_healthy`, `extra_args_source` lifecycle and cross-model behaviour, failed-load preserving prior extras, and the duplicate-load short-circuit at `load_model` level. New live integration cases verify shadow-strip via `pgrep` on the live llama-server cmdline, cross-model refusal, and PID stability across a duplicate-load race. All 136 pass. - `pytest studio/backend/tests --deselect test_studio_api.py`: 1079 passed, 46 skipped, identical to the pre-change count. The pre-existing `test_studio_api.py` fixture errors and the terminal-width-sensitive `test_help_output` are unaffected. - Ruff: clean on the three modified files.

for more information, see https://pre-commit.ci

Re-narrow llama_extra_args to None after validate_extra_args when the incoming request omitted the field, so the backend can distinguish "caller omitted, inherit prior load" from "caller explicitly cleared to []". Without this a queued duplicate /load reaches the backend as [] and fails _already_in_target_state's exact-equality check, killing the just-started llama-server. The pass-through validate call from the original "forward llama-server args from unsloth studio run / unsloth run" change is preserved as-is; only the post-pass narrowing is new. Cross-source loads now explicitly clear extras so a model switch can't accidentally inherit via the backend's "no opinion" semantics. Store the caller's hf_variant kwarg (None for local GGUF files) in _extra_args_source instead of the derived self._hf_variant (an extracted filename quant label like "Q4_K_M"). Same-source check in the route is now symmetric for HF and direct-file loads. Add gguf_path to _already_in_target_state and prefer on-disk path identity when both backend and caller have a path. This stops the duplicate-load guard from killing a healthy server on repeat local loads (where hf_variant is None on the caller side but extracted on the backend side). Split shadow-flag stripping into per-group toggles (context / cache / spec / template). The route now opts into stripping only the groups whose first-class field was actually set on the incoming request, so an inherited --chat-template-file survives an Apply that omits chat_template_override. _request_matches_loaded_settings detects shadowing extras on the inherit path and falls through to a real reload so the strip can run. Mark --spec-default, --jinja, --no-jinja as boolean inside the shadow stripper so the value-consuming heuristic no longer eats the following positional token.

danielhanchen · 2026-05-16T14:06:33Z

Fixes pushed to unslothai#5427.

danielhanchen and others added 30 commits May 14, 2026 23:18

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

4acd9d1

ci: retrigger after zoo drift + IPython fixes landed in main

8da1eed

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

b38c29f

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

935775a

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

3b88887

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

93c92aa

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

ef0895c

ci: retrigger Mac Studio UI CI after transient fetch flake

7616df2

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

df2ee0b

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

e1913d6

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

94a106a

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

b25e04b

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

9417bcc

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

72bf330

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

bad668a

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

e46bcca

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

d9e7528

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

ba7ebf4

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

a794a1a

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

da6fe31

Merge branch 'main' into fix-5401-studio-reload-serialise-and-extras

ea53c50

[pre-commit.ci] auto fixes from pre-commit.com hooks

04f0ac2

for more information, see https://pre-commit.ci

Scrub .github/workflows for staging push (matches staging base)

e0d0a33

Merge origin/main into fix-5401-studio-reload-serialise-and-extras

2c09bf1

danielhanchen added 2 commits May 16, 2026 13:56

Studio: cover GGUF reload inheritance and shadow-flag stripping

6d7ae0d

Studio: drop redundant issue refs from inheritance comments

2f4af84

danielhanchen closed this May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tests] Review tests for PR #5427#50

[tests] Review tests for PR #5427#50
danielhanchen wants to merge 32 commits into
mainfrom
pr-5427-tests

danielhanchen commented May 16, 2026

Uh oh!

danielhanchen commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielhanchen commented May 16, 2026

Uh oh!

danielhanchen commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant