[tests] Review tests for PR #5427#50
Closed
danielhanchen wants to merge 32 commits into
Closed
Conversation
Closes unslothai#5401. Three related GGUF reload bugs reproduced against `unsloth studio run -m unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42`: 1. The `POST /api/inference/load` already-loaded short-circuit only compared `model_identifier` and `hf_variant`. A same-(model, variant) Apply that flipped `cache_type_kv` / `speculative_type` / `chat_template_override` / `max_seq_length` / `llama_extra_args` returned `status="already_loaded"` and the new setting silently never reached llama-server. 2. The frontend chat-settings Apply path POSTs `/unload` then `/load` without round-tripping `llama_extra_args`. Every reload after `unsloth run --some-flag X` quietly dropped `--some-flag X` from the spawned `llama-server` command line. 3. `LlamaCppBackend.load_model` released `_lock` between Phase 1 (kill) and Phase 3 (spawn) so two concurrent loads each passed Phase 1 with `self._process is None`. Both ran Phase 2 (download), both reached Phase 3, and the Phase 3 defensive `_kill_process()` from unslothai#5171 collapsed them to one survivor only after both `subprocess.Popen` calls had landed. For the 86 GB MoE in unslothai#5161 / the model in unslothai#5401 the overlap window was tens of seconds, long enough to OOM the host. With a 0.6B model the pgrep timeline showed two simultaneous PIDs for 3.3 s on `main`. Fix: `studio/backend/core/inference/llama_cpp.py` * Add `self._serial_load_lock = threading.Lock()`. The whole body of `load_model` runs under this lock so two concurrent `/api/inference/load` requests are strictly sequential. The fine-grained `_lock` and the Phase 3 defensive `_kill_process()` from unslothai#5171 are kept as a second layer. `/unload`, `/status`, and `/load-progress` are unaffected because they only touch the fine-grained lock or read properties. * Add `self._extra_args` plus an `extra_args` property, written inside `load_model` whenever the caller supplies a non-`None` value. `unload_model()` deliberately does not reset it so the route layer can inherit the args across the frontend's `/unload` + `/load` gap. `studio/backend/routes/inference.py` * Add `_request_matches_loaded_settings(request, llama_backend)` that compares `max_seq_length`, `cache_type_kv`, `speculative_type`, `chat_template_override`, and `llama_extra_args` between the incoming request and the live backend. Same-(model, variant) requests whose runtime settings differ now fall through to a real reload instead of returning `already_loaded`. A missing `llama_extra_args` field on the request is treated as "inherit current", so the short-circuit still fires when the only difference is the frontend not echoing the CLI flags back. * GGUF load branch inherits `llama_extra_args` from `llama_backend.extra_args` when the request omits the field, re-validates through `validate_extra_args`, and forwards the result to `load_model(...)`. An explicit `[]` from the caller is still honoured as "clear". Verified end to end against a live `unsloth studio run` instance: | Scenario | Before | After | | --------------------------------------------------------------- | --------- | ------------------------------------------------------------------------ | | `/load` same (model, variant, settings) | 1 PID, `already_loaded` | unchanged | | `/load` same model, variant, new `cache_type_kv=q8_0` ctx=8192 | `already_loaded`, settings dropped | `loaded`, `/status` reports the new settings, new server has `-c 8192 --cache-type-k q8_0 --top-k 20 --seed 42` | | Frontend Apply `/unload` + `/load`, new settings, no `llama_extra_args` field | Drops `--top-k 20 --seed 42` | Preserves `--top-k 20 --seed 42` | | `/unload` + two parallel `/load` | Two PIDs for 3.3 s | Max simultaneous count = 1 across the full pgrep timeline | | `/load` with `llama_extra_args=[]` (explicit clear) | n/a | `loaded`, new server has no `--top-k` / `--seed` | | `/load` with `llama_extra_args=["--top-k","30","--seed","7"]` (override) | n/a | `loaded`, new server has the supplied flags | `pytest studio/backend/tests` is green except for one pre-existing terminal-width-sensitive assertion (`test_studio_api.py::test_help_output`) and the pre-existing `test_studio_api.py` fixture errors that fail on unmodified main too. No new regressions.
Review feedback on PR unslothai#5427 from gemini-code-assist. The original short-circuit compared ``request.max_seq_length`` against ``llama_backend.context_length`` (the effective context). VRAM-fit logic can cap the running server below what the caller asked for, so this comparison incorrectly returns ``already_loaded`` when the user flips the slider from an explicit length (e.g. 8192) back to "Auto" (0): the explicit request was capped to, say, 4096, and the new "Auto" request reads ``backend.context_length == 4096`` and decides nothing changed. Track the originally requested ``n_ctx`` on the backend instead and compare against that. ``requested_n_ctx == 0`` means the last load asked for the model's native length; ``request.max_seq_length == 0`` matches it. Verified in the sandbox suite (now 90 tests): - ``test_explicit_to_auto_triggers_reload`` -- loaded with explicit 8192, then Apply with ``max_seq_length=0`` falls through to a real reload and the new server runs at the native 40960. - ``test_auto_to_explicit_triggers_reload`` -- inverse direction. - ``test_explicit_to_same_explicit_short_circuits`` -- re-Apply with the same explicit value still short-circuits (no needless reload). - Existing scenarios (kv change, spec change, template change, extra args inherit, parallel-load stress, frontend Apply flow) unchanged. ``pytest studio/backend/tests`` still green on the same set of tests; the pre-existing ``test_help_output`` failure and ``test_studio_api`` fixture errors are unaffected.
Trim the verbose explanatory comments and docstrings introduced in f9cbec3 and dd0b1d5 down to one-line summaries. The "why" still points at issue unslothai#5401; the multi-paragraph rationale belonged in the PR body, not the source. No behaviour change.
Resolves the conflict in `studio/backend/core/inference/llama_cpp.py` where bbd0ba0 / 63c6750 added a `mmproj_matches_model_family` guard to the mmproj attach block and this branch's f9cbec3 wrapped the body of `load_model` in `self._serial_load_lock`, re-indenting the same region by four spaces. Kept both changes: the mmproj attach now runs through the family-check guard inside the serial lock, and the `else: self._api_key = None` fallback for non-direct-stream mode is preserved. Verified post-merge: - `pytest studio/backend/tests --deselect test_studio_api.py`: 1056 passed, 46 skipped, 0 failed. - Sandbox unit + live integration suite re-run: 90 passed.
Tightens the inheritance and serial-load paths to close the six P2 findings raised by codex-connector on PR unslothai#5427 against `f9cbec3b` / `dd0b1d58`. 1. Re-check loaded state before killing queued loads. Two duplicate `/api/inference/load` requests both pass the route-level `is_loaded` gate before the first publishes `_healthy = True`. The second waits on `_serial_load_lock`, enters Phase 1, and tears down the just-spawned llama-server for a redundant full reload. Added `LlamaCppBackend._already_in_target_state(...)` and a short-circuit at the top of the serial-lock block: if the live server already satisfies the kwargs, return True without killing. 2. Don't inherit CLI overrides that shadow new first-class settings. `unsloth run -c 4096` is a permitted pass-through; the validator docs explicitly call out `-c`/`--ctx-size`. Stored in `_extra_args` and appended after Studio's own flags, the inherited `-c 4096` silently won the last-wins parse against a new `max_seq_length=8192`. Added `strip_shadowing_flags` in `llama_server_args.py` (covers `-c`, `--cache-type-k/v`, `--spec-*`, `--chat-template*`, `--jinja`/`--no-jinja`) and the route runs the inherited list through it before validate + forward. 3. Restrict inherited llama args to the same GGUF model. `_extra_args` is deliberately preserved across `unload_model()` for the chat- settings Apply flow (`/unload` + `/load` with no `llama_extra_args` field). Now also track `_extra_args_source = (model_identifier, hf_variant)` so the route can refuse cross-model inheritance. `LlamaCppBackend.extra_args_source` exposes the tuple. 4. Persist extras only after a successful load. `_extra_args` was written at the top of `load_model` before Popen + health check, so a failed startup left bad args in place to poison the next UI retry. The write (along with `_requested_n_ctx`) is now deferred until after `_healthy = True`. 5. Ignore speculative diffs for vision loads. `load_model` silently gates speculative decoding on `not is_vision`, so the backend's `_speculative_type` stays `None` for vision models. The route's comparator now normalises the request's value to `"off"` when `llama_backend.is_vision` to avoid a no-op reload of a vision server every time the dropdown defaults to `default`. The `_already_in_target_state` helper applies the same rule. 6. Wait for the replacement server before short-circuiting. `_kill_process` did not clear `_healthy`; the new first-class settings (`_cache_type_kv`, `_speculative_type`, `_chat_template_override`) are written under `_lock` BEFORE Popen + `_wait_for_health`. A duplicate `/load` arriving during the new server's warm-up window could short-circuit against the not-yet-healthy replacement and the caller would start inference against a server that was still loading. `_kill_process` now sets `_healthy = False` in its `finally` block so `is_loaded` returns False from the moment the old server is killed until the new one finishes warm-up. Tests: - Sandbox suite under `./temp/sim_5401/` extended to 136 tests (was 90): new unit coverage for `strip_shadowing_flags` (12 cases), `_kill_process` clears `_healthy`, `extra_args_source` lifecycle and cross-model behaviour, failed-load preserving prior extras, and the duplicate-load short-circuit at `load_model` level. New live integration cases verify shadow-strip via `pgrep` on the live llama-server cmdline, cross-model refusal, and PID stability across a duplicate-load race. All 136 pass. - `pytest studio/backend/tests --deselect test_studio_api.py`: 1079 passed, 46 skipped, identical to the pre-change count. The pre-existing `test_studio_api.py` fixture errors and the terminal-width-sensitive `test_help_output` are unaffected. - Ruff: clean on the three modified files.
for more information, see https://pre-commit.ci
Re-narrow llama_extra_args to None after validate_extra_args when the incoming request omitted the field, so the backend can distinguish "caller omitted, inherit prior load" from "caller explicitly cleared to []". Without this a queued duplicate /load reaches the backend as [] and fails _already_in_target_state's exact-equality check, killing the just-started llama-server. The pass-through validate call from the original "forward llama-server args from unsloth studio run / unsloth run" change is preserved as-is; only the post-pass narrowing is new. Cross-source loads now explicitly clear extras so a model switch can't accidentally inherit via the backend's "no opinion" semantics. Store the caller's hf_variant kwarg (None for local GGUF files) in _extra_args_source instead of the derived self._hf_variant (an extracted filename quant label like "Q4_K_M"). Same-source check in the route is now symmetric for HF and direct-file loads. Add gguf_path to _already_in_target_state and prefer on-disk path identity when both backend and caller have a path. This stops the duplicate-load guard from killing a healthy server on repeat local loads (where hf_variant is None on the caller side but extracted on the backend side). Split shadow-flag stripping into per-group toggles (context / cache / spec / template). The route now opts into stripping only the groups whose first-class field was actually set on the incoming request, so an inherited --chat-template-file survives an Apply that omits chat_template_override. _request_matches_loaded_settings detects shadowing extras on the inherit path and falls through to a real reload so the strip can run. Mark --spec-default, --jinja, --no-jinja as boolean inside the shadow stripper so the value-consuming heuristic no longer eats the following positional token.
Collaborator
Author
|
Fixes pushed to unslothai#5427. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated test files from review process