Skip to content

[tests] Review tests for PR #5427#50

Closed
danielhanchen wants to merge 32 commits into
mainfrom
pr-5427-tests
Closed

[tests] Review tests for PR #5427#50
danielhanchen wants to merge 32 commits into
mainfrom
pr-5427-tests

Conversation

@danielhanchen

Copy link
Copy Markdown
Collaborator

Automated test files from review process

danielhanchen and others added 30 commits May 14, 2026 23:18
Closes unslothai#5401.

Three related GGUF reload bugs reproduced against `unsloth studio run -m unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42`:

1. The `POST /api/inference/load` already-loaded short-circuit only compared `model_identifier` and `hf_variant`. A same-(model, variant) Apply that flipped `cache_type_kv` / `speculative_type` / `chat_template_override` / `max_seq_length` / `llama_extra_args` returned `status="already_loaded"` and the new setting silently never reached llama-server.

2. The frontend chat-settings Apply path POSTs `/unload` then `/load` without round-tripping `llama_extra_args`. Every reload after `unsloth run --some-flag X` quietly dropped `--some-flag X` from the spawned `llama-server` command line.

3. `LlamaCppBackend.load_model` released `_lock` between Phase 1 (kill) and Phase 3 (spawn) so two concurrent loads each passed Phase 1 with `self._process is None`. Both ran Phase 2 (download), both reached Phase 3, and the Phase 3 defensive `_kill_process()` from unslothai#5171 collapsed them to one survivor only after both `subprocess.Popen` calls had landed. For the 86 GB MoE in unslothai#5161 / the model in unslothai#5401 the overlap window was tens of seconds, long enough to OOM the host. With a 0.6B model the pgrep timeline showed two simultaneous PIDs for 3.3 s on `main`.

Fix:

`studio/backend/core/inference/llama_cpp.py`

* Add `self._serial_load_lock = threading.Lock()`. The whole body of `load_model` runs under this lock so two concurrent `/api/inference/load` requests are strictly sequential. The fine-grained `_lock` and the Phase 3 defensive `_kill_process()` from unslothai#5171 are kept as a second layer. `/unload`, `/status`, and `/load-progress` are unaffected because they only touch the fine-grained lock or read properties.
* Add `self._extra_args` plus an `extra_args` property, written inside `load_model` whenever the caller supplies a non-`None` value. `unload_model()` deliberately does not reset it so the route layer can inherit the args across the frontend's `/unload` + `/load` gap.

`studio/backend/routes/inference.py`

* Add `_request_matches_loaded_settings(request, llama_backend)` that compares `max_seq_length`, `cache_type_kv`, `speculative_type`, `chat_template_override`, and `llama_extra_args` between the incoming request and the live backend. Same-(model, variant) requests whose runtime settings differ now fall through to a real reload instead of returning `already_loaded`. A missing `llama_extra_args` field on the request is treated as "inherit current", so the short-circuit still fires when the only difference is the frontend not echoing the CLI flags back.
* GGUF load branch inherits `llama_extra_args` from `llama_backend.extra_args` when the request omits the field, re-validates through `validate_extra_args`, and forwards the result to `load_model(...)`. An explicit `[]` from the caller is still honoured as "clear".

Verified end to end against a live `unsloth studio run` instance:

| Scenario                                                        | Before    | After                                                                    |
| --------------------------------------------------------------- | --------- | ------------------------------------------------------------------------ |
| `/load` same (model, variant, settings)                         | 1 PID, `already_loaded` | unchanged                                                                |
| `/load` same model, variant, new `cache_type_kv=q8_0` ctx=8192  | `already_loaded`, settings dropped | `loaded`, `/status` reports the new settings, new server has `-c 8192 --cache-type-k q8_0 --top-k 20 --seed 42` |
| Frontend Apply `/unload` + `/load`, new settings, no `llama_extra_args` field | Drops `--top-k 20 --seed 42` | Preserves `--top-k 20 --seed 42`                                          |
| `/unload` + two parallel `/load`                                | Two PIDs for 3.3 s | Max simultaneous count = 1 across the full pgrep timeline                  |
| `/load` with `llama_extra_args=[]` (explicit clear)             | n/a       | `loaded`, new server has no `--top-k` / `--seed`                          |
| `/load` with `llama_extra_args=["--top-k","30","--seed","7"]` (override) | n/a       | `loaded`, new server has the supplied flags                                |

`pytest studio/backend/tests` is green except for one pre-existing terminal-width-sensitive assertion (`test_studio_api.py::test_help_output`) and the pre-existing `test_studio_api.py` fixture errors that fail on unmodified main too. No new regressions.
Review feedback on PR unslothai#5427 from gemini-code-assist.

The original short-circuit compared ``request.max_seq_length`` against
``llama_backend.context_length`` (the effective context). VRAM-fit
logic can cap the running server below what the caller asked for, so
this comparison incorrectly returns ``already_loaded`` when the user
flips the slider from an explicit length (e.g. 8192) back to "Auto"
(0): the explicit request was capped to, say, 4096, and the new "Auto"
request reads ``backend.context_length == 4096`` and decides nothing
changed.

Track the originally requested ``n_ctx`` on the backend instead and
compare against that. ``requested_n_ctx == 0`` means the last load
asked for the model's native length; ``request.max_seq_length == 0``
matches it.

Verified in the sandbox suite (now 90 tests):

- ``test_explicit_to_auto_triggers_reload`` -- loaded with explicit
  8192, then Apply with ``max_seq_length=0`` falls through to a real
  reload and the new server runs at the native 40960.
- ``test_auto_to_explicit_triggers_reload`` -- inverse direction.
- ``test_explicit_to_same_explicit_short_circuits`` -- re-Apply with
  the same explicit value still short-circuits (no needless reload).
- Existing scenarios (kv change, spec change, template change, extra
  args inherit, parallel-load stress, frontend Apply flow) unchanged.

``pytest studio/backend/tests`` still green on the same set of tests;
the pre-existing ``test_help_output`` failure and ``test_studio_api``
fixture errors are unaffected.
Trim the verbose explanatory comments and docstrings introduced in
f9cbec3 and dd0b1d5 down to one-line summaries. The "why" still
points at issue unslothai#5401; the multi-paragraph rationale belonged in the
PR body, not the source. No behaviour change.
Resolves the conflict in `studio/backend/core/inference/llama_cpp.py`
where bbd0ba0 / 63c6750 added a `mmproj_matches_model_family` guard
to the mmproj attach block and this branch's f9cbec3 wrapped the body
of `load_model` in `self._serial_load_lock`, re-indenting the same
region by four spaces.

Kept both changes: the mmproj attach now runs through the family-check
guard inside the serial lock, and the `else: self._api_key = None`
fallback for non-direct-stream mode is preserved.

Verified post-merge:

- `pytest studio/backend/tests --deselect test_studio_api.py`:
  1056 passed, 46 skipped, 0 failed.
- Sandbox unit + live integration suite re-run: 90 passed.
Tightens the inheritance and serial-load paths to close the six P2
findings raised by codex-connector on PR unslothai#5427 against `f9cbec3b` /
`dd0b1d58`.

1. Re-check loaded state before killing queued loads. Two duplicate
   `/api/inference/load` requests both pass the route-level
   `is_loaded` gate before the first publishes `_healthy = True`. The
   second waits on `_serial_load_lock`, enters Phase 1, and tears down
   the just-spawned llama-server for a redundant full reload. Added
   `LlamaCppBackend._already_in_target_state(...)` and a short-circuit
   at the top of the serial-lock block: if the live server already
   satisfies the kwargs, return True without killing.

2. Don't inherit CLI overrides that shadow new first-class settings.
   `unsloth run -c 4096` is a permitted pass-through; the validator
   docs explicitly call out `-c`/`--ctx-size`. Stored in `_extra_args`
   and appended after Studio's own flags, the inherited `-c 4096`
   silently won the last-wins parse against a new
   `max_seq_length=8192`. Added `strip_shadowing_flags` in
   `llama_server_args.py` (covers `-c`, `--cache-type-k/v`, `--spec-*`,
   `--chat-template*`, `--jinja`/`--no-jinja`) and the route runs the
   inherited list through it before validate + forward.

3. Restrict inherited llama args to the same GGUF model. `_extra_args`
   is deliberately preserved across `unload_model()` for the chat-
   settings Apply flow (`/unload` + `/load` with no `llama_extra_args`
   field). Now also track `_extra_args_source = (model_identifier,
   hf_variant)` so the route can refuse cross-model inheritance.
   `LlamaCppBackend.extra_args_source` exposes the tuple.

4. Persist extras only after a successful load. `_extra_args` was
   written at the top of `load_model` before Popen + health check, so
   a failed startup left bad args in place to poison the next UI
   retry. The write (along with `_requested_n_ctx`) is now deferred
   until after `_healthy = True`.

5. Ignore speculative diffs for vision loads. `load_model` silently
   gates speculative decoding on `not is_vision`, so the backend's
   `_speculative_type` stays `None` for vision models. The route's
   comparator now normalises the request's value to `"off"` when
   `llama_backend.is_vision` to avoid a no-op reload of a vision
   server every time the dropdown defaults to `default`. The
   `_already_in_target_state` helper applies the same rule.

6. Wait for the replacement server before short-circuiting. `_kill_process`
   did not clear `_healthy`; the new first-class settings
   (`_cache_type_kv`, `_speculative_type`, `_chat_template_override`)
   are written under `_lock` BEFORE Popen + `_wait_for_health`. A
   duplicate `/load` arriving during the new server's warm-up window
   could short-circuit against the not-yet-healthy replacement and the
   caller would start inference against a server that was still
   loading. `_kill_process` now sets `_healthy = False` in its
   `finally` block so `is_loaded` returns False from the moment the
   old server is killed until the new one finishes warm-up.

Tests:

- Sandbox suite under `./temp/sim_5401/` extended to 136 tests (was
  90): new unit coverage for `strip_shadowing_flags` (12 cases),
  `_kill_process` clears `_healthy`, `extra_args_source` lifecycle and
  cross-model behaviour, failed-load preserving prior extras, and the
  duplicate-load short-circuit at `load_model` level. New live
  integration cases verify shadow-strip via `pgrep` on the live
  llama-server cmdline, cross-model refusal, and PID stability across
  a duplicate-load race. All 136 pass.
- `pytest studio/backend/tests --deselect test_studio_api.py`:
  1079 passed, 46 skipped, identical to the pre-change count. The
  pre-existing `test_studio_api.py` fixture errors and the
  terminal-width-sensitive `test_help_output` are unaffected.
- Ruff: clean on the three modified files.
Re-narrow llama_extra_args to None after validate_extra_args when the
incoming request omitted the field, so the backend can distinguish
"caller omitted, inherit prior load" from "caller explicitly cleared
to []". Without this a queued duplicate /load reaches the backend as
[] and fails _already_in_target_state's exact-equality check, killing
the just-started llama-server. The pass-through validate call from
the original "forward llama-server args from unsloth studio run /
unsloth run" change is preserved as-is; only the post-pass narrowing
is new. Cross-source loads now explicitly clear extras so a model
switch can't accidentally inherit via the backend's "no opinion"
semantics.

Store the caller's hf_variant kwarg (None for local GGUF files) in
_extra_args_source instead of the derived self._hf_variant
(an extracted filename quant label like "Q4_K_M"). Same-source check
in the route is now symmetric for HF and direct-file loads.

Add gguf_path to _already_in_target_state and prefer on-disk path
identity when both backend and caller have a path. This stops the
duplicate-load guard from killing a healthy server on repeat local
loads (where hf_variant is None on the caller side but extracted on
the backend side).

Split shadow-flag stripping into per-group toggles (context / cache /
spec / template). The route now opts into stripping only the groups
whose first-class field was actually set on the incoming request, so
an inherited --chat-template-file survives an Apply that omits
chat_template_override. _request_matches_loaded_settings detects
shadowing extras on the inherit path and falls through to a real
reload so the strip can run.

Mark --spec-default, --jinja, --no-jinja as boolean inside the
shadow stripper so the value-consuming heuristic no longer eats the
following positional token.
@danielhanchen

Copy link
Copy Markdown
Collaborator Author

Fixes pushed to unslothai#5427.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant