Studio: serialise GGUF reload and inherit unsloth-run extra args#5427
Conversation
Closes #5401. Three related GGUF reload bugs reproduced against `unsloth studio run -m unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42`: 1. The `POST /api/inference/load` already-loaded short-circuit only compared `model_identifier` and `hf_variant`. A same-(model, variant) Apply that flipped `cache_type_kv` / `speculative_type` / `chat_template_override` / `max_seq_length` / `llama_extra_args` returned `status="already_loaded"` and the new setting silently never reached llama-server. 2. The frontend chat-settings Apply path POSTs `/unload` then `/load` without round-tripping `llama_extra_args`. Every reload after `unsloth run --some-flag X` quietly dropped `--some-flag X` from the spawned `llama-server` command line. 3. `LlamaCppBackend.load_model` released `_lock` between Phase 1 (kill) and Phase 3 (spawn) so two concurrent loads each passed Phase 1 with `self._process is None`. Both ran Phase 2 (download), both reached Phase 3, and the Phase 3 defensive `_kill_process()` from #5171 collapsed them to one survivor only after both `subprocess.Popen` calls had landed. For the 86 GB MoE in #5161 / the model in #5401 the overlap window was tens of seconds, long enough to OOM the host. With a 0.6B model the pgrep timeline showed two simultaneous PIDs for 3.3 s on `main`. Fix: `studio/backend/core/inference/llama_cpp.py` * Add `self._serial_load_lock = threading.Lock()`. The whole body of `load_model` runs under this lock so two concurrent `/api/inference/load` requests are strictly sequential. The fine-grained `_lock` and the Phase 3 defensive `_kill_process()` from #5171 are kept as a second layer. `/unload`, `/status`, and `/load-progress` are unaffected because they only touch the fine-grained lock or read properties. * Add `self._extra_args` plus an `extra_args` property, written inside `load_model` whenever the caller supplies a non-`None` value. `unload_model()` deliberately does not reset it so the route layer can inherit the args across the frontend's `/unload` + `/load` gap. `studio/backend/routes/inference.py` * Add `_request_matches_loaded_settings(request, llama_backend)` that compares `max_seq_length`, `cache_type_kv`, `speculative_type`, `chat_template_override`, and `llama_extra_args` between the incoming request and the live backend. Same-(model, variant) requests whose runtime settings differ now fall through to a real reload instead of returning `already_loaded`. A missing `llama_extra_args` field on the request is treated as "inherit current", so the short-circuit still fires when the only difference is the frontend not echoing the CLI flags back. * GGUF load branch inherits `llama_extra_args` from `llama_backend.extra_args` when the request omits the field, re-validates through `validate_extra_args`, and forwards the result to `load_model(...)`. An explicit `[]` from the caller is still honoured as "clear". Verified end to end against a live `unsloth studio run` instance: | Scenario | Before | After | | --------------------------------------------------------------- | --------- | ------------------------------------------------------------------------ | | `/load` same (model, variant, settings) | 1 PID, `already_loaded` | unchanged | | `/load` same model, variant, new `cache_type_kv=q8_0` ctx=8192 | `already_loaded`, settings dropped | `loaded`, `/status` reports the new settings, new server has `-c 8192 --cache-type-k q8_0 --top-k 20 --seed 42` | | Frontend Apply `/unload` + `/load`, new settings, no `llama_extra_args` field | Drops `--top-k 20 --seed 42` | Preserves `--top-k 20 --seed 42` | | `/unload` + two parallel `/load` | Two PIDs for 3.3 s | Max simultaneous count = 1 across the full pgrep timeline | | `/load` with `llama_extra_args=[]` (explicit clear) | n/a | `loaded`, new server has no `--top-k` / `--seed` | | `/load` with `llama_extra_args=["--top-k","30","--seed","7"]` (override) | n/a | `loaded`, new server has the supplied flags | `pytest studio/backend/tests` is green except for one pre-existing terminal-width-sensitive assertion (`test_studio_api.py::test_help_output`) and the pre-existing `test_studio_api.py` fixture errors that fail on unmodified main too. No new regressions.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f9cbec3b60
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if extra_args is not None: | ||
| self._extra_args = list(extra_args) | ||
|
|
||
| self._cancel_event.clear() |
There was a problem hiding this comment.
Preserve unload cancellation for queued loads
When two /api/inference/load calls are serialized here and /api/inference/unload is posted while the first load is downloading, unload_model() only sets _cancel_event; after the first load exits, the second load acquires _serial_load_lock and immediately clears that cancellation before starting. In that concurrent-load + unload scenario the unload request does not cancel the queued in-flight load, so a llama-server can be spawned after the user has explicitly unloaded/cancelled loading.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces a serial lock for model loading in llama_cpp.py to prevent concurrent loads from causing OOM and implements a more robust 'already loaded' check in the inference routes by comparing all runtime settings. It also allows inheriting extra arguments across reloads. Feedback focuses on improving the context length comparison by tracking the originally requested n_ctx value, which ensures that switching between 'Auto' and explicit lengths correctly triggers a reload.
| # without the llama_extra_args field, since the chat UI does not | ||
| # round-trip those flags) can inherit the values the CLI / first | ||
| # load supplied. See issue #5401. | ||
| self._extra_args: Optional[List[str]] = None |
There was a problem hiding this comment.
To accurately detect when a user wants to change settings during a reload, we should track the original n_ctx value requested. This allows the route layer to distinguish between a request for the model's native length (0) and an explicit length that happens to match the current effective (possibly capped) context length.
| self._extra_args: Optional[List[str]] = None | |
| self._extra_args: Optional[List[str]] = None | |
| self._requested_n_ctx: int = 0 |
| them. The route layer uses this to inherit ``unsloth run``-style | ||
| flags across frontend Apply reloads — see issue #5401. | ||
| """ | ||
| return list(self._extra_args) if self._extra_args is not None else None |
There was a problem hiding this comment.
Expose the requested context length as a property to allow the route layer to perform accurate reload checks.
| return list(self._extra_args) if self._extra_args is not None else None | |
| return list(self._extra_args) if self._extra_args is not None else None | |
| @property | |
| def requested_n_ctx(self) -> int: | |
| """The original n_ctx value from the last load_model() call.""" | |
| return self._requested_n_ctx |
| if extra_args is not None: | ||
| self._extra_args = list(extra_args) |
| backend_ctx = llama_backend.context_length or 0 | ||
| if request.max_seq_length and request.max_seq_length != backend_ctx: | ||
| return False |
There was a problem hiding this comment.
The current check for max_seq_length has a bug: if request.max_seq_length is 0 (Auto), the if block is skipped entirely. This means if a model is currently running at an explicit length (e.g., 8192) and the user switches back to "Auto", the short-circuit will incorrectly return already_loaded and the server will remain at 8192 instead of resetting to the native length.
Comparing against the tracked requested_n_ctx (suggested in llama_cpp.py) fixes this by ensuring the intent of the request matches the intent of the current load.
| backend_ctx = llama_backend.context_length or 0 | |
| if request.max_seq_length and request.max_seq_length != backend_ctx: | |
| return False | |
| if request.max_seq_length != llama_backend.requested_n_ctx: | |
| return False |
Review feedback on PR #5427 from gemini-code-assist. The original short-circuit compared ``request.max_seq_length`` against ``llama_backend.context_length`` (the effective context). VRAM-fit logic can cap the running server below what the caller asked for, so this comparison incorrectly returns ``already_loaded`` when the user flips the slider from an explicit length (e.g. 8192) back to "Auto" (0): the explicit request was capped to, say, 4096, and the new "Auto" request reads ``backend.context_length == 4096`` and decides nothing changed. Track the originally requested ``n_ctx`` on the backend instead and compare against that. ``requested_n_ctx == 0`` means the last load asked for the model's native length; ``request.max_seq_length == 0`` matches it. Verified in the sandbox suite (now 90 tests): - ``test_explicit_to_auto_triggers_reload`` -- loaded with explicit 8192, then Apply with ``max_seq_length=0`` falls through to a real reload and the new server runs at the native 40960. - ``test_auto_to_explicit_triggers_reload`` -- inverse direction. - ``test_explicit_to_same_explicit_short_circuits`` -- re-Apply with the same explicit value still short-circuits (no needless reload). - Existing scenarios (kv change, spec change, template change, extra args inherit, parallel-load stress, frontend Apply flow) unchanged. ``pytest studio/backend/tests`` still green on the same set of tests; the pre-existing ``test_help_output`` failure and ``test_studio_api`` fixture errors are unaffected.
|
Thanks both. Pushed gemini: requested_n_ctx tracking (addressed)Real correctness bug. The original comparator did backend_ctx = llama_backend.context_length or 0
if request.max_seq_length and request.max_seq_length != backend_ctx:
return Falsewhich compares against the effective context (post VRAM-fit cap) and silently skips when Fix tracks the originally requested
Sandbox added three new live-integration tests against
Plus four new unit tests on codex: unload cancellation for queued loads (P2, deferred)Real but semantically subtle: when load A is mid-Phase-2 and load B is queued behind the serial lock, an The reason I am not addressing this in PR #5427 is that the obvious fix (have load B abort if it sees The original #5401 OOM (two |
|
Thanks for the detailed update and for implementing the |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dd0b1d58a7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # ── Phase 1: kill old process (under lock, fast) ────────── | ||
| with self._lock: | ||
| self._kill_process() |
There was a problem hiding this comment.
Re-check loaded state before killing queued loads
When two identical /api/inference/load requests overlap, both route-level already_loaded checks run before the first request has finished loading. The second request then waits on _serial_load_lock, enters here after the first server is healthy, and immediately calls _kill_process(), tearing down the just-loaded model and doing a redundant full reload; the first caller can receive status="loaded" while the server is already being killed by the queued duplicate. Re-check the current loaded model/settings after acquiring the serial lock, before killing the process.
Useful? React with 👍 / 👎.
Resolves the conflict in `studio/backend/core/inference/llama_cpp.py` where bbd0ba0 / 63c6750 added a `mmproj_matches_model_family` guard to the mmproj attach block and this branch's f9cbec3 wrapped the body of `load_model` in `self._serial_load_lock`, re-indenting the same region by four spaces. Kept both changes: the mmproj attach now runs through the family-check guard inside the serial lock, and the `else: self._api_key = None` fallback for non-direct-stream mode is preserved. Verified post-merge: - `pytest studio/backend/tests --deselect test_studio_api.py`: 1056 passed, 46 skipped, 0 failed. - Sandbox unit + live integration suite re-run: 90 passed.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56d598045d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if request.llama_extra_args is None and llama_backend.extra_args: | ||
| inherited = list(llama_backend.extra_args) | ||
| try: | ||
| extra_llama_args = validate_extra_args(inherited) |
There was a problem hiding this comment.
Don't inherit CLI overrides when applying UI settings
When the previous load came from unsloth run with pass-through args that override a first-class setting (for example -c/--ctx-size, which the validator permits and llama_cpp appends after Studio's own flags), a later UI /load omits llama_extra_args and lands here. The route then reuses those old args while also passing the user's new max_seq_length/cache/spec settings, so the inherited CLI flag wins and the Apply change has no effect; meanwhile _requested_n_ctx is updated to the new request value, so subsequent already-loaded checks can incorrectly report a match against a server still running the old override.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3b88887763
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if request.llama_extra_args is None and llama_backend.extra_args: | ||
| inherited = list(llama_backend.extra_args) | ||
| try: | ||
| extra_llama_args = validate_extra_args(inherited) |
There was a problem hiding this comment.
Restrict inherited llama args to the same GGUF model
When a request omits llama_extra_args, this inherits the previous GGUF server's args without checking that they came from the same model being loaded. Since _extra_args is intentionally preserved across unloads, a user can load model A with model-specific pass-through flags such as an allowed --chat-template-file or RoPE option, switch/unload, and then load model B from the UI with the field omitted; model B will silently run with model A's flags, which can break startup or corrupt inference behavior. Consider only inheriting for the same model/variant or clearing the stored args on unrelated loads.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 93c92aadbb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if extra_args is not None: | ||
| self._extra_args = list(extra_args) | ||
| self._requested_n_ctx = int(n_ctx) |
There was a problem hiding this comment.
Persist extra args only after a successful load
When a GGUF startup fails after being invoked with non-empty extra_args, these lines have already saved those args in _extra_args; because the route later inherits llama_backend.extra_args whenever the UI omits llama_extra_args, a normal retry from the chat UI will keep reusing the failed/invalid llama-server flags even though no model was successfully loaded with them. Store the args only after the health check succeeds, or roll them back on failure, so a failed load does not poison subsequent omitted-args loads.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b25e04b8f4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| req_spec = _normalise_settings_str(request.speculative_type) | ||
| backend_spec = _normalise_settings_str(llama_backend.speculative_type) | ||
| if (req_spec or "off") != (backend_spec or "off"): |
There was a problem hiding this comment.
Ignore speculative diffs for vision loads
When the loaded GGUF is vision/multimodal and the request carries a non-off speculative_type, this comparison forces a reload even though the backend ignores speculative decoding for vision models (load_model only applies it when not is_vision, so llama_backend.speculative_type stays None). In that scenario a no-op /api/inference/load for the same model/settings cannot take the new short-circuit and will unnecessarily tear down and reload the vision model; normalize requested speculation to off for vision before comparing.
Useful? React with 👍 / 👎.
|
This PR appears to address open issue(s). The duplicate detector matched the following open issues with HIGH confidence:
If this PR fixes any of them, consider adding |
danielhanchen
left a comment
There was a problem hiding this comment.
Review
Three backend bugs in Studio's llama.cpp reload path. The fixes are correct and well-reasoned. One issue in the new strip_shadowing_flags function needs to be addressed before merge; the rest are notes.
[Must fix] strip_shadowing_flags: boolean flags incorrectly treated as value-bearing
File: studio/backend/core/inference/llama_server_args.py
--spec-default, --jinja, and --no-jinja are in _SHADOWING_FLAGS and are boolean flags (they take no value argument). They are NOT in _DENYLIST — the module's own comment at line 36 explicitly lists --jinja and --spec-* as pass-through flags — so they can appear in a user's extra_args.
The parser's value-consumption branch:
if "=" not in tok and i + 1 < n and _flag_name(tokens[i + 1]) is None:
i += 2fires whenever the next token doesn't look like a flag. For these boolean flags, if a positional value happens to follow them, the parser advances by 2, silently discarding the next token.
Fix: recognise the boolean flags and skip only 1 token regardless of what follows:
_BOOLEAN_SHADOWING_FLAGS: frozenset[str] = frozenset({
"--spec-default",
"--jinja",
"--no-jinja",
})
def strip_shadowing_flags(args: Iterable[str]) -> list[str]:
tokens = [str(a) for a in (args or [])]
out: list[str] = []
i, n = 0, len(tokens)
while i < n:
tok = tokens[i]
flag = _flag_name(tok)
if flag is None or flag not in _SHADOWING_FLAGS:
out.append(tok)
i += 1
continue
# Packed flags (--flag=val) are self-contained. Boolean flags take
# no value token. Only bare value-bearing flags consume the next token.
if "=" in tok or flag in _BOOLEAN_SHADOWING_FLAGS:
i += 1
elif i + 1 < n and _flag_name(tokens[i + 1]) is None:
i += 2
else:
i += 1
return out[Minor] _already_in_target_state / _request_matches_loaded_settings duplication
Files: llama_cpp.py, routes/inference.py
Both functions independently compare the full settings set for the same semantic purpose. The PR cross-references them in a comment which is good, but the two can silently diverge if one is updated without the other. A future cache_type_kv normalisation change in one that misses the other would re-introduce #5401.
A _settings_fingerprint() -> tuple helper on LlamaCppBackend that both callers delegate to would eliminate the duplication. Not blocking.
Notes (no action required)
max_seq_length default is 0, not None. LoadRequest.max_seq_length has int = Field(0, ...) so comparing it directly to requested_n_ctx (also initialised to 0) is correct for the Auto case. The plain != without int-cast is safe.
Backend fixes look correct:
_serial_load_lockserialises concurrent/loadcalls end-to-end; at most one llama-server process alive at any time._healthy = Falsein_kill_processcloses the race window where a concurrent reader could seeis_loaded = Trueagainst a dead PID during warm-up.- Extra-args inheritance (None = inherit, [] = clear, list = override) is clearly documented and gated on same-model source to prevent cross-model flag bleed.
_request_matches_loaded_settingscorrectly comparesmax_seq_lengthagainstrequested_n_ctx(the requested value, not the VRAM-capped effective value) — right comparison for detecting slider flips.
|
Auto-review verdict: Approved Fixes a Studio GGUF reload regression where same-(model, variant) Apply silently dropped settings, where chat-settings Apply lost unsloth-run CLI extras, and where two llama-server processes could coexist during reload (#5401). The patch serialises load_model under a load-lock, broadens the already-loaded comparator across runtime settings, and inherits llama_extra_args with shadow-flag stripping; review iteration tightened the inheritance to handle local-GGUF identity, None vs [] semantics, and per-group strip toggles. Reason: All six accepted findings fixed; mergeback pushed cleanly; PR addresses three reproducible Studio reload bugs with no remaining real regressions |
…d-serialise-and-extras
Re-narrow llama_extra_args to None after validate_extra_args when the incoming request omitted the field, so the backend can distinguish "caller omitted, inherit prior load" from "caller explicitly cleared to []". Without this a queued duplicate /load reaches the backend as [] and fails _already_in_target_state's exact-equality check, killing the just-started llama-server. The pass-through validate call from the original "forward llama-server args from unsloth studio run / unsloth run" change is preserved as-is; only the post-pass narrowing is new. Cross-source loads now explicitly clear extras so a model switch can't accidentally inherit via the backend's "no opinion" semantics. Store the caller's hf_variant kwarg (None for local GGUF files) in _extra_args_source instead of the derived self._hf_variant (an extracted filename quant label like "Q4_K_M"). Same-source check in the route is now symmetric for HF and direct-file loads. Add gguf_path to _already_in_target_state and prefer on-disk path identity when both backend and caller have a path. This stops the duplicate-load guard from killing a healthy server on repeat local loads (where hf_variant is None on the caller side but extracted on the backend side). Split shadow-flag stripping into per-group toggles (context / cache / spec / template). The route now opts into stripping only the groups whose first-class field was actually set on the incoming request, so an inherited --chat-template-file survives an Apply that omits chat_template_override. _request_matches_loaded_settings detects shadowing extras on the inherit path and falls through to a real reload so the strip can run. Mark --spec-default, --jinja, --no-jinja as boolean inside the shadow stripper so the value-consuming heuristic no longer eats the following positional token.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
21da1db to
cd14cae
Compare
|
Force-pushed The previous mergeback chain had picked up commits from a private staging fork rebase that ended up reverting legitimate main commits on the PR head:
The rewritten branch is a clean reset to
What the substance refinements add over
Diff against main is now 5 files / +1320 / -591: Verified:
|
codex-connector P2 on PR #5427 cd14cae: the inheritance gate at ``routes/inference.py:696`` compared the stored ``source[1]`` against ``request.gguf_variant``, but the HF branch loaded with ``hf_variant = config.gguf_variant`` (the *resolved* variant after ModelConfig auto-pick). When the caller omitted ``gguf_variant`` on a follow-up Apply, ``source[1] == "Q4_K_M"`` but ``(request.gguf_variant or "") == ""``, ``same_source`` returned False, and the chat-settings Apply silently dropped CLI pass-through flags for every auto-pick / local-file load. Fix both sides of the comparison to key off ``config.gguf_variant``: * The route compares ``source[1]`` to ``config.gguf_variant`` (the resolved label) rather than the request field. * The local-mode load_model call now passes ``hf_variant = config.gguf_variant`` so ``_extra_args_source`` stores the same string the route reads back. The HF branch already did this. Sandbox: added test_source_records_caller_variant_not_extracted_label to lock the storage key contract. ``pytest studio/backend/tests --deselect test_studio_api.py``: 1100 passed, identical to pre-change.
|
Pushed The inheritance gate at Fix keys both sides off
Sandbox: added |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c7d8a047dc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if backend_extra and strip_shadowing_flags(backend_extra) != backend_extra: | ||
| return False |
There was a problem hiding this comment.
Avoid forcing reload when no shadowed setting was requested
The already_loaded comparator now unconditionally rejects inherited llama_extra_args that contain any shadowing flag (-c, --cache-type-*, --spec-*, template flags) when llama_extra_args is omitted, but the reload path only strips those groups if the corresponding first-class field is present in model_fields_set (later at the inheritance step). For requests that omit those fields (e.g. minimal API /load calls with just model + variant), this creates a permanent mismatch loop: every call reloads, inherited args remain unchanged, and the next call reloads again. This regresses idempotent /load behavior and can repeatedly restart the server for no effective config change.
Useful? React with 👍 / 👎.
The validator's web-UI block named only ``--webui`` / ``--no-webui``, which is llama.cpp's pre-rename spelling. Current upstream (``tools/server/README.md``) uses ``--ui`` / ``--no-ui`` plus ``--ui-config``, ``--ui-config-file``, and ``--ui-mcp-proxy`` / ``--no-ui-mcp-proxy``. Without these in the denylist a user could ``unsloth run --ui`` and enable llama-server's built-in web UI on the port Studio's reverse proxy targets, breaking the UI surface. Keep the legacy ``--webui`` group so the validator still rejects old binaries that haven't been re-spelled. Cross-referenced against the README's full flag list; this was the only gap for the post-#5401 inheritance / shadow-strip work. Pass- through flags from every other README category (sampling, jinja, ctx, cache, threads, GPU, reasoning, grammar, chat-template-kwargs) already validate cleanly; sandbox suite exercises ~60 of them in the new ``test_08_llama_server_pass_through.py``. ``pytest studio/backend/tests --deselect test_studio_api.py``: 1100 passed, identical to pre-change.
|
Pushed While auditing the validator against llama.cpp's current The legacy Validation:
|
…load (unslothai#5693) * studio: settle GPU VRAM after killing llama-server before the next reload The NVIDIA driver reclaims a dead process's CUDA allocations asynchronously after the kernel reaps the PID -- typically tens to hundreds of milliseconds. Sampling `_get_gpu_free_memory` in that window reads artificially low, which propagates into `_select_gpus` / `_fit_context_to_vram` and flips the layer-split toward `--fit on` with more CPU-offloaded layers than steady-state would have required. On a tight VRAM card the resulting mmap thrash + OOM matches the Apply-reload kill path that bare-shell launches with the same flags never hit (continues the lineage of unslothai#5161 / unslothai#5401 / unslothai#5427). Adds `LlamaCppBackend._wait_for_vram_settle`: bounded poll of `_get_gpu_free_memory` that returns as soon as two consecutive samples agree per-GPU within `max(256 MiB, 2% of larger sample)`, or `max_wait` (default 2 s) wall-clock elapses with probe time included in the bound. Records `_last_kill_monotonic` inside `_kill_process`'s `finally` block so the wait engages on both in-process `load_model -> _kill_process -> load` and the frontend chat-settings Apply path (`/unload` then `/load`). The call site runs OUTSIDE the broad `self._lock` so concurrent `/unload`, `/cancel`, `/status` are not blocked during the wait. Short-circuits at zero cost on cold start (no kill recorded), stale kill (older than 15 s, driver has already settled), CPU-only host (probe returns empty), and probe exceptions (nvidia-smi gone away). 11 new unit tests in `test_llama_cpp_wait_for_vram_settle.py` cover: cold-start zero cost, stale-kill skip, slow-probe deadline bound, GPU index-set change, per-GPU stability with one draining card, the 2 % adaptive tolerance, _kill_process timestamp recording on real kill vs no-op, and an `inspect.getsource` contract that pins the call site to outside the Phase 3 lock and uses `_last_kill_monotonic` so a future refactor can't silently regress any of these properties. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Michael Han <michaelhan2050@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…load (unslothai#5693) * studio: settle GPU VRAM after killing llama-server before the next reload The NVIDIA driver reclaims a dead process's CUDA allocations asynchronously after the kernel reaps the PID -- typically tens to hundreds of milliseconds. Sampling `_get_gpu_free_memory` in that window reads artificially low, which propagates into `_select_gpus` / `_fit_context_to_vram` and flips the layer-split toward `--fit on` with more CPU-offloaded layers than steady-state would have required. On a tight VRAM card the resulting mmap thrash + OOM matches the Apply-reload kill path that bare-shell launches with the same flags never hit (continues the lineage of unslothai#5161 / unslothai#5401 / unslothai#5427). Adds `LlamaCppBackend._wait_for_vram_settle`: bounded poll of `_get_gpu_free_memory` that returns as soon as two consecutive samples agree per-GPU within `max(256 MiB, 2% of larger sample)`, or `max_wait` (default 2 s) wall-clock elapses with probe time included in the bound. Records `_last_kill_monotonic` inside `_kill_process`'s `finally` block so the wait engages on both in-process `load_model -> _kill_process -> load` and the frontend chat-settings Apply path (`/unload` then `/load`). The call site runs OUTSIDE the broad `self._lock` so concurrent `/unload`, `/cancel`, `/status` are not blocked during the wait. Short-circuits at zero cost on cold start (no kill recorded), stale kill (older than 15 s, driver has already settled), CPU-only host (probe returns empty), and probe exceptions (nvidia-smi gone away). 11 new unit tests in `test_llama_cpp_wait_for_vram_settle.py` cover: cold-start zero cost, stale-kill skip, slow-probe deadline bound, GPU index-set change, per-GPU stability with one draining card, the 2 % adaptive tolerance, _kill_process timestamp recording on real kill vs no-op, and an `inspect.getsource` contract that pins the call site to outside the Phase 3 lock and uses `_last_kill_monotonic` so a future refactor can't silently regress any of these properties. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Michael Han <michaelhan2050@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Closes #5401.
Problem
Three related bugs in the GGUF reload path, all reproducible against
unsloth studio run -m unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42and the chat-settings Apply button.Silent same-model settings drop.
POST /api/inference/load's already-loaded short-circuit (routes/inference.py:465-511onmain) compared onlymodel_identifierandhf_variant. A same-(model, variant) Apply that flippedcache_type_kv/speculative_type/chat_template_override/max_seq_length/llama_extra_argsreturnedstatus="already_loaded"and the new setting never reached llama-server. Verified:POST /api/inference/loadwithcache_type_kv=q8_0andmax_seq_length=8192returnedalready_loaded, while/api/inference/statusafterwards still reportedcache_type_kv=null context_length=40960and the original llama-server PID kept running with the original flags.unsloth runextra args lost on every reload. The chat-settings Apply path inuse-chat-model-runtime.ts:517-529does not round-tripllama_extra_args, so every/unload+/loadcycle dropped the CLI-supplied flags from the spawned llama-server command line. After Apply, the new server lacked--top-k 20 --seed 42.Two llama-server processes alive during the Phase 2 / Phase 3 window.
LlamaCppBackend.load_modelreleased_lockbetween Phase 1 (kill) and Phase 3 (spawn), so two concurrent loads each passed Phase 1 withself._process is None. Both ran Phase 2 (download), both reached Phase 3, and the Phase 3 defensive_kill_process()added in Studio: kill in-flight llama-server before spawning a new one #5171 collapsed them to one survivor only after bothsubprocess.Popencalls had already landed. With the 0.6B repro a per-150 ms pgrep timeline showed two simultaneous llama-server PIDs for 3.3 s. For the 86 GB MoE in Apply button spawns new llama-server without killing previous one, causing OOM #5161 the same overlap is tens of seconds, which is what OOMs the host in the original [Bug] Model reload with 'unsloth run' doesn't unload current model causing memory spike and crash. #5401 report.The user's screenshot in #5401 (two stacked memory peaks, second annotated "loads same model again") matches (3) compounded with (1): the silent-settings short-circuit hides the first Apply, the user re-clicks, the second Apply triggers a real reload whose Phase 2 / Phase 3 window overlaps the in-flight first reload.
Fix
studio/backend/core/inference/llama_cpp.pyself._serial_load_lock = threading.Lock()is now held across the whole body ofload_model(Phase 1 kill + Phase 2 download + Phase 3 spawn + warm-up). Two concurrent/api/inference/loadrequests are now strictly sequential, so at most one llama-server process is alive at any moment. The fine-grained_lockand the Phase 3 defensive_kill_process()from Studio: kill in-flight llama-server before spawning a new one #5171 are kept as a second layer for any future caller that goes around the publicload_modelAPI./api/inference/unload,/api/inference/status, and/api/inference/load-progressare unaffected because they only touch the fine-grained_lockor read properties.self._extra_args: Optional[List[str]]plus anextra_argsproperty are written insideload_modelwhenever the caller supplies a non-Nonevalue.unload_model()deliberately does not reset it so the route layer can inherit the args across the frontend's/unload+/loadgap.studio/backend/routes/inference.py_request_matches_loaded_settings(request, llama_backend)comparesmax_seq_length,cache_type_kv,speculative_type,chat_template_override, andllama_extra_argsbetween the incoming request and the live backend. Same-(model, variant) requests whose runtime settings differ now fall through to a real reload instead of returningalready_loaded. A missingllama_extra_argson the request is treated as "inherit current", so the short-circuit still fires when the only difference is the frontend not echoing the CLI flags back.llama_extra_argsfromllama_backend.extra_argswhen the request omits the field, re-validates throughvalidate_extra_args, and forwards the result toload_model(...). An explicit[]from the caller is still honoured as "clear".The bulk of the diff is the re-indent that hosts
load_model's existing body insidewith self._serial_load_lock:. The body was left insideLlamaCppBackend.load_model(not split into a helper) sotests/test_llama_cpp_no_context_shift.py'sinspect.getsource(LlamaCppBackend.load_model)contract still passes.Verified
End-to-end against a live
unsloth studio runon port 8889 with a Qwen3-0.6B-GGUFQ4_K_Mmodel and--top-k 20 --seed 42forwarded:/loadsame (model, variant, settings)already_loaded/loadsame model, variant, newcache_type_kv=q8_0ctx=8192already_loaded,/statusstill reports old settingsloaded,/statusreportscache_type_kv=q8_0 context_length=8192, new server runs-c 8192 --cache-type-k q8_0 --top-k 20 --seed 42/unload+/load, new settings, nollama_extra_argsfield--top-k 20 --seed 42--top-k 20 --seed 42/unload+ two parallel/load/loadwithllama_extra_args=[](explicit clear)loaded, new server has no--top-k/--seed/loadwithllama_extra_args=["--top-k","30","--seed","7"](explicit override)loaded, new server has the supplied flagspytest studio/backend/testsis green except for one pre-existing terminal-width-sensitive assertion (test_studio_api.py::test_help_output) and the pre-existingtest_studio_api.pyfixture errors that fail on unmodifiedmaintoo. No new regressions.Test plan
pytest studio/backend/tests/test_llama_cpp_no_context_shift.py studio/backend/tests/test_llama_cpp_context_fit.py studio/backend/tests/test_llama_server_args.py(99 passed).pgrep llama-serverthat the simultaneous count never exceeds 1 across/unloadfollowed by two parallel/loadcalls.unsloth run -m unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42, then changed KV cache + context via/api/inference/load, confirmed the new server has the new settings and still carries--top-k 20 --seed 42.