Skip to content

Studio: serialise GGUF reload and inherit unsloth-run extra args#5427

Merged
danielhanchen merged 37 commits into
mainfrom
fix-5401-studio-reload-serialise-and-extras
May 17, 2026
Merged

Studio: serialise GGUF reload and inherit unsloth-run extra args#5427
danielhanchen merged 37 commits into
mainfrom
fix-5401-studio-reload-serialise-and-extras

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Closes #5401.

Problem

Three related bugs in the GGUF reload path, all reproducible against unsloth studio run -m unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42 and the chat-settings Apply button.

  1. Silent same-model settings drop. POST /api/inference/load's already-loaded short-circuit (routes/inference.py:465-511 on main) compared only model_identifier and hf_variant. A same-(model, variant) Apply that flipped cache_type_kv / speculative_type / chat_template_override / max_seq_length / llama_extra_args returned status="already_loaded" and the new setting never reached llama-server. Verified: POST /api/inference/load with cache_type_kv=q8_0 and max_seq_length=8192 returned already_loaded, while /api/inference/status afterwards still reported cache_type_kv=null context_length=40960 and the original llama-server PID kept running with the original flags.

  2. unsloth run extra args lost on every reload. The chat-settings Apply path in use-chat-model-runtime.ts:517-529 does not round-trip llama_extra_args, so every /unload + /load cycle dropped the CLI-supplied flags from the spawned llama-server command line. After Apply, the new server lacked --top-k 20 --seed 42.

  3. Two llama-server processes alive during the Phase 2 / Phase 3 window. LlamaCppBackend.load_model released _lock between Phase 1 (kill) and Phase 3 (spawn), so two concurrent loads each passed Phase 1 with self._process is None. Both ran Phase 2 (download), both reached Phase 3, and the Phase 3 defensive _kill_process() added in Studio: kill in-flight llama-server before spawning a new one #5171 collapsed them to one survivor only after both subprocess.Popen calls had already landed. With the 0.6B repro a per-150 ms pgrep timeline showed two simultaneous llama-server PIDs for 3.3 s. For the 86 GB MoE in Apply button spawns new llama-server without killing previous one, causing OOM #5161 the same overlap is tens of seconds, which is what OOMs the host in the original [Bug] Model reload with 'unsloth run' doesn't unload current model causing memory spike and crash. #5401 report.

The user's screenshot in #5401 (two stacked memory peaks, second annotated "loads same model again") matches (3) compounded with (1): the silent-settings short-circuit hides the first Apply, the user re-clicks, the second Apply triggers a real reload whose Phase 2 / Phase 3 window overlaps the in-flight first reload.

Fix

studio/backend/core/inference/llama_cpp.py

  • self._serial_load_lock = threading.Lock() is now held across the whole body of load_model (Phase 1 kill + Phase 2 download + Phase 3 spawn + warm-up). Two concurrent /api/inference/load requests are now strictly sequential, so at most one llama-server process is alive at any moment. The fine-grained _lock and the Phase 3 defensive _kill_process() from Studio: kill in-flight llama-server before spawning a new one #5171 are kept as a second layer for any future caller that goes around the public load_model API. /api/inference/unload, /api/inference/status, and /api/inference/load-progress are unaffected because they only touch the fine-grained _lock or read properties.
  • self._extra_args: Optional[List[str]] plus an extra_args property are written inside load_model whenever the caller supplies a non-None value. unload_model() deliberately does not reset it so the route layer can inherit the args across the frontend's /unload + /load gap.

studio/backend/routes/inference.py

  • New _request_matches_loaded_settings(request, llama_backend) compares max_seq_length, cache_type_kv, speculative_type, chat_template_override, and llama_extra_args between the incoming request and the live backend. Same-(model, variant) requests whose runtime settings differ now fall through to a real reload instead of returning already_loaded. A missing llama_extra_args on the request is treated as "inherit current", so the short-circuit still fires when the only difference is the frontend not echoing the CLI flags back.
  • GGUF load branch inherits llama_extra_args from llama_backend.extra_args when the request omits the field, re-validates through validate_extra_args, and forwards the result to load_model(...). An explicit [] from the caller is still honoured as "clear".

The bulk of the diff is the re-indent that hosts load_model's existing body inside with self._serial_load_lock:. The body was left inside LlamaCppBackend.load_model (not split into a helper) so tests/test_llama_cpp_no_context_shift.py's inspect.getsource(LlamaCppBackend.load_model) contract still passes.

Verified

End-to-end against a live unsloth studio run on port 8889 with a Qwen3-0.6B-GGUF Q4_K_M model and --top-k 20 --seed 42 forwarded:

Scenario Before After
/load same (model, variant, settings) 1 PID, already_loaded Unchanged
/load same model, variant, new cache_type_kv=q8_0 ctx=8192 already_loaded, /status still reports old settings loaded, /status reports cache_type_kv=q8_0 context_length=8192, new server runs -c 8192 --cache-type-k q8_0 --top-k 20 --seed 42
Frontend Apply /unload + /load, new settings, no llama_extra_args field New server drops --top-k 20 --seed 42 Preserves --top-k 20 --seed 42
/unload + two parallel /load Two llama-server PIDs for 3.3 s Max simultaneous count = 1 across the full pgrep timeline
/load with llama_extra_args=[] (explicit clear) n/a loaded, new server has no --top-k / --seed
/load with llama_extra_args=["--top-k","30","--seed","7"] (explicit override) n/a loaded, new server has the supplied flags

pytest studio/backend/tests is green except for one pre-existing terminal-width-sensitive assertion (test_studio_api.py::test_help_output) and the pre-existing test_studio_api.py fixture errors that fail on unmodified main too. No new regressions.

Test plan

  • pytest studio/backend/tests/test_llama_cpp_no_context_shift.py studio/backend/tests/test_llama_cpp_context_fit.py studio/backend/tests/test_llama_server_args.py (99 passed).
  • Confirmed via per-150 ms pgrep llama-server that the simultaneous count never exceeds 1 across /unload followed by two parallel /load calls.
  • Verified unsloth run -m unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42, then changed KV cache + context via /api/inference/load, confirmed the new server has the new settings and still carries --top-k 20 --seed 42.
  • Manual Apply-button click in chat settings against a multi-GB GGUF (reviewer's environment).
  • Confirm the chat UI does not regress on a same-model Apply for a setting that did not actually change (should remain a no-op via the broadened short-circuit).

Closes #5401.

Three related GGUF reload bugs reproduced against `unsloth studio run -m unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42`:

1. The `POST /api/inference/load` already-loaded short-circuit only compared `model_identifier` and `hf_variant`. A same-(model, variant) Apply that flipped `cache_type_kv` / `speculative_type` / `chat_template_override` / `max_seq_length` / `llama_extra_args` returned `status="already_loaded"` and the new setting silently never reached llama-server.

2. The frontend chat-settings Apply path POSTs `/unload` then `/load` without round-tripping `llama_extra_args`. Every reload after `unsloth run --some-flag X` quietly dropped `--some-flag X` from the spawned `llama-server` command line.

3. `LlamaCppBackend.load_model` released `_lock` between Phase 1 (kill) and Phase 3 (spawn) so two concurrent loads each passed Phase 1 with `self._process is None`. Both ran Phase 2 (download), both reached Phase 3, and the Phase 3 defensive `_kill_process()` from #5171 collapsed them to one survivor only after both `subprocess.Popen` calls had landed. For the 86 GB MoE in #5161 / the model in #5401 the overlap window was tens of seconds, long enough to OOM the host. With a 0.6B model the pgrep timeline showed two simultaneous PIDs for 3.3 s on `main`.

Fix:

`studio/backend/core/inference/llama_cpp.py`

* Add `self._serial_load_lock = threading.Lock()`. The whole body of `load_model` runs under this lock so two concurrent `/api/inference/load` requests are strictly sequential. The fine-grained `_lock` and the Phase 3 defensive `_kill_process()` from #5171 are kept as a second layer. `/unload`, `/status`, and `/load-progress` are unaffected because they only touch the fine-grained lock or read properties.
* Add `self._extra_args` plus an `extra_args` property, written inside `load_model` whenever the caller supplies a non-`None` value. `unload_model()` deliberately does not reset it so the route layer can inherit the args across the frontend's `/unload` + `/load` gap.

`studio/backend/routes/inference.py`

* Add `_request_matches_loaded_settings(request, llama_backend)` that compares `max_seq_length`, `cache_type_kv`, `speculative_type`, `chat_template_override`, and `llama_extra_args` between the incoming request and the live backend. Same-(model, variant) requests whose runtime settings differ now fall through to a real reload instead of returning `already_loaded`. A missing `llama_extra_args` field on the request is treated as "inherit current", so the short-circuit still fires when the only difference is the frontend not echoing the CLI flags back.
* GGUF load branch inherits `llama_extra_args` from `llama_backend.extra_args` when the request omits the field, re-validates through `validate_extra_args`, and forwards the result to `load_model(...)`. An explicit `[]` from the caller is still honoured as "clear".

Verified end to end against a live `unsloth studio run` instance:

| Scenario                                                        | Before    | After                                                                    |
| --------------------------------------------------------------- | --------- | ------------------------------------------------------------------------ |
| `/load` same (model, variant, settings)                         | 1 PID, `already_loaded` | unchanged                                                                |
| `/load` same model, variant, new `cache_type_kv=q8_0` ctx=8192  | `already_loaded`, settings dropped | `loaded`, `/status` reports the new settings, new server has `-c 8192 --cache-type-k q8_0 --top-k 20 --seed 42` |
| Frontend Apply `/unload` + `/load`, new settings, no `llama_extra_args` field | Drops `--top-k 20 --seed 42` | Preserves `--top-k 20 --seed 42`                                          |
| `/unload` + two parallel `/load`                                | Two PIDs for 3.3 s | Max simultaneous count = 1 across the full pgrep timeline                  |
| `/load` with `llama_extra_args=[]` (explicit clear)             | n/a       | `loaded`, new server has no `--top-k` / `--seed`                          |
| `/load` with `llama_extra_args=["--top-k","30","--seed","7"]` (override) | n/a       | `loaded`, new server has the supplied flags                                |

`pytest studio/backend/tests` is green except for one pre-existing terminal-width-sensitive assertion (`test_studio_api.py::test_help_output`) and the pre-existing `test_studio_api.py` fixture errors that fail on unmodified main too. No new regressions.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9cbec3b60

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if extra_args is not None:
self._extra_args = list(extra_args)

self._cancel_event.clear()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve unload cancellation for queued loads

When two /api/inference/load calls are serialized here and /api/inference/unload is posted while the first load is downloading, unload_model() only sets _cancel_event; after the first load exits, the second load acquires _serial_load_lock and immediately clears that cancellation before starting. In that concurrent-load + unload scenario the unload request does not cancel the queued in-flight load, so a llama-server can be spawned after the user has explicitly unloaded/cancelled loading.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a serial lock for model loading in llama_cpp.py to prevent concurrent loads from causing OOM and implements a more robust 'already loaded' check in the inference routes by comparing all runtime settings. It also allows inheriting extra arguments across reloads. Feedback focuses on improving the context length comparison by tracking the originally requested n_ctx value, which ensures that switching between 'Auto' and explicit lengths correctly triggers a reload.

# without the llama_extra_args field, since the chat UI does not
# round-trip those flags) can inherit the values the CLI / first
# load supplied. See issue #5401.
self._extra_args: Optional[List[str]] = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To accurately detect when a user wants to change settings during a reload, we should track the original n_ctx value requested. This allows the route layer to distinguish between a request for the model's native length (0) and an explicit length that happens to match the current effective (possibly capped) context length.

Suggested change
self._extra_args: Optional[List[str]] = None
self._extra_args: Optional[List[str]] = None
self._requested_n_ctx: int = 0

them. The route layer uses this to inherit ``unsloth run``-style
flags across frontend Apply reloads — see issue #5401.
"""
return list(self._extra_args) if self._extra_args is not None else None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Expose the requested context length as a property to allow the route layer to perform accurate reload checks.

Suggested change
return list(self._extra_args) if self._extra_args is not None else None
return list(self._extra_args) if self._extra_args is not None else None
@property
def requested_n_ctx(self) -> int:
"""The original n_ctx value from the last load_model() call."""
return self._requested_n_ctx

Comment on lines +2034 to +2035
if extra_args is not None:
self._extra_args = list(extra_args)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Store the requested n_ctx at the start of the serialized load block.

Suggested change
if extra_args is not None:
self._extra_args = list(extra_args)
if extra_args is not None:
self._extra_args = list(extra_args)
self._requested_n_ctx = n_ctx

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines +428 to +430
backend_ctx = llama_backend.context_length or 0
if request.max_seq_length and request.max_seq_length != backend_ctx:
return False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current check for max_seq_length has a bug: if request.max_seq_length is 0 (Auto), the if block is skipped entirely. This means if a model is currently running at an explicit length (e.g., 8192) and the user switches back to "Auto", the short-circuit will incorrectly return already_loaded and the server will remain at 8192 instead of resetting to the native length.

Comparing against the tracked requested_n_ctx (suggested in llama_cpp.py) fixes this by ensuring the intent of the request matches the intent of the current load.

Suggested change
backend_ctx = llama_backend.context_length or 0
if request.max_seq_length and request.max_seq_length != backend_ctx:
return False
if request.max_seq_length != llama_backend.requested_n_ctx:
return False

Review feedback on PR #5427 from gemini-code-assist.

The original short-circuit compared ``request.max_seq_length`` against
``llama_backend.context_length`` (the effective context). VRAM-fit
logic can cap the running server below what the caller asked for, so
this comparison incorrectly returns ``already_loaded`` when the user
flips the slider from an explicit length (e.g. 8192) back to "Auto"
(0): the explicit request was capped to, say, 4096, and the new "Auto"
request reads ``backend.context_length == 4096`` and decides nothing
changed.

Track the originally requested ``n_ctx`` on the backend instead and
compare against that. ``requested_n_ctx == 0`` means the last load
asked for the model's native length; ``request.max_seq_length == 0``
matches it.

Verified in the sandbox suite (now 90 tests):

- ``test_explicit_to_auto_triggers_reload`` -- loaded with explicit
  8192, then Apply with ``max_seq_length=0`` falls through to a real
  reload and the new server runs at the native 40960.
- ``test_auto_to_explicit_triggers_reload`` -- inverse direction.
- ``test_explicit_to_same_explicit_short_circuits`` -- re-Apply with
  the same explicit value still short-circuits (no needless reload).
- Existing scenarios (kv change, spec change, template change, extra
  args inherit, parallel-load stress, frontend Apply flow) unchanged.

``pytest studio/backend/tests`` still green on the same set of tests;
the pre-existing ``test_help_output`` failure and ``test_studio_api``
fixture errors are unaffected.
@danielhanchen

Copy link
Copy Markdown
Member Author

Thanks both. Pushed dd0b1d5 to address the gemini-code-assist feedback.

gemini: requested_n_ctx tracking (addressed)

Real correctness bug. The original comparator did

backend_ctx = llama_backend.context_length or 0
if request.max_seq_length and request.max_seq_length != backend_ctx:
    return False

which compares against the effective context (post VRAM-fit cap) and silently skips when request.max_seq_length == 0. A user who loaded with explicit 8192 (capped to 4096) and then flips back to "Auto" via the slider would have their Apply silently swallowed.

Fix tracks the originally requested n_ctx on LlamaCppBackend and compares against it:

  • LlamaCppBackend._requested_n_ctx: int = 0 field plus requested_n_ctx property.
  • load_model stores self._requested_n_ctx = int(n_ctx) in the same bookkeeping block as _extra_args.
  • Route comparator becomes if request.max_seq_length != llama_backend.requested_n_ctx: return False -- no if ... and guard, so a slider flip in either direction (explicit-to-Auto or Auto-to-explicit) falls through.

Sandbox added three new live-integration tests against unsloth studio run --model unsloth/Qwen3-0.6B-GGUF --gguf-variant Q4_K_M --top-k 20 --seed 42:

  • test_explicit_to_auto_triggers_reload -- load with max_seq_length=8192, re-Apply with max_seq_length=0, confirm status="loaded" and pgrep llama-server shows -c 40960 (the model's native length).
  • test_auto_to_explicit_triggers_reload -- inverse direction.
  • test_explicit_to_same_explicit_short_circuits -- re-Apply with the same explicit value remains a no-op.

Plus four new unit tests on LlamaCppBackend.requested_n_ctx (starts at 0, records explicit, records Auto, overwritten per load). Sandbox total is now 90 / 90.

codex: unload cancellation for queued loads (P2, deferred)

Real but semantically subtle: when load A is mid-Phase-2 and load B is queued behind the serial lock, an /unload between them sets _cancel_event, load A honours it and exits, then load B acquires the lock and clears _cancel_event before starting. So /unload is effectively "lost" for load B.

The reason I am not addressing this in PR #5427 is that the obvious fix (have load B abort if it sees _cancel_event set at lock-acquire time) breaks the chat-settings Apply pattern, which is exactly /unload then /load. In the Apply case the unload must NOT cancel the subsequent load. Distinguishing "unload that should also cancel queued loads" from "unload that precedes a fresh load" needs request-generation tracking on LlamaCppBackend (each load remembers the unload generation it was submitted at; aborts if that generation is older than the current backend generation). That is a separate, larger change.

The original #5401 OOM (two llama-server processes coexisting for tens of seconds during a Phase 2 download) is fixed by this PR regardless of the cancel-propagation question. I will file a follow-up for the cancel-propagation semantics.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Thanks for the detailed update and for implementing the requested_n_ctx tracking and the serial load lock. These changes directly address the correctness and resource management issues identified. Your explanation regarding the deferred unload cancellation for queued loads is clear, and I agree that handling it via request-generation tracking is the correct approach for a follow-up PR. The added tests look comprehensive and well-structured for verifying these fixes.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd0b1d58a7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +2068 to +2070
# ── Phase 1: kill old process (under lock, fast) ──────────
with self._lock:
self._kill_process()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-check loaded state before killing queued loads

When two identical /api/inference/load requests overlap, both route-level already_loaded checks run before the first request has finished loading. The second request then waits on _serial_load_lock, enters here after the first server is healthy, and immediately calls _kill_process(), tearing down the just-loaded model and doing a redundant full reload; the first caller can receive status="loaded" while the server is already being killed by the queued duplicate. Re-check the current loaded model/settings after acquiring the serial lock, before killing the process.

Useful? React with 👍 / 👎.

Trim the verbose explanatory comments and docstrings introduced in
f9cbec3 and dd0b1d5 down to one-line summaries. The "why" still
points at issue #5401; the multi-paragraph rationale belonged in the
PR body, not the source. No behaviour change.
Resolves the conflict in `studio/backend/core/inference/llama_cpp.py`
where bbd0ba0 / 63c6750 added a `mmproj_matches_model_family` guard
to the mmproj attach block and this branch's f9cbec3 wrapped the body
of `load_model` in `self._serial_load_lock`, re-indenting the same
region by four spaces.

Kept both changes: the mmproj attach now runs through the family-check
guard inside the serial lock, and the `else: self._api_key = None`
fallback for non-direct-stream mode is preserved.

Verified post-merge:

- `pytest studio/backend/tests --deselect test_studio_api.py`:
  1056 passed, 46 skipped, 0 failed.
- Sandbox unit + live integration suite re-run: 90 passed.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56d598045d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines +658 to +661
if request.llama_extra_args is None and llama_backend.extra_args:
inherited = list(llama_backend.extra_args)
try:
extra_llama_args = validate_extra_args(inherited)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't inherit CLI overrides when applying UI settings

When the previous load came from unsloth run with pass-through args that override a first-class setting (for example -c/--ctx-size, which the validator permits and llama_cpp appends after Studio's own flags), a later UI /load omits llama_extra_args and lands here. The route then reuses those old args while also passing the user's new max_seq_length/cache/spec settings, so the inherited CLI flag wins and the Apply change has no effect; meanwhile _requested_n_ctx is updated to the new request value, so subsequent already-loaded checks can incorrectly report a match against a server still running the old override.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3b88887763

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines +658 to +661
if request.llama_extra_args is None and llama_backend.extra_args:
inherited = list(llama_backend.extra_args)
try:
extra_llama_args = validate_extra_args(inherited)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict inherited llama args to the same GGUF model

When a request omits llama_extra_args, this inherits the previous GGUF server's args without checking that they came from the same model being loaded. Since _extra_args is intentionally preserved across unloads, a user can load model A with model-specific pass-through flags such as an allowed --chat-template-file or RoPE option, switch/unload, and then load model B from the UI with the field omitted; model B will silently run with model A's flags, which can break startup or corrupt inference behavior. Consider only inheriting for the same model/variant or clearing the stored args on unrelated loads.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93c92aadbb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +2011 to +2013
if extra_args is not None:
self._extra_args = list(extra_args)
self._requested_n_ctx = int(n_ctx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist extra args only after a successful load

When a GGUF startup fails after being invoked with non-empty extra_args, these lines have already saved those args in _extra_args; because the route later inherits llama_backend.extra_args whenever the UI omits llama_extra_args, a normal retry from the chat UI will keep reusing the failed/invalid llama-server flags even though no model was successfully loaded with them. Store the args only after the health check succeeds, or roll them back on failure, so a failed load does not poison subsequent omitted-args loads.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b25e04b8f4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines +428 to +430
req_spec = _normalise_settings_str(request.speculative_type)
backend_spec = _normalise_settings_str(llama_backend.speculative_type)
if (req_spec or "off") != (backend_spec or "off"):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore speculative diffs for vision loads

When the loaded GGUF is vision/multimodal and the request carries a non-off speculative_type, this comparison forces a reload even though the backend ignores speculative decoding for vision models (load_model only applies it when not is_vision, so llama_backend.speculative_type stays None). In that scenario a no-op /api/inference/load for the same model/settings cannot take the new short-circuit and will unnecessarily tear down and reload the vision model; normalize requested speculation to off for vision before comparing.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

This PR appears to address open issue(s). The duplicate detector matched the following open issues with HIGH confidence:

  • unslothai/unsloth#5238@okaybeydanol — PR fixes Studio GGUF same-model settings reload path, including context changes no longer being ignored during Apply-triggered reloads.
  • unslothai/unsloth#4971@xyehya — PR makes same-model GGUF reload honor runtime settings including requested context instead of treating the model as already loaded.
  • unslothai/unsloth#4857@xyehya — PR preserves and filters llama-server extra args such as fit-target across same-model reloads and settings changes.

If this PR fixes any of them, consider adding closes #N / resolves #N to the description so the issue auto-closes on merge. If the match is wrong, ignore this comment.

@danielhanchen danielhanchen added the auto-reviewing Auto-review in progress label May 16, 2026

@danielhanchen danielhanchen left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Three backend bugs in Studio's llama.cpp reload path. The fixes are correct and well-reasoned. One issue in the new strip_shadowing_flags function needs to be addressed before merge; the rest are notes.


[Must fix] strip_shadowing_flags: boolean flags incorrectly treated as value-bearing

File: studio/backend/core/inference/llama_server_args.py

--spec-default, --jinja, and --no-jinja are in _SHADOWING_FLAGS and are boolean flags (they take no value argument). They are NOT in _DENYLIST — the module's own comment at line 36 explicitly lists --jinja and --spec-* as pass-through flags — so they can appear in a user's extra_args.

The parser's value-consumption branch:

if "=" not in tok and i + 1 < n and _flag_name(tokens[i + 1]) is None:
    i += 2

fires whenever the next token doesn't look like a flag. For these boolean flags, if a positional value happens to follow them, the parser advances by 2, silently discarding the next token.

Fix: recognise the boolean flags and skip only 1 token regardless of what follows:

_BOOLEAN_SHADOWING_FLAGS: frozenset[str] = frozenset({
    "--spec-default",
    "--jinja",
    "--no-jinja",
})

def strip_shadowing_flags(args: Iterable[str]) -> list[str]:
    tokens = [str(a) for a in (args or [])]
    out: list[str] = []
    i, n = 0, len(tokens)
    while i < n:
        tok = tokens[i]
        flag = _flag_name(tok)
        if flag is None or flag not in _SHADOWING_FLAGS:
            out.append(tok)
            i += 1
            continue
        # Packed flags (--flag=val) are self-contained. Boolean flags take
        # no value token. Only bare value-bearing flags consume the next token.
        if "=" in tok or flag in _BOOLEAN_SHADOWING_FLAGS:
            i += 1
        elif i + 1 < n and _flag_name(tokens[i + 1]) is None:
            i += 2
        else:
            i += 1
    return out

[Minor] _already_in_target_state / _request_matches_loaded_settings duplication

Files: llama_cpp.py, routes/inference.py

Both functions independently compare the full settings set for the same semantic purpose. The PR cross-references them in a comment which is good, but the two can silently diverge if one is updated without the other. A future cache_type_kv normalisation change in one that misses the other would re-introduce #5401.

A _settings_fingerprint() -> tuple helper on LlamaCppBackend that both callers delegate to would eliminate the duplication. Not blocking.


Notes (no action required)

max_seq_length default is 0, not None. LoadRequest.max_seq_length has int = Field(0, ...) so comparing it directly to requested_n_ctx (also initialised to 0) is correct for the Auto case. The plain != without int-cast is safe.

Backend fixes look correct:

  • _serial_load_lock serialises concurrent /load calls end-to-end; at most one llama-server process alive at any time.
  • _healthy = False in _kill_process closes the race window where a concurrent reader could see is_loaded = True against a dead PID during warm-up.
  • Extra-args inheritance (None = inherit, [] = clear, list = override) is clearly documented and gated on same-model source to prevent cross-model flag bleed.
  • _request_matches_loaded_settings correctly compares max_seq_length against requested_n_ctx (the requested value, not the VRAM-capped effective value) — right comparison for detecting slider flips.

@danielhanchen danielhanchen added auto-approved Auto-review approved the PR and removed auto-reviewing Auto-review in progress labels May 16, 2026
@danielhanchen

Copy link
Copy Markdown
Member Author

Auto-review verdict: Approved

Fixes a Studio GGUF reload regression where same-(model, variant) Apply silently dropped settings, where chat-settings Apply lost unsloth-run CLI extras, and where two llama-server processes could coexist during reload (#5401). The patch serialises load_model under a load-lock, broadens the already-loaded comparator across runtime settings, and inherits llama_extra_args with shadow-flag stripping; review iteration tightened the inheritance to handle local-GGUF identity, None vs [] semantics, and per-group strip toggles.

Reason: All six accepted findings fixed; mergeback pushed cleanly; PR addresses three reproducible Studio reload bugs with no remaining real regressions

danielhanchen and others added 8 commits May 16, 2026 14:09
Re-narrow llama_extra_args to None after validate_extra_args when the
incoming request omitted the field, so the backend can distinguish
"caller omitted, inherit prior load" from "caller explicitly cleared
to []". Without this a queued duplicate /load reaches the backend as
[] and fails _already_in_target_state's exact-equality check, killing
the just-started llama-server. The pass-through validate call from
the original "forward llama-server args from unsloth studio run /
unsloth run" change is preserved as-is; only the post-pass narrowing
is new. Cross-source loads now explicitly clear extras so a model
switch can't accidentally inherit via the backend's "no opinion"
semantics.

Store the caller's hf_variant kwarg (None for local GGUF files) in
_extra_args_source instead of the derived self._hf_variant
(an extracted filename quant label like "Q4_K_M"). Same-source check
in the route is now symmetric for HF and direct-file loads.

Add gguf_path to _already_in_target_state and prefer on-disk path
identity when both backend and caller have a path. This stops the
duplicate-load guard from killing a healthy server on repeat local
loads (where hf_variant is None on the caller side but extracted on
the backend side).

Split shadow-flag stripping into per-group toggles (context / cache /
spec / template). The route now opts into stripping only the groups
whose first-class field was actually set on the incoming request, so
an inherited --chat-template-file survives an Apply that omits
chat_template_override. _request_matches_loaded_settings detects
shadowing extras on the inherit path and falls through to a real
reload so the strip can run.

Mark --spec-default, --jinja, --no-jinja as boolean inside the
shadow stripper so the value-consuming heuristic no longer eats the
following positional token.
@danielhanchen danielhanchen force-pushed the fix-5401-studio-reload-serialise-and-extras branch from 21da1db to cd14cae Compare May 16, 2026 14:23
@danielhanchen

Copy link
Copy Markdown
Member Author

Force-pushed cd14cae1 (from 21da1db1).

The previous mergeback chain had picked up commits from a private staging fork rebase that ended up reverting legitimate main commits on the PR head:

  • Scrub .github/workflows for staging push (matches staging base) deleted 22 workflow files and 11393 lines.
  • Sync .github/workflows with upstream author branch restored most of them.
  • Drop cross-main pickup unrelated to inheritance fix removed scripts/check_frontend_dep_removal.py and tests/studio/test_frontend_dep_removal.py (PR ci: deterministic check for studio/frontend dep removals #5478) and the matching block in .github/workflows/studio-frontend-ci.yml. Those files belong to main, not to the staging fork's base, so the diff against upstream main had ballooned to 11 files / +1757 / -3441 with PR ci: deterministic check for studio/frontend dep removals #5478 spuriously deleted.

The rewritten branch is a clean reset to b427a9d8 (the six-P2-followup commit) + a fresh git merge origin/main + cherry-picks of just the orchestrator's substance refinements:

  • 781c13b3 Studio: tighten GGUF reload inheritance and duplicate-load guard
  • cde7baa1 Studio: trim comments around GGUF reload inheritance
  • 9d763fa9 Studio: cover GGUF reload inheritance and shadow-flag stripping (test_gguf_reload_inheritance.py, test_llama_server_args.py)
  • f67514d1, 0e3a5f17 Studio: drop redundant issue refs from inheritance comments (split per file)
  • 3ce5cefc, cd14cae1 pre-commit auto fixes

What the substance refinements add over b427a9d8:

  • _already_in_target_state now compares gguf_path directly for local-file loads (the backend stores an extracted variant label while the caller passes hf_variant=None, which would otherwise fail the equality).
  • _extra_args_source records the caller's hf_variant rather than self._hf_variant so the route's same_source check stays symmetric across HF and direct-file loads.
  • _SHADOWING_FLAGS is split into _CONTEXT_FLAGS / _CACHE_FLAGS / _SPEC_FLAGS / _TEMPLATE_FLAGS plus a _BOOLEAN_SHADOWING_FLAGS set, and strip_shadowing_flags takes per-group strip_* kwargs.
  • The route uses request.model_fields_set to selectively strip only the groups whose first-class field the caller actually supplied, so an inherited --chat-template-file survives an Apply that omits chat_template_override.
  • The route narrows validate_extra_args(None) == [] back to None post-validation so the inheritance path can distinguish "caller omitted" from "caller explicit []".
  • Cross-model branch sets extra_llama_args = [] explicitly to avoid load_model's "no opinion" semantics.
  • The route comparator refuses to short-circuit on the inherit path if stored extras contain shadow flags (those would be stripped on the reload path, so the loaded server is not equivalent to what a fresh Apply with these first-class fields would produce).

Diff against main is now 5 files / +1320 / -591:

 studio/backend/core/inference/llama_cpp.py         | 1337 +++++++++++---------
 studio/backend/core/inference/llama_server_args.py |   92 ++
 studio/backend/routes/inference.py                 |  127 +-
 studio/backend/tests/test_gguf_reload_inheritance.py |  237 +++
 studio/backend/tests/test_llama_server_args.py     |  118 ++

Verified:

  • pytest studio/backend/tests --deselect test_studio_api.py: 1100 passed, 46 skipped (was 1079 before the refinements; +21 from the two new test files).
  • Sandbox suite under ./temp/sim_5401/: 136 passed against the live unsloth studio run.
  • Ruff clean on the five touched files.

codex-connector P2 on PR #5427 cd14cae: the inheritance gate at
``routes/inference.py:696`` compared the stored ``source[1]`` against
``request.gguf_variant``, but the HF branch loaded with
``hf_variant = config.gguf_variant`` (the *resolved* variant after
ModelConfig auto-pick). When the caller omitted ``gguf_variant`` on a
follow-up Apply, ``source[1] == "Q4_K_M"`` but
``(request.gguf_variant or "") == ""``, ``same_source`` returned False,
and the chat-settings Apply silently dropped CLI pass-through flags
for every auto-pick / local-file load.

Fix both sides of the comparison to key off ``config.gguf_variant``:

* The route compares ``source[1]`` to ``config.gguf_variant`` (the
  resolved label) rather than the request field.
* The local-mode load_model call now passes
  ``hf_variant = config.gguf_variant`` so ``_extra_args_source``
  stores the same string the route reads back. The HF branch already
  did this.

Sandbox: added test_source_records_caller_variant_not_extracted_label
to lock the storage key contract.
``pytest studio/backend/tests --deselect test_studio_api.py``:
1100 passed, identical to pre-change.
@danielhanchen

Copy link
Copy Markdown
Member Author

Pushed 74b45c0b to address chatgpt-codex-connector's P2 on cd14cae1 (Match inherited args source against resolved GGUF variant).

The inheritance gate at routes/inference.py:696 compared the stored source[1] against request.gguf_variant. The HF branch loads with hf_variant = config.gguf_variant (the resolved variant after ModelConfig's auto-pick). So on a follow-up Apply that omitted gguf_variant, source[1] == "Q4_K_M" but (request.gguf_variant or "") == ""same_source returned False and every Apply for an HF auto-pick or local-file load silently dropped the CLI flags.

Fix keys both sides off config.gguf_variant:

  • Route compares source[1] against config.gguf_variant (the resolved label).
  • Local-mode load_model call now also passes hf_variant = config.gguf_variant so storage and lookup use the same string.

Sandbox: added test_source_records_caller_variant_not_extracted_label. Upstream pytest studio/backend/tests --deselect test_studio_api.py: 1100 passed, identical to pre-change.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c7d8a047dc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +458 to +459
if backend_extra and strip_shadowing_flags(backend_extra) != backend_extra:
return False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid forcing reload when no shadowed setting was requested

The already_loaded comparator now unconditionally rejects inherited llama_extra_args that contain any shadowing flag (-c, --cache-type-*, --spec-*, template flags) when llama_extra_args is omitted, but the reload path only strips those groups if the corresponding first-class field is present in model_fields_set (later at the inheritance step). For requests that omit those fields (e.g. minimal API /load calls with just model + variant), this creates a permanent mismatch loop: every call reloads, inherited args remain unchanged, and the next call reloads again. This regresses idempotent /load behavior and can repeatedly restart the server for no effective config change.

Useful? React with 👍 / 👎.

The validator's web-UI block named only ``--webui`` / ``--no-webui``,
which is llama.cpp's pre-rename spelling. Current upstream
(``tools/server/README.md``) uses ``--ui`` / ``--no-ui`` plus
``--ui-config``, ``--ui-config-file``, and ``--ui-mcp-proxy`` /
``--no-ui-mcp-proxy``. Without these in the denylist a user could
``unsloth run --ui`` and enable llama-server's built-in web UI on
the port Studio's reverse proxy targets, breaking the UI surface.

Keep the legacy ``--webui`` group so the validator still rejects
old binaries that haven't been re-spelled.

Cross-referenced against the README's full flag list; this was the
only gap for the post-#5401 inheritance / shadow-strip work. Pass-
through flags from every other README category (sampling, jinja,
ctx, cache, threads, GPU, reasoning, grammar, chat-template-kwargs)
already validate cleanly; sandbox suite exercises ~60 of them in
the new ``test_08_llama_server_pass_through.py``.

``pytest studio/backend/tests --deselect test_studio_api.py``:
1100 passed, identical to pre-change.
@danielhanchen

Copy link
Copy Markdown
Member Author

Pushed 0da6e6eb: extend the llama_server_args.py denylist with the upstream --ui family.

While auditing the validator against llama.cpp's current tools/server/README.md for the post-#5401 work, I noticed the web-UI block named only --webui / --no-webui. Current upstream renamed these to --ui / --no-ui and added --ui-config, --ui-config-file, --ui-mcp-proxy / --no-ui-mcp-proxy. Without the new names in the denylist a user could unsloth studio run --ui and enable llama-server's built-in web UI on Studio's reverse-proxy port -- breaking the UI surface clients see.

The legacy --webui group is kept so the validator still rejects pre-rename binaries.

Validation:

  • New sandbox file test_08_llama_server_pass_through.py cross-references every README category. 161 cases pass: ~60 denied (incl. all the new --ui variants) and ~60 pass-through (sampling, jinja, ctx, cache, threads, GPU, reasoning, grammar, --chat-template-kwargs '{"enable_thinking": false}').
  • Live: a launched unsloth studio run -m unsloth/Qwen3-0.6B-GGUF:Q4_K_M --chat-template-kwargs '{"enable_thinking": false}' was confirmed to place the user's JSON value AFTER Studio's auto-set one on the llama-server cmdline (last-wins parse honours the user).
  • pytest studio/backend/tests --deselect test_studio_api.py: 1100 passed, identical to pre-change.
  • Ruff: clean.

@danielhanchen danielhanchen merged commit ab56e4a into main May 17, 2026
32 checks passed
@danielhanchen danielhanchen deleted the fix-5401-studio-reload-serialise-and-extras branch May 17, 2026 11:16
Imagineer99 pushed a commit to Imagineer99/unsloth that referenced this pull request May 22, 2026
…load (unslothai#5693)

* studio: settle GPU VRAM after killing llama-server before the next reload

The NVIDIA driver reclaims a dead process's CUDA allocations
asynchronously after the kernel reaps the PID -- typically tens to
hundreds of milliseconds. Sampling `_get_gpu_free_memory` in that
window reads artificially low, which propagates into `_select_gpus`
/ `_fit_context_to_vram` and flips the layer-split toward `--fit on`
with more CPU-offloaded layers than steady-state would have required.
On a tight VRAM card the resulting mmap thrash + OOM matches the
Apply-reload kill path that bare-shell launches with the same flags
never hit (continues the lineage of unslothai#5161 / unslothai#5401 / unslothai#5427).

Adds `LlamaCppBackend._wait_for_vram_settle`: bounded poll of
`_get_gpu_free_memory` that returns as soon as two consecutive
samples agree per-GPU within `max(256 MiB, 2% of larger sample)`,
or `max_wait` (default 2 s) wall-clock elapses with probe time
included in the bound. Records `_last_kill_monotonic` inside
`_kill_process`'s `finally` block so the wait engages on both
in-process `load_model -> _kill_process -> load` and the frontend
chat-settings Apply path (`/unload` then `/load`). The call site
runs OUTSIDE the broad `self._lock` so concurrent `/unload`,
`/cancel`, `/status` are not blocked during the wait.

Short-circuits at zero cost on cold start (no kill recorded), stale
kill (older than 15 s, driver has already settled), CPU-only host
(probe returns empty), and probe exceptions (nvidia-smi gone away).

11 new unit tests in `test_llama_cpp_wait_for_vram_settle.py` cover:
cold-start zero cost, stale-kill skip, slow-probe deadline bound,
GPU index-set change, per-GPU stability with one draining card, the
2 % adaptive tolerance, _kill_process timestamp recording on real
kill vs no-op, and an `inspect.getsource` contract that pins the
call site to outside the Phase 3 lock and uses `_last_kill_monotonic`
so a future refactor can't silently regress any of these properties.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Michael Han <michaelhan2050@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
rsd-darshan pushed a commit to rsd-darshan/unsloth that referenced this pull request Jun 3, 2026
…load (unslothai#5693)

* studio: settle GPU VRAM after killing llama-server before the next reload

The NVIDIA driver reclaims a dead process's CUDA allocations
asynchronously after the kernel reaps the PID -- typically tens to
hundreds of milliseconds. Sampling `_get_gpu_free_memory` in that
window reads artificially low, which propagates into `_select_gpus`
/ `_fit_context_to_vram` and flips the layer-split toward `--fit on`
with more CPU-offloaded layers than steady-state would have required.
On a tight VRAM card the resulting mmap thrash + OOM matches the
Apply-reload kill path that bare-shell launches with the same flags
never hit (continues the lineage of unslothai#5161 / unslothai#5401 / unslothai#5427).

Adds `LlamaCppBackend._wait_for_vram_settle`: bounded poll of
`_get_gpu_free_memory` that returns as soon as two consecutive
samples agree per-GPU within `max(256 MiB, 2% of larger sample)`,
or `max_wait` (default 2 s) wall-clock elapses with probe time
included in the bound. Records `_last_kill_monotonic` inside
`_kill_process`'s `finally` block so the wait engages on both
in-process `load_model -> _kill_process -> load` and the frontend
chat-settings Apply path (`/unload` then `/load`). The call site
runs OUTSIDE the broad `self._lock` so concurrent `/unload`,
`/cancel`, `/status` are not blocked during the wait.

Short-circuits at zero cost on cold start (no kill recorded), stale
kill (older than 15 s, driver has already settled), CPU-only host
(probe returns empty), and probe exceptions (nvidia-smi gone away).

11 new unit tests in `test_llama_cpp_wait_for_vram_settle.py` cover:
cold-start zero cost, stale-kill skip, slow-probe deadline bound,
GPU index-set change, per-GPU stability with one draining card, the
2 % adaptive tolerance, _kill_process timestamp recording on real
kill vs no-op, and an `inspect.getsource` contract that pins the
call site to outside the Phase 3 lock and uses `_last_kill_monotonic`
so a future refactor can't silently regress any of these properties.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Michael Han <michaelhan2050@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-addresses-issue Pre-flight: appears to address an open issue auto-approved Auto-review approved the PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Model reload with 'unsloth run' doesn't unload current model causing memory spike and crash.

1 participant