Skip to content

Fix local GGUF settings reload falling through to transformers path#5332

Closed
BlackBox-Labs wants to merge 7 commits into
unslothai:mainfrom
BlackBox-Labs:fix/local-gguf-reload-settings
Closed

Fix local GGUF settings reload falling through to transformers path#5332
BlackBox-Labs wants to merge 7 commits into
unslothai:mainfrom
BlackBox-Labs:fix/local-gguf-reload-settings

Conversation

@BlackBox-Labs

Copy link
Copy Markdown

Fixes #5238

Summary

When changing inference settings (context length, KV cache dtype, etc.) on a local GGUF model and clicking Apply, the reload would fall through to the HuggingFace/transformers path instead of the llama-server GGUF path, causing a "No config file found" error.

Root Cause

The already-loaded check at routes/inference.py:465 only handles GGUF models with a gguf_variant (HF repos). Local GGUF files loaded by file path have no gguf_variant, so they skipped the GGUF check, entered the else block, and since the orchestrator had no active transformers model, fell through to the transformers path.

Fix

Added a local GGUF already-loaded handler that detects when a local GGUF is already loaded via llama-server and kills the existing process before the GGUF reload path runs, ensuring the reload uses the correct backend with updated settings.

Testing

  • Load a local GGUF model → works
  • Change inference settings & Apply → model reloads with new settings, no error
  • Load a remote (HF repo) GGUF → unchanged path (already worked)
  • Load a transformers model → unchanged path

When changing inference settings (context length, KV cache dtype, etc.)
on a local GGUF model and clicking Apply, the reload was falling through
to the HuggingFace/transformers path instead of the llama-server GGUF
path, causing a 'No config file found' error because GGUF files don't
have a config.json.

The root cause was that the already-loaded check only handles GGUF models
with a gguf_variant (HF repos). Local GGUF files loaded by path have no
variant, so they skipped the GGUF check and fell through to transformers.

Fix: add a local GGUF already-loaded handler that kills the existing
llama-server process before the GGUF reload path runs, ensuring the
reload uses the correct backend with updated settings.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a844758c9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to unload an existing local GGUF model before reloading it with updated settings, preventing the system from incorrectly falling back to the transformers path. The feedback suggests refining the unload condition to specifically target the same model identifier, which would avoid redundant unloads when switching between different model types.

Comment thread studio/backend/routes/inference.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8e2e8f97c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +579 to +583
not request.gguf_variant
and llama_backend.is_loaded
and llama_backend.model_identifier
and llama_backend.model_identifier.lower() == model_identifier.lower()
):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Defer local GGUF unload until validations pass

This preemptive llama_backend.unload_model() runs before request validation, so a bad reload request can terminate a healthy serving model and then return an error. For example, when the same local GGUF is loaded and the caller sends gpu_ids, this branch unloads first, then the GGUF path rejects the request with 400 (gpu_ids is not supported for GGUF models yet), leaving inference with no model loaded. Please move this unload until after request/config validation that can fail.

Useful? React with 👍 / 👎.

@BlackBox-Labs

Copy link
Copy Markdown
Author

Update: Found a second path that blocks the reload

Why we missed it first time:
The initial fix only handled the case where load_model() is called while llama_backend.is_loaded is still True (llama-server alive). But the frontend sends a separate POST /api/inference/unload before the load request, which kills llama-server. After that:

  • llama_backend.is_loaded = False → first fix doesn't fire
  • Orchestrator's backend.active_model_name still holds the .gguf path
  • The else block at route line ~282 sees the matching path and returns "already_loaded" — never reaches the GGUF reload path

Attempted fix: Added a .gguf exclusion to the else block's already_loaded condition so local GGUF models fall through to the proper GGUF path instead.

Not 100% sure this is the cleanest approach. Alternatives considered:

  • Clearing backend.active_model_name during GGUF unload
  • Adding a separate check earlier in load_model() for local GGUF files

Would appreciate review — is the .gguf suffix check reasonable, or is there a better pattern the project prefers for distinguishing local GGUF from Unsloth models in the orchestrator path?

@rolandtannous rolandtannous self-assigned this May 14, 2026
@danielhanchen

Copy link
Copy Markdown
Member

Hi @BlackBox-Labs, thanks for the careful debugging here. Your root-cause writeup on the original failure was correct, and the second comment about the /unload + /load ordering caught a real subtlety.

After tracing this against current main, the bug from #5238 has already been fixed structurally by #5427 (merged a few hours after you opened this), so I am going to close this one. A quick summary of what landed on main:

  • _request_matches_loaded_settings() at studio/backend/routes/inference.py:425-463 now compares max_seq_length, cache_type_kv, speculative_type, chat_template_override, and llama_extra_args. The if request.gguf_variant: branch at :537-547 calls it, so a same-model Apply with new settings triggers a real reload instead of returning already_loaded.
  • ModelConfig.from_identifier(...) at :648 classifies a local .gguf path as GGUF by suffix, so the request routes through the GGUF branch (:664) and never falls into the transformers loader that produced the "No config file found" error.
  • _serial_load_lock in LlamaCppBackend.load_model serialises concurrent loads, removing the overlapping llama-server window.
  • New extras inheritance at :688-743 keeps --top-k, --seed, and friends alive across the frontend's /unload + /load Apply cycle.

The reason for closing rather than rebasing: with #5427 in place, the explicit llama_backend.unload_model() in this PR would now (a) clear _extra_args / _extra_args_source before the inheritance path can read them, so CLI flags get lost on every Apply, and (b) fire before the request validations at :665-669 (gpu_ids on GGUF) and :769-772 (mmproj companion check), so an invalid request would leave the user with their llama-server already killed. The codex review comments on :577 and :588 were both flagging this. The .endswith(".gguf") guard on the orchestrator else-branch is also unreachable on current main, since backend.active_model_name is only ever set from a non-GGUF config.identifier after a worker subprocess load.

If you hit any residual reload weirdness on the latest main, please do open a fresh issue with the steps and a backend log. Thanks again for the work tracking this down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants