Fix local GGUF settings reload falling through to transformers path by BlackBox-Labs · Pull Request #5332 · unslothai/unsloth

BlackBox-Labs · 2026-05-08T06:12:04Z

Summary

When changing inference settings (context length, KV cache dtype, etc.) on a local GGUF model and clicking Apply, the reload would fall through to the HuggingFace/transformers path instead of the llama-server GGUF path, causing a "No config file found" error.

Root Cause

The already-loaded check at routes/inference.py:465 only handles GGUF models with a gguf_variant (HF repos). Local GGUF files loaded by file path have no gguf_variant, so they skipped the GGUF check, entered the else block, and since the orchestrator had no active transformers model, fell through to the transformers path.

Fix

Added a local GGUF already-loaded handler that detects when a local GGUF is already loaded via llama-server and kills the existing process before the GGUF reload path runs, ensuring the reload uses the correct backend with updated settings.

Testing

Load a local GGUF model → works
Change inference settings & Apply → model reloads with new settings, no error
Load a remote (HF repo) GGUF → unchanged path (already worked)
Load a transformers model → unchanged path

When changing inference settings (context length, KV cache dtype, etc.) on a local GGUF model and clicking Apply, the reload was falling through to the HuggingFace/transformers path instead of the llama-server GGUF path, causing a 'No config file found' error because GGUF files don't have a config.json. The root cause was that the already-loaded check only handles GGUF models with a gguf_variant (HF repos). Local GGUF files loaded by path have no variant, so they skipped the GGUF check and fell through to transformers. Fix: add a local GGUF already-loaded handler that kills the existing llama-server process before the GGUF reload path runs, ensuring the reload uses the correct backend with updated settings.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a844758c9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gemini-code-assist

Code Review

This pull request introduces a mechanism to unload an existing local GGUF model before reloading it with updated settings, preventing the system from incorrectly falling back to the transformers path. The feedback suggests refining the unload condition to specifically target the same model identifier, which would avoid redundant unloads when switching between different model types.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8e2e8f97c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-08T09:15:51Z

+            not request.gguf_variant
+            and llama_backend.is_loaded
+            and llama_backend.model_identifier
+            and llama_backend.model_identifier.lower() == model_identifier.lower()
+        ):


Defer local GGUF unload until validations pass

This preemptive llama_backend.unload_model() runs before request validation, so a bad reload request can terminate a healthy serving model and then return an error. For example, when the same local GGUF is loaded and the caller sends gpu_ids, this branch unloads first, then the GGUF path rejects the request with 400 (gpu_ids is not supported for GGUF models yet), leaving inference with no model loaded. Please move this unload until after request/config validation that can fail.

Useful? React with 👍 / 👎.

BlackBox-Labs · 2026-05-08T11:24:49Z

Update: Found a second path that blocks the reload

Why we missed it first time:
The initial fix only handled the case where load_model() is called while llama_backend.is_loaded is still True (llama-server alive). But the frontend sends a separate POST /api/inference/unload before the load request, which kills llama-server. After that:

llama_backend.is_loaded = False → first fix doesn't fire
Orchestrator's backend.active_model_name still holds the .gguf path
The else block at route line ~282 sees the matching path and returns "already_loaded" — never reaches the GGUF reload path

Attempted fix: Added a .gguf exclusion to the else block's already_loaded condition so local GGUF models fall through to the proper GGUF path instead.

Not 100% sure this is the cleanest approach. Alternatives considered:

Clearing backend.active_model_name during GGUF unload
Adding a separate check earlier in load_model() for local GGUF files

Would appreciate review — is the .gguf suffix check reasonable, or is there a better pattern the project prefers for distinguishing local GGUF from Unsloth models in the orchestrator path?

danielhanchen · 2026-05-18T07:08:16Z

Hi @BlackBox-Labs, thanks for the careful debugging here. Your root-cause writeup on the original failure was correct, and the second comment about the /unload + /load ordering caught a real subtlety.

After tracing this against current main, the bug from #5238 has already been fixed structurally by #5427 (merged a few hours after you opened this), so I am going to close this one. A quick summary of what landed on main:

_request_matches_loaded_settings() at studio/backend/routes/inference.py:425-463 now compares max_seq_length, cache_type_kv, speculative_type, chat_template_override, and llama_extra_args. The if request.gguf_variant: branch at :537-547 calls it, so a same-model Apply with new settings triggers a real reload instead of returning already_loaded.
ModelConfig.from_identifier(...) at :648 classifies a local .gguf path as GGUF by suffix, so the request routes through the GGUF branch (:664) and never falls into the transformers loader that produced the "No config file found" error.
_serial_load_lock in LlamaCppBackend.load_model serialises concurrent loads, removing the overlapping llama-server window.
New extras inheritance at :688-743 keeps --top-k, --seed, and friends alive across the frontend's /unload + /load Apply cycle.

The reason for closing rather than rebasing: with #5427 in place, the explicit llama_backend.unload_model() in this PR would now (a) clear _extra_args / _extra_args_source before the inheritance path can read them, so CLI flags get lost on every Apply, and (b) fire before the request validations at :665-669 (gpu_ids on GGUF) and :769-772 (mmproj companion check), so an invalid request would leave the user with their llama-server already killed. The codex review comments on :577 and :588 were both flagging this. The .endswith(".gguf") guard on the orchestrator else-branch is also unreachable on current main, since backend.active_model_name is only ever set from a non-GGUF config.identifier after a worker subprocess load.

If you hit any residual reload weirdness on the latest main, please do open a fresh issue with the steps and a backend log. Thanks again for the work tracking this down.

BlackBox-Labs requested a review from rolandtannous as a code owner May 8, 2026 06:12

BlackBox-Labs mentioned this pull request May 8, 2026

[Bug] Changing context length and clicking Apply reloads with a missing config error, but the selected context length is applied after manually reloading the model #5238

Closed

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread studio/backend/routes/inference.py Outdated

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

Comment thread studio/backend/routes/inference.py Outdated

BlackBox-Labs and others added 5 commits May 8, 2026 12:04

fix: narrow local GGUF reload condition with model_identifier match

a02dfd4

[pre-commit.ci] auto fixes from pre-commit.com hooks

fcaadd9

for more information, see https://pre-commit.ci

fix: skip already_loaded for local GGUF in orchestrator else block

7c0923a

[pre-commit.ci] auto fixes from pre-commit.com hooks

977f406

for more information, see https://pre-commit.ci

Merge branch 'main' into fix/local-gguf-reload-settings

d8e2e8f

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Merge branch 'main' into fix/local-gguf-reload-settings

58f8aa9

rolandtannous self-assigned this May 14, 2026

danielhanchen closed this May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix local GGUF settings reload falling through to transformers path#5332

Fix local GGUF settings reload falling through to transformers path#5332
BlackBox-Labs wants to merge 7 commits into
unslothai:mainfrom
BlackBox-Labs:fix/local-gguf-reload-settings

BlackBox-Labs commented May 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 8, 2026

Uh oh!

BlackBox-Labs commented May 8, 2026

Uh oh!

danielhanchen commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

BlackBox-Labs commented May 8, 2026

Summary

Root Cause

Fix

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

BlackBox-Labs commented May 8, 2026

Update: Found a second path that blocks the reload

Uh oh!

danielhanchen commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants