Fix local GGUF settings reload falling through to transformers path#5332
Fix local GGUF settings reload falling through to transformers path#5332BlackBox-Labs wants to merge 7 commits into
Conversation
When changing inference settings (context length, KV cache dtype, etc.) on a local GGUF model and clicking Apply, the reload was falling through to the HuggingFace/transformers path instead of the llama-server GGUF path, causing a 'No config file found' error because GGUF files don't have a config.json. The root cause was that the already-loaded check only handles GGUF models with a gguf_variant (HF repos). Local GGUF files loaded by path have no variant, so they skipped the GGUF check and fell through to transformers. Fix: add a local GGUF already-loaded handler that kills the existing llama-server process before the GGUF reload path runs, ensuring the reload uses the correct backend with updated settings.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8a844758c9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to unload an existing local GGUF model before reloading it with updated settings, preventing the system from incorrectly falling back to the transformers path. The feedback suggests refining the unload condition to specifically target the same model identifier, which would avoid redundant unloads when switching between different model types.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d8e2e8f97c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| not request.gguf_variant | ||
| and llama_backend.is_loaded | ||
| and llama_backend.model_identifier | ||
| and llama_backend.model_identifier.lower() == model_identifier.lower() | ||
| ): |
There was a problem hiding this comment.
Defer local GGUF unload until validations pass
This preemptive llama_backend.unload_model() runs before request validation, so a bad reload request can terminate a healthy serving model and then return an error. For example, when the same local GGUF is loaded and the caller sends gpu_ids, this branch unloads first, then the GGUF path rejects the request with 400 (gpu_ids is not supported for GGUF models yet), leaving inference with no model loaded. Please move this unload until after request/config validation that can fail.
Useful? React with 👍 / 👎.
Update: Found a second path that blocks the reloadWhy we missed it first time:
Attempted fix: Added a Not 100% sure this is the cleanest approach. Alternatives considered:
Would appreciate review — is the |
|
Hi @BlackBox-Labs, thanks for the careful debugging here. Your root-cause writeup on the original failure was correct, and the second comment about the After tracing this against current main, the bug from #5238 has already been fixed structurally by #5427 (merged a few hours after you opened this), so I am going to close this one. A quick summary of what landed on main:
The reason for closing rather than rebasing: with If you hit any residual reload weirdness on the latest main, please do open a fresh issue with the steps and a backend log. Thanks again for the work tracking this down. |
Fixes #5238
Summary
When changing inference settings (context length, KV cache dtype, etc.) on a local GGUF model and clicking Apply, the reload would fall through to the HuggingFace/transformers path instead of the llama-server GGUF path, causing a
"No config file found"error.Root Cause
The already-loaded check at
routes/inference.py:465only handles GGUF models with agguf_variant(HF repos). Local GGUF files loaded by file path have nogguf_variant, so they skipped the GGUF check, entered theelseblock, and since the orchestrator had no active transformers model, fell through to the transformers path.Fix
Added a local GGUF already-loaded handler that detects when a local GGUF is already loaded via llama-server and kills the existing process before the GGUF reload path runs, ensuring the reload uses the correct backend with updated settings.
Testing