studio: load cached GGUF models when fully offline#5505
Conversation
When huggingface.co is unreachable, GGUF model loads fail in three distinct
places even though the bits are already in ~/.cache/huggingface/hub. Each
failure has a different surface symptom:
1. list_gguf_variants() raises straight through HTTPException(500), so the
variant dropdown shows 'Failed to list GGUF variants'.
2. detect_gguf_model_remote() silently returns None after retries fail. The
caller then treats a GGUF-only repo as non-GGUF and routes it through the
transformers/MLX path. On Apple Silicon this surfaces as 'Unsloth currently
only works on NVIDIA, AMD and Intel GPUs.'
3. _download_gguf() loses list_repo_files() to the network and falls back to a
filename heuristic ('{repo}-{variant}.gguf'). When the repo name does not
echo the filenames (e.g. repo 'Qwen3.6-27B-MTP-GGUF' contains a file
'Qwen3.6-27B-UD-Q4_K_XL.gguf' with no MTP), hf_hub_download cannot find
that invented filename in the cache and aborts.
Fix in three layers:
- list_gguf_variants / detect_gguf_model_remote: honor HF_HUB_OFFLINE and
fall back to scanning the local HF cache snapshot when the API throws.
detect_gguf_model_remote still keeps its retry loop for transient flakes;
the cache fallback only kicks in after every attempt fails.
- _download_gguf: when list_repo_files() fails, look up variant -> real
filename inside the cached snapshot before resorting to the heuristic.
- llama_cpp.load_model / inference worker startup: when DNS for
huggingface.co fails (2s probe), set HF_HUB_OFFLINE=1 for the process so
every hf_hub_download call below resolves from cache instantly instead of
spending ~25s on five exponential retries.
Online behavior is unchanged: the API is tried first and only used to fail
over. The cache scan is a strict subset of what list_local_gguf_variants
already does today for local paths.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 659cea1a20
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for p in snap.rglob("*.gguf") | ||
| if "mmproj" not in p.name.lower() | ||
| and boundary.search(p.name.lower()) |
There was a problem hiding this comment.
Match cache variant against relative path, not basename
The offline cache fallback filters candidates with boundary.search(p.name.lower()), which ignores directory components. That breaks repos where the quant appears only in a subdirectory (the codebase already documents layouts like BF16/...gguf): those files are skipped, gguf_filename stays unset, and the code falls back to the synthetic {repo}-{variant}.gguf name that is often absent from cache. In a fully offline load this turns a valid cached model into a download failure.
Useful? React with 👍 / 👎.
| os.environ["HF_HUB_OFFLINE"] = "1" | ||
| os.environ.setdefault("TRANSFORMERS_OFFLINE", "1") |
There was a problem hiding this comment.
Do not persist HF_HUB_OFFLINE on a single DNS miss
When startup DNS resolution fails once, this permanently sets HF_HUB_OFFLINE=1 for the entire worker lifetime. A transient resolver hiccup at process start therefore forces all subsequent model loads into offline mode (including when connectivity is later healthy), preventing normal Hub fetches until the worker is restarted.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request adds offline support for GGUF model detection and resolution by implementing local Hugging Face cache lookups and automatic reachability detection for huggingface.co. Key changes include adding fallback mechanisms in llama_cpp.py and model_config.py to serve model metadata from the local cache when the network is unavailable. Feedback suggests centralizing duplicated cache lookup logic, utilizing the is_offline_mode utility from huggingface_hub, and ensuring consistent logging when offline mode is triggered.
| except Exception: | ||
| os.environ["HF_HUB_OFFLINE"] = "1" | ||
| os.environ.setdefault("TRANSFORMERS_OFFLINE", "1") |
There was a problem hiding this comment.
For consistency with the change in llama_cpp.py, it would be beneficial to log a warning when offline mode is enabled. This helps in debugging by making it clear why the application is operating in offline mode. In accordance with repository guidelines, ensure the warning message is dynamically generated to include the specific configuration values it refers to (e.g., the source or value of the offline setting) to ensure accuracy and avoid confusion.
References
- User-facing warning messages should be dynamically generated to include the specific configuration values they refer to, rather than using hardcoded examples, to ensure accuracy and avoid confusion.
| def _list_gguf_variants_from_hf_cache( | ||
| repo_id: str, | ||
| ) -> Optional[tuple[list[GgufVariantInfo], bool]]: | ||
| """Variants from the local HF cache snapshot, or None if not cached.""" | ||
| try: | ||
| from huggingface_hub import constants as hf_constants | ||
| except Exception: | ||
| return None | ||
|
|
||
| cache_dir = Path(hf_constants.HF_HUB_CACHE) | ||
| if not cache_dir.is_dir(): | ||
| return None | ||
|
|
||
| target = f"models--{repo_id.replace('/', '--')}".lower() | ||
| repo_dir: Optional[Path] = None | ||
| for entry in cache_dir.iterdir(): | ||
| if entry.is_dir() and entry.name.lower() == target: | ||
| repo_dir = entry | ||
| break | ||
| if repo_dir is None: | ||
| return None | ||
|
|
||
| snapshots = repo_dir / "snapshots" | ||
| if not snapshots.is_dir(): | ||
| return None | ||
|
|
||
| snap_dirs = [s for s in snapshots.iterdir() if s.is_dir()] | ||
| if not snap_dirs: | ||
| return None | ||
| snap_dirs.sort(key = lambda s: s.stat().st_mtime, reverse = True) | ||
| for snap in snap_dirs: | ||
| variants, has_vision = list_local_gguf_variants(str(snap)) | ||
| if variants or has_vision: | ||
| return variants, has_vision | ||
| return None |
There was a problem hiding this comment.
The logic for finding a repository directory in the Hugging Face cache is duplicated in multiple locations (e.g., _detect_gguf_from_hf_cache and _download_gguf). This logic is also inefficient as it iterates over the entire cache directory. Consider creating a single helper function to find the repository directory to improve maintainability and performance, as per repository guidelines to centralize recurring logic and avoid redundant iterations.
References
- Centralize recurring or complex logical checks into a single helper function and reuse it across the codebase to ensure consistency and simplify maintenance.
- To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.
| offline = os.environ.get("HF_HUB_OFFLINE", "").lower() in ( | ||
| "1", | ||
| "true", | ||
| "yes", |
There was a problem hiding this comment.
This check for offline mode can be simplified by using the is_offline_mode utility from huggingface_hub. This makes the code more concise and robust. Additionally, as this check is recurring, consider centralizing it into a single helper function to ensure consistency and simplify maintenance.
from huggingface_hub.utils import is_offline_mode
offline = is_offline_mode()References
- Centralize recurring or complex logical checks into a single helper function and reuse it across the codebase to ensure consistency and simplify maintenance.
| offline = os.environ.get("HF_HUB_OFFLINE", "").lower() in ( | ||
| "1", | ||
| "true", | ||
| "yes", |
There was a problem hiding this comment.
This check for offline mode can be simplified by using the is_offline_mode utility from huggingface_hub. This makes the code more concise and robust. Additionally, as this check is recurring, consider centralizing it into a single helper function to ensure consistency and simplify maintenance.
from huggingface_hub.utils import is_offline_mode
offline = is_offline_mode()References
- Centralize recurring or complex logical checks into a single helper function and reuse it across the codebase to ensure consistency and simplify maintenance.
Fixes from the review pass on #5505: * ruff F823 (lint CI red): the late `import os` at the bottom of LlamaCppBackend.load_model made `os` a function-local name, so my new `os.environ` reference at the top of the same method was a use-before-bind. Surfaces at runtime as 'cannot access local variable os where it is not associated with a value' and is why the Mac/Windows Studio API jobs were failing too. The env-var mutation has been moved into a module-level contextmanager, so load_model no longer touches `os` directly. * Codex P1: cache variant match now uses the relative path, not the basename. Layouts like `BF16/foo.gguf` (variant token only in parent dir) were silently skipped, falling through to the bogus `{repo}-{variant}.gguf` heuristic and failing offline loads of models stored under quant-named subdirs. * Codex P1: HF_HUB_OFFLINE no longer persists past one model load. llama_cpp.load_model now uses a contextmanager that probes DNS, sets HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE only when DNS is dead, and pops them in finally (preserving any prior user setting of TRANSFORMERS_OFFLINE). Pre-existing user-set HF_HUB_OFFLINE is respected as a no-op. worker.py keeps the startup probe because the orchestrator spawns a fresh worker per load -- comment updated to make that lifecycle explicit, and a warning is now logged. * Gemini: cache-dir lookup centralized in `_iter_hf_cache_snapshots`. Three near-identical copies (in list/detect helpers and the llama_cpp offline scan) now go through one helper. * Gemini: `huggingface_hub.utils.is_offline_mode` does not exist in 1.x (verified locally); `huggingface_hub.constants.HF_HUB_OFFLINE` is snapshot-at-import-time and does not reflect runtime mutations. Manual env-var parsing kept. * socket probe now saves and restores the prior default timeout instead of unconditionally setting None on exit, so it composes with caller code that already configured a timeout. * worker.py probe now logs a warning when offline mode is auto-enabled so debugging the case isn't blind.
|
Pushed 6476b9b addressing the review feedback. Verdicts inline: Lint (ruff F823) — fixed. This was a real bug, not just a style nit. Codex P1 #1 (match cache variant against relative path) — fixed. Real bug. The cache scan now matches against Codex P1 #2 (do not persist HF_HUB_OFFLINE) — fixed. Real concern for the long-lived FastAPI process. Gemini (centralize cache-dir lookup) — fixed. Extracted Gemini (use Gemini (log warning when worker enables offline) — fixed. Drive-by improvements the review pass surfaced:
Test coverage: the local suite that exercises the patched code is now 119 cases (102 from the original verification + 17 net new for the review fixes). Notable additions:
Pre-commit formatting ( |
danielhanchen
left a comment
There was a problem hiding this comment.
I pulled the branch down, set up an isolated harness with the same pinned huggingface_hub==0.36.2, ran 41 simulated scenarios against the three changed call paths (fake HF cache with both flat and snapshot-symlink layouts, monkey-patched hf_model_info / list_repo_files, monkey-patched socket.gethostbyname). Full suite passes; results back both bot P1 findings empirically. Sim lives at workspace_32/temp/pr5505_sim/ if you want it.
The PR is solving real bugs: I reproduced all three failure modes against the pre-PR code (case 06 raises offline, case 15 builds the synthetic Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf that misses the cache). Online behavior is unchanged versus main (case 27). Cross-platform-wise the new code is pure stdlib + pathlib.Path + huggingface_hub.constants.HF_HUB_CACHE, no sys.platform branches added; the realistic-layout suite covers HF cache symlinks (snapshots/<sha>/<name> into blobs/) and dangling symlinks from partial downloads.
P1
1. _download_gguf offline cache filter regresses online/offline parity.
studio/backend/core/inference/llama_cpp.py:1814 matches on boundary.search(p.name.lower()) (basename), while the online block at :1748-1757 matches on the full path from list_repo_files (which can return rfilenames like BF16/foo.gguf). The codebase already supports the subdir-only layout: model_config.py:1020, :1213, :1258, :1443-1446, :1490-1491, :2293-2300. Simulation case 16 / E4 reproduces this end to end against a cache containing only BF16/foo.gguf: PR resolves to the synthetic gpt-oss-20b-BF16-BF16.gguf (which is not on disk) and the load fails; case 17 / E4 shows the one-character fix passing.
Suggested change (one line):
# studio/backend/core/inference/llama_cpp.py around line 1814
matches = sorted(
p.relative_to(snap).as_posix()
for p in snap.rglob("*.gguf")
if "mmproj" not in p.name.lower()
and boundary.search(p.relative_to(snap).as_posix().lower())
)The basename-only failure is latent today (every in-repo example I found still has the quant token in the basename) but it diverges from the online filter right above and silently turns valid cached repos into download failures the moment any future unsloth GGUF ships a subdir-only layout. Cheap to fix now.
2. LlamaCppBackend.load_model permanently mutates HF_HUB_OFFLINE on one transient DNS miss.
llama_cpp.py:2052-2067 sets os.environ["HF_HUB_OFFLINE"] = "1" and os.environ.setdefault("TRANSFORMERS_OFFLINE", "1") in the except branch and never restores them. LlamaCppBackend is a module-level singleton (routes/inference.py:433 -> _llama_cpp_backend = LlamaCppBackend()), reused across every /load for the lifetime of the FastAPI server. After a single DNS hiccup at first load the env var is sticky forever: when connectivity recovers the next uncached-repo load short-circuits the probe (if hf_repo and "HF_HUB_OFFLINE" not in os.environ), hf_hub_download keeps refusing the Hub, and the user sees stale-cache errors with no recovery short of restarting the server. grep "del os.environ\['HF_HUB_OFFLINE'\]\\|os.environ.pop.*HF_HUB_OFFLINE" studio/ returns zero hits.
Simulation case 19 reproduces the leak: DNS fails -> env stuck at 1 even after DNS recovers. Case 20 / 21 / 22 show the scoped variant restores the env, preserves a user-set HF_HUB_OFFLINE=1, and is a no-op when DNS succeeds.
worker.py:651-663 does the same thing but isn't a real bug there because the worker is one-shot per load: orchestrator.py:606-608 documents "Always spawns a fresh subprocess for each model load" and _CTX.Process(...) at :181-192 confirms it. Each load gets a fresh probe and a fresh process env. The block in worker.py can stay as-is; only llama_cpp.py:2052-2067 needs to change.
Suggested change (scope the mutation):
# studio/backend/core/inference/llama_cpp.py around line 2052
self._cancel_event.clear()
_offline_keys: list[str] = []
if hf_repo and "HF_HUB_OFFLINE" not in os.environ:
import socket as _socket
try:
_socket.setdefaulttimeout(2.0)
_socket.gethostbyname("huggingface.co")
except Exception:
os.environ["HF_HUB_OFFLINE"] = "1"
_offline_keys.append("HF_HUB_OFFLINE")
if "TRANSFORMERS_OFFLINE" not in os.environ:
os.environ["TRANSFORMERS_OFFLINE"] = "1"
_offline_keys.append("TRANSFORMERS_OFFLINE")
logger.warning(
"huggingface.co unreachable, enabling HF_HUB_OFFLINE for this load."
)
finally:
_socket.setdefaulttimeout(None)
try:
# ... existing kill / launch / hf_hub_download path ...
finally:
for _k in _offline_keys:
os.environ.pop(_k, None)_offline_keys ensures we only pop the variables this code added; a user who explicitly set HF_HUB_OFFLINE=1 before launching Studio keeps that value across the load. Alternative shape if you'd rather not touch os.environ at all: thread local_files_only=True into the specific hf_hub_download / get_paths_info calls reached after the probe fails. No local_files_only= usage anywhere in studio/ today (zero grep hits), so this would be the first instance; the scoped try/finally is the smaller diff.
P2
3. Centralise the cache repo-dir lookup. Three sites (_list_gguf_variants_from_hf_cache, _detect_gguf_from_hf_cache, and the new inline block at llama_cpp.py:1797-1840) hand-roll the same cache_dir / f"models--{repo_id.replace('/', '--')}".lower() resolution plus snapshot iteration sorted by mtime. One _find_repo_cache_dir(repo_id) -> Optional[Path] (or _iter_repo_snapshots(repo_id)) helper in model_config.py keeps the three sites in sync, and the fix in P1 #1 lives in exactly one place.
4. worker.py:651-663 should log when auto-offline mode kicks in, matching the logger.warning(...) already in llama_cpp.py:2061. Pure observability nit.
5. Drop the huggingface_hub.utils.is_offline_mode suggestion from the bot's medium comments. That function does not exist in pinned huggingface_hub==0.36.2. I verified with python -c "from huggingface_hub.utils import is_offline_mode" -> ImportError, and again with from huggingface_hub import is_offline_mode -> ImportError. The closest modern API is huggingface_hub.constants.HF_HUB_OFFLINE (module-level bool computed from env at import), but the current manual env check is also fine and is more flexible because it accepts "true" / "yes" (simulated in cases E11 / E12 / E13). Keep what you have.
Nits
-
The cache fallback in
list_gguf_variants/detect_gguf_model_remotecatches bareExceptionand serves cache on every HF API error. The retry loop indetect_gguf_model_remotealready early-returns onRepositoryNotFoundError/GatedRepoError/RevisionNotFoundError/EntryNotFoundError; consider mirroring that for the outerexcept Exceptioncache fallback so a stale cache cannot quietly paper over a deliberate access-control or revision failure. Not blocking, the synthetic-name fallback after both API and cache miss still raises eventually. -
Concurrency edge:
socket.setdefaulttimeout(2.0)is process-global. If two/loadrequests land in the same FastAPI worker concurrently, the second thread's probe can see the first thread's reset toNonemid-flight (or vice versa). Simulation case E6 doesn't repro the worst case but observes the race window. Low priority; only matters once you have parallel-load tenants. -
The PR-description bullet "Drop a redundant
import osinside the disk-space try block" is fine but unrelated to offline; easy to split into its own commit if you keep the offline-fix PR tight.
Verification
If you want to reproduce the proofs locally without retraining anything:
cd workspace_32/temp/pr5505_sim
uv venv .venv && source .venv/bin/activate
uv pip install "huggingface_hub==0.36.2" structlog pytest
python run_sim.py # 28 cases, flat fake cache
python run_sim_extras.py # 13 cases, realistic snapshot-symlink cache + concurrency
Cases that reproduce the two P1s in isolation:
| Case | What it shows |
|---|---|
| 06 / 15 | BEFORE-PR offline path raises / builds bogus filename (the bug the PR fixes) |
| 14 | PR fixes the Qwen3.6-27B-MTP-GGUF case |
| 16 / E4 | P1 #1 reproduced: PR's basename filter misses BF16/foo.gguf |
| 17 / E4 | One-line fix (relative-path filter) resolves the subdir layout |
| 19 | P1 #2 reproduced: HF_HUB_OFFLINE stays 1 after DNS recovery |
| 20 / 21 / 22 | Scoped try/finally restores env, preserves user-set value, no-op on success |
| 27 | Online happy path unchanged versus main |
| E1-E5 | Realistic HF cache layout (snapshot symlinks into blobs/, dangling links) |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6476b9b9bc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| os.environ["HF_HUB_OFFLINE"] = "1" | ||
| os.environ.setdefault("TRANSFORMERS_OFFLINE", "1") |
There was a problem hiding this comment.
Avoid forcing offline mode after one DNS lookup failure
A transient failure in _socket.gethostbyname("huggingface.co") now unconditionally sets HF_HUB_OFFLINE=1 for the entire worker process, so the same load attempt will refuse all Hub HTTP calls even if connectivity is actually available moments later. This regresses online reliability for partially cached models: instead of retrying/download behavior, the load fails immediately in offline-only mode until a new subprocess is spawned. Fresh evidence in this revision is that run_inference_process sets the env once at startup while the spawned worker is kept alive through the loaded handshake in orchestrator.load_model, so every Hub call in that attempt inherits offline mode.
Useful? React with 👍 / 👎.
Lock in the offline fallback path from #5505 so future refactors can't silently regress either bug. 26 tests, 0.55 s, no network/GPU/subprocess. Covers: * _iter_hf_cache_snapshots: missing cache, missing repo, missing snapshots/, newest-mtime ordering, case-insensitive repo match. * _list_gguf_variants_from_hf_cache and the list_gguf_variants online/offline-env/API-exception/reraise paths. * _detect_gguf_from_hf_cache and detect_gguf_model_remote 3x-fail fallback. Pre-existing RepositoryNotFoundError early-return preserved. * Codex P1 #1 regression: BF16/foo.gguf (quant only in subdir name) must resolve via _detect_gguf_from_hf_cache, which now matches the snapshot-relative path rather than the basename. * _probe_dns_dead: returns True/False, restores prior socket timeout. * Codex P1 #2 regression: _hf_offline_if_dns_dead sets env only inside the block, restores on exit (including on exception), re-probes DNS on the next call so a transient hiccup cannot lock the long-lived LlamaCppBackend singleton offline. Honors a user-set HF_HUB_OFFLINE as a no-op. Preserves a user-set TRANSFORMERS_OFFLINE across exit. Follows the existing studio backend test stub pattern (loggers / structlog / httpx stubs + backend dir on sys.path).
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c27d1da1b6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ) -> Optional[tuple[list[GgufVariantInfo], bool]]: | ||
| """Variants from the local HF cache snapshot, or None if not cached.""" | ||
| for snap in _iter_hf_cache_snapshots(repo_id): | ||
| variants, has_vision = list_local_gguf_variants(str(snap)) |
There was a problem hiding this comment.
Derive offline variants from relative paths
The new cache fallback calls list_local_gguf_variants(str(snap)), but that helper computes quant labels from each file basename (f.name) rather than the snapshot-relative path. In repos that store quants in directories (for example BF16/model.gguf and Q4_K_M/model.gguf), both files collapse to the same pseudo-quant (model), so the offline variant list can lose/distort variants and drive incorrect default variant selection. This regression is specific to the new offline HF-cache path because the online path receives full repo filenames.
Useful? React with 👍 / 👎.
…abel Two follow-up fixes from the review pass on #5505: * _download_mmproj() now mirrors _download_gguf()'s offline path: when list_repo_files() fails, scan the local HF cache snapshot for any GGUF whose basename starts with mmproj-. Without this, offline vision GGUF loads succeed at the main weight (the existing PR fix) but the mmproj returns None and llama-server starts without vision support. Same _iter_hf_cache_snapshots helper, F16 preference and fallback to the first match are preserved. * _extract_quant_label() now considers parent directory segments when the basename has no quant token. Layouts like BF16/foo.gguf are already documented in this file and are returned by the new snapshot-relative-path filter in _download_gguf; before this fix their variant label collapsed to "foo" (the last hyphen segment of the basename). Regex is the same; the search just walks parent segments innermost-first if the basename misses. Tests (studio/backend/tests/test_offline_gguf_cache_fallback.py): * TestExtractQuantLabelSubdir: basename quant unchanged, quant-only- in-parent, UD- prefix in parent, deeper nesting picks the innermost matching segment. * TestDownloadMmprojOfflineCacheFallback: cache fallback returns the mmproj when list_repo_files fails, F16 preference holds when both variants are in cache, no-mmproj cache returns None. * httpx stub now prefers the real package when installed (the CI install list already includes it) and falls back to the stub only when httpx is genuinely missing. Newer huggingface_hub imports HTTPError/Response/Request at module load, so the previous fixed-set stub broke when those names were added upstream. 26 existing cases plus 7 new = 33 pass in 0.74s.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6c17f750c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # API failed; fall back to local snapshot if fully downloaded. | ||
| cached = _list_gguf_variants_from_hf_cache(repo_id) | ||
| if cached is not None: |
There was a problem hiding this comment.
Keep permanent Hub errors from using stale cache
When the Hub API raises a permanent error while the process is not explicitly offline (for example RepositoryNotFoundError, GatedRepoError, or RevisionNotFoundError for a deleted/private repo or bad token), this catch-all now still returns any stale local snapshot. That makes the variant list look valid and can let the UI proceed with cached data even though the repo access/name is definitively invalid; detect_gguf_model_remote already preserves these errors by not consulting cache for the same exception names.
Useful? React with 👍 / 👎.
| rel_files = [ | ||
| f.relative_to(snap).as_posix() | ||
| for f in _iter_gguf_files(snap, recursive = True) | ||
| ] |
There was a problem hiding this comment.
Exclude mmproj files from cached model detection
In the offline cache path this list includes vision projector GGUFs, so _pick_best_gguf can return mmproj-*-F16.gguf as the main model when the cache is partial or when the actual model filename lacks a preferred quant token. That routes the repo to llama.cpp with the projector file as -m, instead of failing or selecting a non-mmproj model; the other GGUF listing paths explicitly skip mmproj files for this reason.
Useful? React with 👍 / 👎.
Resolves conflict in studio/backend/core/inference/llama_cpp.py: the offline DNS scoping wrapper around _download_gguf/_download_mmproj now sits inside the new _serial_load_lock + _already_in_target_state early-return added on main.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: feaeec5a90
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| prev = socket.getdefaulttimeout() | ||
| socket.setdefaulttimeout(timeout) | ||
| try: |
There was a problem hiding this comment.
Avoid mutating global socket timeout in DNS probe
_probe_dns_dead sets socket.setdefaulttimeout(timeout) before calling gethostbyname, which changes the default timeout process-wide for the duration of the probe. During a load_model call this can affect unrelated concurrent requests in the same backend process: any path that opens sockets without an explicit timeout will inherit 2.0s and may fail spuriously. This creates cross-request, timing-dependent network failures that are hard to diagnose.
Useful? React with 👍 / 👎.
Four review findings tightened, with regression tests: - list_local_gguf_variants subdir collapse (P1 codex 10:08): pass the snapshot-relative path to _extract_quant_label so BF16/foo.gguf and Q4_K_M/foo.gguf produce distinct labels instead of folding to the same basename pseudo-quant. - list_gguf_variants cache fallback (P2 codex 12:10): surface RepositoryNotFoundError / GatedRepoError / RevisionNotFoundError / EntryNotFoundError to the caller instead of masking with stale cache, matching detect_gguf_model_remote. - _detect_gguf_from_hf_cache mmproj (P2 codex 12:10): exclude mmproj files from the candidate list so a partial cache with only a vision projector cannot route the projector as the main model. - _probe_dns_dead global timeout (P2 codex 13:06): run the gethostbyname on a daemon thread with join timeout so concurrent sockets in the same interpreter never inherit a process-wide socket.setdefaulttimeout mutation. Same shape applied in worker.py's startup probe.
unslothai#5505 fixed the GGUF/llama-server load path. Studio still has two adjacent code paths that burn ~30-60s of soft-failed timeouts before the worker subprocess starts when DNS to huggingface.co is dead and the model is already in the local HF cache. Inference parent process (routes/inference.py:load_model): * ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead so the LoRA-detect hf_model_info call and the urllib config probes in utils/transformers_version.py short-circuit when DNS is dead. * utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/ TRANSFORMERS_OFFLINE check used by list_gguf_variants and detect_gguf_model_remote into a shared _env_offline() helper, then reused it to gate the LoRA-detect hf_model_info call. * utils/transformers_version.py: _check_tokenizer_config_needs_v5 and _check_config_needs_550 now early-return False when offline instead of issuing a 10s urllib.urlopen against huggingface.co/raw/main. Training worker (core/training/worker.py:run_training_process): * Add the same 2s DNS probe used by core/inference/worker.py at the top of the training subprocess. On failure, set HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of the subprocess imports torch/transformers/unsloth, so every from_pretrained, snapshot_download, and load_dataset call below resolves from cache. Scope is per-subprocess; the orchestrator always spawns a fresh worker per training run. Training trainer (core/training/trainer.py:load_model): * Skip the proactive hf_model_info gated-repo probe when _env_offline() is true. The API is unreachable anyway, and a gated model that is already cached is exactly the scenario the user is trying to train against. from_pretrained surfaces the real error if access is actually denied. Tests (tests/test_offline_inference_parent.py, 7 new cases): * _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE. * transformers_version urllib short-circuit when offline. * LoRA detect hf_model_info skip when offline. Existing tests/test_offline_gguf_cache_fallback.py still passes (26 cases) because the inline env check was extracted, not changed.
Same fix as unslothai#5505's _probe_dns_dead refactor: run gethostbyname on a daemon thread with join timeout so concurrent sockets in the parent interpreter never inherit a process-wide socket.setdefaulttimeout mutation. Adds a static-pin regression test that the inference parent file does not regress on this.
Two layered fixes for the Windows GGUF smoke CI Tool calling Tests flake that exit-22'd on a single httpx.ReadError during llama-server warmup. The 'windows-latest -> windows-2025-vs2026' image rollout is hitting main with the identical symptom. A. _wait_for_health: catch httpx.ReadError, RemoteProtocolError, WriteError alongside ConnectError and TimeoutException. A TCP RST mid-read while llama-server is still binding the port (WinError 10054) is a 'still warming up' signal, not fatal. The existing _process.poll() check still wins for real crashes. B. _drain_stdout + spawn: tee llama-server stdout/stderr to a per-launch log file at ~/.unsloth/studio/logs/llama-server/ <port>.log. Any future subprocess crash leaves a forensic trace on disk even when Studio's traceback only captures the symptom (ReadError) and not the cause. Best-effort: a logging-side OSError never blocks the load. Regression coverage: TestWaitForHealthRetriesOnReadError pins the retry behaviour for the three new exception types and verifies that a real process exit still short-circuits the loop.
for more information, see https://pre-commit.ci
Composite fix for the Tool calling Tests flake that exit-22'd on a single httpx.ReadError during llama-server warm-up. The windows-latest -> windows-2025-vs2026 runner image rollout has been hitting main with the identical symptom. - All three jobs (openai-anthropic, tool-calling, json-images) now retry POST /api/inference/load up to 3 times with 10s backoff and preserve the response body for post-mortem. One transient 500 no longer fails the whole job. - A new "Collect llama-server logs" step copies the per-launch llama-server stdout teed by Studio under ~/.unsloth/studio/logs/ llama-server/ into the workspace, and the upload-artifact step now includes logs/llama-server/*.log so any future subprocess crash leaves a forensic trace.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c3f9ce6394
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| HTTP=$(curl -s -o /tmp/load.json -w '%{http_code}' \ | ||
| -X POST "http://127.0.0.1:${STUDIO_PORT}/api/inference/load" \ | ||
| -H "Authorization: Bearer $TOKEN" -H 'content-type: application/json' \ | ||
| --max-time 600 \ | ||
| -d "{\"model_path\":\"$GGUF_REPO\",\"gguf_variant\":\"$GGUF_VARIANT\",\"is_lora\":false,\"max_seq_length\":2048}") |
There was a problem hiding this comment.
Handle curl failures inside the retry loop
In this workflow shell: bash is executed with -e/pipefail (GitHub documents shell: bash as bash --noprofile --norc -eo pipefail {0}), so if curl hits the transient TCP reset/timeout this retry was added for, the non-zero exit from the command substitution aborts the step before attempt 2 or 3 runs. This makes the new retry only cover HTTP responses like 500, not the connection-level failures described in the comment; the same pattern is repeated in the other load retry blocks.
Useful? React with 👍 / 👎.
#5505 fixed the GGUF/llama-server load path. Studio still has two adjacent code paths that burn ~30-60s of soft-failed timeouts before the worker subprocess starts when DNS to huggingface.co is dead and the model is already in the local HF cache. Inference parent process (routes/inference.py:load_model): * ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead so the LoRA-detect hf_model_info call and the urllib config probes in utils/transformers_version.py short-circuit when DNS is dead. * utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/ TRANSFORMERS_OFFLINE check used by list_gguf_variants and detect_gguf_model_remote into a shared _env_offline() helper, then reused it to gate the LoRA-detect hf_model_info call. * utils/transformers_version.py: _check_tokenizer_config_needs_v5 and _check_config_needs_550 now early-return False when offline instead of issuing a 10s urllib.urlopen against huggingface.co/raw/main. Training worker (core/training/worker.py:run_training_process): * Add the same 2s DNS probe used by core/inference/worker.py at the top of the training subprocess. On failure, set HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of the subprocess imports torch/transformers/unsloth, so every from_pretrained, snapshot_download, and load_dataset call below resolves from cache. Scope is per-subprocess; the orchestrator always spawns a fresh worker per training run. Training trainer (core/training/trainer.py:load_model): * Skip the proactive hf_model_info gated-repo probe when _env_offline() is true. The API is unreachable anyway, and a gated model that is already cached is exactly the scenario the user is trying to train against. from_pretrained surfaces the real error if access is actually denied. Tests (tests/test_offline_inference_parent.py, 7 new cases): * _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE. * transformers_version urllib short-circuit when offline. * LoRA detect hf_model_info skip when offline. Existing tests/test_offline_gguf_cache_fallback.py still passes (26 cases) because the inline env check was extracted, not changed.
Same fix as #5505's _probe_dns_dead refactor: run gethostbyname on a daemon thread with join timeout so concurrent sockets in the parent interpreter never inherit a process-wide socket.setdefaulttimeout mutation. Adds a static-pin regression test that the inference parent file does not regress on this.
#5505 fixed the GGUF/llama-server load path. Studio still has two adjacent code paths that burn ~30-60s of soft-failed timeouts before the worker subprocess starts when DNS to huggingface.co is dead and the model is already in the local HF cache. Inference parent process (routes/inference.py:load_model): * ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead so the LoRA-detect hf_model_info call and the urllib config probes in utils/transformers_version.py short-circuit when DNS is dead. * utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/ TRANSFORMERS_OFFLINE check used by list_gguf_variants and detect_gguf_model_remote into a shared _env_offline() helper, then reused it to gate the LoRA-detect hf_model_info call. * utils/transformers_version.py: _check_tokenizer_config_needs_v5 and _check_config_needs_550 now early-return False when offline instead of issuing a 10s urllib.urlopen against huggingface.co/raw/main. Training worker (core/training/worker.py:run_training_process): * Add the same 2s DNS probe used by core/inference/worker.py at the top of the training subprocess. On failure, set HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of the subprocess imports torch/transformers/unsloth, so every from_pretrained, snapshot_download, and load_dataset call below resolves from cache. Scope is per-subprocess; the orchestrator always spawns a fresh worker per training run. Training trainer (core/training/trainer.py:load_model): * Skip the proactive hf_model_info gated-repo probe when _env_offline() is true. The API is unreachable anyway, and a gated model that is already cached is exactly the scenario the user is trying to train against. from_pretrained surfaces the real error if access is actually denied. Tests (tests/test_offline_inference_parent.py, 7 new cases): * _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE. * transformers_version urllib short-circuit when offline. * LoRA detect hf_model_info skip when offline. Existing tests/test_offline_gguf_cache_fallback.py still passes (26 cases) because the inline env check was extracted, not changed.
Same fix as #5505's _probe_dns_dead refactor: run gethostbyname on a daemon thread with join timeout so concurrent sockets in the parent interpreter never inherit a process-wide socket.setdefaulttimeout mutation. Adds a static-pin regression test that the inference parent file does not regress on this.
Resolve conflict in studio/backend/core/inference/llama_cpp.py: _wait_for_health's except clause. Main's unslothai#5505 landed essentially the same fix this branch shipped in 56537b6 -- catching httpx.ReadError and RemoteProtocolError to keep a transient TCP RST from masking a real subprocess crash -- and added httpx.WriteError on top. Took main's version (strict superset) and dropped this branch's duplicate. Extend tests/test_llama_cpp_wait_for_health.py: - Add WriteError to the stub-completion list so the file imports cleanly against either real httpx or a sibling test's stub. - Add test_write_error_also_swallowed alongside the existing ReadError and RemoteProtocolError cases. All other files in this merge auto-resolved cleanly (install_llama_prebuilt.py, workflows, etc.). Local verification: - pytest test_llama_cpp_wait_for_health.py + windows_nvidia_path + windows_gpu_detection_mock + llama_server_args -- 116 passed. - pytest test_import_fixes_drift.py -- 18 passed. - pytest test_offline_gguf_cache_fallback.py (main's unslothai#5505 addition) -- 44 passed.
…#5512) * studio: extend offline DNS auto-detect to inference parent + training #5505 fixed the GGUF/llama-server load path. Studio still has two adjacent code paths that burn ~30-60s of soft-failed timeouts before the worker subprocess starts when DNS to huggingface.co is dead and the model is already in the local HF cache. Inference parent process (routes/inference.py:load_model): * ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead so the LoRA-detect hf_model_info call and the urllib config probes in utils/transformers_version.py short-circuit when DNS is dead. * utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/ TRANSFORMERS_OFFLINE check used by list_gguf_variants and detect_gguf_model_remote into a shared _env_offline() helper, then reused it to gate the LoRA-detect hf_model_info call. * utils/transformers_version.py: _check_tokenizer_config_needs_v5 and _check_config_needs_550 now early-return False when offline instead of issuing a 10s urllib.urlopen against huggingface.co/raw/main. Training worker (core/training/worker.py:run_training_process): * Add the same 2s DNS probe used by core/inference/worker.py at the top of the training subprocess. On failure, set HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of the subprocess imports torch/transformers/unsloth, so every from_pretrained, snapshot_download, and load_dataset call below resolves from cache. Scope is per-subprocess; the orchestrator always spawns a fresh worker per training run. Training trainer (core/training/trainer.py:load_model): * Skip the proactive hf_model_info gated-repo probe when _env_offline() is true. The API is unreachable anyway, and a gated model that is already cached is exactly the scenario the user is trying to train against. from_pretrained surfaces the real error if access is actually denied. Tests (tests/test_offline_inference_parent.py, 7 new cases): * _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE. * transformers_version urllib short-circuit when offline. * LoRA detect hf_model_info skip when offline. Existing tests/test_offline_gguf_cache_fallback.py still passes (26 cases) because the inline env check was extracted, not changed. * tests: prefer real httpx over stub in offline-test files The studio test stub convention only included the 6 httpx exception names that existed callers needed. Newer huggingface_hub (1.15+) imports HTTPError, Response, Request, HTTPStatusError, AsyncClient, and more at module import time. When httpx is truly absent the stub chase becomes a treadmill. Use the real package when installed (the CI install list already includes httpx, so this is the production environment). Fall back to the stub only when httpx is genuinely missing. No code under test changes. * studio: detect cached LoRA adapters offline; tighten test Two follow-ups from the review pass on #5512: * ModelConfig.from_identifier no longer skips the remote LoRA-detect hf_model_info call when _env_offline() is true. huggingface_hub short-circuits the call via OfflineModeIsEnabled in ~0ms when HF_HUB_OFFLINE is set, so the original 25s concern was moot once routes/inference.py wrapped the call in _hf_offline_if_dns_dead. Skipping the API meant users with a cached LoRA adapter (adapter_config.json on disk) got is_lora=False and the load failed. After the API call (which raises fast offline) a new cache-fallback walks the HF cache snapshot for adapter_config.json via the existing _iter_hf_cache_snapshots helper. * test_hf_model_info_not_called_when_offline replaced. The old test raised AssertionError inside production code that catches Exception, so it passed even if the call happened. New tests use MagicMock and assert call_count >= 1, plus a fixture that stages a fake HF cache with adapter_config.json to verify the offline cache detection. Test count goes from 7 to 8 in test_offline_inference_parent.py. Combined with test_offline_gguf_cache_fallback.py: 34 pass in 9.75s. * Fix/adjust offline training DNS probe per PR #5505 review Same fix as #5505's _probe_dns_dead refactor: run gethostbyname on a daemon thread with join timeout so concurrent sockets in the parent interpreter never inherit a process-wide socket.setdefaulttimeout mutation. Adds a static-pin regression test that the inference parent file does not regress on this. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Trim verbose code comments per review feedback Shorten the longer explanatory comments added by this PR while keeping the WHY of each non-obvious branch: - trainer.py: collapse the 5-line proactive gated-check comment. - training/worker.py: trim the offline auto-detect preamble and the "logger isn't configured" note. - routes/inference.py: shorten the DNS-probe wrap rationale. - transformers_version.py: collapse the two urllib short-circuit notes. - model_config.py: shorten the LoRA detect + cache-fallback notes. - tests/test_offline_inference_parent.py: tighter module docstring, trim class docstrings, drop multi-line explainer comments inside the tests; behaviour and coverage unchanged (9/9 tests still pass). --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The orchestrator's iter-1 refactor of llama_cpp.py inadvertently removed _probe_dns_dead and _hf_offline_if_dns_dead (added on main by unslothai#5505 between when this branch forked and the orchestrator's merge), which caused tests/test_offline_gguf_cache_fallback.py to fail collection across Python 3.10 / 3.11 / 3.12 / 3.13: ImportError: cannot import name '_hf_offline_if_dns_dead' from 'core.inference.llama_cpp' The original intent of this PR for llama_cpp.py was only to delegate the existing _parse_tool_calls_from_text staticmethod to the shared core/inference/tool_call_parser.py, so this commit: 1. Restores studio/backend/core/inference/llama_cpp.py to origin/main verbatim. 2. Re-adds the single import of parse_tool_calls_from_text from the shared parser module. 3. Re-applies the staticmethod-body swap to call the shared parser. Net delta vs main is now small (the shared parser pulls the body out; the DNS-offline helpers and every other GGUF feature stay exactly as main has them). Test pass count after the fix (all on Linux Python 3.11): * 41 safetensors tool-loop tests * 44 offline GGUF cache fallback tests (the previously failing file) * 217 other related tool / inference / anthropic tests = 302 total
Summary
Three different "offline" failures in the GGUF load path, all because we ask huggingface.co for things we already have on disk. Each one surfaces a different error to the user despite the same underlying problem (DNS or HF Hub unreachable).
The fix is the same shape everywhere: try the network, fall back to scanning the local HF cache snapshot if the network is dead.
Failures observed
list_gguf_variants()instudio/backend/utils/models/model_config.pyraises straight throughHTTPException(500), UI showsFailed to list GGUF variants: ... Failed to resolve 'huggingface.co'.detect_gguf_model_remote()retries three times then returnsNone. The caller treats the GGUF-only repo as non-GGUF and routes it through the transformers/Unsloth path. On Apple Silicon the worker subprocess fails to import unsloth and the UI showsFailed to import ML libraries: Unsloth currently only works on NVIDIA, AMD and Intel GPUs._download_gguf()instudio/backend/core/inference/llama_cpp.pyloseslist_repo_files()to the network and falls back tof"{repo_name}-{hf_variant}.gguf". When the file inside the repo does not echo the repo name (e.g.unsloth/Qwen3.6-27B-MTP-GGUFcontainsQwen3.6-27B-UD-Q4_K_XL.ggufwith noMTP),hf_hub_downloadcannot find that invented filename in the cache and raisesFailed to download GGUF file ... we cannot find the requested files in the local cache.All three repros tested with the model fully present in
~/.cache/huggingface/hub, internet off.Changes
studio/backend/utils/models/model_config.py_list_gguf_variants_from_hf_cache(repo_id)that walksmodels--<repo>/snapshots/<hash>/and reuseslist_local_gguf_variants()._detect_gguf_from_hf_cache(repo_id)that returns the best GGUF filename from the same snapshot via_pick_best_gguf().list_gguf_variants(): honorHF_HUB_OFFLINE/TRANSFORMERS_OFFLINEup front; on any unhandled HF exception, fall back to the cache scan before re-raising.detect_gguf_model_remote(): same treatment, layered on top of the existing 3-attempt retry loop. The retries still run for transient flakes; the cache scan only kicks in after every attempt fails.studio/backend/core/inference/llama_cpp.py_download_gguf(): whenlist_repo_files()fails, resolvehf_variantto a real filename by scanning the local HF cache snapshot with the same word-boundary regex the online path uses. Only if the cache scan also turns up nothing does the legacy{repo}-{variant}.ggufheuristic run.load_model(): ifhf_repois set andHF_HUB_OFFLINEis not already in env, do a 2-second DNS probe ofhuggingface.co. If it fails, setHF_HUB_OFFLINE=1(andTRANSFORMERS_OFFLINE=1) for the process so the downstreamhf_hub_downloadcalls hit the cache instantly instead of spending ~25 s on five exponential retries before failing over.import osinside the disk-space try block (module-levelosis already imported).studio/backend/core/inference/worker.pyVerification (offline, model fully cached)
With
HF_HUB_OFFLINE=1set explicitly, the same three calls also complete in ~0 ms with zero network traffic.Online behavior
Unchanged. The HF API is called first; the cache scan is reached only on exception. The retry loop in
detect_gguf_model_remote()keeps its existing semantics (including the early-return on permanent errors likeRepositoryNotFoundError).Test plan
unsloth/Qwen3.5-4B-GGUFwith WiFi off and the model fully cached.unsloth/Qwen3.5-4B-GGUFoffline goes through llama-server (not the Unsloth worker) on macOS.unsloth/Qwen3.6-27B-MTP-GGUFoffline resolves to the real cached filename and starts llama-server.