studio: load cached GGUF models when fully offline by shimmyshimmer · Pull Request #5505 · unslothai/unsloth

shimmyshimmer · 2026-05-17T09:04:40Z

Summary

Three different "offline" failures in the GGUF load path, all because we ask huggingface.co for things we already have on disk. Each one surfaces a different error to the user despite the same underlying problem (DNS or HF Hub unreachable).

The fix is the same shape everywhere: try the network, fall back to scanning the local HF cache snapshot if the network is dead.

Failures observed

Variant dropdown empty. list_gguf_variants() in studio/backend/utils/models/model_config.py raises straight through HTTPException(500), UI shows Failed to list GGUF variants: ... Failed to resolve 'huggingface.co'.
Wrong backend on macOS. detect_gguf_model_remote() retries three times then returns None. The caller treats the GGUF-only repo as non-GGUF and routes it through the transformers/Unsloth path. On Apple Silicon the worker subprocess fails to import unsloth and the UI shows Failed to import ML libraries: Unsloth currently only works on NVIDIA, AMD and Intel GPUs.
Bogus filename in download. _download_gguf() in studio/backend/core/inference/llama_cpp.py loses list_repo_files() to the network and falls back to f"{repo_name}-{hf_variant}.gguf". When the file inside the repo does not echo the repo name (e.g. unsloth/Qwen3.6-27B-MTP-GGUF contains Qwen3.6-27B-UD-Q4_K_XL.gguf with no MTP), hf_hub_download cannot find that invented filename in the cache and raises Failed to download GGUF file ... we cannot find the requested files in the local cache.

All three repros tested with the model fully present in ~/.cache/huggingface/hub, internet off.

Changes

studio/backend/utils/models/model_config.py

Add _list_gguf_variants_from_hf_cache(repo_id) that walks models--<repo>/snapshots/<hash>/ and reuses list_local_gguf_variants().
Add _detect_gguf_from_hf_cache(repo_id) that returns the best GGUF filename from the same snapshot via _pick_best_gguf().
list_gguf_variants(): honor HF_HUB_OFFLINE / TRANSFORMERS_OFFLINE up front; on any unhandled HF exception, fall back to the cache scan before re-raising.
detect_gguf_model_remote(): same treatment, layered on top of the existing 3-attempt retry loop. The retries still run for transient flakes; the cache scan only kicks in after every attempt fails.

studio/backend/core/inference/llama_cpp.py

_download_gguf(): when list_repo_files() fails, resolve hf_variant to a real filename by scanning the local HF cache snapshot with the same word-boundary regex the online path uses. Only if the cache scan also turns up nothing does the legacy {repo}-{variant}.gguf heuristic run.
load_model(): if hf_repo is set and HF_HUB_OFFLINE is not already in env, do a 2-second DNS probe of huggingface.co. If it fails, set HF_HUB_OFFLINE=1 (and TRANSFORMERS_OFFLINE=1) for the process so the downstream hf_hub_download calls hit the cache instantly instead of spending ~25 s on five exponential retries before failing over.
Drop a redundant import os inside the disk-space try block (module-level os is already imported).

studio/backend/core/inference/worker.py

Same DNS auto-detect at worker startup so non-GGUF subprocess loads inherit the offline hint too.

Verification (offline, model fully cached)

detect_gguf_model_remote('unsloth/Qwen3.5-4B-GGUF')
  -> Qwen3.5-4B-UD-Q4_K_XL.gguf                     (0.06 s)

list_gguf_variants('unsloth/Qwen3.5-4B-GGUF')
  -> 2 variants, vision=True                        (0.02 s)

_download_gguf(repo='unsloth/Qwen3.6-27B-MTP-GGUF', variant='UD-Q4_K_XL')
  -> Resolved variant UD-Q4_K_XL -> Qwen3.6-27B-UD-Q4_K_XL.gguf from local HF cache
  -> /Users/.../snapshots/.../Qwen3.6-27B-UD-Q4_K_XL.gguf

With HF_HUB_OFFLINE=1 set explicitly, the same three calls also complete in ~0 ms with zero network traffic.

Online behavior

Unchanged. The HF API is called first; the cache scan is reached only on exception. The retry loop in detect_gguf_model_remote() keeps its existing semantics (including the early-return on permanent errors like RepositoryNotFoundError).

Test plan

Variant dropdown lists Q4_K_XL / IQ2_XXS for unsloth/Qwen3.5-4B-GGUF with WiFi off and the model fully cached.
Loading unsloth/Qwen3.5-4B-GGUF offline goes through llama-server (not the Unsloth worker) on macOS.
Loading unsloth/Qwen3.6-27B-MTP-GGUF offline resolves to the real cached filename and starts llama-server.
Online flow for the same three repos still hits the HF API (no log warnings about cache fallback).

When huggingface.co is unreachable, GGUF model loads fail in three distinct places even though the bits are already in ~/.cache/huggingface/hub. Each failure has a different surface symptom: 1. list_gguf_variants() raises straight through HTTPException(500), so the variant dropdown shows 'Failed to list GGUF variants'. 2. detect_gguf_model_remote() silently returns None after retries fail. The caller then treats a GGUF-only repo as non-GGUF and routes it through the transformers/MLX path. On Apple Silicon this surfaces as 'Unsloth currently only works on NVIDIA, AMD and Intel GPUs.' 3. _download_gguf() loses list_repo_files() to the network and falls back to a filename heuristic ('{repo}-{variant}.gguf'). When the repo name does not echo the filenames (e.g. repo 'Qwen3.6-27B-MTP-GGUF' contains a file 'Qwen3.6-27B-UD-Q4_K_XL.gguf' with no MTP), hf_hub_download cannot find that invented filename in the cache and aborts. Fix in three layers: - list_gguf_variants / detect_gguf_model_remote: honor HF_HUB_OFFLINE and fall back to scanning the local HF cache snapshot when the API throws. detect_gguf_model_remote still keeps its retry loop for transient flakes; the cache fallback only kicks in after every attempt fails. - _download_gguf: when list_repo_files() fails, look up variant -> real filename inside the cached snapshot before resorting to the heuristic. - llama_cpp.load_model / inference worker startup: when DNS for huggingface.co fails (2s probe), set HF_HUB_OFFLINE=1 for the process so every hf_hub_download call below resolves from cache instantly instead of spending ~25s on five exponential retries. Online behavior is unchanged: the API is tried first and only used to fail over. The cache scan is a strict subset of what list_local_gguf_variants already does today for local paths.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 659cea1a20

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T09:08:50Z

+                                for p in snap.rglob("*.gguf")
+                                if "mmproj" not in p.name.lower()
+                                and boundary.search(p.name.lower())


Match cache variant against relative path, not basename

The offline cache fallback filters candidates with boundary.search(p.name.lower()), which ignores directory components. That breaks repos where the quant appears only in a subdirectory (the codebase already documents layouts like BF16/...gguf): those files are skipped, gguf_filename stays unset, and the code falls back to the synthetic {repo}-{variant}.gguf name that is often absent from cache. In a fully offline load this turns a valid cached model into a download failure.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-17T09:08:50Z

+            os.environ["HF_HUB_OFFLINE"] = "1"
+            os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")


Do not persist HF_HUB_OFFLINE on a single DNS miss

When startup DNS resolution fails once, this permanently sets HF_HUB_OFFLINE=1 for the entire worker lifetime. A transient resolver hiccup at process start therefore forces all subsequent model loads into offline mode (including when connectivity is later healthy), preventing normal Hub fetches until the worker is restarted.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request adds offline support for GGUF model detection and resolution by implementing local Hugging Face cache lookups and automatic reachability detection for huggingface.co. Key changes include adding fallback mechanisms in llama_cpp.py and model_config.py to serve model metadata from the local cache when the network is unavailable. Feedback suggests centralizing duplicated cache lookup logic, utilizing the is_offline_mode utility from huggingface_hub, and ensuring consistent logging when offline mode is triggered.

gemini-code-assist · 2026-05-17T09:09:50Z

+        except Exception:
+            os.environ["HF_HUB_OFFLINE"] = "1"
+            os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")


For consistency with the change in llama_cpp.py, it would be beneficial to log a warning when offline mode is enabled. This helps in debugging by making it clear why the application is operating in offline mode. In accordance with repository guidelines, ensure the warning message is dynamically generated to include the specific configuration values it refers to (e.g., the source or value of the offline setting) to ensure accuracy and avoid confusion.

References

User-facing warning messages should be dynamically generated to include the specific configuration values they refer to, rather than using hardcoded examples, to ensure accuracy and avoid confusion.

gemini-code-assist · 2026-05-17T09:09:50Z

+def _list_gguf_variants_from_hf_cache(
+    repo_id: str,
+) -> Optional[tuple[list[GgufVariantInfo], bool]]:
+    """Variants from the local HF cache snapshot, or None if not cached."""
+    try:
+        from huggingface_hub import constants as hf_constants
+    except Exception:
+        return None
+
+    cache_dir = Path(hf_constants.HF_HUB_CACHE)
+    if not cache_dir.is_dir():
+        return None
+
+    target = f"models--{repo_id.replace('/', '--')}".lower()
+    repo_dir: Optional[Path] = None
+    for entry in cache_dir.iterdir():
+        if entry.is_dir() and entry.name.lower() == target:
+            repo_dir = entry
+            break
+    if repo_dir is None:
+        return None
+
+    snapshots = repo_dir / "snapshots"
+    if not snapshots.is_dir():
+        return None
+
+    snap_dirs = [s for s in snapshots.iterdir() if s.is_dir()]
+    if not snap_dirs:
+        return None
+    snap_dirs.sort(key = lambda s: s.stat().st_mtime, reverse = True)
+    for snap in snap_dirs:
+        variants, has_vision = list_local_gguf_variants(str(snap))
+        if variants or has_vision:
+            return variants, has_vision
+    return None


The logic for finding a repository directory in the Hugging Face cache is duplicated in multiple locations (e.g., _detect_gguf_from_hf_cache and _download_gguf). This logic is also inefficient as it iterates over the entire cache directory. Consider creating a single helper function to find the repository directory to improve maintainability and performance, as per repository guidelines to centralize recurring logic and avoid redundant iterations.

References

Centralize recurring or complex logical checks into a single helper function and reuse it across the codebase to ensure consistency and simplify maintenance.

To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

gemini-code-assist · 2026-05-17T09:09:50Z

+    offline = os.environ.get("HF_HUB_OFFLINE", "").lower() in (
+        "1",
+        "true",
+        "yes",


This check for offline mode can be simplified by using the is_offline_mode utility from huggingface_hub. This makes the code more concise and robust. Additionally, as this check is recurring, consider centralizing it into a single helper function to ensure consistency and simplify maintenance.

from huggingface_hub.utils import is_offline_mode offline = is_offline_mode()

References

Centralize recurring or complex logical checks into a single helper function and reuse it across the codebase to ensure consistency and simplify maintenance.

gemini-code-assist · 2026-05-17T09:09:50Z

+    offline = os.environ.get("HF_HUB_OFFLINE", "").lower() in (
+        "1",
+        "true",
+        "yes",


This check for offline mode can be simplified by using the is_offline_mode utility from huggingface_hub. This makes the code more concise and robust. Additionally, as this check is recurring, consider centralizing it into a single helper function to ensure consistency and simplify maintenance.

from huggingface_hub.utils import is_offline_mode offline = is_offline_mode()

References

Centralize recurring or complex logical checks into a single helper function and reuse it across the codebase to ensure consistency and simplify maintenance.

Fixes from the review pass on #5505: * ruff F823 (lint CI red): the late `import os` at the bottom of LlamaCppBackend.load_model made `os` a function-local name, so my new `os.environ` reference at the top of the same method was a use-before-bind. Surfaces at runtime as 'cannot access local variable os where it is not associated with a value' and is why the Mac/Windows Studio API jobs were failing too. The env-var mutation has been moved into a module-level contextmanager, so load_model no longer touches `os` directly. * Codex P1: cache variant match now uses the relative path, not the basename. Layouts like `BF16/foo.gguf` (variant token only in parent dir) were silently skipped, falling through to the bogus `{repo}-{variant}.gguf` heuristic and failing offline loads of models stored under quant-named subdirs. * Codex P1: HF_HUB_OFFLINE no longer persists past one model load. llama_cpp.load_model now uses a contextmanager that probes DNS, sets HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE only when DNS is dead, and pops them in finally (preserving any prior user setting of TRANSFORMERS_OFFLINE). Pre-existing user-set HF_HUB_OFFLINE is respected as a no-op. worker.py keeps the startup probe because the orchestrator spawns a fresh worker per load -- comment updated to make that lifecycle explicit, and a warning is now logged. * Gemini: cache-dir lookup centralized in `_iter_hf_cache_snapshots`. Three near-identical copies (in list/detect helpers and the llama_cpp offline scan) now go through one helper. * Gemini: `huggingface_hub.utils.is_offline_mode` does not exist in 1.x (verified locally); `huggingface_hub.constants.HF_HUB_OFFLINE` is snapshot-at-import-time and does not reflect runtime mutations. Manual env-var parsing kept. * socket probe now saves and restores the prior default timeout instead of unconditionally setting None on exit, so it composes with caller code that already configured a timeout. * worker.py probe now logs a warning when offline mode is auto-enabled so debugging the case isn't blind.

shimmyshimmer · 2026-05-17T09:41:24Z

Pushed 6476b9b addressing the review feedback. Verdicts inline:

Lint (ruff F823) — fixed. This was a real bug, not just a style nit. load_model had a late import os further down its body which made os a function-local name; my new os.environ reference at the top of the same method was therefore a use-before-bind. At runtime this surfaces as cannot access local variable 'os' where it is not associated with a value, and is the same error that caused the Mac/Windows Studio API jobs to fail with /api/inference/load -> 500. The env-var mutation has been moved into a module-level contextmanager (_hf_offline_if_dns_dead), so load_model no longer touches os directly.

Codex P1 #1 (match cache variant against relative path) — fixed. Real bug. The cache scan now matches against relative_to(snap).as_posix().lower() rather than p.name.lower(), so snapshot layouts where the quant lives in a parent dir (e.g. BF16/foo.gguf) are correctly found. Added regression tests.

Codex P1 #2 (do not persist HF_HUB_OFFLINE) — fixed. Real concern for the long-lived FastAPI process. llama_cpp.load_model now wraps the download phase in a _hf_offline_if_dns_dead() contextmanager that pops HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE in finally, so a transient DNS hiccup at the start of one load can't quarantine future loads. The contextmanager respects any pre-existing user setting (no-op if HF_HUB_OFFLINE is already in env) and preserves a user-set TRANSFORMERS_OFFLINE. For worker.py the scope is naturally one model load (orchestrator spawns a fresh subprocess per load), so the env mutation there is intentionally per-process; comment updated to make that lifecycle explicit.

Gemini (centralize cache-dir lookup) — fixed. Extracted _iter_hf_cache_snapshots(repo_id) in model_config.py. _list_gguf_variants_from_hf_cache, _detect_gguf_from_hf_cache, and the offline scan in llama_cpp._download_gguf all route through it now.

Gemini (use is_offline_mode from huggingface_hub) — declined, function does not exist. Verified locally on huggingface_hub 1.15.0: huggingface_hub.utils.is_offline_mode is not a public symbol. huggingface_hub.constants.HF_HUB_OFFLINE exists as a bool but is snapshot-at-import-time and does not reflect runtime env mutations, so it isn't usable for our case where the env var can be flipped mid-process. Kept the explicit env-var parsing, which also handles TRANSFORMERS_OFFLINE as a fallback.

Gemini (log warning when worker enables offline) — fixed. worker.py now emits huggingface.co unreachable; HF_HUB_OFFLINE=1 set for this worker. when the probe trips.

Drive-by improvements the review pass surfaced:

DNS probe now saves and restores the prior socket.getdefaulttimeout() instead of unconditionally setting None on exit, so it composes with caller code that may have set a timeout.
Cache iteration in _iter_hf_cache_snapshots now guards iterdir() against OSError (permission errors on shared caches).

Test coverage: the local suite that exercises the patched code is now 119 cases (102 from the original verification + 17 net new for the review fixes). Notable additions:

test_relative_path_match_with_subdir_variant: regression for the basename-only bug Codex found.
test_hf_offline_cm_sets_and_clears_on_dns_failure: verifies the env var is gone on context exit.
test_hf_offline_cm_cleans_up_even_on_exception: same, on raised exception inside the body.
test_hf_offline_cm_preserves_pre_existing_transformers_offline: cleanup does not blow away a user-set value.
test_dns_probe_restores_prior_socket_timeout: probe doesn't clobber a caller's prior timeout.

Pre-commit formatting (scripts/run_ruff_format.py + kwarg spacing) applied locally, so the pre-commit.ci bot shouldn't need to add a follow-up commit this time.

danielhanchen

I pulled the branch down, set up an isolated harness with the same pinned huggingface_hub==0.36.2, ran 41 simulated scenarios against the three changed call paths (fake HF cache with both flat and snapshot-symlink layouts, monkey-patched hf_model_info / list_repo_files, monkey-patched socket.gethostbyname). Full suite passes; results back both bot P1 findings empirically. Sim lives at workspace_32/temp/pr5505_sim/ if you want it.

The PR is solving real bugs: I reproduced all three failure modes against the pre-PR code (case 06 raises offline, case 15 builds the synthetic Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf that misses the cache). Online behavior is unchanged versus main (case 27). Cross-platform-wise the new code is pure stdlib + pathlib.Path + huggingface_hub.constants.HF_HUB_CACHE, no sys.platform branches added; the realistic-layout suite covers HF cache symlinks (snapshots/<sha>/<name> into blobs/) and dangling symlinks from partial downloads.

P1

1. _download_gguf offline cache filter regresses online/offline parity.
studio/backend/core/inference/llama_cpp.py:1814 matches on boundary.search(p.name.lower()) (basename), while the online block at :1748-1757 matches on the full path from list_repo_files (which can return rfilenames like BF16/foo.gguf). The codebase already supports the subdir-only layout: model_config.py:1020, :1213, :1258, :1443-1446, :1490-1491, :2293-2300. Simulation case 16 / E4 reproduces this end to end against a cache containing only BF16/foo.gguf: PR resolves to the synthetic gpt-oss-20b-BF16-BF16.gguf (which is not on disk) and the load fails; case 17 / E4 shows the one-character fix passing.

Suggested change (one line):

# studio/backend/core/inference/llama_cpp.py around line 1814
matches = sorted(
    p.relative_to(snap).as_posix()
    for p in snap.rglob("*.gguf")
    if "mmproj" not in p.name.lower()
    and boundary.search(p.relative_to(snap).as_posix().lower())
)

The basename-only failure is latent today (every in-repo example I found still has the quant token in the basename) but it diverges from the online filter right above and silently turns valid cached repos into download failures the moment any future unsloth GGUF ships a subdir-only layout. Cheap to fix now.

2. LlamaCppBackend.load_model permanently mutates HF_HUB_OFFLINE on one transient DNS miss.
llama_cpp.py:2052-2067 sets os.environ["HF_HUB_OFFLINE"] = "1" and os.environ.setdefault("TRANSFORMERS_OFFLINE", "1") in the except branch and never restores them. LlamaCppBackend is a module-level singleton (routes/inference.py:433 -> _llama_cpp_backend = LlamaCppBackend()), reused across every /load for the lifetime of the FastAPI server. After a single DNS hiccup at first load the env var is sticky forever: when connectivity recovers the next uncached-repo load short-circuits the probe (if hf_repo and "HF_HUB_OFFLINE" not in os.environ), hf_hub_download keeps refusing the Hub, and the user sees stale-cache errors with no recovery short of restarting the server. grep "del os.environ\['HF_HUB_OFFLINE'\]\\|os.environ.pop.*HF_HUB_OFFLINE" studio/ returns zero hits.

Simulation case 19 reproduces the leak: DNS fails -> env stuck at 1 even after DNS recovers. Case 20 / 21 / 22 show the scoped variant restores the env, preserves a user-set HF_HUB_OFFLINE=1, and is a no-op when DNS succeeds.

worker.py:651-663 does the same thing but isn't a real bug there because the worker is one-shot per load: orchestrator.py:606-608 documents "Always spawns a fresh subprocess for each model load" and _CTX.Process(...) at :181-192 confirms it. Each load gets a fresh probe and a fresh process env. The block in worker.py can stay as-is; only llama_cpp.py:2052-2067 needs to change.

Suggested change (scope the mutation):

# studio/backend/core/inference/llama_cpp.py around line 2052
self._cancel_event.clear()

_offline_keys: list[str] = []
if hf_repo and "HF_HUB_OFFLINE" not in os.environ:
    import socket as _socket
    try:
        _socket.setdefaulttimeout(2.0)
        _socket.gethostbyname("huggingface.co")
    except Exception:
        os.environ["HF_HUB_OFFLINE"] = "1"
        _offline_keys.append("HF_HUB_OFFLINE")
        if "TRANSFORMERS_OFFLINE" not in os.environ:
            os.environ["TRANSFORMERS_OFFLINE"] = "1"
            _offline_keys.append("TRANSFORMERS_OFFLINE")
        logger.warning(
            "huggingface.co unreachable, enabling HF_HUB_OFFLINE for this load."
        )
    finally:
        _socket.setdefaulttimeout(None)
try:
    # ... existing kill / launch / hf_hub_download path ...
finally:
    for _k in _offline_keys:
        os.environ.pop(_k, None)

_offline_keys ensures we only pop the variables this code added; a user who explicitly set HF_HUB_OFFLINE=1 before launching Studio keeps that value across the load. Alternative shape if you'd rather not touch os.environ at all: thread local_files_only=True into the specific hf_hub_download / get_paths_info calls reached after the probe fails. No local_files_only= usage anywhere in studio/ today (zero grep hits), so this would be the first instance; the scoped try/finally is the smaller diff.

P2

3. Centralise the cache repo-dir lookup. Three sites (_list_gguf_variants_from_hf_cache, _detect_gguf_from_hf_cache, and the new inline block at llama_cpp.py:1797-1840) hand-roll the same cache_dir / f"models--{repo_id.replace('/', '--')}".lower() resolution plus snapshot iteration sorted by mtime. One _find_repo_cache_dir(repo_id) -> Optional[Path] (or _iter_repo_snapshots(repo_id)) helper in model_config.py keeps the three sites in sync, and the fix in P1 #1 lives in exactly one place.

4. worker.py:651-663 should log when auto-offline mode kicks in, matching the logger.warning(...) already in llama_cpp.py:2061. Pure observability nit.

5. Drop the huggingface_hub.utils.is_offline_mode suggestion from the bot's medium comments. That function does not exist in pinned huggingface_hub==0.36.2. I verified with python -c "from huggingface_hub.utils import is_offline_mode" -> ImportError, and again with from huggingface_hub import is_offline_mode -> ImportError. The closest modern API is huggingface_hub.constants.HF_HUB_OFFLINE (module-level bool computed from env at import), but the current manual env check is also fine and is more flexible because it accepts "true" / "yes" (simulated in cases E11 / E12 / E13). Keep what you have.

Nits

The cache fallback in list_gguf_variants / detect_gguf_model_remote catches bare Exception and serves cache on every HF API error. The retry loop in detect_gguf_model_remote already early-returns on RepositoryNotFoundError / GatedRepoError / RevisionNotFoundError / EntryNotFoundError; consider mirroring that for the outer except Exception cache fallback so a stale cache cannot quietly paper over a deliberate access-control or revision failure. Not blocking, the synthetic-name fallback after both API and cache miss still raises eventually.
Concurrency edge: socket.setdefaulttimeout(2.0) is process-global. If two /load requests land in the same FastAPI worker concurrently, the second thread's probe can see the first thread's reset to None mid-flight (or vice versa). Simulation case E6 doesn't repro the worst case but observes the race window. Low priority; only matters once you have parallel-load tenants.
The PR-description bullet "Drop a redundant import os inside the disk-space try block" is fine but unrelated to offline; easy to split into its own commit if you keep the offline-fix PR tight.

Verification

If you want to reproduce the proofs locally without retraining anything:

cd workspace_32/temp/pr5505_sim
uv venv .venv && source .venv/bin/activate
uv pip install "huggingface_hub==0.36.2" structlog pytest
python run_sim.py        # 28 cases, flat fake cache
python run_sim_extras.py # 13 cases, realistic snapshot-symlink cache + concurrency

Cases that reproduce the two P1s in isolation:

Case	What it shows
06 / 15	BEFORE-PR offline path raises / builds bogus filename (the bug the PR fixes)
14	PR fixes the `Qwen3.6-27B-MTP-GGUF` case
16 / E4	P1 #1 reproduced: PR's basename filter misses `BF16/foo.gguf`
17 / E4	One-line fix (relative-path filter) resolves the subdir layout
19	P1 #2 reproduced: `HF_HUB_OFFLINE` stays `1` after DNS recovery
20 / 21 / 22	Scoped try/finally restores env, preserves user-set value, no-op on success
27	Online happy path unchanged versus main
E1-E5	Realistic HF cache layout (snapshot symlinks into blobs/, dangling links)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6476b9b9bc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T09:45:36Z

+            os.environ["HF_HUB_OFFLINE"] = "1"
+            os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")


Avoid forcing offline mode after one DNS lookup failure

A transient failure in _socket.gethostbyname("huggingface.co") now unconditionally sets HF_HUB_OFFLINE=1 for the entire worker process, so the same load attempt will refuse all Hub HTTP calls even if connectivity is actually available moments later. This regresses online reliability for partially cached models: instead of retrying/download behavior, the load fails immediately in offline-only mode until a new subprocess is spawned. Fresh evidence in this revision is that run_inference_process sets the env once at startup while the spawned worker is kept alive through the loaded handshake in orchestrator.load_model, so every Hub call in that attempt inherits offline mode.

Useful? React with 👍 / 👎.

Lock in the offline fallback path from #5505 so future refactors can't silently regress either bug. 26 tests, 0.55 s, no network/GPU/subprocess. Covers: * _iter_hf_cache_snapshots: missing cache, missing repo, missing snapshots/, newest-mtime ordering, case-insensitive repo match. * _list_gguf_variants_from_hf_cache and the list_gguf_variants online/offline-env/API-exception/reraise paths. * _detect_gguf_from_hf_cache and detect_gguf_model_remote 3x-fail fallback. Pre-existing RepositoryNotFoundError early-return preserved. * Codex P1 #1 regression: BF16/foo.gguf (quant only in subdir name) must resolve via _detect_gguf_from_hf_cache, which now matches the snapshot-relative path rather than the basename. * _probe_dns_dead: returns True/False, restores prior socket timeout. * Codex P1 #2 regression: _hf_offline_if_dns_dead sets env only inside the block, restores on exit (including on exception), re-probes DNS on the next call so a transient hiccup cannot lock the long-lived LlamaCppBackend singleton offline. Honors a user-set HF_HUB_OFFLINE as a no-op. Preserves a user-set TRANSFORMERS_OFFLINE across exit. Follows the existing studio backend test stub pattern (loggers / structlog / httpx stubs + backend dir on sys.path).

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c27d1da1b6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T10:08:24Z

+) -> Optional[tuple[list[GgufVariantInfo], bool]]:
+    """Variants from the local HF cache snapshot, or None if not cached."""
+    for snap in _iter_hf_cache_snapshots(repo_id):
+        variants, has_vision = list_local_gguf_variants(str(snap))


Derive offline variants from relative paths

The new cache fallback calls list_local_gguf_variants(str(snap)), but that helper computes quant labels from each file basename (f.name) rather than the snapshot-relative path. In repos that store quants in directories (for example BF16/model.gguf and Q4_K_M/model.gguf), both files collapse to the same pseudo-quant (model), so the offline variant list can lose/distort variants and drive incorrect default variant selection. This regression is specific to the new offline HF-cache path because the online path receives full repo filenames.

Useful? React with 👍 / 👎.

…abel Two follow-up fixes from the review pass on #5505: * _download_mmproj() now mirrors _download_gguf()'s offline path: when list_repo_files() fails, scan the local HF cache snapshot for any GGUF whose basename starts with mmproj-. Without this, offline vision GGUF loads succeed at the main weight (the existing PR fix) but the mmproj returns None and llama-server starts without vision support. Same _iter_hf_cache_snapshots helper, F16 preference and fallback to the first match are preserved. * _extract_quant_label() now considers parent directory segments when the basename has no quant token. Layouts like BF16/foo.gguf are already documented in this file and are returned by the new snapshot-relative-path filter in _download_gguf; before this fix their variant label collapsed to "foo" (the last hyphen segment of the basename). Regex is the same; the search just walks parent segments innermost-first if the basename misses. Tests (studio/backend/tests/test_offline_gguf_cache_fallback.py): * TestExtractQuantLabelSubdir: basename quant unchanged, quant-only- in-parent, UD- prefix in parent, deeper nesting picks the innermost matching segment. * TestDownloadMmprojOfflineCacheFallback: cache fallback returns the mmproj when list_repo_files fails, F16 preference holds when both variants are in cache, no-mmproj cache returns None. * httpx stub now prefers the real package when installed (the CI install list already includes it) and falls back to the stub only when httpx is genuinely missing. Newer huggingface_hub imports HTTPError/Response/Request at module load, so the previous fixed-set stub broke when those names were added upstream. 26 existing cases plus 7 new = 33 pass in 0.74s.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6c17f750c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T12:10:52Z

+        # API failed; fall back to local snapshot if fully downloaded.
+        cached = _list_gguf_variants_from_hf_cache(repo_id)
+        if cached is not None:


Keep permanent Hub errors from using stale cache

When the Hub API raises a permanent error while the process is not explicitly offline (for example RepositoryNotFoundError, GatedRepoError, or RevisionNotFoundError for a deleted/private repo or bad token), this catch-all now still returns any stale local snapshot. That makes the variant list look valid and can let the UI proceed with cached data even though the repo access/name is definitively invalid; detect_gguf_model_remote already preserves these errors by not consulting cache for the same exception names.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-17T12:10:52Z

+        rel_files = [
+            f.relative_to(snap).as_posix()
+            for f in _iter_gguf_files(snap, recursive = True)
+        ]


Exclude mmproj files from cached model detection

In the offline cache path this list includes vision projector GGUFs, so _pick_best_gguf can return mmproj-*-F16.gguf as the main model when the cache is partial or when the actual model filename lacks a preferred quant token. That routes the repo to llama.cpp with the projector file as -m, instead of failing or selecting a non-mmproj model; the other GGUF listing paths explicitly skip mmproj files for this reason.

Useful? React with 👍 / 👎.

Resolves conflict in studio/backend/core/inference/llama_cpp.py: the offline DNS scoping wrapper around _download_gguf/_download_mmproj now sits inside the new _serial_load_lock + _already_in_target_state early-return added on main.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: feaeec5a90

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T13:06:50Z

+    prev = socket.getdefaulttimeout()
+    socket.setdefaulttimeout(timeout)
+    try:


Avoid mutating global socket timeout in DNS probe

_probe_dns_dead sets socket.setdefaulttimeout(timeout) before calling gethostbyname, which changes the default timeout process-wide for the duration of the probe. During a load_model call this can affect unrelated concurrent requests in the same backend process: any path that opens sockets without an explicit timeout will inherit 2.0s and may fail spuriously. This creates cross-request, timing-dependent network failures that are hard to diagnose.

Useful? React with 👍 / 👎.

Four review findings tightened, with regression tests: - list_local_gguf_variants subdir collapse (P1 codex 10:08): pass the snapshot-relative path to _extract_quant_label so BF16/foo.gguf and Q4_K_M/foo.gguf produce distinct labels instead of folding to the same basename pseudo-quant. - list_gguf_variants cache fallback (P2 codex 12:10): surface RepositoryNotFoundError / GatedRepoError / RevisionNotFoundError / EntryNotFoundError to the caller instead of masking with stale cache, matching detect_gguf_model_remote. - _detect_gguf_from_hf_cache mmproj (P2 codex 12:10): exclude mmproj files from the candidate list so a partial cache with only a vision projector cannot route the projector as the main model. - _probe_dns_dead global timeout (P2 codex 13:06): run the gethostbyname on a daemon thread with join timeout so concurrent sockets in the same interpreter never inherit a process-wide socket.setdefaulttimeout mutation. Same shape applied in worker.py's startup probe.

unslothai#5505 fixed the GGUF/llama-server load path. Studio still has two adjacent code paths that burn ~30-60s of soft-failed timeouts before the worker subprocess starts when DNS to huggingface.co is dead and the model is already in the local HF cache. Inference parent process (routes/inference.py:load_model): * ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead so the LoRA-detect hf_model_info call and the urllib config probes in utils/transformers_version.py short-circuit when DNS is dead. * utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/ TRANSFORMERS_OFFLINE check used by list_gguf_variants and detect_gguf_model_remote into a shared _env_offline() helper, then reused it to gate the LoRA-detect hf_model_info call. * utils/transformers_version.py: _check_tokenizer_config_needs_v5 and _check_config_needs_550 now early-return False when offline instead of issuing a 10s urllib.urlopen against huggingface.co/raw/main. Training worker (core/training/worker.py:run_training_process): * Add the same 2s DNS probe used by core/inference/worker.py at the top of the training subprocess. On failure, set HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of the subprocess imports torch/transformers/unsloth, so every from_pretrained, snapshot_download, and load_dataset call below resolves from cache. Scope is per-subprocess; the orchestrator always spawns a fresh worker per training run. Training trainer (core/training/trainer.py:load_model): * Skip the proactive hf_model_info gated-repo probe when _env_offline() is true. The API is unreachable anyway, and a gated model that is already cached is exactly the scenario the user is trying to train against. from_pretrained surfaces the real error if access is actually denied. Tests (tests/test_offline_inference_parent.py, 7 new cases): * _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE. * transformers_version urllib short-circuit when offline. * LoRA detect hf_model_info skip when offline. Existing tests/test_offline_gguf_cache_fallback.py still passes (26 cases) because the inline env check was extracted, not changed.

Same fix as unslothai#5505's _probe_dns_dead refactor: run gethostbyname on a daemon thread with join timeout so concurrent sockets in the parent interpreter never inherit a process-wide socket.setdefaulttimeout mutation. Adds a static-pin regression test that the inference parent file does not regress on this.

Two layered fixes for the Windows GGUF smoke CI Tool calling Tests flake that exit-22'd on a single httpx.ReadError during llama-server warmup. The 'windows-latest -> windows-2025-vs2026' image rollout is hitting main with the identical symptom. A. _wait_for_health: catch httpx.ReadError, RemoteProtocolError, WriteError alongside ConnectError and TimeoutException. A TCP RST mid-read while llama-server is still binding the port (WinError 10054) is a 'still warming up' signal, not fatal. The existing _process.poll() check still wins for real crashes. B. _drain_stdout + spawn: tee llama-server stdout/stderr to a per-launch log file at ~/.unsloth/studio/logs/llama-server/ <port>.log. Any future subprocess crash leaves a forensic trace on disk even when Studio's traceback only captures the symptom (ReadError) and not the cause. Best-effort: a logging-side OSError never blocks the load. Regression coverage: TestWaitForHealthRetriesOnReadError pins the retry behaviour for the three new exception types and verifies that a real process exit still short-circuits the loop.

for more information, see https://pre-commit.ci

Composite fix for the Tool calling Tests flake that exit-22'd on a single httpx.ReadError during llama-server warm-up. The windows-latest -> windows-2025-vs2026 runner image rollout has been hitting main with the identical symptom. - All three jobs (openai-anthropic, tool-calling, json-images) now retry POST /api/inference/load up to 3 times with 10s backoff and preserve the response body for post-mortem. One transient 500 no longer fails the whole job. - A new "Collect llama-server logs" step copies the per-launch llama-server stdout teed by Studio under ~/.unsloth/studio/logs/ llama-server/ into the workspace, and the upload-artifact step now includes logs/llama-server/*.log so any future subprocess crash leaves a forensic trace.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c3f9ce6394

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-18T02:32:47Z

+            HTTP=$(curl -s -o /tmp/load.json -w '%{http_code}' \
+              -X POST "http://127.0.0.1:${STUDIO_PORT}/api/inference/load" \
+              -H "Authorization: Bearer $TOKEN" -H 'content-type: application/json' \
+              --max-time 600 \
+              -d "{\"model_path\":\"$GGUF_REPO\",\"gguf_variant\":\"$GGUF_VARIANT\",\"is_lora\":false,\"max_seq_length\":2048}")


Handle curl failures inside the retry loop

In this workflow shell: bash is executed with -e/pipefail (GitHub documents shell: bash as bash --noprofile --norc -eo pipefail {0}), so if curl hits the transient TCP reset/timeout this retry was added for, the non-zero exit from the command substitution aborts the step before attempt 2 or 3 runs. This makes the new retry only cover HTTP responses like 500, not the connection-level failures described in the comment; the same pattern is repeated in the other load retry blocks.

Useful? React with 👍 / 👎.

#5505 fixed the GGUF/llama-server load path. Studio still has two adjacent code paths that burn ~30-60s of soft-failed timeouts before the worker subprocess starts when DNS to huggingface.co is dead and the model is already in the local HF cache. Inference parent process (routes/inference.py:load_model): * ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead so the LoRA-detect hf_model_info call and the urllib config probes in utils/transformers_version.py short-circuit when DNS is dead. * utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/ TRANSFORMERS_OFFLINE check used by list_gguf_variants and detect_gguf_model_remote into a shared _env_offline() helper, then reused it to gate the LoRA-detect hf_model_info call. * utils/transformers_version.py: _check_tokenizer_config_needs_v5 and _check_config_needs_550 now early-return False when offline instead of issuing a 10s urllib.urlopen against huggingface.co/raw/main. Training worker (core/training/worker.py:run_training_process): * Add the same 2s DNS probe used by core/inference/worker.py at the top of the training subprocess. On failure, set HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of the subprocess imports torch/transformers/unsloth, so every from_pretrained, snapshot_download, and load_dataset call below resolves from cache. Scope is per-subprocess; the orchestrator always spawns a fresh worker per training run. Training trainer (core/training/trainer.py:load_model): * Skip the proactive hf_model_info gated-repo probe when _env_offline() is true. The API is unreachable anyway, and a gated model that is already cached is exactly the scenario the user is trying to train against. from_pretrained surfaces the real error if access is actually denied. Tests (tests/test_offline_inference_parent.py, 7 new cases): * _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE. * transformers_version urllib short-circuit when offline. * LoRA detect hf_model_info skip when offline. Existing tests/test_offline_gguf_cache_fallback.py still passes (26 cases) because the inline env check was extracted, not changed.

Same fix as #5505's _probe_dns_dead refactor: run gethostbyname on a daemon thread with join timeout so concurrent sockets in the parent interpreter never inherit a process-wide socket.setdefaulttimeout mutation. Adds a static-pin regression test that the inference parent file does not regress on this.

#5505 fixed the GGUF/llama-server load path. Studio still has two adjacent code paths that burn ~30-60s of soft-failed timeouts before the worker subprocess starts when DNS to huggingface.co is dead and the model is already in the local HF cache. Inference parent process (routes/inference.py:load_model): * ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead so the LoRA-detect hf_model_info call and the urllib config probes in utils/transformers_version.py short-circuit when DNS is dead. * utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/ TRANSFORMERS_OFFLINE check used by list_gguf_variants and detect_gguf_model_remote into a shared _env_offline() helper, then reused it to gate the LoRA-detect hf_model_info call. * utils/transformers_version.py: _check_tokenizer_config_needs_v5 and _check_config_needs_550 now early-return False when offline instead of issuing a 10s urllib.urlopen against huggingface.co/raw/main. Training worker (core/training/worker.py:run_training_process): * Add the same 2s DNS probe used by core/inference/worker.py at the top of the training subprocess. On failure, set HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of the subprocess imports torch/transformers/unsloth, so every from_pretrained, snapshot_download, and load_dataset call below resolves from cache. Scope is per-subprocess; the orchestrator always spawns a fresh worker per training run. Training trainer (core/training/trainer.py:load_model): * Skip the proactive hf_model_info gated-repo probe when _env_offline() is true. The API is unreachable anyway, and a gated model that is already cached is exactly the scenario the user is trying to train against. from_pretrained surfaces the real error if access is actually denied. Tests (tests/test_offline_inference_parent.py, 7 new cases): * _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE. * transformers_version urllib short-circuit when offline. * LoRA detect hf_model_info skip when offline. Existing tests/test_offline_gguf_cache_fallback.py still passes (26 cases) because the inline env check was extracted, not changed.

Same fix as #5505's _probe_dns_dead refactor: run gethostbyname on a daemon thread with join timeout so concurrent sockets in the parent interpreter never inherit a process-wide socket.setdefaulttimeout mutation. Adds a static-pin regression test that the inference parent file does not regress on this.

Resolve conflict in studio/backend/core/inference/llama_cpp.py: _wait_for_health's except clause. Main's unslothai#5505 landed essentially the same fix this branch shipped in 56537b6 -- catching httpx.ReadError and RemoteProtocolError to keep a transient TCP RST from masking a real subprocess crash -- and added httpx.WriteError on top. Took main's version (strict superset) and dropped this branch's duplicate. Extend tests/test_llama_cpp_wait_for_health.py: - Add WriteError to the stub-completion list so the file imports cleanly against either real httpx or a sibling test's stub. - Add test_write_error_also_swallowed alongside the existing ReadError and RemoteProtocolError cases. All other files in this merge auto-resolved cleanly (install_llama_prebuilt.py, workflows, etc.). Local verification: - pytest test_llama_cpp_wait_for_health.py + windows_nvidia_path + windows_gpu_detection_mock + llama_server_args -- 116 passed. - pytest test_import_fixes_drift.py -- 18 passed. - pytest test_offline_gguf_cache_fallback.py (main's unslothai#5505 addition) -- 44 passed.

…#5512) * studio: extend offline DNS auto-detect to inference parent + training #5505 fixed the GGUF/llama-server load path. Studio still has two adjacent code paths that burn ~30-60s of soft-failed timeouts before the worker subprocess starts when DNS to huggingface.co is dead and the model is already in the local HF cache. Inference parent process (routes/inference.py:load_model): * ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead so the LoRA-detect hf_model_info call and the urllib config probes in utils/transformers_version.py short-circuit when DNS is dead. * utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/ TRANSFORMERS_OFFLINE check used by list_gguf_variants and detect_gguf_model_remote into a shared _env_offline() helper, then reused it to gate the LoRA-detect hf_model_info call. * utils/transformers_version.py: _check_tokenizer_config_needs_v5 and _check_config_needs_550 now early-return False when offline instead of issuing a 10s urllib.urlopen against huggingface.co/raw/main. Training worker (core/training/worker.py:run_training_process): * Add the same 2s DNS probe used by core/inference/worker.py at the top of the training subprocess. On failure, set HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of the subprocess imports torch/transformers/unsloth, so every from_pretrained, snapshot_download, and load_dataset call below resolves from cache. Scope is per-subprocess; the orchestrator always spawns a fresh worker per training run. Training trainer (core/training/trainer.py:load_model): * Skip the proactive hf_model_info gated-repo probe when _env_offline() is true. The API is unreachable anyway, and a gated model that is already cached is exactly the scenario the user is trying to train against. from_pretrained surfaces the real error if access is actually denied. Tests (tests/test_offline_inference_parent.py, 7 new cases): * _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE. * transformers_version urllib short-circuit when offline. * LoRA detect hf_model_info skip when offline. Existing tests/test_offline_gguf_cache_fallback.py still passes (26 cases) because the inline env check was extracted, not changed. * tests: prefer real httpx over stub in offline-test files The studio test stub convention only included the 6 httpx exception names that existed callers needed. Newer huggingface_hub (1.15+) imports HTTPError, Response, Request, HTTPStatusError, AsyncClient, and more at module import time. When httpx is truly absent the stub chase becomes a treadmill. Use the real package when installed (the CI install list already includes httpx, so this is the production environment). Fall back to the stub only when httpx is genuinely missing. No code under test changes. * studio: detect cached LoRA adapters offline; tighten test Two follow-ups from the review pass on #5512: * ModelConfig.from_identifier no longer skips the remote LoRA-detect hf_model_info call when _env_offline() is true. huggingface_hub short-circuits the call via OfflineModeIsEnabled in ~0ms when HF_HUB_OFFLINE is set, so the original 25s concern was moot once routes/inference.py wrapped the call in _hf_offline_if_dns_dead. Skipping the API meant users with a cached LoRA adapter (adapter_config.json on disk) got is_lora=False and the load failed. After the API call (which raises fast offline) a new cache-fallback walks the HF cache snapshot for adapter_config.json via the existing _iter_hf_cache_snapshots helper. * test_hf_model_info_not_called_when_offline replaced. The old test raised AssertionError inside production code that catches Exception, so it passed even if the call happened. New tests use MagicMock and assert call_count >= 1, plus a fixture that stages a fake HF cache with adapter_config.json to verify the offline cache detection. Test count goes from 7 to 8 in test_offline_inference_parent.py. Combined with test_offline_gguf_cache_fallback.py: 34 pass in 9.75s. * Fix/adjust offline training DNS probe per PR #5505 review Same fix as #5505's _probe_dns_dead refactor: run gethostbyname on a daemon thread with join timeout so concurrent sockets in the parent interpreter never inherit a process-wide socket.setdefaulttimeout mutation. Adds a static-pin regression test that the inference parent file does not regress on this. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Trim verbose code comments per review feedback Shorten the longer explanatory comments added by this PR while keeping the WHY of each non-obvious branch: - trainer.py: collapse the 5-line proactive gated-check comment. - training/worker.py: trim the offline auto-detect preamble and the "logger isn't configured" note. - routes/inference.py: shorten the DNS-probe wrap rationale. - transformers_version.py: collapse the two urllib short-circuit notes. - model_config.py: shorten the LoRA detect + cache-fallback notes. - tests/test_offline_inference_parent.py: tighter module docstring, trim class docstrings, drop multi-line explainer comments inside the tests; behaviour and coverage unchanged (9/9 tests still pass). --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

The orchestrator's iter-1 refactor of llama_cpp.py inadvertently removed _probe_dns_dead and _hf_offline_if_dns_dead (added on main by unslothai#5505 between when this branch forked and the orchestrator's merge), which caused tests/test_offline_gguf_cache_fallback.py to fail collection across Python 3.10 / 3.11 / 3.12 / 3.13: ImportError: cannot import name '_hf_offline_if_dns_dead' from 'core.inference.llama_cpp' The original intent of this PR for llama_cpp.py was only to delegate the existing _parse_tool_calls_from_text staticmethod to the shared core/inference/tool_call_parser.py, so this commit: 1. Restores studio/backend/core/inference/llama_cpp.py to origin/main verbatim. 2. Re-adds the single import of parse_tool_calls_from_text from the shared parser module. 3. Re-applies the staticmethod-body swap to call the shared parser. Net delta vs main is now small (the shared parser pulls the body out; the DNS-offline helpers and every other GGUF feature stay exactly as main has them). Test pass count after the fix (all on Linux Python 3.11): * 41 safetensors tool-loop tests * 44 offline GGUF cache fallback tests (the previously failing file) * 217 other related tool / inference / anthropic tests = 302 total

shimmyshimmer requested a review from rolandtannous as a code owner May 17, 2026 09:04

pre-commit-ci Bot and others added 2 commits May 17, 2026 09:04

[pre-commit.ci] auto fixes from pre-commit.com hooks

659cea1

for more information, see https://pre-commit.ci

studio: tighten inline comments on offline GGUF fallback

5e9eb46

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

danielhanchen reviewed May 17, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

danielhanchen and others added 2 commits May 17, 2026 10:03

[pre-commit.ci] auto fixes from pre-commit.com hooks

c27d1da

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

This was referenced May 17, 2026

studio: extend offline DNS auto-detect to inference parent + training #5512

Merged

__init__: include HF_DATASETS_OFFLINE in the offline env cross-sync unslothai/unsloth-zoo#664

Merged

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

danielhanchen and others added 2 commits May 17, 2026 13:01

Merge branch 'main' into offline-gguf-cache-fallback

65e2171

Resolves conflict in studio/backend/core/inference/llama_cpp.py: the offline DNS scoping wrapper around _download_gguf/_download_mmproj now sits inside the new _serial_load_lock + _already_in_target_state early-return added on main.

[pre-commit.ci] auto fixes from pre-commit.com hooks

feaeec5

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

danielhanchen added 2 commits May 17, 2026 16:34

Merge branch 'main' into offline-gguf-cache-fallback

9af262e

danielhanchen and others added 3 commits May 18, 2026 02:25

[pre-commit.ci] auto fixes from pre-commit.com hooks

9480806

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

danielhanchen merged commit 3ff6204 into main May 18, 2026
32 checks passed

danielhanchen deleted the offline-gguf-cache-fallback branch May 18, 2026 04:25

		os.environ["HF_HUB_OFFLINE"] = "1"
		os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")

Uh oh!

Conversation

shimmyshimmer commented May 17, 2026

Summary

Failures observed

Changes

Verification (offline, model fully cached)

Online behavior

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

shimmyshimmer commented May 17, 2026

Uh oh!

danielhanchen left a comment

Choose a reason for hiding this comment

P1

P2

Nits

Verification

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!