Skip to content

studio: load cached GGUF models when fully offline#5505

Merged
danielhanchen merged 14 commits into
mainfrom
offline-gguf-cache-fallback
May 18, 2026
Merged

studio: load cached GGUF models when fully offline#5505
danielhanchen merged 14 commits into
mainfrom
offline-gguf-cache-fallback

Conversation

@shimmyshimmer
Copy link
Copy Markdown
Member

Summary

Three different "offline" failures in the GGUF load path, all because we ask huggingface.co for things we already have on disk. Each one surfaces a different error to the user despite the same underlying problem (DNS or HF Hub unreachable).

The fix is the same shape everywhere: try the network, fall back to scanning the local HF cache snapshot if the network is dead.

Failures observed

  1. Variant dropdown empty. list_gguf_variants() in studio/backend/utils/models/model_config.py raises straight through HTTPException(500), UI shows Failed to list GGUF variants: ... Failed to resolve 'huggingface.co'.
  2. Wrong backend on macOS. detect_gguf_model_remote() retries three times then returns None. The caller treats the GGUF-only repo as non-GGUF and routes it through the transformers/Unsloth path. On Apple Silicon the worker subprocess fails to import unsloth and the UI shows Failed to import ML libraries: Unsloth currently only works on NVIDIA, AMD and Intel GPUs.
  3. Bogus filename in download. _download_gguf() in studio/backend/core/inference/llama_cpp.py loses list_repo_files() to the network and falls back to f"{repo_name}-{hf_variant}.gguf". When the file inside the repo does not echo the repo name (e.g. unsloth/Qwen3.6-27B-MTP-GGUF contains Qwen3.6-27B-UD-Q4_K_XL.gguf with no MTP), hf_hub_download cannot find that invented filename in the cache and raises Failed to download GGUF file ... we cannot find the requested files in the local cache.

All three repros tested with the model fully present in ~/.cache/huggingface/hub, internet off.

Changes

studio/backend/utils/models/model_config.py

  • Add _list_gguf_variants_from_hf_cache(repo_id) that walks models--<repo>/snapshots/<hash>/ and reuses list_local_gguf_variants().
  • Add _detect_gguf_from_hf_cache(repo_id) that returns the best GGUF filename from the same snapshot via _pick_best_gguf().
  • list_gguf_variants(): honor HF_HUB_OFFLINE / TRANSFORMERS_OFFLINE up front; on any unhandled HF exception, fall back to the cache scan before re-raising.
  • detect_gguf_model_remote(): same treatment, layered on top of the existing 3-attempt retry loop. The retries still run for transient flakes; the cache scan only kicks in after every attempt fails.

studio/backend/core/inference/llama_cpp.py

  • _download_gguf(): when list_repo_files() fails, resolve hf_variant to a real filename by scanning the local HF cache snapshot with the same word-boundary regex the online path uses. Only if the cache scan also turns up nothing does the legacy {repo}-{variant}.gguf heuristic run.
  • load_model(): if hf_repo is set and HF_HUB_OFFLINE is not already in env, do a 2-second DNS probe of huggingface.co. If it fails, set HF_HUB_OFFLINE=1 (and TRANSFORMERS_OFFLINE=1) for the process so the downstream hf_hub_download calls hit the cache instantly instead of spending ~25 s on five exponential retries before failing over.
  • Drop a redundant import os inside the disk-space try block (module-level os is already imported).

studio/backend/core/inference/worker.py

  • Same DNS auto-detect at worker startup so non-GGUF subprocess loads inherit the offline hint too.

Verification (offline, model fully cached)

detect_gguf_model_remote('unsloth/Qwen3.5-4B-GGUF')
  -> Qwen3.5-4B-UD-Q4_K_XL.gguf                     (0.06 s)

list_gguf_variants('unsloth/Qwen3.5-4B-GGUF')
  -> 2 variants, vision=True                        (0.02 s)

_download_gguf(repo='unsloth/Qwen3.6-27B-MTP-GGUF', variant='UD-Q4_K_XL')
  -> Resolved variant UD-Q4_K_XL -> Qwen3.6-27B-UD-Q4_K_XL.gguf from local HF cache
  -> /Users/.../snapshots/.../Qwen3.6-27B-UD-Q4_K_XL.gguf

With HF_HUB_OFFLINE=1 set explicitly, the same three calls also complete in ~0 ms with zero network traffic.

Online behavior

Unchanged. The HF API is called first; the cache scan is reached only on exception. The retry loop in detect_gguf_model_remote() keeps its existing semantics (including the early-return on permanent errors like RepositoryNotFoundError).

Test plan

  • Variant dropdown lists Q4_K_XL / IQ2_XXS for unsloth/Qwen3.5-4B-GGUF with WiFi off and the model fully cached.
  • Loading unsloth/Qwen3.5-4B-GGUF offline goes through llama-server (not the Unsloth worker) on macOS.
  • Loading unsloth/Qwen3.6-27B-MTP-GGUF offline resolves to the real cached filename and starts llama-server.
  • Online flow for the same three repos still hits the HF API (no log warnings about cache fallback).

When huggingface.co is unreachable, GGUF model loads fail in three distinct
places even though the bits are already in ~/.cache/huggingface/hub. Each
failure has a different surface symptom:

1. list_gguf_variants() raises straight through HTTPException(500), so the
   variant dropdown shows 'Failed to list GGUF variants'.

2. detect_gguf_model_remote() silently returns None after retries fail. The
   caller then treats a GGUF-only repo as non-GGUF and routes it through the
   transformers/MLX path. On Apple Silicon this surfaces as 'Unsloth currently
   only works on NVIDIA, AMD and Intel GPUs.'

3. _download_gguf() loses list_repo_files() to the network and falls back to a
   filename heuristic ('{repo}-{variant}.gguf'). When the repo name does not
   echo the filenames (e.g. repo 'Qwen3.6-27B-MTP-GGUF' contains a file
   'Qwen3.6-27B-UD-Q4_K_XL.gguf' with no MTP), hf_hub_download cannot find
   that invented filename in the cache and aborts.

Fix in three layers:

- list_gguf_variants / detect_gguf_model_remote: honor HF_HUB_OFFLINE and
  fall back to scanning the local HF cache snapshot when the API throws.
  detect_gguf_model_remote still keeps its retry loop for transient flakes;
  the cache fallback only kicks in after every attempt fails.

- _download_gguf: when list_repo_files() fails, look up variant -> real
  filename inside the cached snapshot before resorting to the heuristic.

- llama_cpp.load_model / inference worker startup: when DNS for
  huggingface.co fails (2s probe), set HF_HUB_OFFLINE=1 for the process so
  every hf_hub_download call below resolves from cache instantly instead of
  spending ~25s on five exponential retries.

Online behavior is unchanged: the API is tried first and only used to fail
over. The cache scan is a strict subset of what list_local_gguf_variants
already does today for local paths.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 659cea1a20

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1814 to +1816
for p in snap.rglob("*.gguf")
if "mmproj" not in p.name.lower()
and boundary.search(p.name.lower())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Match cache variant against relative path, not basename

The offline cache fallback filters candidates with boundary.search(p.name.lower()), which ignores directory components. That breaks repos where the quant appears only in a subdirectory (the codebase already documents layouts like BF16/...gguf): those files are skipped, gguf_filename stays unset, and the code falls back to the synthetic {repo}-{variant}.gguf name that is often absent from cache. In a fully offline load this turns a valid cached model into a download failure.

Useful? React with 👍 / 👎.

Comment on lines +661 to +662
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not persist HF_HUB_OFFLINE on a single DNS miss

When startup DNS resolution fails once, this permanently sets HF_HUB_OFFLINE=1 for the entire worker lifetime. A transient resolver hiccup at process start therefore forces all subsequent model loads into offline mode (including when connectivity is later healthy), preventing normal Hub fetches until the worker is restarted.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds offline support for GGUF model detection and resolution by implementing local Hugging Face cache lookups and automatic reachability detection for huggingface.co. Key changes include adding fallback mechanisms in llama_cpp.py and model_config.py to serve model metadata from the local cache when the network is unavailable. Feedback suggests centralizing duplicated cache lookup logic, utilizing the is_offline_mode utility from huggingface_hub, and ensuring consistent logging when offline mode is triggered.

Comment thread studio/backend/core/inference/worker.py Outdated
Comment on lines +659 to +661
except Exception:
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the change in llama_cpp.py, it would be beneficial to log a warning when offline mode is enabled. This helps in debugging by making it clear why the application is operating in offline mode. In accordance with repository guidelines, ensure the warning message is dynamically generated to include the specific configuration values it refers to (e.g., the source or value of the offline setting) to ensure accuracy and avoid confusion.

References
  1. User-facing warning messages should be dynamically generated to include the specific configuration values they refer to, rather than using hardcoded examples, to ensure accuracy and avoid confusion.

Comment on lines +1286 to +1320
def _list_gguf_variants_from_hf_cache(
repo_id: str,
) -> Optional[tuple[list[GgufVariantInfo], bool]]:
"""Variants from the local HF cache snapshot, or None if not cached."""
try:
from huggingface_hub import constants as hf_constants
except Exception:
return None

cache_dir = Path(hf_constants.HF_HUB_CACHE)
if not cache_dir.is_dir():
return None

target = f"models--{repo_id.replace('/', '--')}".lower()
repo_dir: Optional[Path] = None
for entry in cache_dir.iterdir():
if entry.is_dir() and entry.name.lower() == target:
repo_dir = entry
break
if repo_dir is None:
return None

snapshots = repo_dir / "snapshots"
if not snapshots.is_dir():
return None

snap_dirs = [s for s in snapshots.iterdir() if s.is_dir()]
if not snap_dirs:
return None
snap_dirs.sort(key = lambda s: s.stat().st_mtime, reverse = True)
for snap in snap_dirs:
variants, has_vision = list_local_gguf_variants(str(snap))
if variants or has_vision:
return variants, has_vision
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for finding a repository directory in the Hugging Face cache is duplicated in multiple locations (e.g., _detect_gguf_from_hf_cache and _download_gguf). This logic is also inefficient as it iterates over the entire cache directory. Consider creating a single helper function to find the repository directory to improve maintainability and performance, as per repository guidelines to centralize recurring logic and avoid redundant iterations.

References
  1. Centralize recurring or complex logical checks into a single helper function and reuse it across the codebase to ensure consistency and simplify maintenance.
  2. To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

Comment on lines +1339 to +1342
offline = os.environ.get("HF_HUB_OFFLINE", "").lower() in (
"1",
"true",
"yes",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check for offline mode can be simplified by using the is_offline_mode utility from huggingface_hub. This makes the code more concise and robust. Additionally, as this check is recurring, consider centralizing it into a single helper function to ensure consistency and simplify maintenance.

    from huggingface_hub.utils import is_offline_mode

    offline = is_offline_mode()
References
  1. Centralize recurring or complex logical checks into a single helper function and reuse it across the codebase to ensure consistency and simplify maintenance.

Comment on lines +1561 to +1564
offline = os.environ.get("HF_HUB_OFFLINE", "").lower() in (
"1",
"true",
"yes",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check for offline mode can be simplified by using the is_offline_mode utility from huggingface_hub. This makes the code more concise and robust. Additionally, as this check is recurring, consider centralizing it into a single helper function to ensure consistency and simplify maintenance.

    from huggingface_hub.utils import is_offline_mode

    offline = is_offline_mode()
References
  1. Centralize recurring or complex logical checks into a single helper function and reuse it across the codebase to ensure consistency and simplify maintenance.

Fixes from the review pass on #5505:

* ruff F823 (lint CI red): the late `import os` at the bottom of
  LlamaCppBackend.load_model made `os` a function-local name, so my
  new `os.environ` reference at the top of the same method was a
  use-before-bind. Surfaces at runtime as
  'cannot access local variable os where it is not associated with a value'
  and is why the Mac/Windows Studio API jobs were failing too. The
  env-var mutation has been moved into a module-level contextmanager,
  so load_model no longer touches `os` directly.

* Codex P1: cache variant match now uses the relative path, not the
  basename. Layouts like `BF16/foo.gguf` (variant token only in
  parent dir) were silently skipped, falling through to the bogus
  `{repo}-{variant}.gguf` heuristic and failing offline loads of
  models stored under quant-named subdirs.

* Codex P1: HF_HUB_OFFLINE no longer persists past one model load.
  llama_cpp.load_model now uses a contextmanager that probes DNS,
  sets HF_HUB_OFFLINE/TRANSFORMERS_OFFLINE only when DNS is dead,
  and pops them in finally (preserving any prior user setting of
  TRANSFORMERS_OFFLINE). Pre-existing user-set HF_HUB_OFFLINE is
  respected as a no-op. worker.py keeps the startup probe because the
  orchestrator spawns a fresh worker per load -- comment updated to
  make that lifecycle explicit, and a warning is now logged.

* Gemini: cache-dir lookup centralized in `_iter_hf_cache_snapshots`.
  Three near-identical copies (in list/detect helpers and the
  llama_cpp offline scan) now go through one helper.

* Gemini: `huggingface_hub.utils.is_offline_mode` does not exist in
  1.x (verified locally); `huggingface_hub.constants.HF_HUB_OFFLINE`
  is snapshot-at-import-time and does not reflect runtime mutations.
  Manual env-var parsing kept.

* socket probe now saves and restores the prior default timeout
  instead of unconditionally setting None on exit, so it composes
  with caller code that already configured a timeout.

* worker.py probe now logs a warning when offline mode is auto-enabled
  so debugging the case isn't blind.
@shimmyshimmer
Copy link
Copy Markdown
Member Author

Pushed 6476b9b addressing the review feedback. Verdicts inline:

Lint (ruff F823) — fixed. This was a real bug, not just a style nit. load_model had a late import os further down its body which made os a function-local name; my new os.environ reference at the top of the same method was therefore a use-before-bind. At runtime this surfaces as cannot access local variable 'os' where it is not associated with a value, and is the same error that caused the Mac/Windows Studio API jobs to fail with /api/inference/load -> 500. The env-var mutation has been moved into a module-level contextmanager (_hf_offline_if_dns_dead), so load_model no longer touches os directly.

Codex P1 #1 (match cache variant against relative path) — fixed. Real bug. The cache scan now matches against relative_to(snap).as_posix().lower() rather than p.name.lower(), so snapshot layouts where the quant lives in a parent dir (e.g. BF16/foo.gguf) are correctly found. Added regression tests.

Codex P1 #2 (do not persist HF_HUB_OFFLINE) — fixed. Real concern for the long-lived FastAPI process. llama_cpp.load_model now wraps the download phase in a _hf_offline_if_dns_dead() contextmanager that pops HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE in finally, so a transient DNS hiccup at the start of one load can't quarantine future loads. The contextmanager respects any pre-existing user setting (no-op if HF_HUB_OFFLINE is already in env) and preserves a user-set TRANSFORMERS_OFFLINE. For worker.py the scope is naturally one model load (orchestrator spawns a fresh subprocess per load), so the env mutation there is intentionally per-process; comment updated to make that lifecycle explicit.

Gemini (centralize cache-dir lookup) — fixed. Extracted _iter_hf_cache_snapshots(repo_id) in model_config.py. _list_gguf_variants_from_hf_cache, _detect_gguf_from_hf_cache, and the offline scan in llama_cpp._download_gguf all route through it now.

Gemini (use is_offline_mode from huggingface_hub) — declined, function does not exist. Verified locally on huggingface_hub 1.15.0: huggingface_hub.utils.is_offline_mode is not a public symbol. huggingface_hub.constants.HF_HUB_OFFLINE exists as a bool but is snapshot-at-import-time and does not reflect runtime env mutations, so it isn't usable for our case where the env var can be flipped mid-process. Kept the explicit env-var parsing, which also handles TRANSFORMERS_OFFLINE as a fallback.

Gemini (log warning when worker enables offline) — fixed. worker.py now emits huggingface.co unreachable; HF_HUB_OFFLINE=1 set for this worker. when the probe trips.

Drive-by improvements the review pass surfaced:

  • DNS probe now saves and restores the prior socket.getdefaulttimeout() instead of unconditionally setting None on exit, so it composes with caller code that may have set a timeout.
  • Cache iteration in _iter_hf_cache_snapshots now guards iterdir() against OSError (permission errors on shared caches).

Test coverage: the local suite that exercises the patched code is now 119 cases (102 from the original verification + 17 net new for the review fixes). Notable additions:

  • test_relative_path_match_with_subdir_variant: regression for the basename-only bug Codex found.
  • test_hf_offline_cm_sets_and_clears_on_dns_failure: verifies the env var is gone on context exit.
  • test_hf_offline_cm_cleans_up_even_on_exception: same, on raised exception inside the body.
  • test_hf_offline_cm_preserves_pre_existing_transformers_offline: cleanup does not blow away a user-set value.
  • test_dns_probe_restores_prior_socket_timeout: probe doesn't clobber a caller's prior timeout.

Pre-commit formatting (scripts/run_ruff_format.py + kwarg spacing) applied locally, so the pre-commit.ci bot shouldn't need to add a follow-up commit this time.

Copy link
Copy Markdown
Member

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled the branch down, set up an isolated harness with the same pinned huggingface_hub==0.36.2, ran 41 simulated scenarios against the three changed call paths (fake HF cache with both flat and snapshot-symlink layouts, monkey-patched hf_model_info / list_repo_files, monkey-patched socket.gethostbyname). Full suite passes; results back both bot P1 findings empirically. Sim lives at workspace_32/temp/pr5505_sim/ if you want it.

The PR is solving real bugs: I reproduced all three failure modes against the pre-PR code (case 06 raises offline, case 15 builds the synthetic Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf that misses the cache). Online behavior is unchanged versus main (case 27). Cross-platform-wise the new code is pure stdlib + pathlib.Path + huggingface_hub.constants.HF_HUB_CACHE, no sys.platform branches added; the realistic-layout suite covers HF cache symlinks (snapshots/<sha>/<name> into blobs/) and dangling symlinks from partial downloads.

P1

1. _download_gguf offline cache filter regresses online/offline parity.
studio/backend/core/inference/llama_cpp.py:1814 matches on boundary.search(p.name.lower()) (basename), while the online block at :1748-1757 matches on the full path from list_repo_files (which can return rfilenames like BF16/foo.gguf). The codebase already supports the subdir-only layout: model_config.py:1020, :1213, :1258, :1443-1446, :1490-1491, :2293-2300. Simulation case 16 / E4 reproduces this end to end against a cache containing only BF16/foo.gguf: PR resolves to the synthetic gpt-oss-20b-BF16-BF16.gguf (which is not on disk) and the load fails; case 17 / E4 shows the one-character fix passing.

Suggested change (one line):

# studio/backend/core/inference/llama_cpp.py around line 1814
matches = sorted(
    p.relative_to(snap).as_posix()
    for p in snap.rglob("*.gguf")
    if "mmproj" not in p.name.lower()
    and boundary.search(p.relative_to(snap).as_posix().lower())
)

The basename-only failure is latent today (every in-repo example I found still has the quant token in the basename) but it diverges from the online filter right above and silently turns valid cached repos into download failures the moment any future unsloth GGUF ships a subdir-only layout. Cheap to fix now.

2. LlamaCppBackend.load_model permanently mutates HF_HUB_OFFLINE on one transient DNS miss.
llama_cpp.py:2052-2067 sets os.environ["HF_HUB_OFFLINE"] = "1" and os.environ.setdefault("TRANSFORMERS_OFFLINE", "1") in the except branch and never restores them. LlamaCppBackend is a module-level singleton (routes/inference.py:433 -> _llama_cpp_backend = LlamaCppBackend()), reused across every /load for the lifetime of the FastAPI server. After a single DNS hiccup at first load the env var is sticky forever: when connectivity recovers the next uncached-repo load short-circuits the probe (if hf_repo and "HF_HUB_OFFLINE" not in os.environ), hf_hub_download keeps refusing the Hub, and the user sees stale-cache errors with no recovery short of restarting the server. grep "del os.environ\['HF_HUB_OFFLINE'\]\\|os.environ.pop.*HF_HUB_OFFLINE" studio/ returns zero hits.

Simulation case 19 reproduces the leak: DNS fails -> env stuck at 1 even after DNS recovers. Case 20 / 21 / 22 show the scoped variant restores the env, preserves a user-set HF_HUB_OFFLINE=1, and is a no-op when DNS succeeds.

worker.py:651-663 does the same thing but isn't a real bug there because the worker is one-shot per load: orchestrator.py:606-608 documents "Always spawns a fresh subprocess for each model load" and _CTX.Process(...) at :181-192 confirms it. Each load gets a fresh probe and a fresh process env. The block in worker.py can stay as-is; only llama_cpp.py:2052-2067 needs to change.

Suggested change (scope the mutation):

# studio/backend/core/inference/llama_cpp.py around line 2052
self._cancel_event.clear()

_offline_keys: list[str] = []
if hf_repo and "HF_HUB_OFFLINE" not in os.environ:
    import socket as _socket
    try:
        _socket.setdefaulttimeout(2.0)
        _socket.gethostbyname("huggingface.co")
    except Exception:
        os.environ["HF_HUB_OFFLINE"] = "1"
        _offline_keys.append("HF_HUB_OFFLINE")
        if "TRANSFORMERS_OFFLINE" not in os.environ:
            os.environ["TRANSFORMERS_OFFLINE"] = "1"
            _offline_keys.append("TRANSFORMERS_OFFLINE")
        logger.warning(
            "huggingface.co unreachable, enabling HF_HUB_OFFLINE for this load."
        )
    finally:
        _socket.setdefaulttimeout(None)
try:
    # ... existing kill / launch / hf_hub_download path ...
finally:
    for _k in _offline_keys:
        os.environ.pop(_k, None)

_offline_keys ensures we only pop the variables this code added; a user who explicitly set HF_HUB_OFFLINE=1 before launching Studio keeps that value across the load. Alternative shape if you'd rather not touch os.environ at all: thread local_files_only=True into the specific hf_hub_download / get_paths_info calls reached after the probe fails. No local_files_only= usage anywhere in studio/ today (zero grep hits), so this would be the first instance; the scoped try/finally is the smaller diff.

P2

3. Centralise the cache repo-dir lookup. Three sites (_list_gguf_variants_from_hf_cache, _detect_gguf_from_hf_cache, and the new inline block at llama_cpp.py:1797-1840) hand-roll the same cache_dir / f"models--{repo_id.replace('/', '--')}".lower() resolution plus snapshot iteration sorted by mtime. One _find_repo_cache_dir(repo_id) -> Optional[Path] (or _iter_repo_snapshots(repo_id)) helper in model_config.py keeps the three sites in sync, and the fix in P1 #1 lives in exactly one place.

4. worker.py:651-663 should log when auto-offline mode kicks in, matching the logger.warning(...) already in llama_cpp.py:2061. Pure observability nit.

5. Drop the huggingface_hub.utils.is_offline_mode suggestion from the bot's medium comments. That function does not exist in pinned huggingface_hub==0.36.2. I verified with python -c "from huggingface_hub.utils import is_offline_mode" -> ImportError, and again with from huggingface_hub import is_offline_mode -> ImportError. The closest modern API is huggingface_hub.constants.HF_HUB_OFFLINE (module-level bool computed from env at import), but the current manual env check is also fine and is more flexible because it accepts "true" / "yes" (simulated in cases E11 / E12 / E13). Keep what you have.

Nits

  1. The cache fallback in list_gguf_variants / detect_gguf_model_remote catches bare Exception and serves cache on every HF API error. The retry loop in detect_gguf_model_remote already early-returns on RepositoryNotFoundError / GatedRepoError / RevisionNotFoundError / EntryNotFoundError; consider mirroring that for the outer except Exception cache fallback so a stale cache cannot quietly paper over a deliberate access-control or revision failure. Not blocking, the synthetic-name fallback after both API and cache miss still raises eventually.

  2. Concurrency edge: socket.setdefaulttimeout(2.0) is process-global. If two /load requests land in the same FastAPI worker concurrently, the second thread's probe can see the first thread's reset to None mid-flight (or vice versa). Simulation case E6 doesn't repro the worst case but observes the race window. Low priority; only matters once you have parallel-load tenants.

  3. The PR-description bullet "Drop a redundant import os inside the disk-space try block" is fine but unrelated to offline; easy to split into its own commit if you keep the offline-fix PR tight.

Verification

If you want to reproduce the proofs locally without retraining anything:

cd workspace_32/temp/pr5505_sim
uv venv .venv && source .venv/bin/activate
uv pip install "huggingface_hub==0.36.2" structlog pytest
python run_sim.py        # 28 cases, flat fake cache
python run_sim_extras.py # 13 cases, realistic snapshot-symlink cache + concurrency

Cases that reproduce the two P1s in isolation:

Case What it shows
06 / 15 BEFORE-PR offline path raises / builds bogus filename (the bug the PR fixes)
14 PR fixes the Qwen3.6-27B-MTP-GGUF case
16 / E4 P1 #1 reproduced: PR's basename filter misses BF16/foo.gguf
17 / E4 One-line fix (relative-path filter) resolves the subdir layout
19 P1 #2 reproduced: HF_HUB_OFFLINE stays 1 after DNS recovery
20 / 21 / 22 Scoped try/finally restores env, preserves user-set value, no-op on success
27 Online happy path unchanged versus main
E1-E5 Realistic HF cache layout (snapshot symlinks into blobs/, dangling links)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6476b9b9bc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +664 to +665
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid forcing offline mode after one DNS lookup failure

A transient failure in _socket.gethostbyname("huggingface.co") now unconditionally sets HF_HUB_OFFLINE=1 for the entire worker process, so the same load attempt will refuse all Hub HTTP calls even if connectivity is actually available moments later. This regresses online reliability for partially cached models: instead of retrying/download behavior, the load fails immediately in offline-only mode until a new subprocess is spawned. Fresh evidence in this revision is that run_inference_process sets the env once at startup while the spawned worker is kept alive through the loaded handshake in orchestrator.load_model, so every Hub call in that attempt inherits offline mode.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits May 17, 2026 10:03
Lock in the offline fallback path from #5505 so future refactors can't
silently regress either bug. 26 tests, 0.55 s, no network/GPU/subprocess.

Covers:

* _iter_hf_cache_snapshots: missing cache, missing repo, missing
  snapshots/, newest-mtime ordering, case-insensitive repo match.
* _list_gguf_variants_from_hf_cache and the list_gguf_variants
  online/offline-env/API-exception/reraise paths.
* _detect_gguf_from_hf_cache and detect_gguf_model_remote 3x-fail
  fallback. Pre-existing RepositoryNotFoundError early-return preserved.
* Codex P1 #1 regression: BF16/foo.gguf (quant only in subdir name)
  must resolve via _detect_gguf_from_hf_cache, which now matches the
  snapshot-relative path rather than the basename.
* _probe_dns_dead: returns True/False, restores prior socket timeout.
* Codex P1 #2 regression: _hf_offline_if_dns_dead sets env only inside
  the block, restores on exit (including on exception), re-probes DNS
  on the next call so a transient hiccup cannot lock the long-lived
  LlamaCppBackend singleton offline. Honors a user-set HF_HUB_OFFLINE
  as a no-op. Preserves a user-set TRANSFORMERS_OFFLINE across exit.

Follows the existing studio backend test stub pattern (loggers /
structlog / httpx stubs + backend dir on sys.path).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c27d1da1b6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

) -> Optional[tuple[list[GgufVariantInfo], bool]]:
"""Variants from the local HF cache snapshot, or None if not cached."""
for snap in _iter_hf_cache_snapshots(repo_id):
variants, has_vision = list_local_gguf_variants(str(snap))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Derive offline variants from relative paths

The new cache fallback calls list_local_gguf_variants(str(snap)), but that helper computes quant labels from each file basename (f.name) rather than the snapshot-relative path. In repos that store quants in directories (for example BF16/model.gguf and Q4_K_M/model.gguf), both files collapse to the same pseudo-quant (model), so the offline variant list can lose/distort variants and drive incorrect default variant selection. This regression is specific to the new offline HF-cache path because the online path receives full repo filenames.

Useful? React with 👍 / 👎.

…abel

Two follow-up fixes from the review pass on #5505:

* _download_mmproj() now mirrors _download_gguf()'s offline path:
  when list_repo_files() fails, scan the local HF cache snapshot for
  any GGUF whose basename starts with mmproj-. Without this, offline
  vision GGUF loads succeed at the main weight (the existing PR fix)
  but the mmproj returns None and llama-server starts without vision
  support. Same _iter_hf_cache_snapshots helper, F16 preference and
  fallback to the first match are preserved.

* _extract_quant_label() now considers parent directory segments when
  the basename has no quant token. Layouts like BF16/foo.gguf are
  already documented in this file and are returned by the new
  snapshot-relative-path filter in _download_gguf; before this fix
  their variant label collapsed to "foo" (the last hyphen segment of
  the basename). Regex is the same; the search just walks parent
  segments innermost-first if the basename misses.

Tests (studio/backend/tests/test_offline_gguf_cache_fallback.py):

* TestExtractQuantLabelSubdir: basename quant unchanged, quant-only-
  in-parent, UD- prefix in parent, deeper nesting picks the
  innermost matching segment.
* TestDownloadMmprojOfflineCacheFallback: cache fallback returns the
  mmproj when list_repo_files fails, F16 preference holds when both
  variants are in cache, no-mmproj cache returns None.
* httpx stub now prefers the real package when installed (the CI
  install list already includes it) and falls back to the stub only
  when httpx is genuinely missing. Newer huggingface_hub imports
  HTTPError/Response/Request at module load, so the previous
  fixed-set stub broke when those names were added upstream.

26 existing cases plus 7 new = 33 pass in 0.74s.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6c17f750c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1373 to +1375
# API failed; fall back to local snapshot if fully downloaded.
cached = _list_gguf_variants_from_hf_cache(repo_id)
if cached is not None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep permanent Hub errors from using stale cache

When the Hub API raises a permanent error while the process is not explicitly offline (for example RepositoryNotFoundError, GatedRepoError, or RevisionNotFoundError for a deleted/private repo or bad token), this catch-all now still returns any stale local snapshot. That makes the variant list look valid and can let the UI proceed with cached data even though the repo access/name is definitively invalid; detect_gguf_model_remote already preserves these errors by not consulting cache for the same exception names.

Useful? React with 👍 / 👎.

Comment on lines +1526 to +1529
rel_files = [
f.relative_to(snap).as_posix()
for f in _iter_gguf_files(snap, recursive = True)
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude mmproj files from cached model detection

In the offline cache path this list includes vision projector GGUFs, so _pick_best_gguf can return mmproj-*-F16.gguf as the main model when the cache is partial or when the actual model filename lacks a preferred quant token. That routes the repo to llama.cpp with the projector file as -m, instead of failing or selecting a non-mmproj model; the other GGUF listing paths explicitly skip mmproj files for this reason.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits May 17, 2026 13:01
Resolves conflict in studio/backend/core/inference/llama_cpp.py: the
offline DNS scoping wrapper around _download_gguf/_download_mmproj
now sits inside the new _serial_load_lock + _already_in_target_state
early-return added on main.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: feaeec5a90

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +106 to +108
prev = socket.getdefaulttimeout()
socket.setdefaulttimeout(timeout)
try:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid mutating global socket timeout in DNS probe

_probe_dns_dead sets socket.setdefaulttimeout(timeout) before calling gethostbyname, which changes the default timeout process-wide for the duration of the probe. During a load_model call this can affect unrelated concurrent requests in the same backend process: any path that opens sockets without an explicit timeout will inherit 2.0s and may fail spuriously. This creates cross-request, timing-dependent network failures that are hard to diagnose.

Useful? React with 👍 / 👎.

Four review findings tightened, with regression tests:

- list_local_gguf_variants subdir collapse (P1 codex 10:08): pass the
  snapshot-relative path to _extract_quant_label so BF16/foo.gguf and
  Q4_K_M/foo.gguf produce distinct labels instead of folding to the same
  basename pseudo-quant.
- list_gguf_variants cache fallback (P2 codex 12:10): surface
  RepositoryNotFoundError / GatedRepoError / RevisionNotFoundError /
  EntryNotFoundError to the caller instead of masking with stale cache,
  matching detect_gguf_model_remote.
- _detect_gguf_from_hf_cache mmproj (P2 codex 12:10): exclude mmproj
  files from the candidate list so a partial cache with only a vision
  projector cannot route the projector as the main model.
- _probe_dns_dead global timeout (P2 codex 13:06): run the gethostbyname
  on a daemon thread with join timeout so concurrent sockets in the same
  interpreter never inherit a process-wide socket.setdefaulttimeout
  mutation. Same shape applied in worker.py's startup probe.
rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 18, 2026
unslothai#5505 fixed the GGUF/llama-server load path. Studio still has two
adjacent code paths that burn ~30-60s of soft-failed timeouts before
the worker subprocess starts when DNS to huggingface.co is dead and
the model is already in the local HF cache.

Inference parent process (routes/inference.py:load_model):

* ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead
  so the LoRA-detect hf_model_info call and the urllib config probes
  in utils/transformers_version.py short-circuit when DNS is dead.
* utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/
  TRANSFORMERS_OFFLINE check used by list_gguf_variants and
  detect_gguf_model_remote into a shared _env_offline() helper, then
  reused it to gate the LoRA-detect hf_model_info call.
* utils/transformers_version.py: _check_tokenizer_config_needs_v5 and
  _check_config_needs_550 now early-return False when offline instead
  of issuing a 10s urllib.urlopen against huggingface.co/raw/main.

Training worker (core/training/worker.py:run_training_process):

* Add the same 2s DNS probe used by core/inference/worker.py at the
  top of the training subprocess. On failure, set HF_HUB_OFFLINE,
  TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of
  the subprocess imports torch/transformers/unsloth, so every
  from_pretrained, snapshot_download, and load_dataset call below
  resolves from cache. Scope is per-subprocess; the orchestrator
  always spawns a fresh worker per training run.

Training trainer (core/training/trainer.py:load_model):

* Skip the proactive hf_model_info gated-repo probe when _env_offline()
  is true. The API is unreachable anyway, and a gated model that is
  already cached is exactly the scenario the user is trying to train
  against. from_pretrained surfaces the real error if access is
  actually denied.

Tests (tests/test_offline_inference_parent.py, 7 new cases):

* _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and
  TRANSFORMERS_OFFLINE.
* transformers_version urllib short-circuit when offline.
* LoRA detect hf_model_info skip when offline.

Existing tests/test_offline_gguf_cache_fallback.py still passes
(26 cases) because the inline env check was extracted, not changed.
rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 18, 2026
Same fix as unslothai#5505's _probe_dns_dead refactor: run gethostbyname on a
daemon thread with join timeout so concurrent sockets in the parent
interpreter never inherit a process-wide socket.setdefaulttimeout
mutation. Adds a static-pin regression test that the inference parent
file does not regress on this.
danielhanchen and others added 3 commits May 18, 2026 02:25
Two layered fixes for the Windows GGUF smoke CI Tool calling Tests
flake that exit-22'd on a single httpx.ReadError during llama-server
warmup. The 'windows-latest -> windows-2025-vs2026' image rollout is
hitting main with the identical symptom.

A. _wait_for_health: catch httpx.ReadError, RemoteProtocolError,
   WriteError alongside ConnectError and TimeoutException. A TCP RST
   mid-read while llama-server is still binding the port (WinError
   10054) is a 'still warming up' signal, not fatal. The existing
   _process.poll() check still wins for real crashes.

B. _drain_stdout + spawn: tee llama-server stdout/stderr to a
   per-launch log file at ~/.unsloth/studio/logs/llama-server/
   <port>.log. Any future subprocess crash leaves a forensic trace
   on disk even when Studio's traceback only captures the symptom
   (ReadError) and not the cause. Best-effort: a logging-side OSError
   never blocks the load.

Regression coverage: TestWaitForHealthRetriesOnReadError pins the
retry behaviour for the three new exception types and verifies that a
real process exit still short-circuits the loop.
Composite fix for the Tool calling Tests flake that exit-22'd on a
single httpx.ReadError during llama-server warm-up. The
windows-latest -> windows-2025-vs2026 runner image rollout has been
hitting main with the identical symptom.

- All three jobs (openai-anthropic, tool-calling, json-images) now
  retry POST /api/inference/load up to 3 times with 10s backoff and
  preserve the response body for post-mortem. One transient 500 no
  longer fails the whole job.
- A new "Collect llama-server logs" step copies the per-launch
  llama-server stdout teed by Studio under ~/.unsloth/studio/logs/
  llama-server/ into the workspace, and the upload-artifact step
  now includes logs/llama-server/*.log so any future subprocess
  crash leaves a forensic trace.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c3f9ce6394

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +269 to +273
HTTP=$(curl -s -o /tmp/load.json -w '%{http_code}' \
-X POST "http://127.0.0.1:${STUDIO_PORT}/api/inference/load" \
-H "Authorization: Bearer $TOKEN" -H 'content-type: application/json' \
--max-time 600 \
-d "{\"model_path\":\"$GGUF_REPO\",\"gguf_variant\":\"$GGUF_VARIANT\",\"is_lora\":false,\"max_seq_length\":2048}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle curl failures inside the retry loop

In this workflow shell: bash is executed with -e/pipefail (GitHub documents shell: bash as bash --noprofile --norc -eo pipefail {0}), so if curl hits the transient TCP reset/timeout this retry was added for, the non-zero exit from the command substitution aborts the step before attempt 2 or 3 runs. This makes the new retry only cover HTTP responses like 500, not the connection-level failures described in the comment; the same pattern is repeated in the other load retry blocks.

Useful? React with 👍 / 👎.

@danielhanchen danielhanchen merged commit 3ff6204 into main May 18, 2026
32 checks passed
@danielhanchen danielhanchen deleted the offline-gguf-cache-fallback branch May 18, 2026 04:25
danielhanchen added a commit that referenced this pull request May 18, 2026
#5505 fixed the GGUF/llama-server load path. Studio still has two
adjacent code paths that burn ~30-60s of soft-failed timeouts before
the worker subprocess starts when DNS to huggingface.co is dead and
the model is already in the local HF cache.

Inference parent process (routes/inference.py:load_model):

* ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead
  so the LoRA-detect hf_model_info call and the urllib config probes
  in utils/transformers_version.py short-circuit when DNS is dead.
* utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/
  TRANSFORMERS_OFFLINE check used by list_gguf_variants and
  detect_gguf_model_remote into a shared _env_offline() helper, then
  reused it to gate the LoRA-detect hf_model_info call.
* utils/transformers_version.py: _check_tokenizer_config_needs_v5 and
  _check_config_needs_550 now early-return False when offline instead
  of issuing a 10s urllib.urlopen against huggingface.co/raw/main.

Training worker (core/training/worker.py:run_training_process):

* Add the same 2s DNS probe used by core/inference/worker.py at the
  top of the training subprocess. On failure, set HF_HUB_OFFLINE,
  TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of
  the subprocess imports torch/transformers/unsloth, so every
  from_pretrained, snapshot_download, and load_dataset call below
  resolves from cache. Scope is per-subprocess; the orchestrator
  always spawns a fresh worker per training run.

Training trainer (core/training/trainer.py:load_model):

* Skip the proactive hf_model_info gated-repo probe when _env_offline()
  is true. The API is unreachable anyway, and a gated model that is
  already cached is exactly the scenario the user is trying to train
  against. from_pretrained surfaces the real error if access is
  actually denied.

Tests (tests/test_offline_inference_parent.py, 7 new cases):

* _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and
  TRANSFORMERS_OFFLINE.
* transformers_version urllib short-circuit when offline.
* LoRA detect hf_model_info skip when offline.

Existing tests/test_offline_gguf_cache_fallback.py still passes
(26 cases) because the inline env check was extracted, not changed.
danielhanchen added a commit that referenced this pull request May 18, 2026
Same fix as #5505's _probe_dns_dead refactor: run gethostbyname on a
daemon thread with join timeout so concurrent sockets in the parent
interpreter never inherit a process-wide socket.setdefaulttimeout
mutation. Adds a static-pin regression test that the inference parent
file does not regress on this.
danielhanchen added a commit that referenced this pull request May 18, 2026
#5505 fixed the GGUF/llama-server load path. Studio still has two
adjacent code paths that burn ~30-60s of soft-failed timeouts before
the worker subprocess starts when DNS to huggingface.co is dead and
the model is already in the local HF cache.

Inference parent process (routes/inference.py:load_model):

* ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead
  so the LoRA-detect hf_model_info call and the urllib config probes
  in utils/transformers_version.py short-circuit when DNS is dead.
* utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/
  TRANSFORMERS_OFFLINE check used by list_gguf_variants and
  detect_gguf_model_remote into a shared _env_offline() helper, then
  reused it to gate the LoRA-detect hf_model_info call.
* utils/transformers_version.py: _check_tokenizer_config_needs_v5 and
  _check_config_needs_550 now early-return False when offline instead
  of issuing a 10s urllib.urlopen against huggingface.co/raw/main.

Training worker (core/training/worker.py:run_training_process):

* Add the same 2s DNS probe used by core/inference/worker.py at the
  top of the training subprocess. On failure, set HF_HUB_OFFLINE,
  TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of
  the subprocess imports torch/transformers/unsloth, so every
  from_pretrained, snapshot_download, and load_dataset call below
  resolves from cache. Scope is per-subprocess; the orchestrator
  always spawns a fresh worker per training run.

Training trainer (core/training/trainer.py:load_model):

* Skip the proactive hf_model_info gated-repo probe when _env_offline()
  is true. The API is unreachable anyway, and a gated model that is
  already cached is exactly the scenario the user is trying to train
  against. from_pretrained surfaces the real error if access is
  actually denied.

Tests (tests/test_offline_inference_parent.py, 7 new cases):

* _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and
  TRANSFORMERS_OFFLINE.
* transformers_version urllib short-circuit when offline.
* LoRA detect hf_model_info skip when offline.

Existing tests/test_offline_gguf_cache_fallback.py still passes
(26 cases) because the inline env check was extracted, not changed.
danielhanchen added a commit that referenced this pull request May 18, 2026
Same fix as #5505's _probe_dns_dead refactor: run gethostbyname on a
daemon thread with join timeout so concurrent sockets in the parent
interpreter never inherit a process-wide socket.setdefaulttimeout
mutation. Adds a static-pin regression test that the inference parent
file does not regress on this.
rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 18, 2026
Resolve conflict in studio/backend/core/inference/llama_cpp.py:
_wait_for_health's except clause. Main's unslothai#5505 landed essentially the
same fix this branch shipped in 56537b6 -- catching httpx.ReadError
and RemoteProtocolError to keep a transient TCP RST from masking a
real subprocess crash -- and added httpx.WriteError on top. Took main's
version (strict superset) and dropped this branch's duplicate.

Extend tests/test_llama_cpp_wait_for_health.py:
- Add WriteError to the stub-completion list so the file imports
  cleanly against either real httpx or a sibling test's stub.
- Add test_write_error_also_swallowed alongside the existing ReadError
  and RemoteProtocolError cases.

All other files in this merge auto-resolved cleanly (install_llama_prebuilt.py,
workflows, etc.).

Local verification:
- pytest test_llama_cpp_wait_for_health.py + windows_nvidia_path +
  windows_gpu_detection_mock + llama_server_args -- 116 passed.
- pytest test_import_fixes_drift.py -- 18 passed.
- pytest test_offline_gguf_cache_fallback.py (main's unslothai#5505 addition) -- 44 passed.
danielhanchen added a commit that referenced this pull request May 18, 2026
…#5512)

* studio: extend offline DNS auto-detect to inference parent + training

#5505 fixed the GGUF/llama-server load path. Studio still has two
adjacent code paths that burn ~30-60s of soft-failed timeouts before
the worker subprocess starts when DNS to huggingface.co is dead and
the model is already in the local HF cache.

Inference parent process (routes/inference.py:load_model):

* ModelConfig.from_identifier now runs inside _hf_offline_if_dns_dead
  so the LoRA-detect hf_model_info call and the urllib config probes
  in utils/transformers_version.py short-circuit when DNS is dead.
* utils/models/model_config.py: extracted the inline HF_HUB_OFFLINE/
  TRANSFORMERS_OFFLINE check used by list_gguf_variants and
  detect_gguf_model_remote into a shared _env_offline() helper, then
  reused it to gate the LoRA-detect hf_model_info call.
* utils/transformers_version.py: _check_tokenizer_config_needs_v5 and
  _check_config_needs_550 now early-return False when offline instead
  of issuing a 10s urllib.urlopen against huggingface.co/raw/main.

Training worker (core/training/worker.py:run_training_process):

* Add the same 2s DNS probe used by core/inference/worker.py at the
  top of the training subprocess. On failure, set HF_HUB_OFFLINE,
  TRANSFORMERS_OFFLINE, and HF_DATASETS_OFFLINE before the rest of
  the subprocess imports torch/transformers/unsloth, so every
  from_pretrained, snapshot_download, and load_dataset call below
  resolves from cache. Scope is per-subprocess; the orchestrator
  always spawns a fresh worker per training run.

Training trainer (core/training/trainer.py:load_model):

* Skip the proactive hf_model_info gated-repo probe when _env_offline()
  is true. The API is unreachable anyway, and a gated model that is
  already cached is exactly the scenario the user is trying to train
  against. from_pretrained surfaces the real error if access is
  actually denied.

Tests (tests/test_offline_inference_parent.py, 7 new cases):

* _env_offline truthy/falsy parsing across HF_HUB_OFFLINE and
  TRANSFORMERS_OFFLINE.
* transformers_version urllib short-circuit when offline.
* LoRA detect hf_model_info skip when offline.

Existing tests/test_offline_gguf_cache_fallback.py still passes
(26 cases) because the inline env check was extracted, not changed.

* tests: prefer real httpx over stub in offline-test files

The studio test stub convention only included the 6 httpx exception
names that existed callers needed. Newer huggingface_hub (1.15+)
imports HTTPError, Response, Request, HTTPStatusError, AsyncClient,
and more at module import time. When httpx is truly absent the stub
chase becomes a treadmill.

Use the real package when installed (the CI install list already
includes httpx, so this is the production environment). Fall back to
the stub only when httpx is genuinely missing.

No code under test changes.

* studio: detect cached LoRA adapters offline; tighten test

Two follow-ups from the review pass on #5512:

* ModelConfig.from_identifier no longer skips the remote LoRA-detect
  hf_model_info call when _env_offline() is true. huggingface_hub
  short-circuits the call via OfflineModeIsEnabled in ~0ms when
  HF_HUB_OFFLINE is set, so the original 25s concern was moot once
  routes/inference.py wrapped the call in _hf_offline_if_dns_dead.
  Skipping the API meant users with a cached LoRA adapter
  (adapter_config.json on disk) got is_lora=False and the load
  failed. After the API call (which raises fast offline) a new
  cache-fallback walks the HF cache snapshot for adapter_config.json
  via the existing _iter_hf_cache_snapshots helper.

* test_hf_model_info_not_called_when_offline replaced. The old test
  raised AssertionError inside production code that catches Exception,
  so it passed even if the call happened. New tests use MagicMock and
  assert call_count >= 1, plus a fixture that stages a fake HF cache
  with adapter_config.json to verify the offline cache detection.

Test count goes from 7 to 8 in test_offline_inference_parent.py.
Combined with test_offline_gguf_cache_fallback.py: 34 pass in 9.75s.

* Fix/adjust offline training DNS probe per PR #5505 review

Same fix as #5505's _probe_dns_dead refactor: run gethostbyname on a
daemon thread with join timeout so concurrent sockets in the parent
interpreter never inherit a process-wide socket.setdefaulttimeout
mutation. Adds a static-pin regression test that the inference parent
file does not regress on this.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Trim verbose code comments per review feedback

Shorten the longer explanatory comments added by this PR while keeping
the WHY of each non-obvious branch:

- trainer.py: collapse the 5-line proactive gated-check comment.
- training/worker.py: trim the offline auto-detect preamble and the
  "logger isn't configured" note.
- routes/inference.py: shorten the DNS-probe wrap rationale.
- transformers_version.py: collapse the two urllib short-circuit notes.
- model_config.py: shorten the LoRA detect + cache-fallback notes.
- tests/test_offline_inference_parent.py: tighter module docstring,
  trim class docstrings, drop multi-line explainer comments inside the
  tests; behaviour and coverage unchanged (9/9 tests still pass).

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 18, 2026
The orchestrator's iter-1 refactor of llama_cpp.py inadvertently
removed _probe_dns_dead and _hf_offline_if_dns_dead (added on main by
unslothai#5505 between when this branch forked and the orchestrator's merge),
which caused tests/test_offline_gguf_cache_fallback.py to fail
collection across Python 3.10 / 3.11 / 3.12 / 3.13:

    ImportError: cannot import name '_hf_offline_if_dns_dead' from
    'core.inference.llama_cpp'

The original intent of this PR for llama_cpp.py was only to delegate
the existing _parse_tool_calls_from_text staticmethod to the shared
core/inference/tool_call_parser.py, so this commit:

1. Restores studio/backend/core/inference/llama_cpp.py to
   origin/main verbatim.
2. Re-adds the single import of parse_tool_calls_from_text from the
   shared parser module.
3. Re-applies the staticmethod-body swap to call the shared parser.

Net delta vs main is now small (the shared parser pulls the body
out; the DNS-offline helpers and every other GGUF feature stay
exactly as main has them).

Test pass count after the fix (all on Linux Python 3.11):

* 41 safetensors tool-loop tests
* 44 offline GGUF cache fallback tests (the previously failing file)
* 217 other related tool / inference / anthropic tests = 302 total
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants