Skip to content

fix(studio): custom folder scan fails to find GGUF variants when pointing directly at a model directory#5

Closed
danielhanchen wants to merge 5 commits into
mainfrom
pr-4860-head
Closed

fix(studio): custom folder scan fails to find GGUF variants when pointing directly at a model directory#5
danielhanchen wants to merge 5 commits into
mainfrom
pr-4860-head

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Staging mirror of unslothai#4860

Original PR: unslothai#4860
Author: @JYYYYYT

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.


Original description

When adding a custom scan folder that points directly at a model directory (e.g. /path/to/gemma-4-e2b-it-gguf/), the model list shows individual .gguf files as separate entries instead of recognizing the directory as a single model. Clicking any entry shows "No GGUF variants found".

Fixes:

  1. _scan_models_dir -- detect self-as-model (requires both config file + weight files)
  2. _scan_lmstudio_dir -- skip model directories (not publisher structure)
  3. list_local_gguf_variants -- file path fallback to parent directory

JYYYYYT and others added 5 commits April 5, 2026 15:06
…ights

  When a custom scan folder points directly at a model directory (e.g.
  gemma-4-e2b-it-gguf/ containing config.json and .gguf files),
  _scan_models_dir previously skipped the directory itself and listed
  individual .gguf files as standalone models. The gguf-variants endpoint
  then received file paths instead of directory paths, causing
  list_local_gguf_variants to return an empty list ("No GGUF variants
  found").

  Three fixes:
  1. _scan_models_dir: detect when the scanned directory itself is a model
     (has BOTH a config file AND weight files) and return it as a single
     entry. Both conditions are required to avoid false positives on bare
     .gguf collections or config-only directories.
  2. _scan_lmstudio_dir: early-return when the directory has config files
     (not a publisher structure), and skip child directories that are
     model directories rather than treating them as publisher folders.
  3. list_local_gguf_variants: fall back to the parent directory when a
     .gguf file path is passed instead of a directory.
…com/JYYYYYT/unsloth into fix/local-folder-scan-gguf-variants

# Conflicts:
#	studio/backend/routes/models.py
lm_dir / "adapter_config.json"
).exists():
return []
except OSError:
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances model discovery by allowing directories containing both configuration and weight files to be identified as models. It also updates the LM Studio directory scanner to avoid redundant processing and modifies GGUF variant lookup functions to support standalone file paths by falling back to their parent directories. Review feedback suggests refactoring the duplicated model detection logic into a shared helper function, optimizing directory iterations to avoid redundant scans, and maintaining a consistent list of supported weight file extensions across the codebase.

Comment on lines +155 to +163
_has_config = (models_dir / "config.json").exists() or (
models_dir / "adapter_config.json"
).exists()
_has_weights = any(
f.suffix.lower() in (".gguf", ".safetensors", ".bin")
for f in models_dir.iterdir()
if f.is_file()
)
_is_self_model = _has_config and _has_weights
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for detecting if a directory is a model (checking for config.json or adapter_config.json and weight files) is repeated multiple times in this file (lines 155, 190, 283, 314). Consider refactoring this into a helper function to improve maintainability and ensure consistency across different scan methods.

Copy link
Copy Markdown

@JYYYYYT JYYYYYT Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for detecting if a directory is a model (checking for config.json or adapter_config.json and weight files) is repeated multiple times in this file (lines 155, 190, 283, 314). Consider refactoring this into a helper function to improve maintainability and ensure consistency across different scan methods.

Good suggestion. The model detection logic does appear in multiple places with intentionally different criteria (e.g. _is_self_model requires config AND weights, while Phase 1 uses a broader OR check). Refactoring into a shared helper would require careful parameterization to preserve these distinctions. I think this is better suited as a follow-up PR to keep this fix focused.

Comment on lines +158 to +162
_has_weights = any(
f.suffix.lower() in (".gguf", ".safetensors", ".bin")
for f in models_dir.iterdir()
if f.is_file()
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Using models_dir.iterdir() here and then iterating over the directory again at line 183 is slightly inefficient. While not critical for most model directories, you could consider listing the children once if performance becomes an issue, or using glob patterns which are used elsewhere in the file for similar checks.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Using models_dir.iterdir() here and then iterating over the directory again at line 183 is slightly inefficient. While not critical for most model directories, you could consider listing the children once if performance becomes an issue, or using glob patterns which are used elsewhere in the file for similar checks.

The iterdir() at line 162 only runs when the directory itself is a model (early return path). When it's not, the second iterdir() at line 183 runs instead. They don't both execute for the same directory, so there's no redundant scan.

models_dir / "adapter_config.json"
).exists()
_has_weights = any(
f.suffix.lower() in (".gguf", ".safetensors", ".bin")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The list of weight extensions here (.gguf, .safetensors, .bin) is slightly inconsistent with other parts of the codebase (e.g., line 634 includes .pt and .pth). While these three are the most common for this application, ensuring a single source of truth for supported weight formats would be beneficial.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The list of weight extensions here (.gguf, .safetensors, .bin) is slightly inconsistent with other parts of the codebase (e.g., line 634 includes .pt and .pth). While these three are the most common for this application, ensuring a single source of truth for supported weight formats would be beneficial.

The extensions here (.gguf, .safetensors, .bin) are intentionally scoped to the formats relevant to local model detection. .pt and .pth at line 634 serve a different purpose (training checkpoint detection). A unified constant could help but is out of scope for this PR.

danielhanchen added a commit that referenced this pull request May 25, 2026
…guard for PR unslothai#5754

Round 9 reviewer flagged a pile of handoff asymmetries: every
GPU-owning lifecycle change (training, export, chat, images) needed
its own bespoke unload sequence and they had drifted out of sync.
Some skipped llama-server is_active; some missed safetensors
loading_models; export and training did not check is_export_active.

Backend handoff (P1)
  * routes/inference.py: new _release_chat_for / _release_export_for
    helpers. Both treat llama-server as held when is_loaded OR
    is_active, safetensors as held when active_model_name OR
    loading_models is non-empty, and export as held when
    current_checkpoint OR is_export_active. Both helpers run their
    unloads in worker threads so async routes do not block the
    event loop.
  * routes/training.py: replaces its bespoke inline llama / safe /
    export unload sequence with await _release_chat_for / _release_
    export_for.
  * routes/export.py: same swap for the chat unload chain (export
    still does NOT call _release_export_for on itself).
  * routes/inference.py GGUF + standard chat-load paths: now use
    _release_export_for to drop a settled export, and the standard
    path's llama unload now also handles is_active=True (round 9
    review #8).

Backend reject-on-active export (P1 #5)
  * routes/inference.py: new _raise_if_export_active. Symmetric
    with _raise_if_training_active: a long-running export is
    refused with HTTP 409 instead of being silently killed when
    /images/load or /load arrives. Diffusion / images load and
    both chat-load paths call it.
  * core/inference/diffusion.py _release_other_gpu_owners_for_
    diffusion: no longer tears down an in-flight export job. Only
    drops a SETTLED export checkpoint (current_checkpoint
    populated, is_export_active False). Round 9 review #5 -- the
    previous behavior could terminate an in-flight export and
    leave a partial output artifact.

Token leak via logger.exception (P1 #6)
  * core/inference/diffusion.py: load-failure logging now uses
    logger.error(..., exc_msg) with the already-scrubbed string
    and exc_info=False. logger.exception() with the raw Exception
    would expose any hf_... token that diffusers / huggingface_hub
    embedded in the message or traceback locals, defeating the
    earlier in-flight scrub.

Dependency pinning (P1 #11)
  * pyproject.toml: huggingfacenotorch optional extra now pins
    diffusers>=0.37.0. Previously the floor was only set in
    studio/backend/requirements/no-torch-runtime.txt, so a normal
    pip install would resolve diffusers 0.36.0 (no
    Flux2KleinPipeline) and the default curated FLUX.2 klein
    Images model would fail at runtime.

Cache-delete exact match (P1 #14)
  * routes/models.py /delete-cached: llama.cpp and safetensors
    guards now match on exact repo-id (case-insensitive) instead
    of prefix. Diffusion guard already does this; the chat guards
    were the remaining surface where loading org/model-v2
    blocked deleting org/model.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_release_llama_for()`` now verifies ``llama.unload_model``
did not return False AND that ``is_loaded`` / ``is_active`` /
``loading_model_identifier`` are all cleared after the call. The
previous version only treated raised exceptions as failure, so a
subprocess refusing to terminate or an in-flight GGUF download
let the next workload allocate on top.

P1 #2: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
now raises RuntimeError when ``exp._shutdown_subprocess`` fails on
a settled checkpoint. Direct backend callers used to log at debug
level and proceed toward diffusion allocation while the export
checkpoint still owned VRAM.

P1 #3 + P1 #7: ``/images/load`` no longer drops chat + idle export
before the cheap backend validation runs. ``DiffusionBackend.load_model``
already calls the strict ``_release_other_gpu_owners_for_diffusion``
and ``_release_chat_backend_for_diffusion`` helpers AFTER family
inference and GGUF filename checks pass, so the GPU is still
freed before allocation and a malformed payload no longer
silently unloads the user's chat / chat-export pair.

P1 #4: ``_release_chat_backend_for_diffusion`` now also rejects a
post-unload state where ``loading_model_identifier`` is still set,
matching the route-level ``_release_llama_for`` strictness. A GGUF
download mid-flight before the diffusion handoff used to slip
through and end up double-owning VRAM after diffusion allocated.

P1 #5: ``_release_diffusion_for`` no longer swallows a post-unload
``status()`` failure as ``after = {}``. Training / chat / export
handoffs need proof that the diffusion pipeline released VRAM;
the helper now raises HTTP 503 when the verification status call
itself raises, so the caller retries.

P1 #6: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
raises RuntimeError when ``get_export_backend()`` itself raises.
Direct backend callers used to silently ``return`` here and
proceed to GPU allocation without being able to verify export
ownership.

P1 #8: ``/training/start`` releases settled export BEFORE chat,
matching the chat-load helpers. If idle export shutdown fails the
user's chat model is preserved instead of being dropped for a
training run that never starts.

P2 #9: GGUF load-error scrubber also collapses ``local_gguf_path``,
the resolved HF cache path passed to
``transformer_cls.from_single_file()``. Without this an exception
like ``OSError: cannot load /home/alice/.cache/huggingface/.../flux.gguf``
would leak the operator's filesystem layout through ``last_error``
and ``/images/status``.

All 85 diffusion-relevant backend tests pass locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_preflight_full_diffusers_repo(effective_base, hf_token)``
now runs for every load mode, including the GGUF-with-auto-base
path. Round 19 only preflighted the full repo or an explicit
``base_repo``, so an auto-picked companion that turned out to be
gated / private / missing still unloaded the user's chat model
before ``from_pretrained`` failed. ``effective_base`` is the same
value that feeds every downstream allocation, so preflighting it
unconditionally catches all three modes.

P1 #2: ``diffusers.GGUFQuantizationConfig`` (which imports the
``gguf`` package at construction time) is now built up front,
inside the same try block that surfaces "Re-run Studio setup".
Previously the missing-dependency exception fired AFTER
``_release_other_gpu_owners_for_diffusion`` and
``_release_chat_backend_for_diffusion`` had already taken the
chat / export models down. The downstream from_single_file call
reuses the same ``quant_config`` reference.

P1 #4: ``studio/backend/requirements/studio.txt`` now lists
``diffusers>=0.37.0`` and ``gguf>=0.10.0``. These were only in
the extras files, so fresh standard Studio installs failed on
/images/load with the round 20 P1 #2 dependency error message.

P1 #5: ``LoadRequest``, ``UnloadRequest``, and
``ValidateModelRequest`` now apply the same control-character +
embedded-HF-token validators that ``DiffusionLoadRequest``
already had. /api/inference/load, /api/inference/validate, and
/api/inference/unload used to accept newline / tab / control
characters in ``model_path`` (log-line smuggling) and URL-form
``https://hf_xxxxx@huggingface.co/...`` (credential leak through
structured log sinks).

P2 #6: ``_collapse_local`` in the diffusion load-error scrubber
now resolves relative candidates and adds the absolute form to
the substring set. A relative ``exports/my-flux`` used to leak
``/mnt/disks/.../exports/my-flux/...`` via downstream library
errors because the scrubber only matched the original literal.
Replacement is longest-first so a leaf-only context survives.

All 85 diffusion-relevant + 35 related model-validation tests
pass locally.

(P1 #3 cross-workload GPU handoff lock is deferred: deserves a
focused design pass across /images/load, /chat/load (both
branches), /training/start, and /export/load to pick a lock
boundary that does not deadlock against the backend load locks
or stall the SSE log stream.)
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1 + #2: ``LoadRequest._no_embedded_hf_tokens`` and
``ValidateModelRequest._no_embedded_hf_tokens`` now cover
``gguf_variant`` in addition to ``model_path``. A caller could
pass a variant like ``Q4_K_M-hf_xxxxxxxx`` that flowed into
structured log sinks via the GGUF resolver path; the matching
``DiffusionLoadRequest`` validator already covered every string
field, so this restores parity.

P1 #3: ``/api/inference/unload`` now also matches the llama
``loading_model_identifier`` when picking the GGUF branch. A
pending GGUF download (``is_active`` still False,
``loading_model_identifier`` populated) used to fall through to
the safetensors branch and respond ``status="unloaded"`` while
llama-server kept downloading.

P1 #4 + #5: the final safetensors-handoff sweeps (route-level
``_release_safetensors_chat_for`` and backend
``_release_chat_backend_for_diffusion``) now check ``active_model_name``
and ``loading_models`` WITHOUT the initial ``owned_names`` filter.
A concurrent ``/load`` that landed AFTER the snapshot was
previously ignored, so a chat model that began loading during the
unload window let training / export / GGUF chat / diffusion start
anyway and race the new chat for VRAM.

P2 #6: added ``_preflight_diffusers_subfolder_config`` and
invoked it for GGUF loads with a transformer class
(``effective_base``, ``"transformer"``). A custom base companion
that had ``model_index.json`` but lacked
``transformer/config.json`` previously passed the round 19
preflight, unloaded chat, then failed inside
``from_single_file``.

P2 #7: ``_scrub_validation_obj`` in main.py also scrubs string
dict KEYS. Pydantic ``string_type`` errors surface ``input``
verbatim, and a malformed payload like
``{"repo_id": {"hf_xxxxx": "owner/repo"}}`` would otherwise leak
the token through the 422 response body.

All 85 diffusion-relevant + 35 model-validation tests pass
locally. Existing fakes for ``hf_hub_download`` updated to
accept the new ``subfolder=`` kwarg the round 21 preflight uses.

(P1 #3 cross-workload GPU handoff lock from round 20 is still
deferred; round 21's P1 #4 / #5 raised the sweep-level guarantee,
which closes the most common race without the deadlock risk of
holding a process-wide lock across the entire load.)
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1 + #2 + #6: extended the chat / diffusion / training
identifier hardening to every export-side request model.
ExportCommonOptions (parent of ExportMergedModelRequest /
ExportBaseModelRequest / ExportLoRAAdapterRequest) now applies
_no_control_chars and _reject_embedded_hf_token to repo_id and
base_model_id; ExportGGUFRequest gets the same on its repo_id
plus a control-char check on quantization_method; and
LoadCheckpointRequest validates checkpoint_path. Previously
"/api/export/*" accepted newline-smuggled identifiers and
URL-form ``hf_xxxxx`` tokens that flowed into log lines.

P1 #3 + #4: ``_run_with_helper`` and ``_run_multi_pass_advisor``
now use a shared ``_gpu_workload_busy_for_helper`` that gates on
diffusion (round 22 already), training, AND export. The round 22
guard only checked diffusion, so the dataset helper / advisor
could still load llama-server on top of an active training run
or a resident export checkpoint. Each step fails closed
(unverifiable status counts as busy) so the user's primary
workload is preserved.

P1 #5: PublishDatasetRequest in models/data_recipe.py also
applies the identifier hardening to repo_id; the publish path
previously accepted control characters and URL-form tokens.

P1 #7-10: added _validate_logged_identifier helper to
routes/models.py and applied it to the path / query parameter
endpoints that flow into logger.info(...) calls --
``/config/{model_name}``, ``/check-vision/{model_name}``,
``/check-embedding/{model_name}``, ``/gguf-variants``. Mapped
the validator's ValueError to HTTP 422 so the client sees the
same shape as a Pydantic validation failure.

P2 #11 + #12: ``Loading diffusion model %s`` and
``Diffusion load failed for %s`` log lines route ``repo_id`` /
``effective_base`` through ``_display_repo_id`` (collapses
absolute local paths to the leaf, still scrubs HF tokens)
instead of plain ``_redact_hf_tokens``. The error path was
already collapsed in the user-facing 400 / RuntimeError, but
the structured-log lines kept the full path.

All 97 diffusion + training-validation + related tests pass
locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_gpu_workload_busy_for_helper`` in
``utils/datasets/llm_assist.py`` now also gates on the GGUF chat
backend (llama-server) AND the safetensors chat backend. Round 23
extended it to training + export but missed Chat, so a helper /
advisor GGUF could still race a loaded chat model for VRAM.
Both checks fail closed when status is unverifiable.

P1 #2 / #3 / #4 / #5: re-ordered the route-level GPU-handoff
unloads so the diffusion release runs BEFORE the chat releases.
A wedged diffusion unload used to fire AFTER chat was already
gone, so the user lost both on a single failure. Drop chat last
so an earlier failure preserves it. Applied to
``/training/start`` (training.py), ``/export/load`` (export.py),
``/chat/load`` GGUF branch and ``/chat/load`` safetensors branch
(routes/inference.py).

P1 #7 + P2 #13: ``/delete-finetuned`` body now hardens
``model_path`` and ``gguf_variant`` via the shared
``_validate_logged_identifier`` helper, so control characters
and URL-form HF tokens can no longer log-line-smuggle.

P1 #8 + #10: ``/delete-cached`` body hardens ``repo_id`` and
``variant`` the same way.

P1 #9: ``/download-progress`` ``repo_id`` query parameter is
also hardened; the value flows into log lines deep inside
``_get_repo_size_cached`` on lookup failure.

P1 #11: ``CheckFormatRequest.dataset_name`` and
``AiAssistMappingRequest.{dataset_name, model_name}`` in
``models/datasets.py`` now apply the same control-char +
embedded-HF-token validators, matching every other public
request-body model.

All 115 diffusion + training-validation + cached_gguf + export
+ inference model-validation tests pass locally.

(P1 #6 native-path-lease enforcement for diffusion local paths
and P1 #12 React Compiler frontend lint deferred -- both need
focused design / frontend touchups separate from this batch.)
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_release_llama_for()`` now verifies ``llama.unload_model``
did not return False AND that ``is_loaded`` / ``is_active`` /
``loading_model_identifier`` are all cleared after the call. The
previous version only treated raised exceptions as failure, so a
subprocess refusing to terminate or an in-flight GGUF download
let the next workload allocate on top.

P1 #2: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
now raises RuntimeError when ``exp._shutdown_subprocess`` fails on
a settled checkpoint. Direct backend callers used to log at debug
level and proceed toward diffusion allocation while the export
checkpoint still owned VRAM.

P1 #3 + P1 #7: ``/images/load`` no longer drops chat + idle export
before the cheap backend validation runs. ``DiffusionBackend.load_model``
already calls the strict ``_release_other_gpu_owners_for_diffusion``
and ``_release_chat_backend_for_diffusion`` helpers AFTER family
inference and GGUF filename checks pass, so the GPU is still
freed before allocation and a malformed payload no longer
silently unloads the user's chat / chat-export pair.

P1 #4: ``_release_chat_backend_for_diffusion`` now also rejects a
post-unload state where ``loading_model_identifier`` is still set,
matching the route-level ``_release_llama_for`` strictness. A GGUF
download mid-flight before the diffusion handoff used to slip
through and end up double-owning VRAM after diffusion allocated.

P1 #5: ``_release_diffusion_for`` no longer swallows a post-unload
``status()`` failure as ``after = {}``. Training / chat / export
handoffs need proof that the diffusion pipeline released VRAM;
the helper now raises HTTP 503 when the verification status call
itself raises, so the caller retries.

P1 #6: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
raises RuntimeError when ``get_export_backend()`` itself raises.
Direct backend callers used to silently ``return`` here and
proceed to GPU allocation without being able to verify export
ownership.

P1 #8: ``/training/start`` releases settled export BEFORE chat,
matching the chat-load helpers. If idle export shutdown fails the
user's chat model is preserved instead of being dropped for a
training run that never starts.

P2 #9: GGUF load-error scrubber also collapses ``local_gguf_path``,
the resolved HF cache path passed to
``transformer_cls.from_single_file()``. Without this an exception
like ``OSError: cannot load /home/alice/.cache/huggingface/.../flux.gguf``
would leak the operator's filesystem layout through ``last_error``
and ``/images/status``.

All 85 diffusion-relevant backend tests pass locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_preflight_full_diffusers_repo(effective_base, hf_token)``
now runs for every load mode, including the GGUF-with-auto-base
path. Round 19 only preflighted the full repo or an explicit
``base_repo``, so an auto-picked companion that turned out to be
gated / private / missing still unloaded the user's chat model
before ``from_pretrained`` failed. ``effective_base`` is the same
value that feeds every downstream allocation, so preflighting it
unconditionally catches all three modes.

P1 #2: ``diffusers.GGUFQuantizationConfig`` (which imports the
``gguf`` package at construction time) is now built up front,
inside the same try block that surfaces "Re-run Studio setup".
Previously the missing-dependency exception fired AFTER
``_release_other_gpu_owners_for_diffusion`` and
``_release_chat_backend_for_diffusion`` had already taken the
chat / export models down. The downstream from_single_file call
reuses the same ``quant_config`` reference.

P1 #4: ``studio/backend/requirements/studio.txt`` now lists
``diffusers>=0.37.0`` and ``gguf>=0.10.0``. These were only in
the extras files, so fresh standard Studio installs failed on
/images/load with the round 20 P1 #2 dependency error message.

P1 #5: ``LoadRequest``, ``UnloadRequest``, and
``ValidateModelRequest`` now apply the same control-character +
embedded-HF-token validators that ``DiffusionLoadRequest``
already had. /api/inference/load, /api/inference/validate, and
/api/inference/unload used to accept newline / tab / control
characters in ``model_path`` (log-line smuggling) and URL-form
``https://hf_xxxxx@huggingface.co/...`` (credential leak through
structured log sinks).

P2 #6: ``_collapse_local`` in the diffusion load-error scrubber
now resolves relative candidates and adds the absolute form to
the substring set. A relative ``exports/my-flux`` used to leak
``/mnt/disks/.../exports/my-flux/...`` via downstream library
errors because the scrubber only matched the original literal.
Replacement is longest-first so a leaf-only context survives.

All 85 diffusion-relevant + 35 related model-validation tests
pass locally.

(P1 #3 cross-workload GPU handoff lock is deferred: deserves a
focused design pass across /images/load, /chat/load (both
branches), /training/start, and /export/load to pick a lock
boundary that does not deadlock against the backend load locks
or stall the SSE log stream.)
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1 + #2: ``LoadRequest._no_embedded_hf_tokens`` and
``ValidateModelRequest._no_embedded_hf_tokens`` now cover
``gguf_variant`` in addition to ``model_path``. A caller could
pass a variant like ``Q4_K_M-hf_xxxxxxxx`` that flowed into
structured log sinks via the GGUF resolver path; the matching
``DiffusionLoadRequest`` validator already covered every string
field, so this restores parity.

P1 #3: ``/api/inference/unload`` now also matches the llama
``loading_model_identifier`` when picking the GGUF branch. A
pending GGUF download (``is_active`` still False,
``loading_model_identifier`` populated) used to fall through to
the safetensors branch and respond ``status="unloaded"`` while
llama-server kept downloading.

P1 #4 + #5: the final safetensors-handoff sweeps (route-level
``_release_safetensors_chat_for`` and backend
``_release_chat_backend_for_diffusion``) now check ``active_model_name``
and ``loading_models`` WITHOUT the initial ``owned_names`` filter.
A concurrent ``/load`` that landed AFTER the snapshot was
previously ignored, so a chat model that began loading during the
unload window let training / export / GGUF chat / diffusion start
anyway and race the new chat for VRAM.

P2 #6: added ``_preflight_diffusers_subfolder_config`` and
invoked it for GGUF loads with a transformer class
(``effective_base``, ``"transformer"``). A custom base companion
that had ``model_index.json`` but lacked
``transformer/config.json`` previously passed the round 19
preflight, unloaded chat, then failed inside
``from_single_file``.

P2 #7: ``_scrub_validation_obj`` in main.py also scrubs string
dict KEYS. Pydantic ``string_type`` errors surface ``input``
verbatim, and a malformed payload like
``{"repo_id": {"hf_xxxxx": "owner/repo"}}`` would otherwise leak
the token through the 422 response body.

All 85 diffusion-relevant + 35 model-validation tests pass
locally. Existing fakes for ``hf_hub_download`` updated to
accept the new ``subfolder=`` kwarg the round 21 preflight uses.

(P1 #3 cross-workload GPU handoff lock from round 20 is still
deferred; round 21's P1 #4 / #5 raised the sweep-level guarantee,
which closes the most common race without the deadlock risk of
holding a process-wide lock across the entire load.)
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1 + #2 + #6: extended the chat / diffusion / training
identifier hardening to every export-side request model.
ExportCommonOptions (parent of ExportMergedModelRequest /
ExportBaseModelRequest / ExportLoRAAdapterRequest) now applies
_no_control_chars and _reject_embedded_hf_token to repo_id and
base_model_id; ExportGGUFRequest gets the same on its repo_id
plus a control-char check on quantization_method; and
LoadCheckpointRequest validates checkpoint_path. Previously
"/api/export/*" accepted newline-smuggled identifiers and
URL-form ``hf_xxxxx`` tokens that flowed into log lines.

P1 #3 + #4: ``_run_with_helper`` and ``_run_multi_pass_advisor``
now use a shared ``_gpu_workload_busy_for_helper`` that gates on
diffusion (round 22 already), training, AND export. The round 22
guard only checked diffusion, so the dataset helper / advisor
could still load llama-server on top of an active training run
or a resident export checkpoint. Each step fails closed
(unverifiable status counts as busy) so the user's primary
workload is preserved.

P1 #5: PublishDatasetRequest in models/data_recipe.py also
applies the identifier hardening to repo_id; the publish path
previously accepted control characters and URL-form tokens.

P1 #7-10: added _validate_logged_identifier helper to
routes/models.py and applied it to the path / query parameter
endpoints that flow into logger.info(...) calls --
``/config/{model_name}``, ``/check-vision/{model_name}``,
``/check-embedding/{model_name}``, ``/gguf-variants``. Mapped
the validator's ValueError to HTTP 422 so the client sees the
same shape as a Pydantic validation failure.

P2 #11 + #12: ``Loading diffusion model %s`` and
``Diffusion load failed for %s`` log lines route ``repo_id`` /
``effective_base`` through ``_display_repo_id`` (collapses
absolute local paths to the leaf, still scrubs HF tokens)
instead of plain ``_redact_hf_tokens``. The error path was
already collapsed in the user-facing 400 / RuntimeError, but
the structured-log lines kept the full path.

All 97 diffusion + training-validation + related tests pass
locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_gpu_workload_busy_for_helper`` in
``utils/datasets/llm_assist.py`` now also gates on the GGUF chat
backend (llama-server) AND the safetensors chat backend. Round 23
extended it to training + export but missed Chat, so a helper /
advisor GGUF could still race a loaded chat model for VRAM.
Both checks fail closed when status is unverifiable.

P1 #2 / #3 / #4 / #5: re-ordered the route-level GPU-handoff
unloads so the diffusion release runs BEFORE the chat releases.
A wedged diffusion unload used to fire AFTER chat was already
gone, so the user lost both on a single failure. Drop chat last
so an earlier failure preserves it. Applied to
``/training/start`` (training.py), ``/export/load`` (export.py),
``/chat/load`` GGUF branch and ``/chat/load`` safetensors branch
(routes/inference.py).

P1 #7 + P2 #13: ``/delete-finetuned`` body now hardens
``model_path`` and ``gguf_variant`` via the shared
``_validate_logged_identifier`` helper, so control characters
and URL-form HF tokens can no longer log-line-smuggle.

P1 #8 + #10: ``/delete-cached`` body hardens ``repo_id`` and
``variant`` the same way.

P1 #9: ``/download-progress`` ``repo_id`` query parameter is
also hardened; the value flows into log lines deep inside
``_get_repo_size_cached`` on lookup failure.

P1 #11: ``CheckFormatRequest.dataset_name`` and
``AiAssistMappingRequest.{dataset_name, model_name}`` in
``models/datasets.py`` now apply the same control-char +
embedded-HF-token validators, matching every other public
request-body model.

All 115 diffusion + training-validation + cached_gguf + export
+ inference model-validation tests pass locally.

(P1 #6 native-path-lease enforcement for diffusion local paths
and P1 #12 React Compiler frontend lint deferred -- both need
focused design / frontend touchups separate from this batch.)
danielhanchen added a commit that referenced this pull request May 25, 2026
…hai#5754)

Round 25 P1 #5 bumped studio.txt huggingface-hub to >=1.3.0,<2.0 but

single-env/constraints.txt still pinned ==0.36.2, which made fresh

Studio Update CI / Mac Studio Update CI / Mac Studio UI CI fail at

the studio-deps install step with ResolutionImpossible. Bump the

constraint pin to ==1.8.0 to match the setup.sh / setup.ps1 t5

sub-env pins and satisfy the new studio.txt floor.
danielhanchen added a commit that referenced this pull request May 25, 2026
Twelve P1 findings from round 26 reviewer aggregate, plus the CI
revert of round 25 P1 #5 to a less invasive location.

1. requirements/studio.txt + requirements/single-env/constraints.txt:
   revert the round 25 huggingface-hub bump (broke Studio Update CI,
   Mac Studio Update CI, Mac Studio UI CI, Studio UI CI all with
   ResolutionImpossible against transformers==4.57.6 which requires
   hub<1.0). Standard install path stays on the well-tested 4.57.6 +
   0.36.2 + trl 0.23.1 trio.

2. requirements/no-torch-runtime.txt + pyproject.toml
   [huggingfacenotorch]: bump huggingface_hub floor from >=0.34.0 to
   >=1.3.0,<2.0 -- this is where the actual transformers 5.x +
   hub 0.36.2 broken combo can land because the file installs
   --no-deps. transformers 5.x calls hub.is_offline_mode which only
   exists in hub 1.x.

3. utils/datasets/llm_assist.py: revert round 25 P1 #4 (helper/advisor
   sharing the global llama backend) which introduced three
   regressions: a chat-evict load race after the busy precheck, a
   finally-block that could unload a user chat model, and an
   identifier mismatch the delete guard could not canonicalize. Go
   back to PRIVATE LlamaCppBackend instances and expose the active
   helper/advisor repos through a new thread-safe registry
   (helper_advisor_owns_repo / _register_helper_advisor_repo /
   _unregister_helper_advisor_repo) so DELETE /api/models/delete-cached
   can still block the rmtree.

4. routes/models.py delete_cached_model: check the new helper/advisor
   registry up front and 409 if a helper/advisor still owns the
   target repo. Closes round 26 P1 #13 and #14 (helper/advisor
   identifiers were prefixed and would never equal the raw repo id).

5. routes/models.py get_lora_base_model: validate lora_path with
   _validate_logged_identifier before it is reflected in 404 detail
   and error logs (round 26 P1 #12).

6. routes/inference.py /unload: round 21 P1 #3 added a "or not
   is_loaded" fallback that let an unload of owner/B cancel a pending
   llama load of owner/A. Replace it with a narrow
   llama_is_starting_without_identifier branch that only fires when
   llama-server is mid-startup with neither identifier set (round 26
   P1 #5).

7. routes/inference.py /unload: poll loading_model_identifier for up
   to 5 s after asyncio.to_thread(unload_model) so a legitimate
   pending-load cancel does not 503 because the load thread has not
   yet observed _cancel_event in its finally (round 26 P2 #15).

8. models/training.py TrainingStartRequest: extend identifier
   hardening to hf_dataset, subset, train_split, eval_split. Round 22
   only guarded model_name (round 26 P1 #10).

9. models/data_recipe.py SeedInspectRequest: add _no_control_chars +
   _reject_embedded_hf_token field_validators on dataset_name (round
   26 P1 #11).

Tests: 105 targeted (diffusion + cached_gguf + llama_cpp_cache +
inference_model_validation + models_get_model_config) and 1768
broader backend tests pass locally. Pre-existing
test_desktop_auth.py, test_studio_api.py, and
test_training_worker_flash_attn.py failures reproduce on HEAD
without these changes.
danielhanchen added a commit that referenced this pull request May 25, 2026
Five actionable findings from round 29 reviewer aggregate, plus an
origin/main merge that absorbs the chat_templates.py fix landed in
PR unslothai#5763. Skipped #4 / #5 (studio.txt + constraints.txt hub bump)
because CI evidence from round 26 contradicts that suggestion; the
real broken combo only happens via the --no-deps no-torch path
which is already bumped in no-torch-runtime.txt + pyproject.toml.

1. core/inference/diffusion.py: round 28 reordered
   _release_chat_backend_for_diffusion BEFORE
   _release_other_gpu_owners_for_diffusion to surface the helper /
   advisor busy check early, but that meant the chat unload inside
   _release_chat_backend_for_diffusion now fired before the
   training / export conflict check in the second helper. A direct
   backend caller (tests, scripts) or a route-precheck race with a
   newly-started training run would then unload the user's chat and
   then 409 with nothing loaded. Split the helper busy check into
   _raise_if_helper_advisor_busy_for_diffusion (cheap, no side
   effects), keep _release_chat_backend_for_diffusion as the
   actual chat unload with an opt-out flag, and reorder load_model
   to: (a) helper check, (b) training / export check + idle export
   shutdown, (c) chat unload. All raises now fire BEFORE any
   destructive unload.

2. Merge origin/main: absorbs af6504f (PR unslothai#5763
   chat_templates.py find() guards + the new
   tests/python/test_construct_chat_template_validation.py
   regression test). Removes the 101-line stale-rebase silent
   revert that round 29 reviewer 5 and 8 flagged.

3. frontend/src/features/images/images-page.tsx: supportsNegativePrompt
   now also honours customFamily when no model is loaded yet, so a
   Custom HF repo with family flux.2 / flux.2-klein correctly hides
   the negative prompt field instead of silently sending it.

4. routes/inference.py /images/generate: report the ACTUAL PNG
   width / height from PIL Image.size instead of echoing back the
   requested payload values. FLUX-family pipelines round to
   vae_scale_factor * 2, so a request for 520x520 lands as 512x512
   internally; metadata now matches the bytes on the wire.

Tests: 98 targeted (diffusion + cached_gguf + inference_validation)
and frontend npm run typecheck pass locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
Four actionable findings from round 30. Skipped P1 #1 / #2 / #3
(huggingface-hub bump in studio.txt / single-env / colab-new) because
the live B200 Studio that successfully generated FLUX.2 klein images
runs the exact combo the reviewer flags as broken:
    huggingface_hub 0.36.2 + transformers 4.57.6 + diffusers 0.37.1
    Flux2KleinPipeline: True (imports cleanly)
The is_offline_mode ImportError only fires with transformers 5.x, and
the standard install path pins transformers==4.57.6 via constraints.
The round 26 fix bumped no-torch-runtime.txt + pyproject huggingfacenotorch
where the --no-deps install path can land on transformers 5.x; that
remains the correct surface.

1. core/inference/diffusion.py: preflight transformers + accelerate
   via importlib.util.find_spec BEFORE any destructive GPU-owner
   unload. Diffusers can expose stub pipeline classes when
   transformers / accelerate are missing, so the load used to drop
   chat first and fail later inside from_pretrained. find_spec
   keeps existing tests that stub these modules passing because no
   real module is executed (round 30 P1 #11).

2. models/export.py ExportGGUFRequest.quantization_method: extend
   the embedded HF token validator to this field too. Round 23
   added the control-char guard but not the token guard; the value
   is forwarded into worker command lines and reflected in error /
   success text (round 30 P1 #5).

3. models/data_recipe.py SeedInspectUploadRequest: add
   _no_control_chars + _reject_embedded_hf_token field_validators
   to filename and to each entry of file_names. Mirrors the sibling
   SeedInspectRequest.dataset_name hardening (round 30 P1 #6).

4. frontend/src/features/images/images-page.tsx: defer the initial
   refreshStatus() call via queueMicrotask so the synchronous
   setRefreshingStatus(true) inside it does not trip the
   react-hooks/set-state-in-effect lint on mount (round 30 P2 #12).

Deferred (need larger surgery / out of scope for this round):
   P1 #4 native_path_lease for diffusion local-path loads
   P1 #7-#10 helper/advisor + public-start window mutual lock symmetry

Tests: 98 targeted (diffusion + cached_gguf + inference_validation)
pass locally; frontend npm run typecheck passes.
danielhanchen added a commit that referenced this pull request May 25, 2026
Two universal-consensus round-31 reviewer findings.

Concurrency: /images/load was leaking the public-load pending
counter on any pre-finally HTTPException (round 31 P1 #1, 11/12
votes). _raise_if_helper_advisor_busy("diffusion") published the
counter, then _resolve_diffusion_repo_for_request ran outside the
clearing try/finally. A request like repo_id="/tmp/model" with no
native_path_lease returned 400 and left public_load_pending() true
until process restart, permanently blocking AI Assist. Fix mirrors
the training / export pattern: track diffusion_load_window_published
in an outer try, publish the flag right after the helper-busy
check succeeds, and clear in an outer finally that only fires when
the flag is set. This also closes round 31 P1 #6: a second
request's failure can no longer decrement a still-active first
request's counter, because the second request has not yet flipped
its own publish flag.

Security: _looks_like_local_diffusion_path missed cwd-relative
directories (round 31 P1 #2, 8/12 votes). DiffusionBackend.
load_model accepts repo_id="exports/my-flux" as a local directory
via Path(repo_id).expanduser().is_dir(), but the detector only
flagged values starting with /, ~, ./, ../, backslash, or
absolute. Tightened the detector to also reject:
  * weight-file suffixes (.gguf / .safetensors / .bin / .pt / .pth)
  * non-2-segment values (`owner`, `a/b/c`, `owner/`, `/repo`, `//`)
  * 2-segment values whose parts are `.` or `..`
  * 2-segment values that actually resolve to an existing local
    path under backend CWD (last-resort exists() probe).
The existence probe is a minor side-channel for an already-
authenticated caller, accepted in exchange for closing the silent
bypass of the new lease boundary. Valid Hub ids like
unsloth/FLUX.2-klein-base-4B-GGUF, microsoft/Phi-3.5-mini-instruct
still pass through unchanged.

Skipped (consistent with prior rounds):
  * R31 P1 #3 (Tauri / native lease enum missing
    `load-diffusion-model` op): architectural surface; defer until
    the Images page actually surfaces a local-path picker.
  * R31 P1 #4-#5, #8: studio.txt / constraints.txt / pyproject hub
    pins. Live B200 install path with huggingface_hub==0.36.2,
    transformers==4.57.6, diffusers==0.37.1 imports
    Flux2KleinPipeline cleanly. The is_offline_mode import error
    only triggers when transformers 5.x is paired with hub 0.x,
    which the constraints pin prevents.
  * R31 P1 #7 (find_spec vs real import): a full transformers
    import at module load breaks tests that stub huggingface_hub;
    find_spec is the existing tradeoff.

98 targeted backend tests pass (test_diffusion_routes,
test_diffusion_backend, test_inference_model_validation,
test_models_get_model_config_case_resolution, test_data_recipe_seed,
test_training_raw_support, test_export_log_cursor).
danielhanchen added a commit that referenced this pull request May 25, 2026
Two round-33 reviewer findings: hub-floor consistency and the
multipart upload filename validator gap.

Dependencies: reverted the round-26 huggingface_hub>=1.3.0 floor
in no-torch-runtime.txt and pyproject.toml (round 33 P1 #1-#5,
4/12 vote consensus). studio.txt forces huggingface_hub==0.36.2
to match the transformers==4.57.6 pin in extras-no-deps.txt, so
the 1.3.0 floor was internally inconsistent. Reviewers
reproduced the resolver conflict on a fresh install.

Empirical justification (re-verified on the live B200 host before
the revert): huggingface_hub 0.36.2 + transformers 4.57.6 +
diffusers 0.37.1 imports Flux2KleinPipeline cleanly and runs
end-to-end image generation. transformers 4.57.6 carries its own
transformers.utils.hub.is_offline_mode and does not actually need
huggingface_hub.is_offline_mode at import time. The original bump
was guarding against the (never-realised) transformers 5.x path,
which extras-no-deps explicitly pins away.

Validation: multipart /seed/upload-unstructured-file now applies
the same _no_control_chars and _reject_embedded_hf_token checks
to file.filename that SeedInspectUploadRequest.filename already
applies in the JSON variant (round 33 P1 #7). The filename is
reflected back to the client, persisted in the per-file meta
JSON, and echoed by error responses, so the JSON-side hardening
must not be asymmetric with the multipart path.

Skipped (consistent with prior rounds):
  * Find_spec vs full import (R33 P1 #6): preserves test
    compatibility with the huggingface_hub stub fixture.
  * React hooks set-state-in-effect lint (R33 P1 #8): codebase
    has 146 pre-existing violations of the same rule;
    studio-frontend-ci does not gate on lint.
  * Direct DiffusionBackend.load_model bypass (R33 P1 #9): the
    route is the only production entry point, and the backend
    helper now publishes its own diffusion-backend pending tag
    (round 32 P1 #3). Direct-caller hardening would require
    duplicating the lease check into load_model itself, which
    is out of scope for the route-layer security boundary.
  * One-segment Hub IDs (R33 P2 #10): strict 2-segment Hub id
    check is intentional; one-segment names are not valid Hub
    ids.
  * Cwd-relative shadow of Hub IDs (R33 P2 #11): documented
    side-channel tradeoff accepted in round 31 commit msg.

97 targeted backend tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants