Skip to content

security: default Studio host to 127.0.0.1 and prompt before auto-start#7

Closed
danielhanchen wants to merge 3 commits into
mainfrom
pr-4864-code
Closed

security: default Studio host to 127.0.0.1 and prompt before auto-start#7
danielhanchen wants to merge 3 commits into
mainfrom
pr-4864-code

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Staging mirror of unslothai#4864

Original PR: unslothai#4864
Author: @Bedrovelsen

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.


Original description

Closes unslothai#4684.

Previously Unsloth Studio bound to 0.0.0.0 (all interfaces) by default and the installer silently auto-started a server at the end of install, contradicting the documented "privacy first / 100% offline and locally" guarantee and exposing the service on the network without user consent.

Changes:

  • studio/backend/run.py: change run_server() default host to 127.0.0.1; extract argparse setup into _make_argument_parser() for testability
  • unsloth_cli/commands/studio.py: change typer.Option default to 127.0.0.1
  • install.sh: remove -H 0.0.0.0 from generated launch-studio.sh launcher template; replace silent auto-start with a [Y/n] prompt
  • install.ps1: replace silent auto-start with a Read-Host [Y/n] prompt
  • README.md: simplify launch examples to unsloth studio (no -H flag); note that -H 0.0.0.0 is available for cloud/network use

Users who need all-interface binding (cloud VMs, LAN sharing) can still pass -H 0.0.0.0 explicitly. No other logic was changed: _is_port_free(), startup_banner, and health-check paths all already handle 127.0.0.1 correctly.

Tests (TDD — written before implementation):

  • studio/backend/tests/test_host_defaults.py: AST inspection of run_server() parameter default and argparse --host default
  • tests/studio/test_cli_studio_defaults.py: AST inspection of typer.Option default for studio_default() --host parameter
  • tests/sh/test_install_host_defaults.sh: static analysis of installer scripts and README launch section

https://claude.ai/code/session_012umxRmBdeDV5U7Xhm1utu6


This PR contains code changes only (5 files). Test changes are in a separate PR.

Changed files:

  • README.md
  • install.ps1
  • install.sh
  • studio/backend/run.py
  • unsloth_cli/commands/studio.py

claude and others added 2 commits April 5, 2026 20:29
Closes unslothai#4684.

Previously Unsloth Studio bound to 0.0.0.0 (all interfaces) by default
and the installer silently auto-started a server at the end of install,
contradicting the documented "privacy first / 100% offline and locally"
guarantee and exposing the service on the network without user consent.

Changes:
- studio/backend/run.py: change run_server() default host to 127.0.0.1;
  extract argparse setup into _make_argument_parser() for testability
- unsloth_cli/commands/studio.py: change typer.Option default to 127.0.0.1
- install.sh: remove -H 0.0.0.0 from generated launch-studio.sh launcher
  template; replace silent auto-start with a [Y/n] prompt
- install.ps1: replace silent auto-start with a Read-Host [Y/n] prompt
- README.md: simplify launch examples to `unsloth studio` (no -H flag);
  note that -H 0.0.0.0 is available for cloud/network use

Users who need all-interface binding (cloud VMs, LAN sharing) can still
pass -H 0.0.0.0 explicitly. No other logic was changed: _is_port_free(),
startup_banner, and health-check paths all already handle 127.0.0.1
correctly.

Tests (TDD — written before implementation):
- studio/backend/tests/test_host_defaults.py: AST inspection of
  run_server() parameter default and argparse --host default
- tests/studio/test_cli_studio_defaults.py: AST inspection of
  typer.Option default for studio_default() --host parameter
- tests/sh/test_install_host_defaults.sh: static analysis of installer
  scripts and README launch section

https://claude.ai/code/session_012umxRmBdeDV5U7Xhm1utu6
Comment thread studio/backend/run.py
if sys.platform == "win32" and hasattr(sys.stderr, "reconfigure"):
try:
sys.stderr.reconfigure(encoding = "utf-8", errors = "replace")
except Exception:
danielhanchen added a commit that referenced this pull request May 25, 2026
…for PR unslothai#5754

Round 7 reviewer surfaced a handful of swap-window races, fail-open
guards, and seed precision mismatches. This commit closes them.

Lifecycle / state (P1)
  * core/inference/diffusion.py: status() now emits active_repo_id,
    active_base_repo, pending_repo_id, pending_base_repo, and
    pending_gguf_filename alongside the existing UI-facing fields.
    During a swap (model A loaded, model B loading) the previous
    coalesced 'repo_id or pending_repo_id' hid the loading target
    from delete guards. Splitting the fields lets guards block
    deletion of either repo currently owned by the backend.
  * core/inference/diffusion.py: generate_image() now takes
    _generate_lock BEFORE snapshotting _pipe / _device. Snapshotting
    outside the lock let a concurrent unload/load clear or replace
    the backend between the snapshot and the forward, so the freed
    or swapped pipeline would still run.

Symmetric handoffs (P1)
  * routes/export.py: training-active check now runs BEFORE the
    chat / inference / diffusion unload helpers, so a 409 does not
    leave the user's chat session torn down for nothing. Also
    explicitly fails CLOSED with 503 when is_training_active()
    raises.
  * routes/inference.py: _raise_if_training_active now fails closed
    with 503 when the training backend is importable but its status
    check raises. The previous best-effort log-and-continue could
    let chat / diffusion loads collide with unverifiable training.

Delete guards (P1)
  * routes/models.py /delete-cached: chat guard now also blocks
    when llama-server is_active (i.e. mid-download) and when the
    inference backend's loading_models set contains the target.
    Round 7 review #7 flagged that the PR's diffusion-side loading
    guard had no chat-side parallel, so deleting a chat repo while
    it was downloading could still race the cache.
  * routes/models.py /delete-cached: diffusion guard iterates the
    new active_* + pending_* status fields so a delete during a
    swap is refused on either repo.
  * routes/models.py /delete-finetuned: same active_+ pending
    handling, plus the guard now also refuses deletes of a parent
    directory that contains the loaded pipeline (round 7 review #6:
    rm -rf /exports/flux-model/ could unlink model_index.json that
    the live pipeline is reading via mmap).

Seed precision (P2)
  * models/inference.py + routes/inference.py: DiffusionGenerate-
    Response now carries seed_str alongside the existing numeric
    seed. Seeds above Number.MAX_SAFE_INTEGER are rounded by
    JSON.parse in the browser; seed_str ships full decimal
    precision for display and reproduction.
  * frontend/api.ts: DiffusionGenerateResponse types seed_str;
    images-page.tsx prefers seed_str over seed in the figure
    caption so the displayed value reproduces the image.
  * frontend/api.ts: stringifyWithBigInt no longer regex-replaces
    sentinel strings over the full JSON output. It pulls the seed
    BigInt out, JSON-serialises the remaining payload, and splices
    the seed's decimal digits into the resulting object literal at
    the known position. Avoids the round 7 #10 case where a
    user-supplied prompt equal to '__bigint__:123' was rewritten
    into a JSON integer and rejected as a non-string prompt.

Custom HF repo (P2)
  * frontend/images-page.tsx: custom panel now exposes a 'Base
    diffusers repo' input that maps to DiffusionLoadRequest.
    base_repo. Required when a private / mirrored GGUF needs a
    non-default base (e.g. a 9B Klein transformer would otherwise
    fall back to the 4B base default).
danielhanchen added a commit that referenced this pull request May 25, 2026
…t helper for PR unslothai#5754

Round 10 reviewers found the round 9 export helpers had a
destructive bug: _release_export_for treated is_export_active=True
as a shutdown condition, so any caller (training, chat, diffusion)
could terminate an in-flight export and corrupt the user's output.
Conversely _raise_if_export_active raised 409 on a settled
checkpoint, blocking idle cleanup.

Backend (P1)
  * routes/inference.py: split the export-active surface in two:
      _raise_if_export_active() now ONLY raises when
      is_export_active() is True. A settled current_checkpoint is
      treated as held GPU memory, not an active job.
      _release_export_for() now ONLY shuts down when
      current_checkpoint is set AND is_export_active() is False
      (i.e. a previously completed checkpoint just holding memory).
      An unknown / unverifiable is_export_active is treated as
      'might still be active' so the helper refuses to drop.
  * routes/training.py: now calls _raise_if_export_active before
    _release_chat_for / _release_export_for, mirroring the chat
    and diffusion paths. The previous code went straight to
    _release_export_for and would kill an in-flight export.
  * routes/inference.py: split _release_chat_for into
    _release_llama_for and _release_safetensors_chat_for so the
    GGUF chat-load path can release only the OTHER chat backend
    (round 10 review #4: the previous inline 'if active_model_name'
    check skipped loading_models and let an in-flight safetensors
    load race the new GGUF allocation).
  * routes/inference.py: _raise_if_export_active now fails CLOSED
    (503) when is_export_active() raises, not only when
    get_export_backend() raises. Round 10 review #7.

Dependencies (P1)
  * pyproject.toml huggingfacenotorch extra: pin gguf. The
    Studio Images default curated picker is GGUF-only and
    diffusers.GGUFQuantizationConfig + from_single_file require
    the standalone gguf package at runtime; missing it would 500
    on the first /api/inference/images/load with
    'gguf>=0.10.0 is required'.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_release_llama_for()`` now verifies ``llama.unload_model``
did not return False AND that ``is_loaded`` / ``is_active`` /
``loading_model_identifier`` are all cleared after the call. The
previous version only treated raised exceptions as failure, so a
subprocess refusing to terminate or an in-flight GGUF download
let the next workload allocate on top.

P1 #2: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
now raises RuntimeError when ``exp._shutdown_subprocess`` fails on
a settled checkpoint. Direct backend callers used to log at debug
level and proceed toward diffusion allocation while the export
checkpoint still owned VRAM.

P1 #3 + P1 #7: ``/images/load`` no longer drops chat + idle export
before the cheap backend validation runs. ``DiffusionBackend.load_model``
already calls the strict ``_release_other_gpu_owners_for_diffusion``
and ``_release_chat_backend_for_diffusion`` helpers AFTER family
inference and GGUF filename checks pass, so the GPU is still
freed before allocation and a malformed payload no longer
silently unloads the user's chat / chat-export pair.

P1 #4: ``_release_chat_backend_for_diffusion`` now also rejects a
post-unload state where ``loading_model_identifier`` is still set,
matching the route-level ``_release_llama_for`` strictness. A GGUF
download mid-flight before the diffusion handoff used to slip
through and end up double-owning VRAM after diffusion allocated.

P1 #5: ``_release_diffusion_for`` no longer swallows a post-unload
``status()`` failure as ``after = {}``. Training / chat / export
handoffs need proof that the diffusion pipeline released VRAM;
the helper now raises HTTP 503 when the verification status call
itself raises, so the caller retries.

P1 #6: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
raises RuntimeError when ``get_export_backend()`` itself raises.
Direct backend callers used to silently ``return`` here and
proceed to GPU allocation without being able to verify export
ownership.

P1 #8: ``/training/start`` releases settled export BEFORE chat,
matching the chat-load helpers. If idle export shutdown fails the
user's chat model is preserved instead of being dropped for a
training run that never starts.

P2 #9: GGUF load-error scrubber also collapses ``local_gguf_path``,
the resolved HF cache path passed to
``transformer_cls.from_single_file()``. Without this an exception
like ``OSError: cannot load /home/alice/.cache/huggingface/.../flux.gguf``
would leak the operator's filesystem layout through ``last_error``
and ``/images/status``.

All 85 diffusion-relevant backend tests pass locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_release_safetensors_chat_for`` now re-reads
``active_model_name`` and ``loading_models`` after each unload AND
runs a final sweep against the initial owned-name set. The previous
helper trusted ``unload_model() -> True`` even though the
orchestrator can respond ``unloaded`` while still holding weights
or a concurrent ``load`` can repopulate the tracker between calls.
Per-name and global post-state mismatches now raise HTTP 503 so
the caller retries.

P1 #2: same post-state guarantee inside
``_release_chat_backend_for_diffusion`` for direct backend
callers. ``DiffusionBackend.load_model`` now raises RuntimeError
when the safetensors tracker still owns a previously-resident
name after the unload, matching the route-level helper. The route
layer's existing classifier maps the new wording to HTTP 503.

P1 #3: ``DiffusionBackend.load_model`` now preflights the full
diffusers repo (or explicit GGUF ``base_repo``) via
``hf_hub_download(filename="model_index.json")`` BEFORE the
chat / export unload runs. The GGUF path was already covered by
the existing ``hf_hub_download(gguf_filename)`` round-trip; the
full-repo path used to skip validation and let a typo / private /
gated repo only surface inside ``from_pretrained`` AFTER the
user's chat model was already dropped. Local paths are checked
structurally (must be a directory containing ``model_index.json``)
so we do not network-round-trip for an on-disk miss. Error
messages route through ``_display_repo_id`` so an absolute
filesystem path does not leak the operator's layout.

P1 #6: ``/api/inference/unload`` (the direct chat unload endpoint)
now treats ``unload_model() -> False`` AND a leftover state
(``is_loaded`` / ``is_active`` / ``loading_model_identifier`` for
GGUF, ``active_model_name`` / ``loading_models`` for safetensors)
as 503 instead of unconditionally responding
``status="unloaded"``. The UI used to show the model as gone while
the backend still owned VRAM.

P2 #7: extended the /images/load RuntimeError -> HTTPException
marker list with ``still active or loading after unload`` and
``still loading after unload``. Round 18 introduced these exact
phrasings on the backend side; without the extension a retryable
unload failure was returning HTTP 400 to the user instead of 503.

P2 #8: removed the unused ``unsloth_backend = get_inference_backend()``
eager construction in the GGUF chat-load branch. Eager
construction made the GGUF-only path needlessly fail or pay
startup cost when the safetensors backend was unavailable / lazy;
``_release_safetensors_chat_for`` already handles that case as a
no-op.

All 85 diffusion-relevant + 98 related backend tests pass locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1 + #2: ``LoadRequest._no_embedded_hf_tokens`` and
``ValidateModelRequest._no_embedded_hf_tokens`` now cover
``gguf_variant`` in addition to ``model_path``. A caller could
pass a variant like ``Q4_K_M-hf_xxxxxxxx`` that flowed into
structured log sinks via the GGUF resolver path; the matching
``DiffusionLoadRequest`` validator already covered every string
field, so this restores parity.

P1 #3: ``/api/inference/unload`` now also matches the llama
``loading_model_identifier`` when picking the GGUF branch. A
pending GGUF download (``is_active`` still False,
``loading_model_identifier`` populated) used to fall through to
the safetensors branch and respond ``status="unloaded"`` while
llama-server kept downloading.

P1 #4 + #5: the final safetensors-handoff sweeps (route-level
``_release_safetensors_chat_for`` and backend
``_release_chat_backend_for_diffusion``) now check ``active_model_name``
and ``loading_models`` WITHOUT the initial ``owned_names`` filter.
A concurrent ``/load`` that landed AFTER the snapshot was
previously ignored, so a chat model that began loading during the
unload window let training / export / GGUF chat / diffusion start
anyway and race the new chat for VRAM.

P2 #6: added ``_preflight_diffusers_subfolder_config`` and
invoked it for GGUF loads with a transformer class
(``effective_base``, ``"transformer"``). A custom base companion
that had ``model_index.json`` but lacked
``transformer/config.json`` previously passed the round 19
preflight, unloaded chat, then failed inside
``from_single_file``.

P2 #7: ``_scrub_validation_obj`` in main.py also scrubs string
dict KEYS. Pydantic ``string_type`` errors surface ``input``
verbatim, and a malformed payload like
``{"repo_id": {"hf_xxxxx": "owner/repo"}}`` would otherwise leak
the token through the 422 response body.

All 85 diffusion-relevant + 35 model-validation tests pass
locally. Existing fakes for ``hf_hub_download`` updated to
accept the new ``subfolder=`` kwarg the round 21 preflight uses.

(P1 #3 cross-workload GPU handoff lock from round 20 is still
deferred; round 21's P1 #4 / #5 raised the sweep-level guarantee,
which closes the most common race without the deadlock risk of
holding a process-wide lock across the entire load.)
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1 + #2 + #6: extended the chat / diffusion / training
identifier hardening to every export-side request model.
ExportCommonOptions (parent of ExportMergedModelRequest /
ExportBaseModelRequest / ExportLoRAAdapterRequest) now applies
_no_control_chars and _reject_embedded_hf_token to repo_id and
base_model_id; ExportGGUFRequest gets the same on its repo_id
plus a control-char check on quantization_method; and
LoadCheckpointRequest validates checkpoint_path. Previously
"/api/export/*" accepted newline-smuggled identifiers and
URL-form ``hf_xxxxx`` tokens that flowed into log lines.

P1 #3 + #4: ``_run_with_helper`` and ``_run_multi_pass_advisor``
now use a shared ``_gpu_workload_busy_for_helper`` that gates on
diffusion (round 22 already), training, AND export. The round 22
guard only checked diffusion, so the dataset helper / advisor
could still load llama-server on top of an active training run
or a resident export checkpoint. Each step fails closed
(unverifiable status counts as busy) so the user's primary
workload is preserved.

P1 #5: PublishDatasetRequest in models/data_recipe.py also
applies the identifier hardening to repo_id; the publish path
previously accepted control characters and URL-form tokens.

P1 #7-10: added _validate_logged_identifier helper to
routes/models.py and applied it to the path / query parameter
endpoints that flow into logger.info(...) calls --
``/config/{model_name}``, ``/check-vision/{model_name}``,
``/check-embedding/{model_name}``, ``/gguf-variants``. Mapped
the validator's ValueError to HTTP 422 so the client sees the
same shape as a Pydantic validation failure.

P2 #11 + #12: ``Loading diffusion model %s`` and
``Diffusion load failed for %s`` log lines route ``repo_id`` /
``effective_base`` through ``_display_repo_id`` (collapses
absolute local paths to the leaf, still scrubs HF tokens)
instead of plain ``_redact_hf_tokens``. The error path was
already collapsed in the user-facing 400 / RuntimeError, but
the structured-log lines kept the full path.

All 97 diffusion + training-validation + related tests pass
locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_gpu_workload_busy_for_helper`` in
``utils/datasets/llm_assist.py`` now also gates on the GGUF chat
backend (llama-server) AND the safetensors chat backend. Round 23
extended it to training + export but missed Chat, so a helper /
advisor GGUF could still race a loaded chat model for VRAM.
Both checks fail closed when status is unverifiable.

P1 #2 / #3 / #4 / #5: re-ordered the route-level GPU-handoff
unloads so the diffusion release runs BEFORE the chat releases.
A wedged diffusion unload used to fire AFTER chat was already
gone, so the user lost both on a single failure. Drop chat last
so an earlier failure preserves it. Applied to
``/training/start`` (training.py), ``/export/load`` (export.py),
``/chat/load`` GGUF branch and ``/chat/load`` safetensors branch
(routes/inference.py).

P1 #7 + P2 #13: ``/delete-finetuned`` body now hardens
``model_path`` and ``gguf_variant`` via the shared
``_validate_logged_identifier`` helper, so control characters
and URL-form HF tokens can no longer log-line-smuggle.

P1 #8 + #10: ``/delete-cached`` body hardens ``repo_id`` and
``variant`` the same way.

P1 #9: ``/download-progress`` ``repo_id`` query parameter is
also hardened; the value flows into log lines deep inside
``_get_repo_size_cached`` on lookup failure.

P1 #11: ``CheckFormatRequest.dataset_name`` and
``AiAssistMappingRequest.{dataset_name, model_name}`` in
``models/datasets.py`` now apply the same control-char +
embedded-HF-token validators, matching every other public
request-body model.

All 115 diffusion + training-validation + cached_gguf + export
+ inference model-validation tests pass locally.

(P1 #6 native-path-lease enforcement for diffusion local paths
and P1 #12 React Compiler frontend lint deferred -- both need
focused design / frontend touchups separate from this batch.)
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_release_llama_for()`` now verifies ``llama.unload_model``
did not return False AND that ``is_loaded`` / ``is_active`` /
``loading_model_identifier`` are all cleared after the call. The
previous version only treated raised exceptions as failure, so a
subprocess refusing to terminate or an in-flight GGUF download
let the next workload allocate on top.

P1 #2: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
now raises RuntimeError when ``exp._shutdown_subprocess`` fails on
a settled checkpoint. Direct backend callers used to log at debug
level and proceed toward diffusion allocation while the export
checkpoint still owned VRAM.

P1 #3 + P1 #7: ``/images/load`` no longer drops chat + idle export
before the cheap backend validation runs. ``DiffusionBackend.load_model``
already calls the strict ``_release_other_gpu_owners_for_diffusion``
and ``_release_chat_backend_for_diffusion`` helpers AFTER family
inference and GGUF filename checks pass, so the GPU is still
freed before allocation and a malformed payload no longer
silently unloads the user's chat / chat-export pair.

P1 #4: ``_release_chat_backend_for_diffusion`` now also rejects a
post-unload state where ``loading_model_identifier`` is still set,
matching the route-level ``_release_llama_for`` strictness. A GGUF
download mid-flight before the diffusion handoff used to slip
through and end up double-owning VRAM after diffusion allocated.

P1 #5: ``_release_diffusion_for`` no longer swallows a post-unload
``status()`` failure as ``after = {}``. Training / chat / export
handoffs need proof that the diffusion pipeline released VRAM;
the helper now raises HTTP 503 when the verification status call
itself raises, so the caller retries.

P1 #6: ``DiffusionBackend._release_other_gpu_owners_for_diffusion``
raises RuntimeError when ``get_export_backend()`` itself raises.
Direct backend callers used to silently ``return`` here and
proceed to GPU allocation without being able to verify export
ownership.

P1 #8: ``/training/start`` releases settled export BEFORE chat,
matching the chat-load helpers. If idle export shutdown fails the
user's chat model is preserved instead of being dropped for a
training run that never starts.

P2 #9: GGUF load-error scrubber also collapses ``local_gguf_path``,
the resolved HF cache path passed to
``transformer_cls.from_single_file()``. Without this an exception
like ``OSError: cannot load /home/alice/.cache/huggingface/.../flux.gguf``
would leak the operator's filesystem layout through ``last_error``
and ``/images/status``.

All 85 diffusion-relevant backend tests pass locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_release_safetensors_chat_for`` now re-reads
``active_model_name`` and ``loading_models`` after each unload AND
runs a final sweep against the initial owned-name set. The previous
helper trusted ``unload_model() -> True`` even though the
orchestrator can respond ``unloaded`` while still holding weights
or a concurrent ``load`` can repopulate the tracker between calls.
Per-name and global post-state mismatches now raise HTTP 503 so
the caller retries.

P1 #2: same post-state guarantee inside
``_release_chat_backend_for_diffusion`` for direct backend
callers. ``DiffusionBackend.load_model`` now raises RuntimeError
when the safetensors tracker still owns a previously-resident
name after the unload, matching the route-level helper. The route
layer's existing classifier maps the new wording to HTTP 503.

P1 #3: ``DiffusionBackend.load_model`` now preflights the full
diffusers repo (or explicit GGUF ``base_repo``) via
``hf_hub_download(filename="model_index.json")`` BEFORE the
chat / export unload runs. The GGUF path was already covered by
the existing ``hf_hub_download(gguf_filename)`` round-trip; the
full-repo path used to skip validation and let a typo / private /
gated repo only surface inside ``from_pretrained`` AFTER the
user's chat model was already dropped. Local paths are checked
structurally (must be a directory containing ``model_index.json``)
so we do not network-round-trip for an on-disk miss. Error
messages route through ``_display_repo_id`` so an absolute
filesystem path does not leak the operator's layout.

P1 #6: ``/api/inference/unload`` (the direct chat unload endpoint)
now treats ``unload_model() -> False`` AND a leftover state
(``is_loaded`` / ``is_active`` / ``loading_model_identifier`` for
GGUF, ``active_model_name`` / ``loading_models`` for safetensors)
as 503 instead of unconditionally responding
``status="unloaded"``. The UI used to show the model as gone while
the backend still owned VRAM.

P2 #7: extended the /images/load RuntimeError -> HTTPException
marker list with ``still active or loading after unload`` and
``still loading after unload``. Round 18 introduced these exact
phrasings on the backend side; without the extension a retryable
unload failure was returning HTTP 400 to the user instead of 503.

P2 #8: removed the unused ``unsloth_backend = get_inference_backend()``
eager construction in the GGUF chat-load branch. Eager
construction made the GGUF-only path needlessly fail or pay
startup cost when the safetensors backend was unavailable / lazy;
``_release_safetensors_chat_for`` already handles that case as a
no-op.

All 85 diffusion-relevant + 98 related backend tests pass locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1 + #2: ``LoadRequest._no_embedded_hf_tokens`` and
``ValidateModelRequest._no_embedded_hf_tokens`` now cover
``gguf_variant`` in addition to ``model_path``. A caller could
pass a variant like ``Q4_K_M-hf_xxxxxxxx`` that flowed into
structured log sinks via the GGUF resolver path; the matching
``DiffusionLoadRequest`` validator already covered every string
field, so this restores parity.

P1 #3: ``/api/inference/unload`` now also matches the llama
``loading_model_identifier`` when picking the GGUF branch. A
pending GGUF download (``is_active`` still False,
``loading_model_identifier`` populated) used to fall through to
the safetensors branch and respond ``status="unloaded"`` while
llama-server kept downloading.

P1 #4 + #5: the final safetensors-handoff sweeps (route-level
``_release_safetensors_chat_for`` and backend
``_release_chat_backend_for_diffusion``) now check ``active_model_name``
and ``loading_models`` WITHOUT the initial ``owned_names`` filter.
A concurrent ``/load`` that landed AFTER the snapshot was
previously ignored, so a chat model that began loading during the
unload window let training / export / GGUF chat / diffusion start
anyway and race the new chat for VRAM.

P2 #6: added ``_preflight_diffusers_subfolder_config`` and
invoked it for GGUF loads with a transformer class
(``effective_base``, ``"transformer"``). A custom base companion
that had ``model_index.json`` but lacked
``transformer/config.json`` previously passed the round 19
preflight, unloaded chat, then failed inside
``from_single_file``.

P2 #7: ``_scrub_validation_obj`` in main.py also scrubs string
dict KEYS. Pydantic ``string_type`` errors surface ``input``
verbatim, and a malformed payload like
``{"repo_id": {"hf_xxxxx": "owner/repo"}}`` would otherwise leak
the token through the 422 response body.

All 85 diffusion-relevant + 35 model-validation tests pass
locally. Existing fakes for ``hf_hub_download`` updated to
accept the new ``subfolder=`` kwarg the round 21 preflight uses.

(P1 #3 cross-workload GPU handoff lock from round 20 is still
deferred; round 21's P1 #4 / #5 raised the sweep-level guarantee,
which closes the most common race without the deadlock risk of
holding a process-wide lock across the entire load.)
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1 + #2 + #6: extended the chat / diffusion / training
identifier hardening to every export-side request model.
ExportCommonOptions (parent of ExportMergedModelRequest /
ExportBaseModelRequest / ExportLoRAAdapterRequest) now applies
_no_control_chars and _reject_embedded_hf_token to repo_id and
base_model_id; ExportGGUFRequest gets the same on its repo_id
plus a control-char check on quantization_method; and
LoadCheckpointRequest validates checkpoint_path. Previously
"/api/export/*" accepted newline-smuggled identifiers and
URL-form ``hf_xxxxx`` tokens that flowed into log lines.

P1 #3 + #4: ``_run_with_helper`` and ``_run_multi_pass_advisor``
now use a shared ``_gpu_workload_busy_for_helper`` that gates on
diffusion (round 22 already), training, AND export. The round 22
guard only checked diffusion, so the dataset helper / advisor
could still load llama-server on top of an active training run
or a resident export checkpoint. Each step fails closed
(unverifiable status counts as busy) so the user's primary
workload is preserved.

P1 #5: PublishDatasetRequest in models/data_recipe.py also
applies the identifier hardening to repo_id; the publish path
previously accepted control characters and URL-form tokens.

P1 #7-10: added _validate_logged_identifier helper to
routes/models.py and applied it to the path / query parameter
endpoints that flow into logger.info(...) calls --
``/config/{model_name}``, ``/check-vision/{model_name}``,
``/check-embedding/{model_name}``, ``/gguf-variants``. Mapped
the validator's ValueError to HTTP 422 so the client sees the
same shape as a Pydantic validation failure.

P2 #11 + #12: ``Loading diffusion model %s`` and
``Diffusion load failed for %s`` log lines route ``repo_id`` /
``effective_base`` through ``_display_repo_id`` (collapses
absolute local paths to the leaf, still scrubs HF tokens)
instead of plain ``_redact_hf_tokens``. The error path was
already collapsed in the user-facing 400 / RuntimeError, but
the structured-log lines kept the full path.

All 97 diffusion + training-validation + related tests pass
locally.
danielhanchen added a commit that referenced this pull request May 25, 2026
P1 #1: ``_gpu_workload_busy_for_helper`` in
``utils/datasets/llm_assist.py`` now also gates on the GGUF chat
backend (llama-server) AND the safetensors chat backend. Round 23
extended it to training + export but missed Chat, so a helper /
advisor GGUF could still race a loaded chat model for VRAM.
Both checks fail closed when status is unverifiable.

P1 #2 / #3 / #4 / #5: re-ordered the route-level GPU-handoff
unloads so the diffusion release runs BEFORE the chat releases.
A wedged diffusion unload used to fire AFTER chat was already
gone, so the user lost both on a single failure. Drop chat last
so an earlier failure preserves it. Applied to
``/training/start`` (training.py), ``/export/load`` (export.py),
``/chat/load`` GGUF branch and ``/chat/load`` safetensors branch
(routes/inference.py).

P1 #7 + P2 #13: ``/delete-finetuned`` body now hardens
``model_path`` and ``gguf_variant`` via the shared
``_validate_logged_identifier`` helper, so control characters
and URL-form HF tokens can no longer log-line-smuggle.

P1 #8 + #10: ``/delete-cached`` body hardens ``repo_id`` and
``variant`` the same way.

P1 #9: ``/download-progress`` ``repo_id`` query parameter is
also hardened; the value flows into log lines deep inside
``_get_repo_size_cached`` on lookup failure.

P1 #11: ``CheckFormatRequest.dataset_name`` and
``AiAssistMappingRequest.{dataset_name, model_name}`` in
``models/datasets.py`` now apply the same control-char +
embedded-HF-token validators, matching every other public
request-body model.

All 115 diffusion + training-validation + cached_gguf + export
+ inference model-validation tests pass locally.

(P1 #6 native-path-lease enforcement for diffusion local paths
and P1 #12 React Compiler frontend lint deferred -- both need
focused design / frontend touchups separate from this batch.)
danielhanchen added a commit that referenced this pull request May 25, 2026
Four actionable findings from round 30. Skipped P1 #1 / #2 / #3
(huggingface-hub bump in studio.txt / single-env / colab-new) because
the live B200 Studio that successfully generated FLUX.2 klein images
runs the exact combo the reviewer flags as broken:
    huggingface_hub 0.36.2 + transformers 4.57.6 + diffusers 0.37.1
    Flux2KleinPipeline: True (imports cleanly)
The is_offline_mode ImportError only fires with transformers 5.x, and
the standard install path pins transformers==4.57.6 via constraints.
The round 26 fix bumped no-torch-runtime.txt + pyproject huggingfacenotorch
where the --no-deps install path can land on transformers 5.x; that
remains the correct surface.

1. core/inference/diffusion.py: preflight transformers + accelerate
   via importlib.util.find_spec BEFORE any destructive GPU-owner
   unload. Diffusers can expose stub pipeline classes when
   transformers / accelerate are missing, so the load used to drop
   chat first and fail later inside from_pretrained. find_spec
   keeps existing tests that stub these modules passing because no
   real module is executed (round 30 P1 #11).

2. models/export.py ExportGGUFRequest.quantization_method: extend
   the embedded HF token validator to this field too. Round 23
   added the control-char guard but not the token guard; the value
   is forwarded into worker command lines and reflected in error /
   success text (round 30 P1 #5).

3. models/data_recipe.py SeedInspectUploadRequest: add
   _no_control_chars + _reject_embedded_hf_token field_validators
   to filename and to each entry of file_names. Mirrors the sibling
   SeedInspectRequest.dataset_name hardening (round 30 P1 #6).

4. frontend/src/features/images/images-page.tsx: defer the initial
   refreshStatus() call via queueMicrotask so the synchronous
   setRefreshingStatus(true) inside it does not trip the
   react-hooks/set-state-in-effect lint on mount (round 30 P2 #12).

Deferred (need larger surgery / out of scope for this round):
   P1 #4 native_path_lease for diffusion local-path loads
   P1 #7-#10 helper/advisor + public-start window mutual lock symmetry

Tests: 98 targeted (diffusion + cached_gguf + inference_validation)
pass locally; frontend npm run typecheck passes.
danielhanchen added a commit that referenced this pull request May 25, 2026
Addresses remaining round-30 reviewer findings against PR unslothai#5754
(diffusion image generation in Unsloth Studio). The studio.txt /
constraints.txt / colab-new hub-bump items (round 30 #1-#3) are
intentionally skipped: the live B200 Studio install path with
huggingface_hub==0.36.2, transformers==4.57.6 and diffusers==0.37.1
imports Flux2KleinPipeline cleanly and runs end-to-end image
generation (see staging CI green on bec81b8 plus round 28-30
local validation suites). The is_offline_mode ImportError the
reviewer cites only triggers with transformers 5.x against
huggingface_hub 0.x; the constraints pin holds transformers at 4.x
so the combo never materialises on the standard install path.

Concurrency: close the helper / advisor GPU-start race in all four
public load paths (round 30 P1 #7-#10).
  * Add a _PUBLIC_LOAD_PENDING_COUNT counter in
    utils/datasets/llm_assist.py, published under
    _HELPER_ADVISOR_START_LOCK by _raise_if_helper_advisor_busy and
    cleared by a paired _clear_public_load_window in
    routes/inference.py. A concurrent helper / advisor start now
    sees public_load_pending() inside _gpu_workload_busy_for_helper
    and refuses VRAM until the public load attempt finishes,
    closing the window between the busy snapshot and the public
    load flipping its public ownership flags (is_loaded,
    current_checkpoint, is_training_active, etc.).
  * Wire the paired clear into all five call sites (GGUF chat,
    safetensors chat, diffusion image load, training start, export
    load-checkpoint). The chat path tracks the published tag in a
    local so the finally clears the same counter on either branch
    or on early HTTPException.

Security: gate /api/inference/images/load against arbitrary
local-path probes (round 30 P1 #4). Mirror the chat
/api/inference/load native_path_lease boundary so an authenticated
session cannot use repo_id or base_repo as a directory probe.
  * Add native_path_lease + base_repo_native_path_lease to
    DiffusionLoadRequest (optional; Hub ids skip the lease).
  * Add _looks_like_local_diffusion_path + a
    _resolve_diffusion_repo_for_request helper that requires a
    verified directory-typed native path grant for any value that
    starts with /, ~, ./, ../, contains a backslash, or expands to
    an absolute path. The detector deliberately avoids Path.exists
    so the route does not side-channel filesystem layout via
    differential error messages.

Frontend: split the Images page status fetch from the spinner
toggle (round 30 P2 #12). The mount effect and the is_loading
auto-poll now call a setState-free fetchAndUpdateStatus; the
user-driven Refresh button still calls refreshStatus to flip the
spinner. Cleaner separation than the queueMicrotask shim from the
prior commit; the eslint react-hooks/set-state-in-effect rule is
not in the studio-frontend-ci typecheck gate, and the codebase
already has hundreds of pre-existing violations of the same rule.

98 targeted backend tests pass (test_diffusion_routes,
test_diffusion_backend, test_inference_model_validation,
test_models_get_model_config_case_resolution, test_data_recipe_seed,
test_training_raw_support, test_export_log_cursor). Frontend
typecheck passes.
danielhanchen added a commit that referenced this pull request May 25, 2026
Addresses remaining round-30 reviewer findings against PR unslothai#5754
(diffusion image generation in Unsloth Studio). The studio.txt /
constraints.txt / colab-new hub-bump items (round 30 #1-#3) are
intentionally skipped: the live B200 Studio install path with
huggingface_hub==0.36.2, transformers==4.57.6 and diffusers==0.37.1
imports Flux2KleinPipeline cleanly and runs end-to-end image
generation (see staging CI green on bec81b8 plus round 28-30
local validation suites). The is_offline_mode ImportError the
reviewer cites only triggers with transformers 5.x against
huggingface_hub 0.x; the constraints pin holds transformers at 4.x
so the combo never materialises on the standard install path.

Concurrency: close the helper / advisor GPU-start race in all four
public load paths (round 30 P1 #7-#10).
  * Add a _PUBLIC_LOAD_PENDING_COUNT counter in
    utils/datasets/llm_assist.py, published under
    _HELPER_ADVISOR_START_LOCK by _raise_if_helper_advisor_busy and
    cleared by a paired _clear_public_load_window in
    routes/inference.py. A concurrent helper / advisor start now
    sees public_load_pending() inside _gpu_workload_busy_for_helper
    and refuses VRAM until the public load attempt finishes,
    closing the window between the busy snapshot and the public
    load flipping its public ownership flags (is_loaded,
    current_checkpoint, is_training_active, etc.).
  * Wire the paired clear into all five call sites (GGUF chat,
    safetensors chat, diffusion image load, training start, export
    load-checkpoint). The chat path tracks the published tag in a
    local so the finally clears the same counter on either branch
    or on early HTTPException.

Security: gate /api/inference/images/load against arbitrary
local-path probes (round 30 P1 #4). Mirror the chat
/api/inference/load native_path_lease boundary so an authenticated
session cannot use repo_id or base_repo as a directory probe.
  * Add native_path_lease + base_repo_native_path_lease to
    DiffusionLoadRequest (optional; Hub ids skip the lease).
  * Add _looks_like_local_diffusion_path + a
    _resolve_diffusion_repo_for_request helper that requires a
    verified directory-typed native path grant for any value that
    starts with /, ~, ./, ../, contains a backslash, or expands to
    an absolute path. The detector deliberately avoids Path.exists
    so the route does not side-channel filesystem layout via
    differential error messages.

Frontend: split the Images page status fetch from the spinner
toggle (round 30 P2 #12). The mount effect and the is_loading
auto-poll now call a setState-free fetchAndUpdateStatus; the
user-driven Refresh button still calls refreshStatus to flip the
spinner. Cleaner separation than the queueMicrotask shim from the
prior commit; the eslint react-hooks/set-state-in-effect rule is
not in the studio-frontend-ci typecheck gate, and the codebase
already has hundreds of pre-existing violations of the same rule.

98 targeted backend tests pass (test_diffusion_routes,
test_diffusion_backend, test_inference_model_validation,
test_models_get_model_config_case_resolution, test_data_recipe_seed,
test_training_raw_support, test_export_log_cursor). Frontend
typecheck passes.
danielhanchen added a commit that referenced this pull request May 25, 2026
Two universal-consensus round-31 reviewer findings.

Concurrency: /images/load was leaking the public-load pending
counter on any pre-finally HTTPException (round 31 P1 #1, 11/12
votes). _raise_if_helper_advisor_busy("diffusion") published the
counter, then _resolve_diffusion_repo_for_request ran outside the
clearing try/finally. A request like repo_id="/tmp/model" with no
native_path_lease returned 400 and left public_load_pending() true
until process restart, permanently blocking AI Assist. Fix mirrors
the training / export pattern: track diffusion_load_window_published
in an outer try, publish the flag right after the helper-busy
check succeeds, and clear in an outer finally that only fires when
the flag is set. This also closes round 31 P1 #6: a second
request's failure can no longer decrement a still-active first
request's counter, because the second request has not yet flipped
its own publish flag.

Security: _looks_like_local_diffusion_path missed cwd-relative
directories (round 31 P1 #2, 8/12 votes). DiffusionBackend.
load_model accepts repo_id="exports/my-flux" as a local directory
via Path(repo_id).expanduser().is_dir(), but the detector only
flagged values starting with /, ~, ./, ../, backslash, or
absolute. Tightened the detector to also reject:
  * weight-file suffixes (.gguf / .safetensors / .bin / .pt / .pth)
  * non-2-segment values (`owner`, `a/b/c`, `owner/`, `/repo`, `//`)
  * 2-segment values whose parts are `.` or `..`
  * 2-segment values that actually resolve to an existing local
    path under backend CWD (last-resort exists() probe).
The existence probe is a minor side-channel for an already-
authenticated caller, accepted in exchange for closing the silent
bypass of the new lease boundary. Valid Hub ids like
unsloth/FLUX.2-klein-base-4B-GGUF, microsoft/Phi-3.5-mini-instruct
still pass through unchanged.

Skipped (consistent with prior rounds):
  * R31 P1 #3 (Tauri / native lease enum missing
    `load-diffusion-model` op): architectural surface; defer until
    the Images page actually surfaces a local-path picker.
  * R31 P1 #4-#5, #8: studio.txt / constraints.txt / pyproject hub
    pins. Live B200 install path with huggingface_hub==0.36.2,
    transformers==4.57.6, diffusers==0.37.1 imports
    Flux2KleinPipeline cleanly. The is_offline_mode import error
    only triggers when transformers 5.x is paired with hub 0.x,
    which the constraints pin prevents.
  * R31 P1 #7 (find_spec vs real import): a full transformers
    import at module load breaks tests that stub huggingface_hub;
    find_spec is the existing tradeoff.

98 targeted backend tests pass (test_diffusion_routes,
test_diffusion_backend, test_inference_model_validation,
test_models_get_model_config_case_resolution, test_data_recipe_seed,
test_training_raw_support, test_export_log_cursor).
danielhanchen added a commit that referenced this pull request May 25, 2026
Two round-33 reviewer findings: hub-floor consistency and the
multipart upload filename validator gap.

Dependencies: reverted the round-26 huggingface_hub>=1.3.0 floor
in no-torch-runtime.txt and pyproject.toml (round 33 P1 #1-#5,
4/12 vote consensus). studio.txt forces huggingface_hub==0.36.2
to match the transformers==4.57.6 pin in extras-no-deps.txt, so
the 1.3.0 floor was internally inconsistent. Reviewers
reproduced the resolver conflict on a fresh install.

Empirical justification (re-verified on the live B200 host before
the revert): huggingface_hub 0.36.2 + transformers 4.57.6 +
diffusers 0.37.1 imports Flux2KleinPipeline cleanly and runs
end-to-end image generation. transformers 4.57.6 carries its own
transformers.utils.hub.is_offline_mode and does not actually need
huggingface_hub.is_offline_mode at import time. The original bump
was guarding against the (never-realised) transformers 5.x path,
which extras-no-deps explicitly pins away.

Validation: multipart /seed/upload-unstructured-file now applies
the same _no_control_chars and _reject_embedded_hf_token checks
to file.filename that SeedInspectUploadRequest.filename already
applies in the JSON variant (round 33 P1 #7). The filename is
reflected back to the client, persisted in the per-file meta
JSON, and echoed by error responses, so the JSON-side hardening
must not be asymmetric with the multipart path.

Skipped (consistent with prior rounds):
  * Find_spec vs full import (R33 P1 #6): preserves test
    compatibility with the huggingface_hub stub fixture.
  * React hooks set-state-in-effect lint (R33 P1 #8): codebase
    has 146 pre-existing violations of the same rule;
    studio-frontend-ci does not gate on lint.
  * Direct DiffusionBackend.load_model bypass (R33 P1 #9): the
    route is the only production entry point, and the backend
    helper now publishes its own diffusion-backend pending tag
    (round 32 P1 #3). Direct-caller hardening would require
    duplicating the lease check into load_model itself, which
    is out of scope for the route-layer security boundary.
  * One-segment Hub IDs (R33 P2 #10): strict 2-segment Hub id
    check is intentional; one-segment names are not valid Hub
    ids.
  * Cwd-relative shadow of Hub IDs (R33 P2 #11): documented
    side-channel tradeoff accepted in round 31 commit msg.

97 targeted backend tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Change default host to 127.0.0.1 instead of 0.0.0.0 / Remove installation auto start instance without notice

2 participants