Skip to content

[tests] [studio] Fix VLM detection for transformers v5#5

Closed
danielhanchen wants to merge 3 commits into
mainfrom
pr-4868-tests
Closed

[tests] [studio] Fix VLM detection for transformers v5#5
danielhanchen wants to merge 3 commits into
mainfrom
pr-4868-tests

Conversation

@danielhanchen
Copy link
Copy Markdown
Owner

@danielhanchen danielhanchen commented Apr 6, 2026

Staging mirror of unslothai#4868

Original PR: unslothai#4868
Author: @Datta0

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.


Original description

Fixes: unslothai#4859


This PR contains test changes only (1 file). Code changes are in a separate PR (#4).

Test files:

  • studio/backend/tests/test_transformers_version.py

danielhanchen added a commit that referenced this pull request May 6, 2026
… export) (unslothai#5265)

* Add Apple Silicon MLX routing

Rewrite __init__.py: detect MLX on macOS arm64 before any torch imports
Extract original GPU init to _gpu_init.py (unchanged)
MLX path imports FastMLXModel from unsloth_zoo, skips all GPU code
GPU path unchanged: from ._gpu_init import *

* Add Apple Silicon MLX routing

- Rewrite __init__.py: detect MLX on macOS arm64 before any torch imports
- Extract original GPU init to _gpu_init.py (unchanged)
- MLX path imports FastMLXModel from unsloth_zoo, skips all GPU code
- GPU path unchanged: from ._gpu_init import *

* mlx with studio

* mlx with studio

* updating temporary install.sh

* updating temporary install.sh

* adding t_v5 path

* adding t_v5 path

* fixing vision training

* fixing vision training

* adding chat

* adding chat

* minor

* minor

* Adding export and fixing training issues, inference with lora adaptors

* Adding export and fixing training issues, inference with lora adaptors

* fix: MLX worker pass load_in_4bit, override is_vlm based on dataset, streaming for VLM

* fix: MLX worker pass load_in_4bit, override is_vlm based on dataset, streaming for VLM

* Merge mlx-apple-silicon into main

* update install.sh to point to main branch

* update install.sh to point to main branch

* fix: export returns 3 values (success, message, output_path) matching upstream worker

* fix: export returns 3 values (success, message, output_path) matching upstream worker

* fix(mlx): show training-process peak memory in Studio UI, not system-wide

Studio UI was showing ~95 GB during MLX training because get_gpu_utilization
read "In use system memory" from IORegistry's AGXAccelerator — system-wide
GPU memory across all processes (training + backend + browser + Display).

Now the trainer's mx.get_peak_memory value is forwarded through the
progress event and surfaced via /api/train/hardware while training is
active. Falls back to the system-wide reading when training is not running.

* fix(mlx): show training-process peak memory in Studio UI, not system-wide

Studio UI was showing ~95 GB during MLX training because get_gpu_utilization
read "In use system memory" from IORegistry's AGXAccelerator — system-wide
GPU memory across all processes (training + backend + browser + Display).

Now the trainer's mx.get_peak_memory() value is forwarded through the
progress event and surfaced via /api/train/hardware while training is
active. Falls back to the system-wide reading when training is not running.

* fix(mlx): make is_bfloat16_supported detect M1/M2 (no native bf16)

M1 and M2 chips emulate bf16 in software on the GPU, causing 40-70%
slower prefill compared to native fp16. M3+ have native bf16 (macOS
Sonoma+ MPSGraph). Replaces the always-True stub with chip-aware
detection via mx.device_info.

* fix(mlx): make is_bfloat16_supported() detect M1/M2 (no native bf16)

M1 and M2 chips emulate bf16 in software on the GPU, causing 40-70%
slower prefill compared to native fp16. M3+ have native bf16 (macOS
Sonoma+ MPSGraph). Replaces the always-True stub with chip-aware
detection via mx.device_info().

* feat(mlx): wire training_type="Full Finetuning" through MLX worker

Compute use_lora from the UI's training_type before loading the model,
pass full_finetuning=not use_lora to FastMLXModel.from_pretrained, and
let the existing 'if use_lora' branch skip get_peft_model. Matches the
GPU worker's flow.

* feat(mlx): wire training_type="Full Finetuning" through MLX worker

Compute use_lora from the UI's training_type before loading the model,
pass full_finetuning=not use_lora to FastMLXModel.from_pretrained, and
let the existing 'if use_lora' branch skip get_peft_model. Matches the
GPU worker's flow.

* fix(mlx): pass save_method='merged_16bit' from Studio's export page

Previously the MLX path called save_pretrained_merged with no
save_method, which fell through to a no-op that didn't actually fuse
LoRA into the base. Now Studio's "Merged Model" export properly
fuses LoRA + dequantizes any 4-bit base to bf16, matching the GPU
behavior for the same UI option.

* fix(mlx): pass save_method='merged_16bit' from Studio's export page

Previously the MLX path called save_pretrained_merged() with no
save_method, which fell through to a no-op that didn't actually fuse
LoRA into the base. Now Studio's "Merged Model" export properly
fuses LoRA + dequantizes any 4-bit base to bf16, matching the GPU
behavior for the same UI option.

* fix(studio): pass private to MLX push, return 3-tuples consistently

MLX push_to_hub branch now forwards private=private (matches GPU)
Existing 2-tuple early-returns ('repo_id+token required', 'PEFT model
needed') were tripping the route's 3-tuple unpack. Added a None
output_path so the unpack always succeeds.

* fix(studio): pass private to MLX push, return 3-tuples consistently

- MLX push_to_hub branch now forwards private=private (matches GPU)
- Existing 2-tuple early-returns ('repo_id+token required', 'PEFT model
  needed') were tripping the route's 3-tuple unpack. Added a None
  output_path so the unpack always succeeds.

* studio wirings

* studio wirings

* Merge pull request #5 from Manan17/feat/quant_config

studio wirings

* fix(mlx): wire train_on_completions for VLM via per-template lookup

Mirror the GPU worker: stop excluding VLMs and stop hardcoding
template detection. Look up the model in MODEL_TO_TEMPLATE_MAPPER and
fetch the per-template instruction/response markers from
TEMPLATE_TO_RESPONSES_MAPPER. The frontend already force-disables
train_on_completions for vision+image and audio cases, so backend
just trusts the flag.

* fix(mlx): wire train_on_completions for VLM via per-template lookup

Mirror the GPU worker: stop excluding VLMs and stop hardcoding
template detection. Look up the model in MODEL_TO_TEMPLATE_MAPPER and
fetch the per-template instruction/response markers from
TEMPLATE_TO_RESPONSES_MAPPER. The frontend already force-disables
train_on_completions for vision+image and audio cases, so backend
just trusts the flag.

* wire in lora rslora, init lora weights, random_state

* wire in lora rslora, init lora weights, random_state

* loftq studio error message fix

* loftq studio error message fix

* handle unknown optim and lr scheduler

* handle unknown optim and lr scheduler

* Merge pull request #6 from Manan17/update/peftkwargs

Update/peftkwargs

* feat(mlx): pass finetune_language/attention/mlp/vision flags to FastMLXModel

Studio's four UI checkboxes now actually flow through to MLX get_peft_model
(which was just updated in unsloth-zoo to honor them). Also drops the
incorrect train_projector wiring that tied projector LoRA to the
attn/mlp flags — those are language-side toggles, not projector toggles.

Co-Authored-By: Manan17 <shahmanan170602@gmail.com>

* feat(mlx): pass finetune_language/attention/mlp/vision flags to FastMLXModel

Studio's four UI checkboxes now actually flow through to MLX get_peft_model
(which was just updated in unsloth-zoo to honor them). Also drops the
incorrect train_projector wiring that tied projector LoRA to the
attn/mlp flags — those are language-side toggles, not projector toggles.

Co-Authored-By: Manan17 <shahmanan170602@gmail.com>

* feat(mlx,ux): auto-imply finetune_language_layers when user picks attn/mlp

UI guardrail. The four checkboxes (vision/language/attention/MLP) carry
"scope × module-type" semantics that aren't obvious — picking just
"Attention modules" + "MLP modules" without "Language layers" naturally
reads as "fine-tune attn/mlp" but our backend reads it as "fine-tune
attn/mlp modules in *no* tower" → empty target_modules → zero
trainable params → crash inside value_and_grad.

If user selected attn or mlp module types but no layer scope, default
to language scope. Power users can still explicitly choose
language=False, vision=True if they want vision-only fine-tuning of
attn/mlp.

Co-Authored-By: Manan17 <shahmanan170602@gmail.com>

* feat(mlx,ux): auto-imply finetune_language_layers when user picks attn/mlp

UI guardrail. The four checkboxes (vision/language/attention/MLP) carry
"scope × module-type" semantics that aren't obvious — picking just
"Attention modules" + "MLP modules" without "Language layers" naturally
reads as "fine-tune attn/mlp" but our backend reads it as "fine-tune
attn/mlp modules in *no* tower" → empty target_modules → zero
trainable params → crash inside value_and_grad.

If user selected attn or mlp module types but no layer scope, default
to language scope. Power users can still explicitly choose
language=False, vision=True if they want vision-only fine-tuning of
attn/mlp.

Co-Authored-By: Manan17 <shahmanan170602@gmail.com>

* fix(mlx): wire top_k, repetition_penalty, and VLM top_p through to mlx-lm/mlx-vlm

Inference UI sliders for top_k and repetition_penalty had no effect on
MLX, and VLM top_p was also silently dropped. Plus a latent pre-existing
bug: mlx_vlm.generate_step expects temperature= (long form), but we
were passing temp= which silently fell into **kwargs — every VLM chat
was effectively greedy regardless of the temperature slider.

Text path (_generate_text):
make_sampler now receives top_k in addition to temp/top_p
make_logits_processors built and forwarded when repetition_penalty is
non-trivial (skip when 0.0/1.0 to avoid pointless overhead)

VLM path (_generate_vlm):
Pass top_p, top_k, repetition_penalty as kwargs (mlx_vlm.stream_generate
forwards them to generate_step's sampler/logits_processor builders)
Rename temp= → temperature= so it's actually consumed

Verified end-to-end with a smoke test on Qwen2.5-0.5B-Instruct (text) and
Qwen2.5-VL-3B-Instruct (VLM): each of {greedy, top_p=0.5, top_k=10,
rep_pen=1.5} now produces a distinct output, proving the parameters
reach the sampler.

Co-Authored-By: Manan17 <shahmanan170602@gmail.com>

* fix(mlx): wire top_k, repetition_penalty, and VLM top_p through to mlx-lm/mlx-vlm

Inference UI sliders for top_k and repetition_penalty had no effect on
MLX, and VLM top_p was also silently dropped. Plus a latent pre-existing
bug: mlx_vlm.generate_step expects temperature= (long form), but we
were passing temp= which silently fell into **kwargs — every VLM chat
was effectively greedy regardless of the temperature slider.

Text path (_generate_text):
- make_sampler now receives top_k in addition to temp/top_p
- make_logits_processors built and forwarded when repetition_penalty is
  non-trivial (skip when 0.0/1.0 to avoid pointless overhead)

VLM path (_generate_vlm):
- Pass top_p, top_k, repetition_penalty as kwargs (mlx_vlm.stream_generate
  forwards them to generate_step's sampler/logits_processor builders)
- Rename temp= → temperature= so it's actually consumed

Verified end-to-end with a smoke test on Qwen2.5-0.5B-Instruct (text) and
Qwen2.5-VL-3B-Instruct (VLM): each of {greedy, top_p=0.5, top_k=10,
rep_pen=1.5} now produces a distinct output, proving the parameters
reach the sampler.

Co-Authored-By: Manan17 <shahmanan170602@gmail.com>

* feat(mlx): map format_type to MLX save_method, reuse local save dir for hub push

export_merged_model: format_type="4-bit (FP4)" → save_method="merged_4bit"
(was hardcoded merged_16bit, ignoring the UI choice).
Both export_merged_model and export_base_model now pass save_directory=
to push_to_hub_merged so it reuses the just-written local folder
instead of re-saving under a relative "username/model" directory.

Co-Authored-By: Manan17 <shahmanan170602@gmail.com>

* feat(mlx): map format_type to MLX save_method, reuse local save dir for hub push

- export_merged_model: format_type="4-bit (FP4)" → save_method="merged_4bit"
  (was hardcoded merged_16bit, ignoring the UI choice).
- Both export_merged_model and export_base_model now pass save_directory=
  to push_to_hub_merged so it reuses the just-written local folder
  instead of re-saving under a relative "username/model" directory.

Co-Authored-By: Manan17 <shahmanan170602@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* restore install

* restore install

* fix(mlx): restore FastVisionModel as a distinct class

unsloth/__init__.py was assigning `FastVisionModel = FastLanguageModel`
right after defining `class FastVisionModel(FastLanguageModel)` with a
`for_training` static method. The alias erased the class binding, so
the documented `FastVisionModel.for_training(model)` call from upstream
Unsloth's VLM notebooks raised `AttributeError` on MLX.

Remove the offending alias. `FastVisionModel` is now a real subclass of
`FastLanguageModel` again — inherits `from_pretrained` /
`get_peft_model` / `for_inference`, exposes `for_training` as a no-op
pass-through (no-op because MLX doesn't have a train/eval mode flag;
the call exists purely for GPU/MLX notebook parity).

Verified end-to-end: Qwen3-VL-2B + LaTeX_OCR LoRA + vision LoRA via
FastVisionModel.from_pretrained → get_peft_model → for_training →
MLXTrainer.train runs 10 steps cleanly (loss 1.10 → 0.12, no NaNs,
peak 5.89 GB).

Studio's path (FastLanguageModel.from_pretrained for any repo,
auto-detect VLM in the loader) is unaffected. Tier-1 review finding #8.

* fix(mlx): restore FastVisionModel as a distinct class

unsloth/__init__.py was assigning `FastVisionModel = FastLanguageModel`
right after defining `class FastVisionModel(FastLanguageModel)` with a
`for_training` static method. The alias erased the class binding, so
the documented `FastVisionModel.for_training(model)` call from upstream
Unsloth's VLM notebooks raised `AttributeError` on MLX.

Remove the offending alias. `FastVisionModel` is now a real subclass of
`FastLanguageModel` again — inherits `from_pretrained` /
`get_peft_model` / `for_inference`, exposes `for_training` as a no-op
pass-through (no-op because MLX doesn't have a train/eval mode flag;
the call exists purely for GPU/MLX notebook parity).

Verified end-to-end: Qwen3-VL-2B + LaTeX_OCR LoRA + vision LoRA via
FastVisionModel.from_pretrained → get_peft_model → for_training →
MLXTrainer.train() runs 10 steps cleanly (loss 1.10 → 0.12, no NaNs,
peak 5.89 GB).

Studio's path (FastLanguageModel.from_pretrained for any repo,
auto-detect VLM in the loader) is unaffected. Tier-1 review finding #8.

* Studio: harden MLX training and export, restore GPU init guards

Studio export
Restore Tuple[bool, str, Optional[str]] contract on export_merged_model,
export_base_model, export_gguf, and export_lora_adapter, populating
output_path on successful local saves so routes/worker/CLI/frontend
details.output_path is non-empty again.
Lift the GPU save_method assignment out of the local-save branch so
Hub-only merged exports (save_directory='', push_to_hub=True) no longer
hit UnboundLocalError on the push branch.
For MLX merged and base hub-only export, stage to a tempfile.TemporaryDirectory
before push_to_hub_merged instead of passing save_directory=''.
Source _IS_MLX from unsloth instead of recomputing the platform check
(single source of truth, also enforces mlx-package availability).

Studio MLX training/inference
Pass token=hf_token into FastMLXModel.from_pretrained for gated/private
models, matching the inference path.
Strip hf_token and wandb_token from wandb.init(config=...) so secrets
do not leak into the W&B run config.
Replace load_from_disk(local_datasets[0]) with the existing
UnslothTrainer._resolve_local_files / _loader_for_files helpers so
uploaded JSON/JSONL/CSV/Parquet files train through the normal datasets
loader (load_from_disk still used for HF save_to_disk directories).
Make the dataset slice helper inclusive at the end and treat 0 as a real
index instead of "unset", matching the GPU and embedding paths.
Add a status_message -> message alias inside _send so the existing parent
pump (training.py) renders MLX status updates instead of blanks.
Forward min_p through generate_chat_response into _generate_text /
_generate_vlm and into make_sampler / vlm_kwargs so the sampling control
is no longer a no-op on MLX.
Wrap unsloth_zoo.mlx_loader / mlx_trainer imports with a clearer
ImportError pointing users at install.sh for Apple Silicon.
Exit the MLX stop-polling thread on EOFError/OSError instead of
busy-looping when the queue/pipe is permanently closed (one-line
why-safe rationale inline).

Studio frontend
ParamsSection subscribes to platform deviceType via the Zustand hook so
the gradient checkpointing dropdown re-renders after the async device
fetch completes.

Studio hardware
get_gpu_utilization MLX branch now reads _read_apple_gpu_stats once and
derives VRAM totals from psutil, removing the second ioreg subprocess
per utilization poll.

Unsloth core
Restore the os.geteuid == 0 guard around the CUDA ldconfig recovery
that was lost when GPU initialization moved into _gpu_init.py, plus the
non-root manual-fix warning branch. Non-root CUDA users no longer shell
out to ldconfig at import time.
Load dataprep/raw_text via importlib so the MLX import path no longer
pulls torch in through dataprep/__init__.py -> synthetic.py.
FastVisionModel.from_pretrained overrides the inherited delegator only
to inject text_only=False; this is an extension, not a duplication, and
is needed so VLM checkpoint loads keep the vision tower.
Wrap the MLX-branch unsloth_zoo import with a clearer ImportError.

* Studio: regression tests for MLX training/export and GPU init ldconfig guard

tests/python/test_gpu_init_ldconfig_guard.py asserts the geteuid root
check still wraps the ldconfig recovery and the non-root branch warns
bnb users; AST + source-text inspection so the test runs without torch.
tests/studio/test_export_output_path_contract.py covers the
Tuple[bool, str, Optional[str]] return contract on every export method,
the output_path assignment after successful local save, the Hub-only
GPU save_method binding fix, the MLX hub-only TemporaryDirectory
staging, and the single-source `_IS_MLX` import from unsloth.
tests/studio/test_mlx_training_worker_behaviors.py covers token
forwarding to FastMLXModel.from_pretrained, wandb config secret
stripping, file-aware local dataset loading, status_message ->
message aliasing, inclusive slice semantics, EOFError/OSError stop
thread exit, and the friendly mlx_loader / mlx_trainer ImportError.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix(mlx): cap inference memory + release wired on unload + tame worker pre-pin

Three memory-hardening fixes for Studio's MLX path:

1. Inference applies the same Metal caps as the trainer.
   load_model previously only called set_wired_limit(100% of recommended)
   with no upper memory_limit, leaving large VLM checkpoints unbounded
   during the loader allocation. Add _configure_memory_limits() that sets
   memory_limit to 85% of recommended and wired_limit to min(recommended,
   memory_limit) — matching MLXTrainer's defaults so behavior is the same
   whether the user trains or just runs inference.

2. unload_model releases pinned memory back to the OS — but only when
   the cache is empty. Without this, pinned wired bytes stayed allocated
   to MLX after the model was gone, starving other apps. The release is
   guarded on `not self.models` so unloading one of several cached
   models doesn't un-pin weights still in use.

3. Worker pre-cap is conservative instead of aggressive.
   The previous pre-pin set_wired_limit(100% of recommended) competed
   with MLXTrainer's later more conservative cap. Replace with the same
   85%-memory / min(rec, memory) pair that the trainer applies later
   (idempotent re-apply). Bounds the model load + LoRA setup window
   without over-pinning.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tests/studio: regression tests for the _IS_MLX dispatch gate

Two gates drive every MLX-vs-CUDA dispatch decision in Studio:

  1. unsloth._IS_MLX in unsloth/__init__.py — evaluated once at import
     time, read by Studio worker code to choose the GPU vs MLX trainer
     and inference paths. Defined as
        Darwin AND arm64 AND find_spec("mlx") is not None.

  2. utils.hardware.detect_hardware() — runtime probe with priority
     CUDA > XPU > MLX > CPU. The MLX branch is reached only when both
     CUDA and XPU are unavailable and the host is Apple Silicon and
     mlx is importable.

Neither gate had a direct test. Adds tests/studio/test_is_mlx_dispatch_gate.py
with six tests:

  test_is_mlx_gate_uses_three_required_predicates
      AST-walks unsloth/__init__.py and asserts the _IS_MLX assignment
      is a BoolOp(And) of platform.system()=="Darwin",
      platform.machine()=="arm64", and find_spec("mlx") is not None.
      Catches accidental rewrites that drop a predicate.

  test_is_mlx_gate_true_on_apple_silicon_with_mlx_present
      Spoofs platform to Darwin/arm64, injects a fake mlx module so
      find_spec returns a real ModuleSpec, re-evaluates the gate
      expression. Verifies it flips True under the exact conditions
      Studio expects.

  test_is_mlx_gate_false_when_mlx_missing
      Spoofs Apple Silicon but with mlx absent. Verifies the gate stays
      False (so a Mac without mlx installed does not pretend to have
      MLX support).

  test_is_mlx_gate_false_on_non_apple_silicon
      Canary on the actual Linux+CUDA / AMD / Intel test host: the gate
      must remain False regardless of whether mlx happens to be
      importable. Protects existing GPU users from accidental MLX
      hijack when MLX support evolves.

  test_detect_hardware_picks_mlx_when_only_apple_silicon_available
      Forces torch.cuda and torch.xpu off, spoofs Apple Silicon, injects
      fake mlx and mlx.core. detect_hardware() must return DeviceType.MLX.

  test_detect_hardware_picks_cuda_on_real_host
      Canary: on a real CUDA host detect_hardware() must return
      DeviceType.CUDA. Protects against the MLX branch shadowing CUDA
      dispatch on NVIDIA / AMD ROCm hosts.

Uses the same monkeypatch.setitem(sys.modules, ...) fake-mlx pattern as
the existing test_mlx_inference_backend.py — no new test infrastructure,
no real mlx install required.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add AGPL-3.0 SPDX header to Studio MLX regression tests

Four Studio MLX test files shipped without an SPDX-License-Identifier:

  studio/backend/tests/test_mlx_training_worker_config.py
  tests/studio/test_mlx_training_worker_behaviors.py
  tests/studio/test_export_output_path_contract.py
  tests/studio/test_is_mlx_dispatch_gate.py

They sit in or alongside studio/backend/, which is governed by
studio/LICENSE.AGPL-3.0, and exercise AGPL Studio code. Add the same
"# SPDX-License-Identifier: AGPL-3.0-only" header that's already on
test_mlx_inference_backend.py so the license declaration matches
the code under test rather than defaulting to the repo-root
Apache-2.0.

* Wrap MLX submodule imports with friendly install hint

The _IS_MLX block at the top of unsloth/__init__.py already catches the
missing-package case with a friendly install hint, but the follow-up
"from unsloth_zoo.mlx_trainer import ..." and "from unsloth_zoo.mlx_loader import ..."
lines run unguarded. An Apple Silicon user who has unsloth-zoo installed
but on an older version (e.g. the current PyPI release, before the MLX
modules ship) sees a raw ImportError on the submodule rather than the
hint that points at install.sh.

Wrap the two submodule imports in the same try/except shape so the
friendly install message fires whether the package is missing entirely
or just predates the MLX submodules. No-op once both packages release
together; smooths the transitional window where unsloth/main has merged
but unsloth-zoo on PyPI has not.

---------

Co-authored-by: DoubleMathew <mmathew23@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Can't train Qwen3.5 or Gemma4 on multimodal datasets in Unsloth Studio

2 participants