[tests] [studio] Fix VLM detection for transformers v5 by danielhanchen · Pull Request #5 · danielhanchen/unsloth-staging-2

danielhanchen · 2026-04-06T06:59:52Z

Staging mirror of unslothai#4868

Original PR: unslothai#4868
Author: @Datta0

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.

Original description

Fixes: unslothai#4859

This PR contains test changes only (1 file). Code changes are in a separate PR (#4).

Test files:

studio/backend/tests/test_transformers_version.py

for more information, see https://pre-commit.ci

… export) (unslothai#5265) * Add Apple Silicon MLX routing Rewrite __init__.py: detect MLX on macOS arm64 before any torch imports Extract original GPU init to _gpu_init.py (unchanged) MLX path imports FastMLXModel from unsloth_zoo, skips all GPU code GPU path unchanged: from ._gpu_init import * * Add Apple Silicon MLX routing - Rewrite __init__.py: detect MLX on macOS arm64 before any torch imports - Extract original GPU init to _gpu_init.py (unchanged) - MLX path imports FastMLXModel from unsloth_zoo, skips all GPU code - GPU path unchanged: from ._gpu_init import * * mlx with studio * mlx with studio * updating temporary install.sh * updating temporary install.sh * adding t_v5 path * adding t_v5 path * fixing vision training * fixing vision training * adding chat * adding chat * minor * minor * Adding export and fixing training issues, inference with lora adaptors * Adding export and fixing training issues, inference with lora adaptors * fix: MLX worker pass load_in_4bit, override is_vlm based on dataset, streaming for VLM * fix: MLX worker pass load_in_4bit, override is_vlm based on dataset, streaming for VLM * Merge mlx-apple-silicon into main * update install.sh to point to main branch * update install.sh to point to main branch * fix: export returns 3 values (success, message, output_path) matching upstream worker * fix: export returns 3 values (success, message, output_path) matching upstream worker * fix(mlx): show training-process peak memory in Studio UI, not system-wide Studio UI was showing ~95 GB during MLX training because get_gpu_utilization read "In use system memory" from IORegistry's AGXAccelerator — system-wide GPU memory across all processes (training + backend + browser + Display). Now the trainer's mx.get_peak_memory value is forwarded through the progress event and surfaced via /api/train/hardware while training is active. Falls back to the system-wide reading when training is not running. * fix(mlx): show training-process peak memory in Studio UI, not system-wide Studio UI was showing ~95 GB during MLX training because get_gpu_utilization read "In use system memory" from IORegistry's AGXAccelerator — system-wide GPU memory across all processes (training + backend + browser + Display). Now the trainer's mx.get_peak_memory() value is forwarded through the progress event and surfaced via /api/train/hardware while training is active. Falls back to the system-wide reading when training is not running. * fix(mlx): make is_bfloat16_supported detect M1/M2 (no native bf16) M1 and M2 chips emulate bf16 in software on the GPU, causing 40-70% slower prefill compared to native fp16. M3+ have native bf16 (macOS Sonoma+ MPSGraph). Replaces the always-True stub with chip-aware detection via mx.device_info. * fix(mlx): make is_bfloat16_supported() detect M1/M2 (no native bf16) M1 and M2 chips emulate bf16 in software on the GPU, causing 40-70% slower prefill compared to native fp16. M3+ have native bf16 (macOS Sonoma+ MPSGraph). Replaces the always-True stub with chip-aware detection via mx.device_info(). * feat(mlx): wire training_type="Full Finetuning" through MLX worker Compute use_lora from the UI's training_type before loading the model, pass full_finetuning=not use_lora to FastMLXModel.from_pretrained, and let the existing 'if use_lora' branch skip get_peft_model. Matches the GPU worker's flow. * feat(mlx): wire training_type="Full Finetuning" through MLX worker Compute use_lora from the UI's training_type before loading the model, pass full_finetuning=not use_lora to FastMLXModel.from_pretrained, and let the existing 'if use_lora' branch skip get_peft_model. Matches the GPU worker's flow. * fix(mlx): pass save_method='merged_16bit' from Studio's export page Previously the MLX path called save_pretrained_merged with no save_method, which fell through to a no-op that didn't actually fuse LoRA into the base. Now Studio's "Merged Model" export properly fuses LoRA + dequantizes any 4-bit base to bf16, matching the GPU behavior for the same UI option. * fix(mlx): pass save_method='merged_16bit' from Studio's export page Previously the MLX path called save_pretrained_merged() with no save_method, which fell through to a no-op that didn't actually fuse LoRA into the base. Now Studio's "Merged Model" export properly fuses LoRA + dequantizes any 4-bit base to bf16, matching the GPU behavior for the same UI option. * fix(studio): pass private to MLX push, return 3-tuples consistently MLX push_to_hub branch now forwards private=private (matches GPU) Existing 2-tuple early-returns ('repo_id+token required', 'PEFT model needed') were tripping the route's 3-tuple unpack. Added a None output_path so the unpack always succeeds. * fix(studio): pass private to MLX push, return 3-tuples consistently - MLX push_to_hub branch now forwards private=private (matches GPU) - Existing 2-tuple early-returns ('repo_id+token required', 'PEFT model needed') were tripping the route's 3-tuple unpack. Added a None output_path so the unpack always succeeds. * studio wirings * studio wirings * Merge pull request #5 from Manan17/feat/quant_config studio wirings * fix(mlx): wire train_on_completions for VLM via per-template lookup Mirror the GPU worker: stop excluding VLMs and stop hardcoding template detection. Look up the model in MODEL_TO_TEMPLATE_MAPPER and fetch the per-template instruction/response markers from TEMPLATE_TO_RESPONSES_MAPPER. The frontend already force-disables train_on_completions for vision+image and audio cases, so backend just trusts the flag. * fix(mlx): wire train_on_completions for VLM via per-template lookup Mirror the GPU worker: stop excluding VLMs and stop hardcoding template detection. Look up the model in MODEL_TO_TEMPLATE_MAPPER and fetch the per-template instruction/response markers from TEMPLATE_TO_RESPONSES_MAPPER. The frontend already force-disables train_on_completions for vision+image and audio cases, so backend just trusts the flag. * wire in lora rslora, init lora weights, random_state * wire in lora rslora, init lora weights, random_state * loftq studio error message fix * loftq studio error message fix * handle unknown optim and lr scheduler * handle unknown optim and lr scheduler * Merge pull request #6 from Manan17/update/peftkwargs Update/peftkwargs * feat(mlx): pass finetune_language/attention/mlp/vision flags to FastMLXModel Studio's four UI checkboxes now actually flow through to MLX get_peft_model (which was just updated in unsloth-zoo to honor them). Also drops the incorrect train_projector wiring that tied projector LoRA to the attn/mlp flags — those are language-side toggles, not projector toggles. Co-Authored-By: Manan17 <shahmanan170602@gmail.com> * feat(mlx): pass finetune_language/attention/mlp/vision flags to FastMLXModel Studio's four UI checkboxes now actually flow through to MLX get_peft_model (which was just updated in unsloth-zoo to honor them). Also drops the incorrect train_projector wiring that tied projector LoRA to the attn/mlp flags — those are language-side toggles, not projector toggles. Co-Authored-By: Manan17 <shahmanan170602@gmail.com> * feat(mlx,ux): auto-imply finetune_language_layers when user picks attn/mlp UI guardrail. The four checkboxes (vision/language/attention/MLP) carry "scope × module-type" semantics that aren't obvious — picking just "Attention modules" + "MLP modules" without "Language layers" naturally reads as "fine-tune attn/mlp" but our backend reads it as "fine-tune attn/mlp modules in *no* tower" → empty target_modules → zero trainable params → crash inside value_and_grad. If user selected attn or mlp module types but no layer scope, default to language scope. Power users can still explicitly choose language=False, vision=True if they want vision-only fine-tuning of attn/mlp. Co-Authored-By: Manan17 <shahmanan170602@gmail.com> * feat(mlx,ux): auto-imply finetune_language_layers when user picks attn/mlp UI guardrail. The four checkboxes (vision/language/attention/MLP) carry "scope × module-type" semantics that aren't obvious — picking just "Attention modules" + "MLP modules" without "Language layers" naturally reads as "fine-tune attn/mlp" but our backend reads it as "fine-tune attn/mlp modules in *no* tower" → empty target_modules → zero trainable params → crash inside value_and_grad. If user selected attn or mlp module types but no layer scope, default to language scope. Power users can still explicitly choose language=False, vision=True if they want vision-only fine-tuning of attn/mlp. Co-Authored-By: Manan17 <shahmanan170602@gmail.com> * fix(mlx): wire top_k, repetition_penalty, and VLM top_p through to mlx-lm/mlx-vlm Inference UI sliders for top_k and repetition_penalty had no effect on MLX, and VLM top_p was also silently dropped. Plus a latent pre-existing bug: mlx_vlm.generate_step expects temperature= (long form), but we were passing temp= which silently fell into **kwargs — every VLM chat was effectively greedy regardless of the temperature slider. Text path (_generate_text): make_sampler now receives top_k in addition to temp/top_p make_logits_processors built and forwarded when repetition_penalty is non-trivial (skip when 0.0/1.0 to avoid pointless overhead) VLM path (_generate_vlm): Pass top_p, top_k, repetition_penalty as kwargs (mlx_vlm.stream_generate forwards them to generate_step's sampler/logits_processor builders) Rename temp= → temperature= so it's actually consumed Verified end-to-end with a smoke test on Qwen2.5-0.5B-Instruct (text) and Qwen2.5-VL-3B-Instruct (VLM): each of {greedy, top_p=0.5, top_k=10, rep_pen=1.5} now produces a distinct output, proving the parameters reach the sampler. Co-Authored-By: Manan17 <shahmanan170602@gmail.com> * fix(mlx): wire top_k, repetition_penalty, and VLM top_p through to mlx-lm/mlx-vlm Inference UI sliders for top_k and repetition_penalty had no effect on MLX, and VLM top_p was also silently dropped. Plus a latent pre-existing bug: mlx_vlm.generate_step expects temperature= (long form), but we were passing temp= which silently fell into **kwargs — every VLM chat was effectively greedy regardless of the temperature slider. Text path (_generate_text): - make_sampler now receives top_k in addition to temp/top_p - make_logits_processors built and forwarded when repetition_penalty is non-trivial (skip when 0.0/1.0 to avoid pointless overhead) VLM path (_generate_vlm): - Pass top_p, top_k, repetition_penalty as kwargs (mlx_vlm.stream_generate forwards them to generate_step's sampler/logits_processor builders) - Rename temp= → temperature= so it's actually consumed Verified end-to-end with a smoke test on Qwen2.5-0.5B-Instruct (text) and Qwen2.5-VL-3B-Instruct (VLM): each of {greedy, top_p=0.5, top_k=10, rep_pen=1.5} now produces a distinct output, proving the parameters reach the sampler. Co-Authored-By: Manan17 <shahmanan170602@gmail.com> * feat(mlx): map format_type to MLX save_method, reuse local save dir for hub push export_merged_model: format_type="4-bit (FP4)" → save_method="merged_4bit" (was hardcoded merged_16bit, ignoring the UI choice). Both export_merged_model and export_base_model now pass save_directory= to push_to_hub_merged so it reuses the just-written local folder instead of re-saving under a relative "username/model" directory. Co-Authored-By: Manan17 <shahmanan170602@gmail.com> * feat(mlx): map format_type to MLX save_method, reuse local save dir for hub push - export_merged_model: format_type="4-bit (FP4)" → save_method="merged_4bit" (was hardcoded merged_16bit, ignoring the UI choice). - Both export_merged_model and export_base_model now pass save_directory= to push_to_hub_merged so it reuses the just-written local folder instead of re-saving under a relative "username/model" directory. Co-Authored-By: Manan17 <shahmanan170602@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * restore install * restore install * fix(mlx): restore FastVisionModel as a distinct class unsloth/__init__.py was assigning `FastVisionModel = FastLanguageModel` right after defining `class FastVisionModel(FastLanguageModel)` with a `for_training` static method. The alias erased the class binding, so the documented `FastVisionModel.for_training(model)` call from upstream Unsloth's VLM notebooks raised `AttributeError` on MLX. Remove the offending alias. `FastVisionModel` is now a real subclass of `FastLanguageModel` again — inherits `from_pretrained` / `get_peft_model` / `for_inference`, exposes `for_training` as a no-op pass-through (no-op because MLX doesn't have a train/eval mode flag; the call exists purely for GPU/MLX notebook parity). Verified end-to-end: Qwen3-VL-2B + LaTeX_OCR LoRA + vision LoRA via FastVisionModel.from_pretrained → get_peft_model → for_training → MLXTrainer.train runs 10 steps cleanly (loss 1.10 → 0.12, no NaNs, peak 5.89 GB). Studio's path (FastLanguageModel.from_pretrained for any repo, auto-detect VLM in the loader) is unaffected. Tier-1 review finding #8. * fix(mlx): restore FastVisionModel as a distinct class unsloth/__init__.py was assigning `FastVisionModel = FastLanguageModel` right after defining `class FastVisionModel(FastLanguageModel)` with a `for_training` static method. The alias erased the class binding, so the documented `FastVisionModel.for_training(model)` call from upstream Unsloth's VLM notebooks raised `AttributeError` on MLX. Remove the offending alias. `FastVisionModel` is now a real subclass of `FastLanguageModel` again — inherits `from_pretrained` / `get_peft_model` / `for_inference`, exposes `for_training` as a no-op pass-through (no-op because MLX doesn't have a train/eval mode flag; the call exists purely for GPU/MLX notebook parity). Verified end-to-end: Qwen3-VL-2B + LaTeX_OCR LoRA + vision LoRA via FastVisionModel.from_pretrained → get_peft_model → for_training → MLXTrainer.train() runs 10 steps cleanly (loss 1.10 → 0.12, no NaNs, peak 5.89 GB). Studio's path (FastLanguageModel.from_pretrained for any repo, auto-detect VLM in the loader) is unaffected. Tier-1 review finding #8. * Studio: harden MLX training and export, restore GPU init guards Studio export Restore Tuple[bool, str, Optional[str]] contract on export_merged_model, export_base_model, export_gguf, and export_lora_adapter, populating output_path on successful local saves so routes/worker/CLI/frontend details.output_path is non-empty again. Lift the GPU save_method assignment out of the local-save branch so Hub-only merged exports (save_directory='', push_to_hub=True) no longer hit UnboundLocalError on the push branch. For MLX merged and base hub-only export, stage to a tempfile.TemporaryDirectory before push_to_hub_merged instead of passing save_directory=''. Source _IS_MLX from unsloth instead of recomputing the platform check (single source of truth, also enforces mlx-package availability). Studio MLX training/inference Pass token=hf_token into FastMLXModel.from_pretrained for gated/private models, matching the inference path. Strip hf_token and wandb_token from wandb.init(config=...) so secrets do not leak into the W&B run config. Replace load_from_disk(local_datasets[0]) with the existing UnslothTrainer._resolve_local_files / _loader_for_files helpers so uploaded JSON/JSONL/CSV/Parquet files train through the normal datasets loader (load_from_disk still used for HF save_to_disk directories). Make the dataset slice helper inclusive at the end and treat 0 as a real index instead of "unset", matching the GPU and embedding paths. Add a status_message -> message alias inside _send so the existing parent pump (training.py) renders MLX status updates instead of blanks. Forward min_p through generate_chat_response into _generate_text / _generate_vlm and into make_sampler / vlm_kwargs so the sampling control is no longer a no-op on MLX. Wrap unsloth_zoo.mlx_loader / mlx_trainer imports with a clearer ImportError pointing users at install.sh for Apple Silicon. Exit the MLX stop-polling thread on EOFError/OSError instead of busy-looping when the queue/pipe is permanently closed (one-line why-safe rationale inline). Studio frontend ParamsSection subscribes to platform deviceType via the Zustand hook so the gradient checkpointing dropdown re-renders after the async device fetch completes. Studio hardware get_gpu_utilization MLX branch now reads _read_apple_gpu_stats once and derives VRAM totals from psutil, removing the second ioreg subprocess per utilization poll. Unsloth core Restore the os.geteuid == 0 guard around the CUDA ldconfig recovery that was lost when GPU initialization moved into _gpu_init.py, plus the non-root manual-fix warning branch. Non-root CUDA users no longer shell out to ldconfig at import time. Load dataprep/raw_text via importlib so the MLX import path no longer pulls torch in through dataprep/__init__.py -> synthetic.py. FastVisionModel.from_pretrained overrides the inherited delegator only to inject text_only=False; this is an extension, not a duplication, and is needed so VLM checkpoint loads keep the vision tower. Wrap the MLX-branch unsloth_zoo import with a clearer ImportError. * Studio: regression tests for MLX training/export and GPU init ldconfig guard tests/python/test_gpu_init_ldconfig_guard.py asserts the geteuid root check still wraps the ldconfig recovery and the non-root branch warns bnb users; AST + source-text inspection so the test runs without torch. tests/studio/test_export_output_path_contract.py covers the Tuple[bool, str, Optional[str]] return contract on every export method, the output_path assignment after successful local save, the Hub-only GPU save_method binding fix, the MLX hub-only TemporaryDirectory staging, and the single-source `_IS_MLX` import from unsloth. tests/studio/test_mlx_training_worker_behaviors.py covers token forwarding to FastMLXModel.from_pretrained, wandb config secret stripping, file-aware local dataset loading, status_message -> message aliasing, inclusive slice semantics, EOFError/OSError stop thread exit, and the friendly mlx_loader / mlx_trainer ImportError. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix(mlx): cap inference memory + release wired on unload + tame worker pre-pin Three memory-hardening fixes for Studio's MLX path: 1. Inference applies the same Metal caps as the trainer. load_model previously only called set_wired_limit(100% of recommended) with no upper memory_limit, leaving large VLM checkpoints unbounded during the loader allocation. Add _configure_memory_limits() that sets memory_limit to 85% of recommended and wired_limit to min(recommended, memory_limit) — matching MLXTrainer's defaults so behavior is the same whether the user trains or just runs inference. 2. unload_model releases pinned memory back to the OS — but only when the cache is empty. Without this, pinned wired bytes stayed allocated to MLX after the model was gone, starving other apps. The release is guarded on `not self.models` so unloading one of several cached models doesn't un-pin weights still in use. 3. Worker pre-cap is conservative instead of aggressive. The previous pre-pin set_wired_limit(100% of recommended) competed with MLXTrainer's later more conservative cap. Replace with the same 85%-memory / min(rec, memory) pair that the trainer applies later (idempotent re-apply). Bounds the model load + LoRA setup window without over-pinning. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests/studio: regression tests for the _IS_MLX dispatch gate Two gates drive every MLX-vs-CUDA dispatch decision in Studio: 1. unsloth._IS_MLX in unsloth/__init__.py — evaluated once at import time, read by Studio worker code to choose the GPU vs MLX trainer and inference paths. Defined as Darwin AND arm64 AND find_spec("mlx") is not None. 2. utils.hardware.detect_hardware() — runtime probe with priority CUDA > XPU > MLX > CPU. The MLX branch is reached only when both CUDA and XPU are unavailable and the host is Apple Silicon and mlx is importable. Neither gate had a direct test. Adds tests/studio/test_is_mlx_dispatch_gate.py with six tests: test_is_mlx_gate_uses_three_required_predicates AST-walks unsloth/__init__.py and asserts the _IS_MLX assignment is a BoolOp(And) of platform.system()=="Darwin", platform.machine()=="arm64", and find_spec("mlx") is not None. Catches accidental rewrites that drop a predicate. test_is_mlx_gate_true_on_apple_silicon_with_mlx_present Spoofs platform to Darwin/arm64, injects a fake mlx module so find_spec returns a real ModuleSpec, re-evaluates the gate expression. Verifies it flips True under the exact conditions Studio expects. test_is_mlx_gate_false_when_mlx_missing Spoofs Apple Silicon but with mlx absent. Verifies the gate stays False (so a Mac without mlx installed does not pretend to have MLX support). test_is_mlx_gate_false_on_non_apple_silicon Canary on the actual Linux+CUDA / AMD / Intel test host: the gate must remain False regardless of whether mlx happens to be importable. Protects existing GPU users from accidental MLX hijack when MLX support evolves. test_detect_hardware_picks_mlx_when_only_apple_silicon_available Forces torch.cuda and torch.xpu off, spoofs Apple Silicon, injects fake mlx and mlx.core. detect_hardware() must return DeviceType.MLX. test_detect_hardware_picks_cuda_on_real_host Canary: on a real CUDA host detect_hardware() must return DeviceType.CUDA. Protects against the MLX branch shadowing CUDA dispatch on NVIDIA / AMD ROCm hosts. Uses the same monkeypatch.setitem(sys.modules, ...) fake-mlx pattern as the existing test_mlx_inference_backend.py — no new test infrastructure, no real mlx install required. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add AGPL-3.0 SPDX header to Studio MLX regression tests Four Studio MLX test files shipped without an SPDX-License-Identifier: studio/backend/tests/test_mlx_training_worker_config.py tests/studio/test_mlx_training_worker_behaviors.py tests/studio/test_export_output_path_contract.py tests/studio/test_is_mlx_dispatch_gate.py They sit in or alongside studio/backend/, which is governed by studio/LICENSE.AGPL-3.0, and exercise AGPL Studio code. Add the same "# SPDX-License-Identifier: AGPL-3.0-only" header that's already on test_mlx_inference_backend.py so the license declaration matches the code under test rather than defaulting to the repo-root Apache-2.0. * Wrap MLX submodule imports with friendly install hint The _IS_MLX block at the top of unsloth/__init__.py already catches the missing-package case with a friendly install hint, but the follow-up "from unsloth_zoo.mlx_trainer import ..." and "from unsloth_zoo.mlx_loader import ..." lines run unguarded. An Apple Silicon user who has unsloth-zoo installed but on an older version (e.g. the current PyPI release, before the MLX modules ship) sees a raw ImportError on the submodule rather than the hint that points at install.sh. Wrap the two submodule imports in the same try/except shape so the friendly install message fires whether the package is missing entirely or just predates the MLX submodules. No-op once both packages release together; smooths the transitional window where unsloth/main has merged but unsloth-zoo on PyPI has not. --------- Co-authored-by: DoubleMathew <mmathew23@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>

Datta0 and others added 3 commits April 6, 2026 05:53

Fix VLM detection for transformers v5

248ca24

[pre-commit.ci] auto fixes from pre-commit.com hooks

fbfa817

for more information, see https://pre-commit.ci

Split: keep only 1 file(s)

784da37

danielhanchen mentioned this pull request Apr 6, 2026

[studio] Fix VLM detection for transformers v5 #4

Closed

danielhanchen closed this Apr 6, 2026

danielhanchen mentioned this pull request May 27, 2026

PR #697 Studio probe on macos-14 (do not merge) #157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tests] [studio] Fix VLM detection for transformers v5#5

[tests] [studio] Fix VLM detection for transformers v5#5
danielhanchen wants to merge 3 commits into
mainfrom
pr-4868-tests

danielhanchen commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielhanchen commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Staging mirror of unslothai#4868

Original description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielhanchen commented Apr 6, 2026 •

edited

Loading