MLX training support for Studio on Apple Silicon by mmathew23 · Pull Request #5340 · unslothai/unsloth

mmathew23 · 2026-05-08T20:52:16Z

This PR completes the Studio MLX training path for Apple Silicon by wiring Studio into the MLX backend, stabilizing compiled training, and aligning MLX defaults with Unsloth’s existing Torch behavior where appropriate.

Companion PR: requires the matching unsloth-zoo MLX backend changes: unslothai/unsloth-zoo#634

What changed

Adds Studio MLX training support for text and vision models on Apple Silicon.
Routes Studio MLX training/inference through unsloth_zoo.mlx.loader and unsloth_zoo.mlx.trainer.
Passes MLX training controls from Studio into the backend, including:
- max_grad_norm
- max_grad_value
Sets MLX Studio defaults to:
- max_grad_norm = 0.0
- max_grad_value = 3.0
If both clipping modes are enabled, Studio reports that max_grad_norm is ignored in favor of max_grad_value.
Reports MLX grad norm back through Studio progress/history.
Keeps Apple Silicon MLX detection isolated from the normal CUDA/ROCm/Intel Torch path.
Avoids importing Torch GPU-specific modules in the MLX path where Studio only needs tokenizer/chat-template handling.

Why

Studio’s MLX path needs to run on Apple Silicon without entering Unsloth’s Torch GPU initialization path. The previous wiring could either miss MLX-only backend modules or pull in CUDA/Torch-specific code too early.

The MLX backend also needed clipping and grad norm behavior that is practical for Apple Silicon memory limits. Global norm clipping is expensive for gradient accumulation, while value clipping keeps the compiled low-memory path available, so the Studio default is now value clipping.

Behavior

CUDA, ROCm, and Intel Torch paths should continue to use the existing Unsloth behavior.
Apple Silicon with MLX installed uses the MLX path.
Apple Silicon without MLX falls back to the existing non-training/chat-only Studio behavior.
MLX text training compiles by default where supported.
MLX full finetuning uses fp32 weights by default to better match Unsloth Torch full finetuning behavior.
LoRA and QLoRA remain supported through the MLX backend.

Validation

Ran MLX CCE/runtime tests in unsloth-zoo.
Verified Qwen3 0.6B full finetuning with compiled CCE.
Verified Qwen3 4B full finetuning memory/throughput behavior.
Verified LoRA and QLoRA training paths with gradient accumulation.
Verified grad norm reporting works for full finetuning, LoRA, and QLoRA.
Verified max_grad_value keeps the compiled low-memory path, while global max_grad_norm is guarded for the expensive path.

…he torch gpu imports

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request refactors MLX support by updating import paths to the new unsloth_zoo.mlx structure and decoupling MLX training and inference from Torch-heavy modules. It introduces specialized local dataset resolution for MLX, adds support for gradient clipping parameters (max_grad_norm and max_grad_value), and integrates MLX-specific tokenizer patches. Review feedback identifies potential issues with directory validation and mixed file type handling in the new dataset resolution logic. Additionally, there are concerns regarding the global application of MLX-specific gradient clipping defaults, which may inadvertently affect non-MLX training runs and limit future UI configurability.

gemini-code-assist · 2026-05-08T20:55:09Z

+            parquet_dir = (
+                file_path_obj / "parquet-files"
+                if (file_path_obj / "parquet-files").exists()
+                else file_path_obj
+            )


The check for parquet-files should use .is_dir() instead of .exists(). If a file named parquet-files exists in the directory, the subsequent .glob() call will fail or return no results. Additionally, consider moving this entire dataset resolution logic to a shared utility file (e.g., utils/paths.py) to avoid duplication with UnslothTrainer while still keeping it accessible to the MLX path without importing Torch.

parquet_dir = file_path_obj / "parquet-files" if not parquet_dir.is_dir(): parquet_dir = file_path_obj

gemini-code-assist · 2026-05-08T20:55:09Z

+            candidates: list[Path] = []
+            for ext in (".json", ".jsonl", ".csv", ".parquet"):
+                candidates.extend(sorted(file_path_obj.glob(f"*{ext}")))
+            if candidates:
+                all_files.extend(str(c) for c in candidates)
+                continue
+
+            raise ValueError(f"No supported data files in directory: {file_path_obj}")


The current logic collects files of all supported extensions into a single list. If a directory contains mixed file types (e.g., both .json and .csv), load_dataset will fail because it is called with a single loader type determined by the first file's extension. The loop should break after finding the first extension that matches any files to ensure consistency within the directory.

Suggested change

candidates: list[Path] = []

for ext in (".json", ".jsonl", ".csv", ".parquet"):

candidates.extend(sorted(file_path_obj.glob(f"*{ext}")))

if candidates:

all_files.extend(str(c) for c in candidates)

continue

raise ValueError(f"No supported data files in directory: {file_path_obj}")

for ext in (".json", ".jsonl", ".csv", ".parquet"):

matches = sorted(file_path_obj.glob(f"*{ext}"))

if matches:

all_files.extend(str(m) for m in matches)

break

else:

raise ValueError(f"No supported data files in directory: {file_path_obj}")

continue

gemini-code-assist · 2026-05-08T20:55:09Z

+    max_grad_norm: float = Field(
+        0.0,
+        ge = 0,
+        description = "Global gradient norm clipping threshold. Set 0 to disable.",
+    )
+    max_grad_value: float = Field(
+        3.0,
+        ge = 0,
+        description = "Elementwise gradient value clipping threshold. Set 0 to disable.",
+    )


The default values for max_grad_norm (0.0) and max_grad_value (3.0) appear to be optimized for MLX on Apple Silicon but are now applied as global defaults for the entire Training API. This changes the default behavior for non-MLX (Torch) training runs, which typically default to max_grad_norm=1.0 and no elementwise clipping. Consider keeping these defaults platform-neutral and applying MLX-specific defaults within the MLX worker logic (which already has fallback logic for these keys).

gemini-code-assist · 2026-05-08T20:55:09Z

+    max_grad_norm: 0.0,
+    max_grad_value: 3.0,


Hardcoding max_grad_norm and max_grad_value here applies these MLX-specific defaults to all training platforms (including CUDA/ROCm). This overrides the standard Unsloth/Torch defaults and prevents users from configuring these values if UI controls are added in the future. These values should ideally be omitted to let backend defaults take over, or made platform-aware.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 09fcc55f23

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-08T20:56:15Z

-    and platform.machine() == "arm64"
-    and importlib.util.find_spec("mlx") is not None
-)
+_IS_MLX = _is_mlx_available()


Restore platform-based MLX detection before GPU init

Using _is_mlx_available() as the sole gate means MLX mode is never entered when unsloth_zoo.mlx.runtime is missing or from an older unsloth-zoo build, because the ImportError path returns False and falls through to _gpu_init. On Apple Silicon hosts with MLX installed but stale/missing MLX submodules, this routes users into the CUDA/ROCm/XPU initialization path instead of the intended MLX compatibility/upgrade errors, effectively breaking import/training until package versions are perfectly aligned. The MLX gate should still be based on platform/MLX presence, then validate unsloth_zoo.mlx.* inside the MLX branch.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9e84165d2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T02:48:07Z

+    max_grad_value: float = Field(
+        1.0,
+        ge = 0,


Preserve norm clipping when value clipping is not provided

max_grad_value now defaults to 1.0, and _run_mlx_training explicitly disables max_grad_norm whenever both are > 0. This means API callers that set only max_grad_norm (or older clients unaware of the new field) will silently lose norm clipping because max_grad_value is implicitly enabled by default. In practice, norm clipping is unusable unless callers also know to send max_grad_value=0, which is a behavior regression for partial/legacy payloads.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dcbb50c27f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T23:06:02Z

+    max_grad_norm = float(config.get("max_grad_norm", 0.0) or 0.0)
+    max_grad_value = float(config.get("max_grad_value", 1.0) or 0.0)


Forward gradient clipping settings to the worker config

When users or API clients set the new max_grad_norm/max_grad_value fields, they are accepted in routes/training.py but never copied into the subprocess config in TrainingBackend.start_training (checked studio/backend/core/training/training.py lines 140-197). As a result every MLX run reaches this code with both keys absent, so the requested clipping settings are ignored and the hard-coded defaults here are always used; please add these fields to the config passed to run_training_process.

Useful? React with 👍 / 👎.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d9389d40c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T17:11:55Z

+    return _is_mlx_available()
+
+
+_IS_MLX = is_mlx_available()


Honor UNSLOTH_FORCE_GPU_PATH in MLX detection

unsloth/__init__.py now defines UNSLOTH_FORCE_GPU_PATH=1 as an explicit override to bypass MLX dispatch, but unsloth/device_type.py computes _IS_MLX unconditionally from is_mlx_available(). On Apple Silicon with MLX installed, modules that read unsloth.device_type will still classify the runtime as mlx even when the force-GPU switch is set, creating conflicting runtime behavior between the package entry gate and downstream device-dependent code.

Useful? React with 👍 / 👎.

Simplifies the MLX grad-clipping plumbing now that we are standardising on elementwise value clipping at [-5, 5] for the compiled MLX path and norm clipping disabled. The MLX worker no longer reads max_grad_norm / max_grad_value from the request; both are pinned in one place. Frontend stops sending the field at all, and the TypeScript request type drops it to match. Non-MLX (CUDA/AMD/Intel) is untouched and continues to pick up HF TrainingArguments' default max_grad_norm = 1.0.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07f4150466

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T11:36:36Z

+    # MLX: value-clip grads to [-5, 5]; norm clipping disabled for compile-friendliness.
+    max_grad_norm = 0.0
+    max_grad_value = 5.0  # TODO: expose MLX grad-clip in Studio UI for power users


Honor configured MLX gradient clipping controls

Even though this commit adds max_grad_norm/max_grad_value to the API payload and subprocess config, the MLX worker overwrites them here with fixed values before constructing MLXTrainingConfig. In API-driven runs that set max_grad_norm or disable value clipping with max_grad_value=0, those requested settings are silently ignored and every run uses norm clipping disabled plus value clipping at 5.0. This is fresh evidence beyond the earlier config-forwarding comment because the fields are now forwarded, but these hard-coded assignments still discard them.

Useful? React with 👍 / 👎.

…#5424) * studio: drop unused max_grad_value schema + route plumbing The MLX worker hardcodes max_grad_value to 5.0 after PR unslothai#5340. The schema field, frontend payload type, route forwarder, and start_training kwarg threading were all left in place as a transitional buffer for old clients. The field is now genuinely unused everywhere except inside the MLX worker, so the schema, route forwarder, and config-build entries can go. Pydantic still tolerates older clients that send max_grad_value because TrainingStartRequest's model_config defaults to extra=ignore. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…wins unsloth_zoo PR unslothai#5340 added per-element gradient clipping to MLXTrainer and defaulted ``MLXTrainingConfig.max_grad_value = 5.0``. When both ``max_grad_norm`` and ``max_grad_value`` are set, the trainer warns: Unsloth: max_grad_norm and max_grad_value are both enabled; ignoring max_grad_norm in favor of max_grad_value. and silently drops the test's ``max_grad_norm=1.0``. +-5.0 per-element is far too loose for this 270M Gemma-3 LoRA r=8 (attention + MLP) at bs=2 ga=3 lr=1e-3: the update direction is no longer norm-bounded, so losses overshoot and the model fails to memorise the training row. Reproduced on a CUDA mirror (scripts/cuda_mlx_mirror_sim.py): norm_1 (max_grad_norm=1.0, no clip): losses 7.64 -> 0.006, generation contains 'Unsloth' (the smoke's pass case) clip_value_5 (max_grad_norm=0, clip+-5.0): losses 7.29 -> 8.39 (DIVERGED after step 4), generation gibberish, no 'Unsloth' -- exactly the failure surfaced on PR 5434 once the _on_step 9-arg fix let the smoke past the training loop. Pin ``max_grad_value=0.0`` so the smoke uses the same ``max_grad_norm= 1.0`` clipping it was designed against. Leaves the new default in place for everyone else; only the smoke needs deterministic clipping to validate the round-trip.

#5434) * studio: install flash-linear-attention and tilelang for Qwen3.5 family Studio currently only installs causal-conv1d for qwen3.5 / qwen3.6 / qwen3-next models. Without flash-linear-attention installed alongside it, transformers' Qwen3.5 fast-path gate stays False and the model falls back to a pure-PyTorch loop for the GatedDeltaNet layers. In a 60-step run on unsloth/Qwen3.5-2B on B200, this fallback costs ~2.35x vs the full fast path. On top of that, FLA dispatches its hottest GDN kernels through a TileLang backend when tilelang is importable. Adding tilelang plus a pinned apache-tvm-ffi gives another ~26% on the same workload (4.73 s/step to 3.50 s/step) and is what users have been getting indirectly when they install mamba-ssm (mamba-ssm transitively pulls tilelang and pins apache-tvm-ffi<=0.1.9, which is the last working version on sm_100; 0.1.10 and 0.1.11 crash Triton with misaligned address). Changes: * _ensure_flash_linear_attention: pure-Python PyPI install gated on the same model match set as _ensure_causal_conv1d_fast_path. * _ensure_tilelang_backend: installs apache-tvm-ffi==0.1.9 and tilelang==0.1.8 in one pip resolve so the tvm-ffi pin wins over tilelang's >=0.1.2 constraint. Gated on the Qwen3.5 family only; SSM models (Nemotron-H, Falcon-H1, Granite-H, LFM2) do not use FLA's GDN dispatch. * UNSLOTH_STUDIO_SKIP_TILELANG_INSTALL=1 escape hatch matching the flash-attn pattern. * Orchestration block reordered: causal-conv1d -> fla -> mamba-ssm -> tilelang -> flash-attn (long context). * 7 new tests covering the new helpers, including SSM-model skip, skip-env, full Qwen3 family name variants, and graceful pip install failure. Combined Qwen3.5-2B-Vision step time on B200 in our bench goes from 5.0 s/step (current Studio: causal-conv1d only) to 3.5 s/step (causal-conv1d + fla + tilelang), a 1.43x speedup with no notebook or user code changes required. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests/studio: accept new grad_norm arg in MLX smoke _on_step callback The MLX trainer's step callback now passes a ninth positional argument (grad_norm) per unsloth_zoo/mlx/trainer.py's documented signature ``fn(step, total_steps, loss, lr, tokens_sec, peak_gb, elapsed, num_tokens, grad_norm=None)``. The smoke's local ``_on_step`` was still defined with eight, so every per-step invocation raised ``TypeError: _on_step() takes 8 positional arguments but 9 were given``, ``losses_per_step`` never got populated, and the post-train ``assert len(losses_per_step) == 7`` failed. Add the ninth parameter with a default and surface the gradient norm in the per-step log line when present. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci: retrigger after zoo drift + IPython fixes landed in main * tests/studio: pin max_grad_value=0 in MLX smoke so max_grad_norm=1.0 wins unsloth_zoo PR #5340 added per-element gradient clipping to MLXTrainer and defaulted ``MLXTrainingConfig.max_grad_value = 5.0``. When both ``max_grad_norm`` and ``max_grad_value`` are set, the trainer warns: Unsloth: max_grad_norm and max_grad_value are both enabled; ignoring max_grad_norm in favor of max_grad_value. and silently drops the test's ``max_grad_norm=1.0``. +-5.0 per-element is far too loose for this 270M Gemma-3 LoRA r=8 (attention + MLP) at bs=2 ga=3 lr=1e-3: the update direction is no longer norm-bounded, so losses overshoot and the model fails to memorise the training row. Reproduced on a CUDA mirror (scripts/cuda_mlx_mirror_sim.py): norm_1 (max_grad_norm=1.0, no clip): losses 7.64 -> 0.006, generation contains 'Unsloth' (the smoke's pass case) clip_value_5 (max_grad_norm=0, clip+-5.0): losses 7.29 -> 8.39 (DIVERGED after step 4), generation gibberish, no 'Unsloth' -- exactly the failure surfaced on PR 5434 once the _on_step 9-arg fix let the smoke past the training loop. Pin ``max_grad_value=0.0`` so the smoke uses the same ``max_grad_norm= 1.0`` clipping it was designed against. Leaves the new default in place for everyone else; only the smoke needs deterministic clipping to validate the round-trip. * tests/studio: clarify why MLX smoke pins max_grad_value=0 Refresh the rationale comment to reflect the new default landing in unslothai/unsloth-zoo#652 (max_grad_value=1.0, not 5.0). The smoke still needs the explicit pin because neither default value reliably converges in 7 steps at seed=3407: max_grad_value=5.0 -- diverges after step 4 (loss 7.3 -> 8.4) max_grad_value=1.0 -- stalls (loss ~3.2 plateau across seeds) max_grad_value=0.5/0.25/0.1 -- noisier still max_grad_norm=1.0 -- cleanly drops loss to <0.01, emits "Unsloth!" Mention both the historical 5.0 default and the new 1.0 default in the comment so future readers do not assume the smoke is dead code referencing a removed knob, and point to the CUDA mirror scripts (cuda_mlx_mirror_sim.py + cuda_mlx_clip1_vs_norm1.py) for the empirical evidence. No behaviour change; comment-only refresh. * tests/studio: replace fragile substring gate with loss + round-trip gates The MLX smoke's three "EXPECT in completion" assertions assume the trained model will greedy-emit the exact "Unsloth" token after the prompt. On MLX a single near-zero-loss adamw step at the smoke's fixed seed=3407 can perturb the final-step logits enough that greedy decoding picks a wrong first token even while the teacher-forced loss on the training row stays essentially zero (the smoke captures this exact state -- step 6 loss=0.049, step 7 grad=36.7, step 7 loss=0.17; completion goes from "Unsloth!" to "5 lbs!"). Reproduced extensively on CUDA via scripts/cuda_mlx_step7_*.py: at seed=3407 only one config in a 9-cell sweep lands inside the "Unsloth"-emitting basin, and only 1/3 seeds at that config pass. This is a property of the assertion, not of save/reload correctness. Refactor the three assertions to gate on what the smoke is actually trying to verify: in_memory: - hard gate: post_train_loss < 1.0 (training memorised the row). - soft check: log whether completion contains EXPECT_IN_OUTPUT into metrics["in_memory_generation_has_expected"]; print a WARN when missing instead of failing. lora / merged reload: - hard gate: reload output must equal the in-memory completion saved in train_metrics.json. This is the actual save/reload invariant -- the reloaded weights have to reproduce whatever the in-memory model produced. Falls back to the original gibberish gate if train_metrics.json is unavailable. gguf reload: - hard gate: llama.cpp produced usable, non-empty output after the prompt (>=4 chars). llama.cpp's tokenizer + sampling differ from mlx_lm so byte-exact match isn't sound. Log gguf_has_expected for visibility. Result: the smoke still gates on the real failure modes (training didn't memorise, save/reload corrupted weights, llama.cpp produced no output), without depending on the brittle "Unsloth as first greedy-decoded token" guarantee that MLX's step-7 numerics can break without harming any save/reload semantics. Cross-version constraint: no transformers / trl API touched. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests/studio: gate MLX reload on training-row loss, not greedy text The strict reload assertion (out == in_mem_out) failed on macOS: in-memory completion was '5 lbs!' and the reloaded completion was '_________________________'. Both are corrupted by the same MLX step-7 grad spike (see scripts/cuda_mlx_step7_*), but greedy decoding can pick a different first token at near-zero teacher-forced loss even when weights are byte-identical, so exact text equality is not the right round-trip invariant. Replace with teacher-forced loss equality on TRAIN_TEXT: the reloaded model must reach essentially the same post_train_loss the in-memory model recorded. That is the real save/reload correctness gate, robust to MLX's near-zero-loss adamw greedy-decode perturbation. Falls back to a non-empty-body check when train_metrics.json is missing. CUDA mirror at this seed converges cleanly to ~0.006 loss; on MLX post_train_loss < 1.0 still holds via the existing memorisation gate. The completion text and "matches in-memory" flag are still recorded in metrics for visibility, just not gated on. * ci: retrigger Backend CI after transient pwsh-startup timeout * ci: retrigger MLX dispatch after pytorch CDN DNS flake * studio: harden FLA + tilelang installers per reviewer feedback Addresses bot review on #5434: * Narrow `_ensure_flash_linear_attention` from `_model_wants_causal_conv1d` (which also matches Nemotron-H / Falcon-H1 / Granite-H / LFM2) to `_model_wants_tilelang` (Qwen3.5 / Qwen3.6 / Qwen3-Next only). True SSM families take the mamba_ssm path and never call FLA's GDN kernels, so installing FLA there is wasted bandwidth. * Pin both `flash-linear-attention==0.5.0` and `fla-core==0.5.0` and install with `--no-deps`. Otherwise pip resolves fla-core's declared `torch>=2.7.0` requirement and may silently upgrade the Studio venv's torch on environments running torch 2.4/2.5/2.6. * Skip both installs on Python <3.10 (FLA, fla-core, and tilelang all declare `Requires-Python: >=3.10`). On older interpreters the pip install would fail every launch and leave the worker on the slow torch fallback while still claiming to have set up the fast path. * Skip tilelang install on non-Linux platforms. `tilelang==0.1.8` only publishes Linux x86_64 / aarch64 and macOS arm64 wheels. Falling back to its 93MB sdist on a Studio worker is undesirable. * Detect an existing `apache-tvm-ffi` 0.1.10 / 0.1.11 install and force a reinstall to 0.1.9 with `--force-reinstall --no-deps`. Previously the import-only probe returned early and left the broken version in place, which crashes Triton on sm_100. * Add a 600s timeout to the tilelang and FLA subprocess.run calls, matching the existing flash-attn install pattern, so a network hang cannot block the training subprocess indefinitely. * 13 new / updated tests covering all six guards plus the pinned-spec, timeout, and force-reinstall code paths. Total: 21 passing tests (8 original + 13 new / updated). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: address reviewer.py P1/P2 findings on FLA + tilelang installers Twelve-reviewer aggregated review on this PR flagged several real correctness bugs in the first hardening pass. Fixes: P1: * Add UNSLOTH_STUDIO_SKIP_FLA_INSTALL escape hatch for symmetry with UNSLOTH_STUDIO_SKIP_TILELANG_INSTALL and the existing UNSLOTH_STUDIO_SKIP_FLASHATTN_INSTALL. * Install einops alongside fla-core. `--no-deps` was suppressing fla-core's only non-torch runtime dep, so on a clean venv `import fla.modules` raised ModuleNotFoundError even though pip exited 0. * Drop --no-deps from the tilelang force-reinstall path. tilelang needs z3-solver, ml-dtypes, cloudpickle, etc. at runtime; --force-reinstall --no-deps left libz3.so missing and `import tilelang` raised OSError on the next training subprocess. * Skip FLA install when installed torch is below 2.7.0 (fla-core declares torch>=2.7.0). Otherwise users on Studio's supported torch 2.4/2.5/2.6 stacks get an incompatible FLA installed silently. P2: * Replace bare `except ImportError` probes with helpers that catch `Exception` so a broken native package (OSError on missing .so, RuntimeError in __init__, ...) does not kill the worker before the fallback path can run. * Tighten the tilelang platform guard from "any linux" to "linux + machine in {x86_64, aarch64, ...}" so ppc64le / s390x / armv7 do not fall through and download the 93 MB tilelang sdist. * Add --only-binary=:all: to the tilelang install command. The comment already said we never want the sdist; now the pip invocation enforces it. * Verify both FLA and tilelang are importable after pip exits 0; if not, report and continue on the fallback path. 6 new tests bring the suite to 27 passing (was 21). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: pin packaging + triton with FLA --no-deps install An end-to-end install simulation in a fresh venv caught a real regression: `fla/utils.py` does `from packaging import version` and `import triton` at module load, but fla-core's METADATA only declares einops + torch. With `--no-deps` the worker would land FLA in any runtime that lacks packaging (e.g. minimal torch builds) and the post-install import probe would fall back to the torch GDN loop silently. Add `packaging` and `triton` to `_FLA_RUNTIME_DEPS` so the install spec list always carries them. Tests updated to assert both are now in the install command. * studio: hook transformers' fast-path gates for just-in-time FLA + causal-conv1d install The substring-based detection in this PR (`_model_wants_tilelang` / `_model_wants_causal_conv1d`) is brittle: it depends on what the user typed for the model name, not on what the architecture actually needs. Users typing custom model paths, future Qwen3.7 / non-Qwen GDN architectures, and any model whose author renamed it would silently fall back to the torch loop. The correct signal is the one transformers itself uses to gate the fast path. `transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py` does at module import time: if is_causal_conv1d_available(): from causal_conv1d import causal_conv1d_fn, causal_conv1d_update if is_flash_linear_attention_available(): from fla.modules import FusedRMSNormGated from fla.ops.gated_delta_rule import ( chunk_gated_delta_rule, fused_recurrent_gated_delta_rule, ) Wrap both gates so the first call (always at modeling import, before any forward pass) installs the matching kernel synchronously and delegates to the original function. Any model whose architecture queries those gates auto-triggers the install; models that never query them (Llama, Gemma, dense Qwen, ...) never pay the cost. Mechanics: - Split `_ensure_flash_linear_attention` and `_ensure_tilelang_backend` into `_unconditional` variants (no substring gate, retains python / torch / platform / skip-env guards) plus thin substring wrappers used by the legacy fallback path. - New `_install_fast_path_hooks(event_queue)` patches both gates on `transformers.utils.import_utils` AND sweeps `sys.modules` so any modeling file that already did `from ... import is_X` sees the wrapper (the local binding survives a module-level reassignment). - Wrappers clear the original's `lru_cache` before delegating, install on False, re-check, and short-circuit on subsequent calls. - Set `UNSLOTH_STUDIO_SKIP_FAST_PATH_HOOKS=1` to fall back to the substring path. Verified end-to-end against `transformers.models.qwen3_5_moe`: PRE_STATE fla=False tilelang=False causal_conv1d=False HOOK_INSTALLED Hook fired for is_causal_conv1d_available; installing kernel... Installing prebuilt causal-conv1d wheel... Hook fired for is_flash_linear_attention_available; installing kernel... Installing flash-linear-attention==0.5.0 (with fla-core==0.5.0) for the fast path... Installed flash-linear-attention for the FLA fast path Installing TileLang backend (apache-tvm-ffi==0.1.9, tilelang==0.1.8)... Installed TileLang backend for FLA fast path MODELING_IMPORT_OK FAST_PATH_SYMBOLS {"chunk_gated_delta_rule": true, "fused_recurrent_gated_delta_rule": true, "FusedRMSNormGated": true, "causal_conv1d_fn": true, "causal_conv1d_update": true} POST_STATE fla=True tilelang=True causal_conv1d=True Adds 9 new tests covering: install-on-False, skip-on-True, idempotency, install-failure handling, env-disable, lru_cache clear, sys.modules rebind, missing-transformers fallback, substring fallback. Total test count is now 36 (was 27). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: address reviewer.py n=12 findings on the FLA hook path Eight issues reproduced by parallel reviewers against 6ce495a; all fixed and covered by regression tests. 45 pytest cases pass (was 36); end-to-end Qwen3.5_MoE modeling-import drill still loads all five fast-path symbols. P1 fixes: 1. TileLang loses the Qwen-family guard on the normal FLA hook path (10/12 reviewers, reproduced with allenai/OLMo-Hybrid-1B). The hook unconditionally installed tilelang for any FLA-using model. - Threaded `model_name` through `_install_fast_path_hooks(event_queue, model_name)`. - `_fla_install` now gates tilelang on `_model_wants_tilelang(model_name)` AND a successful FLA install. 2. TileLang repair `--force-reinstall` (without `--no-deps`) could replace `torch==2.12.0+cu130` with `torch==2.12.0`. Split repair into TWO steps: step 1: `--force-reinstall --no-deps apache-tvm-ffi==0.1.9` step 2: regular install of tilelang + apache-tvm-ffi Step 1 surgically downgrades the broken package; step 2 resolves missing transitive deps (z3-solver, ml-dtypes) without --force-reinstall, so it never replaces torch. 3. Hook could return True after the installer's deep import probe failed: when pip exits 0 but `import fla.modules` raises, the old wrapper re-called `original()` (transformers' metadata check) and trusted it. Refactored: - `_ensure_flash_linear_attention_unconditional(...) -> bool` - `_ensure_tilelang_backend_unconditional(...) -> bool` The wrapper now uses the installer's bool directly. 4. SSM models (Nemotron-H, Falcon-H1, Granite-H) use `lazy_load_kernel("causal-conv1d")` and never call `is_causal_conv1d_available()`, so the hook never fires for them. The orchestrator now always runs `_ensure_causal_conv1d_fast_path` outside the hook-mode if/else. P2 fixes: 5. `_rebind_in_already_imported_modules` invoked transformers' lazy module `__getattr__` (hundreds of "Accessing X from .models..." warnings, ~3.4s overhead). Switched to `module.__dict__.get(...)` which only sees real module-level bindings. 6. TileLang installed even when FLA was skipped (Torch <2.7) or failed (timeout, post-install probe failed). Now gated on the installer's bool return. 7. TileLang repair was skipped when FLA was already True but tilelang missing or apache-tvm-ffi on the broken list. Added an optional `post_available_fn` to the wrapper; the FLA hook's `_fla_post_available` runs `_ensure_tilelang_backend_unconditional` when (model wants tilelang) AND (tilelang missing OR tvm-ffi broken). 8. `_flash_linear_attention_importable()` only checks deep import, not version. Added `_flash_linear_attention_current()` that compares against the pinned `flash-linear-attention==0.5.0` / `fla-core==0.5.0`; older versions trigger `--force-reinstall --no-deps` so torch stays untouched. Helpers extracted to keep the surface tight: - `_pip_install_cmd(*args)` builds `uv pip install` or `python -m pip install` depending on uv availability. - `_run_pip(cmd, event_queue, label)` runs a pip command with timeout / failure handling and a status emission. Regression tests added: - test_hook_does_not_install_tilelang_for_non_qwen_fla_model - test_hook_does_install_tilelang_for_qwen35 - test_tilelang_repair_does_not_touch_torch_cuda_stack - test_hook_trusts_installer_bool_not_metadata - test_rebind_does_not_trigger_module_getattr - test_hook_skips_tilelang_when_fla_install_is_skipped - test_hook_runs_tilelang_repair_when_fla_already_true - test_fla_installer_force_reinstalls_when_older_version_present - test_run_training_process_eagerly_installs_causal_conv1d_in_normal_mode Existing tests updated for the new `_install_fast_path_hooks` signature and the two-step tilelang repair flow. End-to-end re-verified against transformers.models.qwen3_5_moe: PRE_STATE fla=False, hook fires for both gates, FLA + tilelang + causal-conv1d install, all 5 fast-path symbols non-None. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: fix double-install of tilelang on the FLA hook install path Backend CI surfaced a test-isolation bug introduced by the post_available_fn mechanism for finding #7. The wrapper ran `post_available_fn` in BOTH paths (install ran AND gate already True), but `_fla_install` already chains tilelang on the install path, so the post-available step then called tilelang install AGAIN. This was masked locally because tilelang was installed in the workspace venv (post_available short-circuited on `_tilelang_importable()` returning True). CI starts with no tilelang, so the second call actually fired and the mock recorded two calls. Fix: only run `post_available_fn` when the install path did NOT run. That preserves the finding #7 semantics (tilelang repair when FLA already True but tilelang missing or tvm-ffi broken) without duplicating the chained install on the gate-was-False path. Also tightened `test_hook_skips_install_when_gate_already_true` to monkeypatch `_tilelang_importable=True` and `_installed_tvm_ffi_version=0.1.9` so it stays a pure "no install at all" test regardless of the venv's actual state. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci: retrigger Mac Studio GGUF after transient HF DNS resolve flake * studio: skip tilelang on HIP / ROCm torch (Strix Halo crash report) h34v3nzc0dex tested PR 5434 on Strix Halo (gfx1151, ROCm 7.13, torch 2.11.0+rocm7.13.0) and hit a hard regression: File ".../fla/ops/common/backends/tilelang/__init__.py", line 92, in chunk_bwd_dqkwg File ".../tilelang/jit/kernel.py", line 137, in __init__ File ".../tilelang/tileop/gemm/__init__.py", line 143, in _select_gemm_instruction tvm.error.InternalError: Check failed: (0) is false: Unsupported target for gemm: hip -keys=hip,gpu -mcpu=gfx1151 ... `tilelang==0.1.8` ships no HIP GEMM instruction; `_select_gemm_instruction` raises at lower-time, not import-time. So: - pip install succeeds - `import tilelang` succeeds - `TileLangBackend.is_available()` returns True - FLA's dispatcher picks TileLang for `chunk_bwd_dqkwg` - training subprocess dies at first GDN backward, no graceful fallback The PR's existing platform gate (`_tilelang_platform_supported`) checked only `sys.platform == "linux"` and `platform.machine()`, both of which look identical on a ROCm box. Fix has two layers: 1. INSTALL GATE: new `_torch_has_hip()` helper checks `torch.version.hip is not None`. `_tilelang_platform_supported` now returns False on HIP torch, so the install never fires. 2. RUNTIME GATE: even with the install skipped, a user could have tilelang already present (e.g. venv carried over from a CUDA box). `_install_fast_path_hooks` now calls `os.environ.setdefault("FLA_TILELANG", "0")` when HIP is detected, which is the env-var FLA's `TileLangBackend` already honors. Users who know they have a HIP-aware tilelang fork can override by setting `FLA_TILELANG=1` explicitly. This costs nothing on CUDA (the gate is a no-op when `torch.version.hip is None`), and removes the crash for AMD users. The benchmark numbers in the PR description (1.43x on B200 sm_100) are not affected. The other halves of the PR are confirmed working on gfx1151 by the same report: - `flash-linear-attention 0.5.0` runs at production scale (B=1 T=8192 H=16 K=128 V=128 and others) with no patches. - `causal-conv1d` runs at the shapes the fast-path gate cares about. (A separate Ubuntu 24.04 `--gcc-install-dir` build workaround is needed for the source-build path; that mirrors bbf004c's llama.cpp fix and is out of scope here.) Tests added: - test_tilelang_platform_unsupported_on_hip_torch - test_tilelang_install_skipped_on_hip_torch - test_install_fast_path_hooks_sets_fla_tilelang_zero_on_hip - test_install_fast_path_hooks_respects_user_fla_tilelang_override - test_install_fast_path_hooks_does_not_set_fla_tilelang_on_cuda Total 50 passing (was 45). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci: retrigger Windows Studio UI after transient Playwright tab-lookup flake * studio: auto-discover FLA-using model types from installed transformers Drop the hand-maintained `_TILELANG_MODEL_SUBSTRINGS` tuple (qwen3.5 / qwen3_5 / qwen3.6 / qwen3_6 / qwen3-next / qwen3_next) and derive the allowlist by scanning the installed `transformers/models/*/modeling_*.py` for `from fla.` imports. A model "wants tilelang" iff its modeling file imports an FLA op, which is the same signal `is_flash_linear_attention_available()` is the runtime test for. The scan happens once per worker subprocess and is cached for the process lifetime; an empty result (eg transformers not importable) means "no tilelang pre-install" -- the FLA runtime hook still drives the install via the gate when the loaded model actually probes it. Verified against the live installed transformers, the auto-derived set is {qwen3_5, qwen3_5_moe, qwen3_next}, with `_model_wants_tilelang` matching the HF Hub names `unsloth/Qwen3.5-2B`, `Qwen/Qwen3.5-MoE-A3B`, `mlx-community/qwen3-next-80b`, and correctly rejecting Llama, Mistral, Nemotron-H, Falcon-H1, etc. Future GDN models (Qwen3.7, OLMo-Hybrid-FA, ...) are picked up automatically once they ship in transformers; no further worker edits needed. Also trim docstrings / comments through the FLA / tilelang / HIP / hook block: constants get 1-line trailing comments, function docstrings collapse to 1-3 lines, and the fast-path-hooks banner shrinks from a 27-line block to 4 lines. The file drops from 2847 to 2630 lines without losing the load-bearing WHY notes (--no-deps protects torch; `__dict__.get` avoids lazy-module __getattr__; two-step tvm-ffi repair keeps torch off the dep graph; HIP setdefault disables FLA's TileLang dispatch even with tilelang already installed). 7 new tests (50 -> 57 total): discovery returns only FLA-using model_types; discovery cache reuse; missing transformers handled; OSError on a modeling file is non-fatal; `_model_wants_tilelang` matches real HF repo names across separator variants; empty discovery -> always False; normalization across `-`, `.`, `/`, space. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test: hermetize the non-allowlist hook test against transformers 5.4.0+ transformers 5.4.0 added `olmo_hybrid` as an FLA-using model_type, so the auto-discovered allowlist now includes it -- and the test's prior choice of `allenai/OLMo-Hybrid-1B` as a "non-Qwen FLA-only" example became an allowlist member. CI on Python 3.11 / 3.13 caught this. Swap to a guaranteed-not-in-allowlist fake model_name AND patch _discover_fla_model_types to a known {qwen3_5, qwen3_5_moe, qwen3_next} set so the test stays valid as upstream transformers adds new FLA-using architectures. Renames the test to reflect the actual semantic under test: "outside-allowlist -> no tilelang". * ci: retrigger Windows Studio API after llama.cpp prebuilt staging WinError 5 flake * tests: move MLX smoke gate changes to dedicated PR #5537 The seven MLX smoke commits in this PR's history (_on_step grad_norm, max_grad_value pin, loss + round-trip gates) are unrelated to the FLA / tilelang work. They now live in #5537 so this PR's diff is limited to the studio worker installer changes. Net effect on tests/studio/run_real_mlx_smoke.py vs main: zero. * studio: friendlier install banners (drop hook / gate-name jargon) User-visible status text now reads: Installing flash-linear-attention==<ver> for faster training... Installing TileLang==<ver> for faster training... Installing causal-conv1d for faster training... Installing flash-attn for faster training... Removed the transient "Hook fired for is_flash_linear_attention_available; installing kernel..." banner — the install banner that immediately follows already tells the user what is happening, in plain English. The internal logger.info messages (server-side log) still carry the gate names + "Hook fired ..." for debugging. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…ad_norm) (#5537) * tests/studio: accept new grad_norm arg in MLX smoke _on_step callback The MLX trainer's step callback now passes a ninth positional argument (grad_norm) per unsloth_zoo/mlx/trainer.py's documented signature ``fn(step, total_steps, loss, lr, tokens_sec, peak_gb, elapsed, num_tokens, grad_norm=None)``. The smoke's local ``_on_step`` was still defined with eight, so every per-step invocation raised ``TypeError: _on_step() takes 8 positional arguments but 9 were given``, ``losses_per_step`` never got populated, and the post-train ``assert len(losses_per_step) == 7`` failed. Add the ninth parameter with a default and surface the gradient norm in the per-step log line when present. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests/studio: pin max_grad_value=0 in MLX smoke so max_grad_norm=1.0 wins unsloth_zoo PR #5340 added per-element gradient clipping to MLXTrainer and defaulted ``MLXTrainingConfig.max_grad_value = 5.0``. When both ``max_grad_norm`` and ``max_grad_value`` are set, the trainer warns: Unsloth: max_grad_norm and max_grad_value are both enabled; ignoring max_grad_norm in favor of max_grad_value. and silently drops the test's ``max_grad_norm=1.0``. +-5.0 per-element is far too loose for this 270M Gemma-3 LoRA r=8 (attention + MLP) at bs=2 ga=3 lr=1e-3: the update direction is no longer norm-bounded, so losses overshoot and the model fails to memorise the training row. Reproduced on a CUDA mirror (scripts/cuda_mlx_mirror_sim.py): norm_1 (max_grad_norm=1.0, no clip): losses 7.64 -> 0.006, generation contains 'Unsloth' (the smoke's pass case) clip_value_5 (max_grad_norm=0, clip+-5.0): losses 7.29 -> 8.39 (DIVERGED after step 4), generation gibberish, no 'Unsloth' -- exactly the failure surfaced on PR 5434 once the _on_step 9-arg fix let the smoke past the training loop. Pin ``max_grad_value=0.0`` so the smoke uses the same ``max_grad_norm= 1.0`` clipping it was designed against. Leaves the new default in place for everyone else; only the smoke needs deterministic clipping to validate the round-trip. * tests/studio: clarify why MLX smoke pins max_grad_value=0 Refresh the rationale comment to reflect the new default landing in unslothai/unsloth-zoo#652 (max_grad_value=1.0, not 5.0). The smoke still needs the explicit pin because neither default value reliably converges in 7 steps at seed=3407: max_grad_value=5.0 -- diverges after step 4 (loss 7.3 -> 8.4) max_grad_value=1.0 -- stalls (loss ~3.2 plateau across seeds) max_grad_value=0.5/0.25/0.1 -- noisier still max_grad_norm=1.0 -- cleanly drops loss to <0.01, emits "Unsloth!" Mention both the historical 5.0 default and the new 1.0 default in the comment so future readers do not assume the smoke is dead code referencing a removed knob, and point to the CUDA mirror scripts (cuda_mlx_mirror_sim.py + cuda_mlx_clip1_vs_norm1.py) for the empirical evidence. No behaviour change; comment-only refresh. * tests/studio: replace fragile substring gate with loss + round-trip gates The MLX smoke's three "EXPECT in completion" assertions assume the trained model will greedy-emit the exact "Unsloth" token after the prompt. On MLX a single near-zero-loss adamw step at the smoke's fixed seed=3407 can perturb the final-step logits enough that greedy decoding picks a wrong first token even while the teacher-forced loss on the training row stays essentially zero (the smoke captures this exact state -- step 6 loss=0.049, step 7 grad=36.7, step 7 loss=0.17; completion goes from "Unsloth!" to "5 lbs!"). Reproduced extensively on CUDA via scripts/cuda_mlx_step7_*.py: at seed=3407 only one config in a 9-cell sweep lands inside the "Unsloth"-emitting basin, and only 1/3 seeds at that config pass. This is a property of the assertion, not of save/reload correctness. Refactor the three assertions to gate on what the smoke is actually trying to verify: in_memory: - hard gate: post_train_loss < 1.0 (training memorised the row). - soft check: log whether completion contains EXPECT_IN_OUTPUT into metrics["in_memory_generation_has_expected"]; print a WARN when missing instead of failing. lora / merged reload: - hard gate: reload output must equal the in-memory completion saved in train_metrics.json. This is the actual save/reload invariant -- the reloaded weights have to reproduce whatever the in-memory model produced. Falls back to the original gibberish gate if train_metrics.json is unavailable. gguf reload: - hard gate: llama.cpp produced usable, non-empty output after the prompt (>=4 chars). llama.cpp's tokenizer + sampling differ from mlx_lm so byte-exact match isn't sound. Log gguf_has_expected for visibility. Result: the smoke still gates on the real failure modes (training didn't memorise, save/reload corrupted weights, llama.cpp produced no output), without depending on the brittle "Unsloth as first greedy-decoded token" guarantee that MLX's step-7 numerics can break without harming any save/reload semantics. Cross-version constraint: no transformers / trl API touched. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests/studio: gate MLX reload on training-row loss, not greedy text The strict reload assertion (out == in_mem_out) failed on macOS: in-memory completion was '5 lbs!' and the reloaded completion was '_________________________'. Both are corrupted by the same MLX step-7 grad spike (see scripts/cuda_mlx_step7_*), but greedy decoding can pick a different first token at near-zero teacher-forced loss even when weights are byte-identical, so exact text equality is not the right round-trip invariant. Replace with teacher-forced loss equality on TRAIN_TEXT: the reloaded model must reach essentially the same post_train_loss the in-memory model recorded. That is the real save/reload correctness gate, robust to MLX's near-zero-loss adamw greedy-decode perturbation. Falls back to a non-empty-body check when train_metrics.json is missing. CUDA mirror at this seed converges cleanly to ~0.006 loss; on MLX post_train_loss < 1.0 still holds via the existing memorisation gate. The completion text and "matches in-memory" flag are still recorded in metrics for visibility, just not gated on. * tests/studio: align MLX smoke with elementwise-clip + 30-step gates Two corrections to the earlier f93e918 / e05d6c7 direction: 1. max_grad_value=0.0, max_grad_norm=1.0 picked the memory-heavy norm clip. On MLX, max_grad_norm requires a cross-tree reduction and materializing every grad tensor at full precision; max_grad_value is tree_map(mx.clip) per leaf with no reduction. MLXTrainingConfig defaults to max_grad_value=1.0 for exactly this reason. Flip the smoke to max_grad_norm=0.0, max_grad_value=1.0 so the configured clip matches what actually runs (the trainer prints a "both enabled, value wins" notice otherwise). 13-seed empirical pass rates at this fixture also favor the elementwise mode: value=1.0 62%, norm=1.0 46%, value=5.0 33%, value=0.5 77%. Cheaper default = higher pass rate, no tradeoff. (See PR #5498 / staging-2#119 rounds A-AT.) 2. max_steps=7 was below the convergence horizon at every clip tested. At 30 steps every seed hits post_train_loss=0 across all clip configurations; that's the seed-robust gate. Bump max_steps 7 -> 30, tighten the memorisation gate from post_loss < 1.0 to post_loss < 0.1. 3. Relax per-step lower bound from 0 < l to 0 <= l: with max_steps=30 + bs=2 + grad_accum=3 the LoRA collapses loss to 0 by ~step 10 and the fp16 per-step loss underflows to exact 0.0 from then on. That's the success signal, not a bug. Keeps the e7ec2f5 EXPECT_IN_OUTPUT demotion-to-warning and the e734764 reload teacher-forced-loss round-trip invariant -- those are the right gates regardless of the clip / steps choice. * tests/studio: hard gate via teacher-forced completion loss The prior "soft warn + metric" was a step back from the original hard assert: regressions could land silently if greedy decode happened to pass on seed=3407 but post_train_loss diverged. A true hard gate is needed. Greedy decode is empirically fragile -- a 47-round, 13-seed sweep on this fixture (see danielhanchen#119) showed contains-Unsloth lands in 46-77% across MLX clip configs even when post_train_loss is zero, because fp16 noise on the first generated token after PROMPT perturbs the argmax. Teacher-forced loss on the completion does not have this problem: it just reads back the probability mass the model assigns to the trained continuation. In every config where post_train_loss < 0.1, the completion loss is essentially zero. Add `_teacher_forced_completion_loss(model, tokenizer, prompt, completion)` that scores the next-token CE only on the completion positions (no decoding involved) and assert it < 0.5. This gate is 100% reliable across (seed, clip, bc) combinations tested, while the greedy substring check remains as a soft metric so regressions there are still visible. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

mmathew23 and others added 6 commits May 7, 2026 12:43

mlx fixes

797ae4c

Merge remote-tracking branch 'origin/main'

aaa2fab

Fix studio integration, local dataset files, chat templates without t…

051b202

…he torch gpu imports

pass grad norm in mlx worker

da2b371

fix(studio): pass MLX grad clipping settings

e5bf387

[pre-commit.ci] auto fixes from pre-commit.com hooks

09fcc55

for more information, see https://pre-commit.ci

mmathew23 requested review from Datta0, Manan17, danielhanchen and rolandtannous as code owners May 8, 2026 20:52

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

mlx: update grad value

f9e8416

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

Merge branch 'main' into fix/mlx

dcbb50c

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

mmathew23 and others added 6 commits May 13, 2026 07:29

fix(mlx): address ci and clipping review

9dded6a

fix backward compatibility and CI tests

085b4f4

unsloth local is mlx function

83e577c

[pre-commit.ci] auto fixes from pre-commit.com hooks

b0447f6

for more information, see https://pre-commit.ci

dont reference runtime

d51181f

[pre-commit.ci] auto fixes from pre-commit.com hooks

3d9389d

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

danielhanchen added 2 commits May 14, 2026 04:28

Merge branch 'main' into fix/mlx

55a3a00

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

danielhanchen merged commit a932294 into unslothai:main May 14, 2026
24 of 30 checks passed

danielhanchen mentioned this pull request May 14, 2026

studio: drop unused max_grad_value schema + route plumbing #5424

Merged

3 tasks

		max_grad_norm = float(config.get("max_grad_norm", 0.0) or 0.0)
		max_grad_value = float(config.get("max_grad_value", 1.0) or 0.0)

Uh oh!

Conversation

mmathew23 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mmathew23 commented May 8, 2026 •

edited

Loading