Stage: real-runner validation for MLX Matthew PRs (zoo#684 + unsloth#5656) by danielhanchen · Pull Request #87 · unslothai/unsloth-staging-1

danielhanchen · 2026-05-24T12:31:26Z

Summary

Throwaway PR to validate unslothai/unsloth#5656 and unslothai/unsloth-zoo#684 (both head mmathew23:explore/mlx) on real Windows / macOS-14 / Linux runners without burning credits on the main repos.

Three targeted workflows gated to push events on this branch with path filters; existing staging-fork workflows wiped to stay under the 5-concurrent-Windows-runner cap.

mlx-compiler-linux.yml (ubuntu-latest, CPU + CUDA spoof): matrix main vs mmathew23:explore/mlx for unsloth_zoo.compiler.cross_entropy_replacement_2. Confirms the dedicated UNSLOTH_RETURN_LOGITS=1 elif from zoo PR TypeError: argument of type 'NoneType' is not iterable when merging weights to 16bit and pushing to hub unsloth#666 (f45c31e5) is missing on the PR head, so the regex template fallback path now does a double lm_head matmul. AST rewriter default path is unaffected.
mlx-smoke-macos.yml (macos-14 Apple Silicon): real MLX 30-step smoke (tests/studio/run_real_mlx_smoke.py train) from PR MLX Training updates unsloth#5656 against the combined PR heads. Asserts len(losses_per_step) == config.max_steps == 30.
install-smoke-windows.yml (windows-latest): pydantic schema round-trip for TrainingStartRequest.max_grad_value / cast_norm_output_to_input_dtype plus the two CI-friendly studio backend tests touched by PR MLX Training updates unsloth#5656.

Test plan

mlx-compiler-linux.yml zoo=main leg PASS, zoo=pr-684 leg PASS (job encodes "expected fail on PR head" as success when reality matches expectation).
mlx-smoke-macos.yml: 30 logged training steps; eval contains "Unsloth".
install-smoke-windows.yml: pydantic + studio backend tests green.

* Add a simple --version flag * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Small code clean-up, less ugly * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Slightly better function names. And use again None --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>

Three targeted workflows + one regression test, gated to push events on this staging branch only with paths filters. mlx-compiler-linux.yml (ubuntu-latest, CPU + CUDA spoof) - Matrix: unsloth-zoo @ main (expect PASS) vs mmathew23:explore/mlx (expect FAIL, the regression). The check measures whether the dedicated UNSLOTH_RETURN_LOGITS=1 elif branch in cross_entropy_replacement_2 (added by zoo PR unslothai#666, f45c31e) is present in the patched forward across Llama / Mistral / Qwen3 / Gemma2 family heads. PR unslothai#684 commit 5895b20c removed the elif and ca086522 restored only the NOT_RETURN_LOGITS guard, so the template fallback path now executes self.lm_head twice when UNSLOTH_RETURN_LOGITS=1. Job exit encodes expected outcome so the matrix is green when reality matches expectation. mlx-smoke-macos.yml (macos-14 Apple Silicon) - Real MLX 30-step smoke from unsloth PR unslothai#5656's refreshed tests/studio/run_real_mlx_smoke.py against the combined PR heads (zoo @ explore/mlx, unsloth @ explore/mlx). Asserts len(losses_per_step) == config.max_steps (30, set by unslothai#5537). install-smoke-windows.yml (windows-latest) - Install smoke + pydantic schema check that TrainingStartRequest.max_grad_value (Optional[float]) and cast_norm_output_to_input_dtype (bool, default True) round-trip. Runs the two CI-friendly studio backend tests (test_mlx_training_worker_config, test_training_raw_support). Removed release-desktop.yml + stale.yml from the staging fork so push events on this branch trigger only the three new workflows. Branch is throwaway; do not merge upstream.

- mlx-compiler-linux: drop the diagnostic step that imported unsloth_zoo.compiler directly without the CUDA spoof. Only the test file applies the spoof, so the diagnostic crashed with NotImplementedError on the CPU runner. - mlx-smoke-macos: run_real_mlx_smoke.py train requires --workdir (was added at some point after the original PR landed). Pass a GITHUB_WORKSPACE-relative path. - install-smoke-windows: TrainingStartRequest also requires training_type and format_type. Pass valid Literal values, and while we are here, assert the JSON schema shape too so the test is not just a runtime smoke.

- Linux: pip install 'triton' so `import unsloth_zoo.compiler` succeeds. The module does `import triton` at top level; Triton has Linux x86_64 wheels even without a GPU. - macOS: set HF_HUB_ENABLE_HF_TRANSFER=0 in the job env. The runner enables hf_transfer by default but ships without the wheel, so mlx_lm's snapshot_download crashes inside huggingface_hub. - Windows: pip install 'datasets'. The studio backend test module imports `from datasets import Dataset` at collection time.

Linux compiler-py test: re-implement as a TEXT scan over the installed unsloth_zoo/compiler.py (no `import unsloth_zoo`). This sidesteps the triton CPU wheel <-> torch 2.6.0+cpu API mismatch (triton.compiler.compiler.AttrsDescriptor moved) and the GPU-only device_type @cache, neither of which is relevant for what we are actually measuring: the literal cross_entropy_replacement_2 template string. Workflow now installs only pytest + unsloth-zoo source (no torch, no transformers, no triton). Windows: also pip install structlog, httpx, aiosqlite. The studio backend test modules transitively import core.training which unconditionally does `import structlog` at top level. macOS smoke went green on the previous push - leave it alone.

+    double-`self.lm_head` matmul `else:` when UNSLOTH_RETURN_LOGITS=1.
+    """
+    path = _locate_compiler_py()
+    src = open(path, encoding="utf-8").read()


test_training_raw_support.py transitively imports the full studio backend (core.training.training -> matplotlib, etc.). Adding every transitive dep to the Windows install smoke is whack-a-mole and defeats the smoke's purpose. test_mlx_training_worker_config.py already covers PR unslothai#5656's wiring (model_random_state / lora_random_state fallback, max_grad_value None preservation, dataset_order=torch_randperm) via source-text assertions on worker.py. The test stubs out structlog/loggers/utils itself, so it works with just stdlib. Drop the broader test from the Windows job.

The PR unslothai#684 and PR unslothai#5656 heads were just updated with maintainer fixes (restored compiler.py UNSLOTH_RETURN_LOGITS elif, GPT-2 ln_* matching, Qwen3-VL flag wiring, default-branch reseed; plus seed present-but-None fix). Bump the three workflow files (comment-only) so the paths filters re-fire and we get a fresh signal on all three runners against the updated PR heads.

Now that the fix landed on mmathew23:explore/mlx (c7a0956d), the dedicated UNSLOTH_RETURN_LOGITS=1 elif is back in compiler.py on both zoo=main and zoo=pr-684. Drop the "expected to FAIL on PR head" encoding and just require pytest exit 0 on both. Anyone who reintroduces the regression on either branch will trip the check.

Round 2 of reviewer-driven fixes landed on the PR heads: zoo PR unslothai#684: 0753b115 - merged origin/main (restores unslothai#690 / unslothai#691 gpt-oss eager attn) - cleaned up norm cast monkey patch in train() finally - raise on streaming+dataset_order text combo - VLM baseline CE full-sequence forward parity with CCE - scheduler test now matches HF linear-no-warmup behavior unsloth PR unslothai#5656: bff5b44 (unchanged since last run) Re-fire all three workflows so we get a fresh signal.

…84 / PR unslothai#5656 1a02643)

After PR unslothai#684 head 1d0c11ed added per-epoch reseed to _create_labeled_batches for dataset_order='torch_randperm', the existing macOS smoke only exercised a single-epoch 30-step run, so the multi-epoch path was static-review-only. Add an inline Python step that calls _create_labeled_batches with num_epochs=3 against the already-installed real MLX, asserts the materialized batch count is num_epochs * batches_per_epoch, recovers the actual row order from each epoch's batches, and verifies it matches _torch_randperm_order( len(ds), SEED + epoch_idx) for every epoch. Also asserts epoch 0 and epoch 1 orders differ, so a silent regression where the reseed stops firing fails loudly here. Reuses the install state from the smoke step above instead of opening a separate workflow.

…g on real MLX Two more round-2/3 reviewer findings were static-review-only on the CUDA host. Append inline runtime assertions that reuse the macOS-14 install state from the 30-step training step above: (1) SGD/Muon/Lion HF parity decay (PR unslothai#684 6374bad6). Build an MLXTrainer with weight_decay=0.05 for each of sgd/muon/lion/adamw and assert (a) trainer._manual_weight_decay == 0.05 and (b) the underlying MLX optimizer's internal weight_decay is 0 so the manual decoupled term owns the update. Also probe the bias / norm / GPT-2 ln_* exclusion filter directly. (2) VLM train_on_completions prompt masking (PR unslothai#684 23751c84). Synthetic targets with an assistant marker token; verify _mask_prompt_tokens zeros out positions before the marker and leaves positions from the marker onward intact. Also assert assistant_token_id=0 is a no-op so train_on_completions=False paths are not silently mutated.

…real MLX Round-2 reviewer pass flagged that the pre-fix create_ordered_batches filled a short tail batch with samples from the next epoch inside the same micro-batch (e.g. ds=5, batch_size=3, num_epochs=2 yielded [[0,1,2],[3,4,0],[1,2,3],[4]] with the [3,4,0] cross-boundary batch). The fix slices indices[order_pos:order_pos+batch_size] from the current epoch, emits a partial batch when the tail is short, and advances to epoch+1 at the next iteration. Append an inline assertion that calls create_ordered_batches with num_epochs=2, batch_size=3, len(ds)=5, dataset_order='sequential' and verifies (a) total batch count == 4 ([0,1,2]+[3,4] x 2), (b) no batch mixes a sample from epoch 0's tail with a sample from epoch 1's head. Reuses the install state from the 30-step training step above.

Verifies the proportional per-leaf L2 clip introduced in unsloth-zoo f545a00 executes correctly on real MLX hardware: mode dispatch returns the expected (max_grad_norm, max_grad_value, max_grad_leaf_norm, mode) tuples; per-leaf rescale caps a 3.0-norm leaf to ~1.0 while leaving a 0.5-norm leaf alone; gradient direction is preserved per leaf.

Trains gemma-3-270m to memorise the same text the existing MLX smoke trains on, but with max_grad_leaf_norm=1.0 (the new proportional per-leaf clip default) instead of max_grad_value=1.0 (the prior elementwise clip). Per-step loss + grad-norm is dumped to a JSON artifact so the B200 CUDA reference run can be overlaid for parity comparison.

Runs three back-to-back gemma-3-270m memorisation trainings on macos-14, varying only the clip mode (elementwise / leaf_norm / global_norm). Captures per-step peak GPU memory so the OP's claim that max_grad_norm pays a real memory cost on MLX (cross-tree reduction + fp32 promotion of all leaves) can be quantified against the cheaper per-leaf modes. Dumps the comparison to a JSON artifact.

melroy89 and others added 3 commits May 18, 2026 03:46

Add .gemini/config.yaml (staging repo hygiene)

9f47625

github-code-quality Bot found potential problems May 24, 2026

View reviewed changes

Comment thread tests/staging_validation/test_compiler_return_logits.py Fixed

Comment thread tests/staging_validation/test_compiler_return_logits.py Fixed

shimmyshimmer added 3 commits May 24, 2026 12:36

github-code-quality Bot found potential problems May 24, 2026

View reviewed changes

Comment thread tests/staging_validation/test_compiler_return_logits.py

double-`self.lm_head` matmul `else:` when UNSLOTH_RETURN_LOGITS=1.

"""

path = _locate_compiler_py()

src = open(path, encoding="utf-8").read()

shimmyshimmer and others added 16 commits May 24, 2026 13:03

Re-trigger staging CI on PR head 4cb6ca66 / 56e32b7

a5e857c

Re-trigger staging CI on PR head 6374bad6 (HF parity decay)

1c7d63b

Re-trigger staging CI on PR head d16ef240 (round-3 P1/P2 fixes)

4d7998e

Re-trigger staging CI on round-3 final heads (PR unslothai#684 23751c…

d889bb3

…84 / PR unslothai#5656 1a02643)

Re-trigger MLX smoke for PR head 8c278327 (mask NameError fix)

542ec57

Re-trigger staging CI on PR head 1d0c11ed (round-4 final)

3cf6ec9

danielhanchen force-pushed the main branch from 9f47625 to b9dd7cf Compare June 7, 2026 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage: real-runner validation for MLX Matthew PRs (zoo#684 + unsloth#5656)#87

Stage: real-runner validation for MLX Matthew PRs (zoo#684 + unsloth#5656)#87
danielhanchen wants to merge 22 commits into
mainfrom
mlx-matthew-validation

danielhanchen commented May 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielhanchen commented May 24, 2026

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants