Stage: real-runner validation for MLX Matthew PRs (zoo#684 + unsloth#5656)#87
Open
danielhanchen wants to merge 22 commits into
Open
Stage: real-runner validation for MLX Matthew PRs (zoo#684 + unsloth#5656)#87danielhanchen wants to merge 22 commits into
danielhanchen wants to merge 22 commits into
Conversation
* Add a simple --version flag * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Small code clean-up, less ugly * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Slightly better function names. And use again None --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Three targeted workflows + one regression test, gated to push events on this staging branch only with paths filters. mlx-compiler-linux.yml (ubuntu-latest, CPU + CUDA spoof) - Matrix: unsloth-zoo @ main (expect PASS) vs mmathew23:explore/mlx (expect FAIL, the regression). The check measures whether the dedicated UNSLOTH_RETURN_LOGITS=1 elif branch in cross_entropy_replacement_2 (added by zoo PR unslothai#666, f45c31e) is present in the patched forward across Llama / Mistral / Qwen3 / Gemma2 family heads. PR unslothai#684 commit 5895b20c removed the elif and ca086522 restored only the NOT_RETURN_LOGITS guard, so the template fallback path now executes self.lm_head twice when UNSLOTH_RETURN_LOGITS=1. Job exit encodes expected outcome so the matrix is green when reality matches expectation. mlx-smoke-macos.yml (macos-14 Apple Silicon) - Real MLX 30-step smoke from unsloth PR unslothai#5656's refreshed tests/studio/run_real_mlx_smoke.py against the combined PR heads (zoo @ explore/mlx, unsloth @ explore/mlx). Asserts len(losses_per_step) == config.max_steps (30, set by unslothai#5537). install-smoke-windows.yml (windows-latest) - Install smoke + pydantic schema check that TrainingStartRequest.max_grad_value (Optional[float]) and cast_norm_output_to_input_dtype (bool, default True) round-trip. Runs the two CI-friendly studio backend tests (test_mlx_training_worker_config, test_training_raw_support). Removed release-desktop.yml + stale.yml from the staging fork so push events on this branch trigger only the three new workflows. Branch is throwaway; do not merge upstream.
- mlx-compiler-linux: drop the diagnostic step that imported unsloth_zoo.compiler directly without the CUDA spoof. Only the test file applies the spoof, so the diagnostic crashed with NotImplementedError on the CPU runner. - mlx-smoke-macos: run_real_mlx_smoke.py train requires --workdir (was added at some point after the original PR landed). Pass a GITHUB_WORKSPACE-relative path. - install-smoke-windows: TrainingStartRequest also requires training_type and format_type. Pass valid Literal values, and while we are here, assert the JSON schema shape too so the test is not just a runtime smoke.
- Linux: pip install 'triton' so `import unsloth_zoo.compiler` succeeds. The module does `import triton` at top level; Triton has Linux x86_64 wheels even without a GPU. - macOS: set HF_HUB_ENABLE_HF_TRANSFER=0 in the job env. The runner enables hf_transfer by default but ships without the wheel, so mlx_lm's snapshot_download crashes inside huggingface_hub. - Windows: pip install 'datasets'. The studio backend test module imports `from datasets import Dataset` at collection time.
Linux compiler-py test: re-implement as a TEXT scan over the installed unsloth_zoo/compiler.py (no `import unsloth_zoo`). This sidesteps the triton CPU wheel <-> torch 2.6.0+cpu API mismatch (triton.compiler.compiler.AttrsDescriptor moved) and the GPU-only device_type @cache, neither of which is relevant for what we are actually measuring: the literal cross_entropy_replacement_2 template string. Workflow now installs only pytest + unsloth-zoo source (no torch, no transformers, no triton). Windows: also pip install structlog, httpx, aiosqlite. The studio backend test modules transitively import core.training which unconditionally does `import structlog` at top level. macOS smoke went green on the previous push - leave it alone.
| double-`self.lm_head` matmul `else:` when UNSLOTH_RETURN_LOGITS=1. | ||
| """ | ||
| path = _locate_compiler_py() | ||
| src = open(path, encoding="utf-8").read() |
test_training_raw_support.py transitively imports the full studio backend (core.training.training -> matplotlib, etc.). Adding every transitive dep to the Windows install smoke is whack-a-mole and defeats the smoke's purpose. test_mlx_training_worker_config.py already covers PR unslothai#5656's wiring (model_random_state / lora_random_state fallback, max_grad_value None preservation, dataset_order=torch_randperm) via source-text assertions on worker.py. The test stubs out structlog/loggers/utils itself, so it works with just stdlib. Drop the broader test from the Windows job.
The PR unslothai#684 and PR unslothai#5656 heads were just updated with maintainer fixes (restored compiler.py UNSLOTH_RETURN_LOGITS elif, GPT-2 ln_* matching, Qwen3-VL flag wiring, default-branch reseed; plus seed present-but-None fix). Bump the three workflow files (comment-only) so the paths filters re-fire and we get a fresh signal on all three runners against the updated PR heads.
Now that the fix landed on mmathew23:explore/mlx (c7a0956d), the dedicated UNSLOTH_RETURN_LOGITS=1 elif is back in compiler.py on both zoo=main and zoo=pr-684. Drop the "expected to FAIL on PR head" encoding and just require pytest exit 0 on both. Anyone who reintroduces the regression on either branch will trip the check.
Round 2 of reviewer-driven fixes landed on the PR heads: zoo PR unslothai#684: 0753b115 - merged origin/main (restores unslothai#690 / unslothai#691 gpt-oss eager attn) - cleaned up norm cast monkey patch in train() finally - raise on streaming+dataset_order text combo - VLM baseline CE full-sequence forward parity with CCE - scheduler test now matches HF linear-no-warmup behavior unsloth PR unslothai#5656: bff5b44 (unchanged since last run) Re-fire all three workflows so we get a fresh signal.
After PR unslothai#684 head 1d0c11ed added per-epoch reseed to _create_labeled_batches for dataset_order='torch_randperm', the existing macOS smoke only exercised a single-epoch 30-step run, so the multi-epoch path was static-review-only. Add an inline Python step that calls _create_labeled_batches with num_epochs=3 against the already-installed real MLX, asserts the materialized batch count is num_epochs * batches_per_epoch, recovers the actual row order from each epoch's batches, and verifies it matches _torch_randperm_order( len(ds), SEED + epoch_idx) for every epoch. Also asserts epoch 0 and epoch 1 orders differ, so a silent regression where the reseed stops firing fails loudly here. Reuses the install state from the smoke step above instead of opening a separate workflow.
…g on real MLX Two more round-2/3 reviewer findings were static-review-only on the CUDA host. Append inline runtime assertions that reuse the macOS-14 install state from the 30-step training step above: (1) SGD/Muon/Lion HF parity decay (PR unslothai#684 6374bad6). Build an MLXTrainer with weight_decay=0.05 for each of sgd/muon/lion/adamw and assert (a) trainer._manual_weight_decay == 0.05 and (b) the underlying MLX optimizer's internal weight_decay is 0 so the manual decoupled term owns the update. Also probe the bias / norm / GPT-2 ln_* exclusion filter directly. (2) VLM train_on_completions prompt masking (PR unslothai#684 23751c84). Synthetic targets with an assistant marker token; verify _mask_prompt_tokens zeros out positions before the marker and leaves positions from the marker onward intact. Also assert assistant_token_id=0 is a no-op so train_on_completions=False paths are not silently mutated.
…real MLX Round-2 reviewer pass flagged that the pre-fix create_ordered_batches filled a short tail batch with samples from the next epoch inside the same micro-batch (e.g. ds=5, batch_size=3, num_epochs=2 yielded [[0,1,2],[3,4,0],[1,2,3],[4]] with the [3,4,0] cross-boundary batch). The fix slices indices[order_pos:order_pos+batch_size] from the current epoch, emits a partial batch when the tail is short, and advances to epoch+1 at the next iteration. Append an inline assertion that calls create_ordered_batches with num_epochs=2, batch_size=3, len(ds)=5, dataset_order='sequential' and verifies (a) total batch count == 4 ([0,1,2]+[3,4] x 2), (b) no batch mixes a sample from epoch 0's tail with a sample from epoch 1's head. Reuses the install state from the 30-step training step above.
Verifies the proportional per-leaf L2 clip introduced in unsloth-zoo f545a00 executes correctly on real MLX hardware: mode dispatch returns the expected (max_grad_norm, max_grad_value, max_grad_leaf_norm, mode) tuples; per-leaf rescale caps a 3.0-norm leaf to ~1.0 while leaving a 0.5-norm leaf alone; gradient direction is preserved per leaf.
Trains gemma-3-270m to memorise the same text the existing MLX smoke trains on, but with max_grad_leaf_norm=1.0 (the new proportional per-leaf clip default) instead of max_grad_value=1.0 (the prior elementwise clip). Per-step loss + grad-norm is dumped to a JSON artifact so the B200 CUDA reference run can be overlaid for parity comparison.
Runs three back-to-back gemma-3-270m memorisation trainings on macos-14, varying only the clip mode (elementwise / leaf_norm / global_norm). Captures per-step peak GPU memory so the OP's claim that max_grad_norm pays a real memory cost on MLX (cross-tree reduction + fp32 promotion of all leaves) can be quantified against the cheaper per-leaf modes. Dumps the comparison to a JSON artifact.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Throwaway PR to validate
unslothai/unsloth#5656andunslothai/unsloth-zoo#684(both headmmathew23:explore/mlx) on real Windows / macOS-14 / Linux runners without burning credits on the main repos.Three targeted workflows gated to push events on this branch with path filters; existing staging-fork workflows wiped to stay under the 5-concurrent-Windows-runner cap.
mlx-compiler-linux.yml(ubuntu-latest, CPU + CUDA spoof): matrixmainvsmmathew23:explore/mlxforunsloth_zoo.compiler.cross_entropy_replacement_2. Confirms the dedicatedUNSLOTH_RETURN_LOGITS=1elif from zoo PR TypeError: argument of type 'NoneType' is not iterable when merging weights to 16bit and pushing to hub unsloth#666 (f45c31e5) is missing on the PR head, so the regex template fallback path now does a doublelm_headmatmul. AST rewriter default path is unaffected.mlx-smoke-macos.yml(macos-14 Apple Silicon): real MLX 30-step smoke (tests/studio/run_real_mlx_smoke.py train) from PR MLX Training updates unsloth#5656 against the combined PR heads. Assertslen(losses_per_step) == config.max_steps == 30.install-smoke-windows.yml(windows-latest): pydantic schema round-trip forTrainingStartRequest.max_grad_value/cast_norm_output_to_input_dtypeplus the two CI-friendly studio backend tests touched by PR MLX Training updates unsloth#5656.Test plan
mlx-compiler-linux.ymlzoo=main leg PASS, zoo=pr-684 leg PASS (job encodes "expected fail on PR head" as success when reality matches expectation).mlx-smoke-macos.yml: 30 logged training steps; eval contains "Unsloth".install-smoke-windows.yml: pydantic + studio backend tests green.