Skip to content

Stage: real-runner validation for MLX Matthew PRs (zoo#684 + unsloth#5656)#87

Open
danielhanchen wants to merge 22 commits into
mainfrom
mlx-matthew-validation
Open

Stage: real-runner validation for MLX Matthew PRs (zoo#684 + unsloth#5656)#87
danielhanchen wants to merge 22 commits into
mainfrom
mlx-matthew-validation

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Throwaway PR to validate unslothai/unsloth#5656 and unslothai/unsloth-zoo#684 (both head mmathew23:explore/mlx) on real Windows / macOS-14 / Linux runners without burning credits on the main repos.

Three targeted workflows gated to push events on this branch with path filters; existing staging-fork workflows wiped to stay under the 5-concurrent-Windows-runner cap.

  • mlx-compiler-linux.yml (ubuntu-latest, CPU + CUDA spoof): matrix main vs mmathew23:explore/mlx for unsloth_zoo.compiler.cross_entropy_replacement_2. Confirms the dedicated UNSLOTH_RETURN_LOGITS=1 elif from zoo PR TypeError: argument of type 'NoneType' is not iterable when merging weights to 16bit and pushing to hub unsloth#666 (f45c31e5) is missing on the PR head, so the regex template fallback path now does a double lm_head matmul. AST rewriter default path is unaffected.
  • mlx-smoke-macos.yml (macos-14 Apple Silicon): real MLX 30-step smoke (tests/studio/run_real_mlx_smoke.py train) from PR MLX Training updates unsloth#5656 against the combined PR heads. Asserts len(losses_per_step) == config.max_steps == 30.
  • install-smoke-windows.yml (windows-latest): pydantic schema round-trip for TrainingStartRequest.max_grad_value / cast_norm_output_to_input_dtype plus the two CI-friendly studio backend tests touched by PR MLX Training updates unsloth#5656.

Test plan

  • mlx-compiler-linux.yml zoo=main leg PASS, zoo=pr-684 leg PASS (job encodes "expected fail on PR head" as success when reality matches expectation).
  • mlx-smoke-macos.yml: 30 logged training steps; eval contains "Unsloth".
  • install-smoke-windows.yml: pydantic + studio backend tests green.

melroy89 and others added 3 commits May 18, 2026 03:46
* Add a simple --version flag

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Small code clean-up, less ugly

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Slightly better function names. And use again None

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Three targeted workflows + one regression test, gated to push events
on this staging branch only with paths filters.

mlx-compiler-linux.yml (ubuntu-latest, CPU + CUDA spoof)
- Matrix: unsloth-zoo @ main (expect PASS) vs mmathew23:explore/mlx
  (expect FAIL, the regression). The check measures whether the
  dedicated UNSLOTH_RETURN_LOGITS=1 elif branch in
  cross_entropy_replacement_2 (added by zoo PR unslothai#666, f45c31e) is
  present in the patched forward across Llama / Mistral / Qwen3 /
  Gemma2 family heads. PR unslothai#684 commit 5895b20c removed the elif and
  ca086522 restored only the NOT_RETURN_LOGITS guard, so the template
  fallback path now executes self.lm_head twice when
  UNSLOTH_RETURN_LOGITS=1. Job exit encodes expected outcome so the
  matrix is green when reality matches expectation.

mlx-smoke-macos.yml (macos-14 Apple Silicon)
- Real MLX 30-step smoke from unsloth PR unslothai#5656's refreshed
  tests/studio/run_real_mlx_smoke.py against the combined PR heads
  (zoo @ explore/mlx, unsloth @ explore/mlx). Asserts
  len(losses_per_step) == config.max_steps (30, set by unslothai#5537).

install-smoke-windows.yml (windows-latest)
- Install smoke + pydantic schema check that
  TrainingStartRequest.max_grad_value (Optional[float]) and
  cast_norm_output_to_input_dtype (bool, default True) round-trip.
  Runs the two CI-friendly studio backend tests
  (test_mlx_training_worker_config, test_training_raw_support).

Removed release-desktop.yml + stale.yml from the staging fork so
push events on this branch trigger only the three new workflows.

Branch is throwaway; do not merge upstream.
Comment thread tests/staging_validation/test_compiler_return_logits.py Fixed
Comment thread tests/staging_validation/test_compiler_return_logits.py Fixed
- mlx-compiler-linux: drop the diagnostic step that imported
  unsloth_zoo.compiler directly without the CUDA spoof. Only the
  test file applies the spoof, so the diagnostic crashed with
  NotImplementedError on the CPU runner.

- mlx-smoke-macos: run_real_mlx_smoke.py train requires --workdir
  (was added at some point after the original PR landed). Pass a
  GITHUB_WORKSPACE-relative path.

- install-smoke-windows: TrainingStartRequest also requires
  training_type and format_type. Pass valid Literal values, and
  while we are here, assert the JSON schema shape too so the test
  is not just a runtime smoke.
- Linux: pip install 'triton' so `import unsloth_zoo.compiler` succeeds.
  The module does `import triton` at top level; Triton has Linux
  x86_64 wheels even without a GPU.

- macOS: set HF_HUB_ENABLE_HF_TRANSFER=0 in the job env. The runner
  enables hf_transfer by default but ships without the wheel, so
  mlx_lm's snapshot_download crashes inside huggingface_hub.

- Windows: pip install 'datasets'. The studio backend test module
  imports `from datasets import Dataset` at collection time.
Linux compiler-py test: re-implement as a TEXT scan over the installed
unsloth_zoo/compiler.py (no `import unsloth_zoo`). This sidesteps the
triton CPU wheel <-> torch 2.6.0+cpu API mismatch
(triton.compiler.compiler.AttrsDescriptor moved) and the GPU-only
device_type @cache, neither of which is relevant for what we are
actually measuring: the literal cross_entropy_replacement_2 template
string. Workflow now installs only pytest + unsloth-zoo source (no
torch, no transformers, no triton).

Windows: also pip install structlog, httpx, aiosqlite. The studio
backend test modules transitively import core.training which
unconditionally does `import structlog` at top level.

macOS smoke went green on the previous push - leave it alone.
double-`self.lm_head` matmul `else:` when UNSLOTH_RETURN_LOGITS=1.
"""
path = _locate_compiler_py()
src = open(path, encoding="utf-8").read()
shimmyshimmer and others added 16 commits May 24, 2026 13:03
test_training_raw_support.py transitively imports the full studio
backend (core.training.training -> matplotlib, etc.). Adding every
transitive dep to the Windows install smoke is whack-a-mole and
defeats the smoke's purpose.

test_mlx_training_worker_config.py already covers PR unslothai#5656's wiring
(model_random_state / lora_random_state fallback, max_grad_value
None preservation, dataset_order=torch_randperm) via source-text
assertions on worker.py. The test stubs out structlog/loggers/utils
itself, so it works with just stdlib.

Drop the broader test from the Windows job.
The PR unslothai#684 and PR unslothai#5656 heads were just updated with maintainer
fixes (restored compiler.py UNSLOTH_RETURN_LOGITS elif, GPT-2 ln_*
matching, Qwen3-VL flag wiring, default-branch reseed; plus seed
present-but-None fix). Bump the three workflow files (comment-only)
so the paths filters re-fire and we get a fresh signal on all three
runners against the updated PR heads.
Now that the fix landed on mmathew23:explore/mlx (c7a0956d), the
dedicated UNSLOTH_RETURN_LOGITS=1 elif is back in compiler.py on
both zoo=main and zoo=pr-684. Drop the "expected to FAIL on PR
head" encoding and just require pytest exit 0 on both. Anyone
who reintroduces the regression on either branch will trip the
check.
Round 2 of reviewer-driven fixes landed on the PR heads:
  zoo PR unslothai#684: 0753b115
    - merged origin/main (restores unslothai#690 / unslothai#691 gpt-oss eager attn)
    - cleaned up norm cast monkey patch in train() finally
    - raise on streaming+dataset_order text combo
    - VLM baseline CE full-sequence forward parity with CCE
    - scheduler test now matches HF linear-no-warmup behavior
  unsloth PR unslothai#5656: bff5b44 (unchanged since last run)

Re-fire all three workflows so we get a fresh signal.
After PR unslothai#684 head 1d0c11ed added per-epoch reseed to
_create_labeled_batches for dataset_order='torch_randperm', the
existing macOS smoke only exercised a single-epoch 30-step run, so
the multi-epoch path was static-review-only. Add an inline Python step
that calls _create_labeled_batches with num_epochs=3 against the
already-installed real MLX, asserts the materialized batch count is
num_epochs * batches_per_epoch, recovers the actual row order from
each epoch's batches, and verifies it matches _torch_randperm_order(
len(ds), SEED + epoch_idx) for every epoch. Also asserts epoch 0 and
epoch 1 orders differ, so a silent regression where the reseed stops
firing fails loudly here. Reuses the install state from the smoke
step above instead of opening a separate workflow.
…g on real MLX

Two more round-2/3 reviewer findings were static-review-only on the
CUDA host. Append inline runtime assertions that reuse the macOS-14
install state from the 30-step training step above:

(1) SGD/Muon/Lion HF parity decay (PR unslothai#684 6374bad6). Build an
    MLXTrainer with weight_decay=0.05 for each of sgd/muon/lion/adamw
    and assert (a) trainer._manual_weight_decay == 0.05 and
    (b) the underlying MLX optimizer's internal weight_decay is 0 so
    the manual decoupled term owns the update. Also probe the
    bias / norm / GPT-2 ln_* exclusion filter directly.

(2) VLM train_on_completions prompt masking (PR unslothai#684 23751c84).
    Synthetic targets with an assistant marker token; verify
    _mask_prompt_tokens zeros out positions before the marker and
    leaves positions from the marker onward intact. Also assert
    assistant_token_id=0 is a no-op so train_on_completions=False
    paths are not silently mutated.
…real MLX

Round-2 reviewer pass flagged that the pre-fix create_ordered_batches
filled a short tail batch with samples from the next epoch inside the
same micro-batch (e.g. ds=5, batch_size=3, num_epochs=2 yielded
[[0,1,2],[3,4,0],[1,2,3],[4]] with the [3,4,0] cross-boundary batch).
The fix slices indices[order_pos:order_pos+batch_size] from the
current epoch, emits a partial batch when the tail is short, and
advances to epoch+1 at the next iteration.

Append an inline assertion that calls create_ordered_batches with
num_epochs=2, batch_size=3, len(ds)=5, dataset_order='sequential' and
verifies (a) total batch count == 4 ([0,1,2]+[3,4] x 2), (b) no batch
mixes a sample from epoch 0's tail with a sample from epoch 1's head.
Reuses the install state from the 30-step training step above.
Verifies the proportional per-leaf L2 clip introduced in unsloth-zoo
f545a00 executes correctly on real MLX hardware: mode dispatch returns
the expected (max_grad_norm, max_grad_value, max_grad_leaf_norm, mode)
tuples; per-leaf rescale caps a 3.0-norm leaf to ~1.0 while leaving a
0.5-norm leaf alone; gradient direction is preserved per leaf.
Trains gemma-3-270m to memorise the same text the existing MLX smoke
trains on, but with max_grad_leaf_norm=1.0 (the new proportional
per-leaf clip default) instead of max_grad_value=1.0 (the prior
elementwise clip). Per-step loss + grad-norm is dumped to a JSON
artifact so the B200 CUDA reference run can be overlaid for parity
comparison.
Runs three back-to-back gemma-3-270m memorisation trainings on
macos-14, varying only the clip mode (elementwise / leaf_norm /
global_norm). Captures per-step peak GPU memory so the OP's claim
that max_grad_norm pays a real memory cost on MLX (cross-tree
reduction + fp32 promotion of all leaves) can be quantified against
the cheaper per-leaf modes. Dumps the comparison to a JSON artifact.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants