tests + CI: callback signature drift detector by danielhanchen · Pull Request #5498 · unslothai/unsloth

danielhanchen · 2026-05-17T04:09:26Z

Summary

The actual smoke-test signature fix that motivated this PR has now landed via #5537. What remains here is the preventive measure that catches the class of bug:

tests/test_callback_signature_drift.py (new): static AST detector that walks the codebase, finds every class that owns a step-callback producer (the call site that invokes a registered callback), and verifies every consumer callback declared anywhere matches the producer's positional arity. Runs in ~7s without importing the producer module.
.github/workflows/consolidated-tests-ci.yml: wires the detector into the Core matrix right after the public-api drift detector, with UNSLOTH_ZOO_SRC=$RUNNER_TEMP/unsloth-zoo so the detector can read unsloth_zoo/mlx/trainer.py (the wheel strips it).

Why keep this PR

The MLX smoke regression that #5537 fixes had the form: producer called callback with N args, consumer declared N-1, trainer's try/except swallowed the TypeError silently, and the symptom was a confusing downstream assert len(losses_per_step) == 7 several seconds later. The detector fails fast at PR time with a precise file:line for both producer and consumer instead of leaving the next person to chase the silent error.

Test plan

pytest tests/test_callback_signature_drift.py passes locally against current unsloth-zoo HEAD (where the trainer producer is 9-arg and the smoke-test consumer is now 9-arg post tests/studio: tighten MLX smoke gates (loss + round-trip, _on_step grad_norm) #5537).
Core CI passes with the new detector step.

gemini-code-assist

Code Review

This pull request updates the _on_step function in the MLX smoke test to track and display the gradient norm during training steps. Review feedback recommends making the grad_norm parameter optional and implementing a check for None values in the print statement to prevent potential type errors.

gemini-code-assist · 2026-05-17T04:12:46Z

        )

-        def _on_step(step, total, loss, lr, tok_s, peak_gb, elapsed, num_tokens):
+        def _on_step(


To improve robustness and maintain consistency with the trainer's implementation (as seen in studio/backend/core/training/worker.py), consider making the grad_norm parameter optional by providing a default value of None. This ensures compatibility if the trainer's signature changes or if the metric is not provided.

Suggested change

def _on_step(

def _on_step(step, total, loss, lr, tok_s, peak_gb, elapsed, num_tokens, grad_norm = None):

gemini-code-assist · 2026-05-17T04:12:46Z

+            step, total, loss, lr, tok_s, peak_gb, elapsed, num_tokens, grad_norm
+        ):
            losses_per_step.append(round(float(loss), 4))
            print(


The grad_norm value can be None. Using a format specifier like :.3f on a None value will raise a TypeError. Following defensive programming practices, consider handling the None case gracefully in the print statement.

Suggested change

print(

f"tok/s={tok_s:.0f} peak={peak_gb:.2f}GB grad_norm={f'{grad_norm:.3f}' if grad_norm is not None else 'N/A'}",

danielhanchen · 2026-05-17T10:01:02Z

Bisection update.

The MLX CI failure here is two stacked regressions, not one:

Callback signature drift (8 -> 9 args added in unsloth-zoo The issue of converting all model names to lowercase #634). Fixed in this PR's first 5 commits + the AST drift detector.
HF/CUDA parity break in MLX gradient clipping, also from unsloth-zoo The issue of converting all model names to lowercase #634. The MLX trainer's MLXTrainingConfig.max_grad_value (1.0, later 5.0) silently zeroes out user-supplied max_grad_norm, so a config that converges and emits "Unsloth" under transformers.SFTTrainer on CUDA emits '5 lbs!' on MLX.

Evidence:

Last green MLX CI on main: run 25856086896 at unsloth 12295c1f, completion = " Unsloth!\n\nMy name is Unsloth! ...".
First red MLX CI on main: run 25859821022 at unsloth a9322946. Only unsloth-zoo commit landed in that window: The issue of converting all model names to lowercase #634 (e6d8f7f, 2026-05-14T12:10:03Z).
Reproduced offline with a CUDA mirror of the same 7-step config (temp/torchcodec_test/cuda_mirror.py): norm-only clip generates "Unsloth", elementwise-only clip generates gibberish, confirming the override is what poisons gradient direction.

Upstream fix:

MLX trainer: max_grad_value default silently overrides max_grad_norm, breaks HF parity unsloth-zoo#662 (issue, bisection + repro)
fix(mlx): max_grad_value default off, honor user max_grad_norm unsloth-zoo#663 (PR: max_grad_value default None, honor max_grad_norm)

This PR drops the max_grad_value=0.0 workaround from run_real_mlx_smoke.py since #663 makes that the default, and turns the EXPECT_IN_OUTPUT assertion into a documented parity gate. CI will stay red here until #663 merges; that's intentional -- it's the right gate to flag any future MLX/torch parity drift.

Round C established the bc=True basin lives at steps in [15, 40] for seed=3407. The smoke test fix in PR unslothai#5498 picks 20. Round D verifies: 17o steps=20, seed=42 -- does 20 work on a different seed? 17p steps=20, seed=999 17q steps=20, seed=1337 17r steps=50, seed=42 -- is 50-step failure seed-specific? 17s steps=100 -- does the loss eventually re-stabilize? If 17o/p/q all generate "Unsloth", PR unslothai#5498's max_steps=20 is seed-robust. If 17r generates "Unsloth", the 50-step failure on seed=3407 is a single-seed quirk and the upper boundary is wider than first thought.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb6e7f793f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-18T08:45:10Z

            print(
                f"  step {step}/{total}  loss={loss:.4f}  lr={lr:.2e}  "
-                f"tok/s={tok_s:.0f}  peak={peak_gb:.2f}GB",
+                f"tok/s={tok_s:.0f}  peak={peak_gb:.2f}GB  grad_norm={grad_norm:.3f}",


Handle optional grad_norm before formatting

With the new max_grad_norm = 0.0 / max_grad_value = 1.0 configuration, the MLX callback can receive grad_norm as None when norm clipping is disabled (the Studio worker treats the same callback argument as optional and only logs it when present). Formatting it with :.3f raises TypeError, and because the trainer catches callback exceptions this recreates the original symptom: the callback stops recording losses_per_step and the later expected 30 logged steps assertion fails on the Mac MLX smoke run.

Useful? React with 👍 / 👎.

…ad_norm) (#5537) * tests/studio: accept new grad_norm arg in MLX smoke _on_step callback The MLX trainer's step callback now passes a ninth positional argument (grad_norm) per unsloth_zoo/mlx/trainer.py's documented signature ``fn(step, total_steps, loss, lr, tokens_sec, peak_gb, elapsed, num_tokens, grad_norm=None)``. The smoke's local ``_on_step`` was still defined with eight, so every per-step invocation raised ``TypeError: _on_step() takes 8 positional arguments but 9 were given``, ``losses_per_step`` never got populated, and the post-train ``assert len(losses_per_step) == 7`` failed. Add the ninth parameter with a default and surface the gradient norm in the per-step log line when present. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests/studio: pin max_grad_value=0 in MLX smoke so max_grad_norm=1.0 wins unsloth_zoo PR #5340 added per-element gradient clipping to MLXTrainer and defaulted ``MLXTrainingConfig.max_grad_value = 5.0``. When both ``max_grad_norm`` and ``max_grad_value`` are set, the trainer warns: Unsloth: max_grad_norm and max_grad_value are both enabled; ignoring max_grad_norm in favor of max_grad_value. and silently drops the test's ``max_grad_norm=1.0``. +-5.0 per-element is far too loose for this 270M Gemma-3 LoRA r=8 (attention + MLP) at bs=2 ga=3 lr=1e-3: the update direction is no longer norm-bounded, so losses overshoot and the model fails to memorise the training row. Reproduced on a CUDA mirror (scripts/cuda_mlx_mirror_sim.py): norm_1 (max_grad_norm=1.0, no clip): losses 7.64 -> 0.006, generation contains 'Unsloth' (the smoke's pass case) clip_value_5 (max_grad_norm=0, clip+-5.0): losses 7.29 -> 8.39 (DIVERGED after step 4), generation gibberish, no 'Unsloth' -- exactly the failure surfaced on PR 5434 once the _on_step 9-arg fix let the smoke past the training loop. Pin ``max_grad_value=0.0`` so the smoke uses the same ``max_grad_norm= 1.0`` clipping it was designed against. Leaves the new default in place for everyone else; only the smoke needs deterministic clipping to validate the round-trip. * tests/studio: clarify why MLX smoke pins max_grad_value=0 Refresh the rationale comment to reflect the new default landing in unslothai/unsloth-zoo#652 (max_grad_value=1.0, not 5.0). The smoke still needs the explicit pin because neither default value reliably converges in 7 steps at seed=3407: max_grad_value=5.0 -- diverges after step 4 (loss 7.3 -> 8.4) max_grad_value=1.0 -- stalls (loss ~3.2 plateau across seeds) max_grad_value=0.5/0.25/0.1 -- noisier still max_grad_norm=1.0 -- cleanly drops loss to <0.01, emits "Unsloth!" Mention both the historical 5.0 default and the new 1.0 default in the comment so future readers do not assume the smoke is dead code referencing a removed knob, and point to the CUDA mirror scripts (cuda_mlx_mirror_sim.py + cuda_mlx_clip1_vs_norm1.py) for the empirical evidence. No behaviour change; comment-only refresh. * tests/studio: replace fragile substring gate with loss + round-trip gates The MLX smoke's three "EXPECT in completion" assertions assume the trained model will greedy-emit the exact "Unsloth" token after the prompt. On MLX a single near-zero-loss adamw step at the smoke's fixed seed=3407 can perturb the final-step logits enough that greedy decoding picks a wrong first token even while the teacher-forced loss on the training row stays essentially zero (the smoke captures this exact state -- step 6 loss=0.049, step 7 grad=36.7, step 7 loss=0.17; completion goes from "Unsloth!" to "5 lbs!"). Reproduced extensively on CUDA via scripts/cuda_mlx_step7_*.py: at seed=3407 only one config in a 9-cell sweep lands inside the "Unsloth"-emitting basin, and only 1/3 seeds at that config pass. This is a property of the assertion, not of save/reload correctness. Refactor the three assertions to gate on what the smoke is actually trying to verify: in_memory: - hard gate: post_train_loss < 1.0 (training memorised the row). - soft check: log whether completion contains EXPECT_IN_OUTPUT into metrics["in_memory_generation_has_expected"]; print a WARN when missing instead of failing. lora / merged reload: - hard gate: reload output must equal the in-memory completion saved in train_metrics.json. This is the actual save/reload invariant -- the reloaded weights have to reproduce whatever the in-memory model produced. Falls back to the original gibberish gate if train_metrics.json is unavailable. gguf reload: - hard gate: llama.cpp produced usable, non-empty output after the prompt (>=4 chars). llama.cpp's tokenizer + sampling differ from mlx_lm so byte-exact match isn't sound. Log gguf_has_expected for visibility. Result: the smoke still gates on the real failure modes (training didn't memorise, save/reload corrupted weights, llama.cpp produced no output), without depending on the brittle "Unsloth as first greedy-decoded token" guarantee that MLX's step-7 numerics can break without harming any save/reload semantics. Cross-version constraint: no transformers / trl API touched. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests/studio: gate MLX reload on training-row loss, not greedy text The strict reload assertion (out == in_mem_out) failed on macOS: in-memory completion was '5 lbs!' and the reloaded completion was '_________________________'. Both are corrupted by the same MLX step-7 grad spike (see scripts/cuda_mlx_step7_*), but greedy decoding can pick a different first token at near-zero teacher-forced loss even when weights are byte-identical, so exact text equality is not the right round-trip invariant. Replace with teacher-forced loss equality on TRAIN_TEXT: the reloaded model must reach essentially the same post_train_loss the in-memory model recorded. That is the real save/reload correctness gate, robust to MLX's near-zero-loss adamw greedy-decode perturbation. Falls back to a non-empty-body check when train_metrics.json is missing. CUDA mirror at this seed converges cleanly to ~0.006 loss; on MLX post_train_loss < 1.0 still holds via the existing memorisation gate. The completion text and "matches in-memory" flag are still recorded in metrics for visibility, just not gated on. * tests/studio: align MLX smoke with elementwise-clip + 30-step gates Two corrections to the earlier f93e918 / e05d6c7 direction: 1. max_grad_value=0.0, max_grad_norm=1.0 picked the memory-heavy norm clip. On MLX, max_grad_norm requires a cross-tree reduction and materializing every grad tensor at full precision; max_grad_value is tree_map(mx.clip) per leaf with no reduction. MLXTrainingConfig defaults to max_grad_value=1.0 for exactly this reason. Flip the smoke to max_grad_norm=0.0, max_grad_value=1.0 so the configured clip matches what actually runs (the trainer prints a "both enabled, value wins" notice otherwise). 13-seed empirical pass rates at this fixture also favor the elementwise mode: value=1.0 62%, norm=1.0 46%, value=5.0 33%, value=0.5 77%. Cheaper default = higher pass rate, no tradeoff. (See PR #5498 / staging-2#119 rounds A-AT.) 2. max_steps=7 was below the convergence horizon at every clip tested. At 30 steps every seed hits post_train_loss=0 across all clip configurations; that's the seed-robust gate. Bump max_steps 7 -> 30, tighten the memorisation gate from post_loss < 1.0 to post_loss < 0.1. 3. Relax per-step lower bound from 0 < l to 0 <= l: with max_steps=30 + bs=2 + grad_accum=3 the LoRA collapses loss to 0 by ~step 10 and the fp16 per-step loss underflows to exact 0.0 from then on. That's the success signal, not a bug. Keeps the e7ec2f5 EXPECT_IN_OUTPUT demotion-to-warning and the e734764 reload teacher-forced-loss round-trip invariant -- those are the right gates regardless of the clip / steps choice. * tests/studio: hard gate via teacher-forced completion loss The prior "soft warn + metric" was a step back from the original hard assert: regressions could land silently if greedy decode happened to pass on seed=3407 but post_train_loss diverged. A true hard gate is needed. Greedy decode is empirically fragile -- a 47-round, 13-seed sweep on this fixture (see danielhanchen#119) showed contains-Unsloth lands in 46-77% across MLX clip configs even when post_train_loss is zero, because fp16 noise on the first generated token after PROMPT perturbs the argmax. Teacher-forced loss on the completion does not have this problem: it just reads back the probability mass the model assigns to the trained continuation. In every config where post_train_loss < 0.1, the completion loss is essentially zero. Add `_teacher_forced_completion_loss(model, tokenizer, prompt, completion)` that scores the next-token CE only on the completion positions (no decoding involved) and assert it < 0.5. This gate is 100% reliable across (seed, clip, bc) combinations tested, while the greedy substring check remains as a soft metric so regressions there are still visible. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Static AST check that fails fast when a producer in unsloth_zoo (or unsloth) changes the arity of a callback but a consumer callback def still declares the old arity. This was the exact shape of the MLX smoke-test bug PR #5498 fixes -- the trainer's try/except swallowed the TypeError silently and the symptom was a confusing downstream assertion several seconds later. What the detector does: * Producer side: walks every .py and finds classes that own a self._<name>_callbacks list, populated via .append() from an add_<name>_callback method, and invoked via `for cb in self._<name>_callbacks: cb(arg1, ..., argN)`. The arity at the call site is the canonical expected arity. * Consumer side: walks every <obj>.add_<name>_callback(fn) call, resolves fn to a def or lambda in the same file, and asserts arity matches. Consumers that use *args or **kwargs are tolerantly accepted as any arity. * Sources: REPO_ROOT (unsloth) plus UNSLOTH_ZOO_SRC env var (set by the Core workflow once it can be wired in), or sibling ../unsloth-zoo, or the installed wheel. Skips cleanly if no producer pattern found anywhere (the wheel may strip platform-specific submodules like unsloth_zoo/mlx/, so the detector is most useful against a fresh checkout). Validated end-to-end: * Reverted run_real_mlx_smoke.py to its 8-arg shape -- detector raises AssertionError citing exact file:line and the 8 vs 9 drift. * Restored the 9-arg shape -- detector PASSes. * Total runtime ~7 s in pytest. Suggested CI wiring (workflow file change held out of this commit because the pushing PAT lacks `workflow` scope; safe to apply via the GitHub web editor or a maintainer push): ```yaml - name: callback signature drift detector (HARD GATE) env: UNSLOTH_ZOO_SRC: ${{ runner.temp }}/unsloth-zoo run: | python -m pytest -v --tb=short tests/test_callback_signature_drift.py ``` Drop the step into .github/workflows/consolidated-tests-ci.yml right after the existing public-api drift detector step. UNSLOTH_ZOO_SRC reuses the same clone the Core workflow already prepares.

for more information, see https://pre-commit.ci

Drops a 6-line pytest step right after the public-api drift detector, with UNSLOTH_ZOO_SRC pointed at the freshly cloned $RUNNER_TEMP/unsloth-zoo so the detector sees unsloth_zoo/mlx/ (the wheel strips it). Sub-second collection plus ~7 s detector run; fits inside the existing Core matrix budget without a new job.

danielhanchen requested a review from rolandtannous as a code owner May 17, 2026 04:09

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

danielhanchen mentioned this pull request May 17, 2026

MLX trainer: max_grad_value default silently overrides max_grad_norm, breaks HF parity unslothai/unsloth-zoo#662

Closed

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

danielhanchen force-pushed the fix-mlx-smoke-step-callback-signature branch from fb6e7f7 to a778f76 Compare May 18, 2026 11:01

danielhanchen and others added 3 commits May 18, 2026 11:33

[pre-commit.ci] auto fixes from pre-commit.com hooks

30e4938

for more information, see https://pre-commit.ci

danielhanchen force-pushed the fix-mlx-smoke-step-callback-signature branch from a778f76 to 8f8e33d Compare May 18, 2026 11:34

danielhanchen changed the title ~~tests/studio: match MLX trainer step callback's 9-arg shape~~ tests + CI: callback signature drift detector May 18, 2026

danielhanchen merged commit d774af2 into main May 18, 2026
5 of 15 checks passed

danielhanchen deleted the fix-mlx-smoke-step-callback-signature branch May 18, 2026 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tests + CI: callback signature drift detector#5498

tests + CI: callback signature drift detector#5498
danielhanchen merged 3 commits into
mainfrom
fix-mlx-smoke-step-callback-signature

danielhanchen commented May 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

danielhanchen commented May 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	def _on_step(
	def _on_step(step, total, loss, lr, tok_s, peak_gb, elapsed, num_tokens, grad_norm = None):

	print(
	f"tok/s={tok_s:.0f} peak={peak_gb:.2f}GB grad_norm={f'{grad_norm:.3f}' if grad_norm is not None else 'N/A'}",

Uh oh!

Conversation

danielhanchen commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why keep this PR

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielhanchen commented May 17, 2026 •

edited

Loading