Skip to content

tests + CI: callback signature drift detector#5498

Merged
danielhanchen merged 3 commits into
mainfrom
fix-mlx-smoke-step-callback-signature
May 18, 2026
Merged

tests + CI: callback signature drift detector#5498
danielhanchen merged 3 commits into
mainfrom
fix-mlx-smoke-step-callback-signature

Conversation

@danielhanchen

@danielhanchen danielhanchen commented May 17, 2026

Copy link
Copy Markdown
Member

Summary

The actual smoke-test signature fix that motivated this PR has now landed via #5537. What remains here is the preventive measure that catches the class of bug:

  • tests/test_callback_signature_drift.py (new): static AST detector that walks the codebase, finds every class that owns a step-callback producer (the call site that invokes a registered callback), and verifies every consumer callback declared anywhere matches the producer's positional arity. Runs in ~7s without importing the producer module.
  • .github/workflows/consolidated-tests-ci.yml: wires the detector into the Core matrix right after the public-api drift detector, with UNSLOTH_ZOO_SRC=$RUNNER_TEMP/unsloth-zoo so the detector can read unsloth_zoo/mlx/trainer.py (the wheel strips it).

Why keep this PR

The MLX smoke regression that #5537 fixes had the form: producer called callback with N args, consumer declared N-1, trainer's try/except swallowed the TypeError silently, and the symptom was a confusing downstream assert len(losses_per_step) == 7 several seconds later. The detector fails fast at PR time with a precise file:line for both producer and consumer instead of leaving the next person to chase the silent error.

Test plan

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the _on_step function in the MLX smoke test to track and display the gradient norm during training steps. Review feedback recommends making the grad_norm parameter optional and implementing a check for None values in the print statement to prevent potential type errors.

)

def _on_step(step, total, loss, lr, tok_s, peak_gb, elapsed, num_tokens):
def _on_step(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve robustness and maintain consistency with the trainer's implementation (as seen in studio/backend/core/training/worker.py), consider making the grad_norm parameter optional by providing a default value of None. This ensures compatibility if the trainer's signature changes or if the metric is not provided.

Suggested change
def _on_step(
def _on_step(step, total, loss, lr, tok_s, peak_gb, elapsed, num_tokens, grad_norm = None):

step, total, loss, lr, tok_s, peak_gb, elapsed, num_tokens, grad_norm
):
losses_per_step.append(round(float(loss), 4))
print(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The grad_norm value can be None. Using a format specifier like :.3f on a None value will raise a TypeError. Following defensive programming practices, consider handling the None case gracefully in the print statement.

Suggested change
print(
f"tok/s={tok_s:.0f} peak={peak_gb:.2f}GB grad_norm={f'{grad_norm:.3f}' if grad_norm is not None else 'N/A'}",

@danielhanchen

Copy link
Copy Markdown
Member Author

Bisection update.

The MLX CI failure here is two stacked regressions, not one:

  1. Callback signature drift (8 -> 9 args added in unsloth-zoo The issue of converting all model names to lowercase #634). Fixed in this PR's first 5 commits + the AST drift detector.
  2. HF/CUDA parity break in MLX gradient clipping, also from unsloth-zoo The issue of converting all model names to lowercase #634. The MLX trainer's MLXTrainingConfig.max_grad_value (1.0, later 5.0) silently zeroes out user-supplied max_grad_norm, so a config that converges and emits "Unsloth" under transformers.SFTTrainer on CUDA emits '5 lbs!' on MLX.

Evidence:

  • Last green MLX CI on main: run 25856086896 at unsloth 12295c1f, completion = " Unsloth!\n\nMy name is Unsloth! ...".
  • First red MLX CI on main: run 25859821022 at unsloth a9322946. Only unsloth-zoo commit landed in that window: The issue of converting all model names to lowercase #634 (e6d8f7f, 2026-05-14T12:10:03Z).
  • Reproduced offline with a CUDA mirror of the same 7-step config (temp/torchcodec_test/cuda_mirror.py): norm-only clip generates "Unsloth", elementwise-only clip generates gibberish, confirming the override is what poisons gradient direction.

Upstream fix:

This PR drops the max_grad_value=0.0 workaround from run_real_mlx_smoke.py since #663 makes that the default, and turns the EXPECT_IN_OUTPUT assertion into a documented parity gate. CI will stay red here until #663 merges; that's intentional -- it's the right gate to flag any future MLX/torch parity drift.

danielhanchen added a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 17, 2026
Round C established the bc=True basin lives at steps in [15, 40] for
seed=3407. The smoke test fix in PR unslothai#5498 picks 20.

Round D verifies:
  17o  steps=20, seed=42    -- does 20 work on a different seed?
  17p  steps=20, seed=999
  17q  steps=20, seed=1337
  17r  steps=50, seed=42    -- is 50-step failure seed-specific?
  17s  steps=100             -- does the loss eventually re-stabilize?

If 17o/p/q all generate "Unsloth", PR unslothai#5498's max_steps=20 is
seed-robust. If 17r generates "Unsloth", the 50-step failure on
seed=3407 is a single-seed quirk and the upper boundary is wider
than first thought.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb6e7f793f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/studio/run_real_mlx_smoke.py Outdated
print(
f" step {step}/{total} loss={loss:.4f} lr={lr:.2e} "
f"tok/s={tok_s:.0f} peak={peak_gb:.2f}GB",
f"tok/s={tok_s:.0f} peak={peak_gb:.2f}GB grad_norm={grad_norm:.3f}",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle optional grad_norm before formatting

With the new max_grad_norm = 0.0 / max_grad_value = 1.0 configuration, the MLX callback can receive grad_norm as None when norm clipping is disabled (the Studio worker treats the same callback argument as optional and only logs it when present). Formatting it with :.3f raises TypeError, and because the trainer catches callback exceptions this recreates the original symptom: the callback stops recording losses_per_step and the later expected 30 logged steps assertion fails on the Mac MLX smoke run.

Useful? React with 👍 / 👎.

@danielhanchen danielhanchen force-pushed the fix-mlx-smoke-step-callback-signature branch from fb6e7f7 to a778f76 Compare May 18, 2026 11:01
danielhanchen added a commit that referenced this pull request May 18, 2026
…ad_norm) (#5537)

* tests/studio: accept new grad_norm arg in MLX smoke _on_step callback

The MLX trainer's step callback now passes a ninth positional argument
(grad_norm) per unsloth_zoo/mlx/trainer.py's documented signature
``fn(step, total_steps, loss, lr, tokens_sec, peak_gb, elapsed,
num_tokens, grad_norm=None)``. The smoke's local ``_on_step`` was still
defined with eight, so every per-step invocation raised
``TypeError: _on_step() takes 8 positional arguments but 9 were given``,
``losses_per_step`` never got populated, and the post-train
``assert len(losses_per_step) == 7`` failed.

Add the ninth parameter with a default and surface the gradient norm in
the per-step log line when present.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tests/studio: pin max_grad_value=0 in MLX smoke so max_grad_norm=1.0 wins

unsloth_zoo PR #5340 added per-element gradient clipping to MLXTrainer
and defaulted ``MLXTrainingConfig.max_grad_value = 5.0``. When both
``max_grad_norm`` and ``max_grad_value`` are set, the trainer warns:

  Unsloth: max_grad_norm and max_grad_value are both enabled;
  ignoring max_grad_norm in favor of max_grad_value.

and silently drops the test's ``max_grad_norm=1.0``. +-5.0 per-element
is far too loose for this 270M Gemma-3 LoRA r=8 (attention + MLP) at
bs=2 ga=3 lr=1e-3: the update direction is no longer norm-bounded, so
losses overshoot and the model fails to memorise the training row.

Reproduced on a CUDA mirror (scripts/cuda_mlx_mirror_sim.py):

  norm_1       (max_grad_norm=1.0, no clip): losses 7.64 -> 0.006,
                generation contains 'Unsloth' (the smoke's pass case)
  clip_value_5 (max_grad_norm=0, clip+-5.0): losses 7.29 -> 8.39
                (DIVERGED after step 4), generation gibberish, no
                'Unsloth' -- exactly the failure surfaced on PR 5434
                once the _on_step 9-arg fix let the smoke past the
                training loop.

Pin ``max_grad_value=0.0`` so the smoke uses the same ``max_grad_norm=
1.0`` clipping it was designed against. Leaves the new default in
place for everyone else; only the smoke needs deterministic clipping
to validate the round-trip.

* tests/studio: clarify why MLX smoke pins max_grad_value=0

Refresh the rationale comment to reflect the new default landing in
unslothai/unsloth-zoo#652 (max_grad_value=1.0, not 5.0). The smoke
still needs the explicit pin because neither default value reliably
converges in 7 steps at seed=3407:

  max_grad_value=5.0 -- diverges after step 4 (loss 7.3 -> 8.4)
  max_grad_value=1.0 -- stalls (loss ~3.2 plateau across seeds)
  max_grad_value=0.5/0.25/0.1 -- noisier still
  max_grad_norm=1.0  -- cleanly drops loss to <0.01, emits "Unsloth!"

Mention both the historical 5.0 default and the new 1.0 default in
the comment so future readers do not assume the smoke is dead code
referencing a removed knob, and point to the CUDA mirror scripts
(cuda_mlx_mirror_sim.py + cuda_mlx_clip1_vs_norm1.py) for the
empirical evidence.

No behaviour change; comment-only refresh.

* tests/studio: replace fragile substring gate with loss + round-trip gates

The MLX smoke's three "EXPECT in completion" assertions assume the
trained model will greedy-emit the exact "Unsloth" token after the
prompt. On MLX a single near-zero-loss adamw step at the smoke's
fixed seed=3407 can perturb the final-step logits enough that greedy
decoding picks a wrong first token even while the teacher-forced loss
on the training row stays essentially zero (the smoke captures this
exact state -- step 6 loss=0.049, step 7 grad=36.7, step 7 loss=0.17;
completion goes from "Unsloth!" to "5 lbs!"). Reproduced extensively
on CUDA via scripts/cuda_mlx_step7_*.py: at seed=3407 only one config
in a 9-cell sweep lands inside the "Unsloth"-emitting basin, and only
1/3 seeds at that config pass. This is a property of the assertion,
not of save/reload correctness.

Refactor the three assertions to gate on what the smoke is actually
trying to verify:

  in_memory:
    - hard gate: post_train_loss < 1.0 (training memorised the row).
    - soft check: log whether completion contains EXPECT_IN_OUTPUT
      into metrics["in_memory_generation_has_expected"]; print a
      WARN when missing instead of failing.

  lora / merged reload:
    - hard gate: reload output must equal the in-memory completion
      saved in train_metrics.json. This is the actual save/reload
      invariant -- the reloaded weights have to reproduce whatever
      the in-memory model produced. Falls back to the original
      gibberish gate if train_metrics.json is unavailable.

  gguf reload:
    - hard gate: llama.cpp produced usable, non-empty output after
      the prompt (>=4 chars). llama.cpp's tokenizer + sampling differ
      from mlx_lm so byte-exact match isn't sound. Log
      gguf_has_expected for visibility.

Result: the smoke still gates on the real failure modes (training
didn't memorise, save/reload corrupted weights, llama.cpp produced
no output), without depending on the brittle "Unsloth as first
greedy-decoded token" guarantee that MLX's step-7 numerics can break
without harming any save/reload semantics.

Cross-version constraint: no transformers / trl API touched.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tests/studio: gate MLX reload on training-row loss, not greedy text

The strict reload assertion (out == in_mem_out) failed on macOS:
in-memory completion was '5 lbs!' and the reloaded completion was
'_________________________'. Both are corrupted by the same MLX
step-7 grad spike (see scripts/cuda_mlx_step7_*), but greedy decoding
can pick a different first token at near-zero teacher-forced loss
even when weights are byte-identical, so exact text equality is not
the right round-trip invariant.

Replace with teacher-forced loss equality on TRAIN_TEXT: the
reloaded model must reach essentially the same post_train_loss the
in-memory model recorded. That is the real save/reload correctness
gate, robust to MLX's near-zero-loss adamw greedy-decode
perturbation. Falls back to a non-empty-body check when
train_metrics.json is missing.

CUDA mirror at this seed converges cleanly to ~0.006 loss; on MLX
post_train_loss < 1.0 still holds via the existing memorisation
gate. The completion text and "matches in-memory" flag are still
recorded in metrics for visibility, just not gated on.

* tests/studio: align MLX smoke with elementwise-clip + 30-step gates

Two corrections to the earlier f93e918 / e05d6c7 direction:

1. max_grad_value=0.0, max_grad_norm=1.0 picked the memory-heavy
   norm clip. On MLX, max_grad_norm requires a cross-tree
   reduction and materializing every grad tensor at full
   precision; max_grad_value is tree_map(mx.clip) per leaf with
   no reduction. MLXTrainingConfig defaults to max_grad_value=1.0
   for exactly this reason. Flip the smoke to
   max_grad_norm=0.0, max_grad_value=1.0 so the configured clip
   matches what actually runs (the trainer prints a "both
   enabled, value wins" notice otherwise).

   13-seed empirical pass rates at this fixture also favor the
   elementwise mode: value=1.0 62%, norm=1.0 46%, value=5.0 33%,
   value=0.5 77%. Cheaper default = higher pass rate, no
   tradeoff. (See PR #5498 / staging-2#119 rounds A-AT.)

2. max_steps=7 was below the convergence horizon at every clip
   tested. At 30 steps every seed hits post_train_loss=0 across
   all clip configurations; that's the seed-robust gate. Bump
   max_steps 7 -> 30, tighten the memorisation gate from
   post_loss < 1.0 to post_loss < 0.1.

3. Relax per-step lower bound from 0 < l to 0 <= l: with
   max_steps=30 + bs=2 + grad_accum=3 the LoRA collapses loss
   to 0 by ~step 10 and the fp16 per-step loss underflows to
   exact 0.0 from then on. That's the success signal, not a bug.

Keeps the e7ec2f5 EXPECT_IN_OUTPUT demotion-to-warning and the
e734764 reload teacher-forced-loss round-trip invariant -- those
are the right gates regardless of the clip / steps choice.

* tests/studio: hard gate via teacher-forced completion loss

The prior "soft warn + metric" was a step back from the original
hard assert: regressions could land silently if greedy decode
happened to pass on seed=3407 but post_train_loss diverged.
A true hard gate is needed.

Greedy decode is empirically fragile -- a 47-round, 13-seed sweep
on this fixture (see danielhanchen#119) showed
contains-Unsloth lands in 46-77% across MLX clip configs even
when post_train_loss is zero, because fp16 noise on the first
generated token after PROMPT perturbs the argmax. Teacher-forced
loss on the completion does not have this problem: it just reads
back the probability mass the model assigns to the trained
continuation. In every config where post_train_loss < 0.1, the
completion loss is essentially zero.

Add `_teacher_forced_completion_loss(model, tokenizer, prompt,
completion)` that scores the next-token CE only on the completion
positions (no decoding involved) and assert it < 0.5. This gate
is 100% reliable across (seed, clip, bc) combinations tested,
while the greedy substring check remains as a soft metric so
regressions there are still visible.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
danielhanchen and others added 3 commits May 18, 2026 11:33
Static AST check that fails fast when a producer in unsloth_zoo (or
unsloth) changes the arity of a callback but a consumer callback def
still declares the old arity. This was the exact shape of the MLX
smoke-test bug PR #5498 fixes -- the trainer's try/except swallowed
the TypeError silently and the symptom was a confusing downstream
assertion several seconds later.

What the detector does:
  * Producer side: walks every .py and finds classes that own a
    self._<name>_callbacks list, populated via .append() from an
    add_<name>_callback method, and invoked via
    `for cb in self._<name>_callbacks: cb(arg1, ..., argN)`. The
    arity at the call site is the canonical expected arity.
  * Consumer side: walks every <obj>.add_<name>_callback(fn) call,
    resolves fn to a def or lambda in the same file, and asserts
    arity matches. Consumers that use *args or **kwargs are
    tolerantly accepted as any arity.
  * Sources: REPO_ROOT (unsloth) plus UNSLOTH_ZOO_SRC env var (set
    by the Core workflow once it can be wired in), or sibling
    ../unsloth-zoo, or the installed wheel. Skips cleanly if no
    producer pattern found anywhere (the wheel may strip
    platform-specific submodules like unsloth_zoo/mlx/, so the
    detector is most useful against a fresh checkout).

Validated end-to-end:
  * Reverted run_real_mlx_smoke.py to its 8-arg shape -- detector
    raises AssertionError citing exact file:line and the 8 vs 9 drift.
  * Restored the 9-arg shape -- detector PASSes.
  * Total runtime ~7 s in pytest.

Suggested CI wiring (workflow file change held out of this commit
because the pushing PAT lacks `workflow` scope; safe to apply via
the GitHub web editor or a maintainer push):

```yaml
- name: callback signature drift detector (HARD GATE)
  env:
    UNSLOTH_ZOO_SRC: ${{ runner.temp }}/unsloth-zoo
  run: |
    python -m pytest -v --tb=short tests/test_callback_signature_drift.py
```

Drop the step into .github/workflows/consolidated-tests-ci.yml right
after the existing public-api drift detector step. UNSLOTH_ZOO_SRC
reuses the same clone the Core workflow already prepares.
Drops a 6-line pytest step right after the public-api drift detector,
with UNSLOTH_ZOO_SRC pointed at the freshly cloned $RUNNER_TEMP/unsloth-zoo
so the detector sees unsloth_zoo/mlx/ (the wheel strips it).

Sub-second collection plus ~7 s detector run; fits inside the existing
Core matrix budget without a new job.
@danielhanchen danielhanchen force-pushed the fix-mlx-smoke-step-callback-signature branch from a778f76 to 8f8e33d Compare May 18, 2026 11:34
@danielhanchen danielhanchen changed the title tests/studio: match MLX trainer step callback's 9-arg shape tests + CI: callback signature drift detector May 18, 2026
@danielhanchen danielhanchen merged commit d774af2 into main May 18, 2026
5 of 15 checks passed
@danielhanchen danielhanchen deleted the fix-mlx-smoke-step-callback-signature branch May 18, 2026 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant