tests: CPU regression detectors for the MoE merge / save path (#5410) by danielhanchen · Pull Request #655 · unslothai/unsloth-zoo

danielhanchen · 2026-05-16T07:13:42Z

Summary

Re-submission of #649 against main. The original PR was stacked on top of fix-5410-moe-merge-layout (the #647 branch). When #647 got squash-merged into main, GitHub auto-closed #649 and marked it merged, but the squash only included #647's commits, so the four new test files never reached main. This PR re-applies the same content cleanly off the current main.

What is missing on `main` today

$ git ls-tree origin/main tests/ | grep -E "moe_merge_e2e|paramwrapper_layout_drift|transformers_moe_structure"
NOT_PRESENT

The unsloth-zoo CI matrix runs no MoE-merge regression tests in any drift cell. This PR fixes that.

What this PR adds (identical to closed #649)

change	what regresses if missing
Add `tests/test_unsloth_zoo_lora_merge.py` to the drift-matrix pytest invocation in `consolidated-tests-ci.yml`	The 22 helper-level merge tests (already present on `main` after #647) only run in the default cell. Without this line, transformers-5 / peft-0.19 combinations never exercise the merge math.
New `tests/test_peft_paramwrapper_layout_drift.py`	A future PEFT release that flips the 3D-Parameter LoRA layout a third time goes silent.
New `tests/test_transformers_moe_structure_drift.py`	A transformers refactor that flips Qwen3MoE / Mixtral between fused 3D and ModuleList goes silent.
New `tests/test_moe_merge_e2e_cpu.py`	Any internal regression that weakens the loud-fail guard or the resolver walk goes silent. Fires at `1e-4` fp32 tolerance, much stronger than the bf16 / `1e-1` tolerance of the helper-level tests.

Verification

Local against unsloth-zoo:main HEAD (7b90fec):

$ pytest tests/test_unsloth_zoo_lora_merge.py \
         tests/test_peft_paramwrapper_layout_drift.py \
         tests/test_transformers_moe_structure_drift.py \
         tests/test_moe_merge_e2e_cpu.py -q
33 passed in 0.28s

Three isolated uv venv matrix points run today against merged main:

venv	peft	transformers	pytest	sim cases
`peft018_tfm550`	0.18.1	5.5.0	22 / 22 PASS	52 / 52 PASS
`peft019_tfm550`	0.19.1	5.5.0	22 / 22 PASS	52 / 52 PASS
`peft019_tfm4576`	0.19.1	4.57.6	22 / 22 PASS	52 / 52 PASS

Confirmed the three new test classes fail with ImportError (_MOE_MERGE_STATE, _detect_moe_lora_layout, _resolve_num_experts_from_lora_stats missing) on pre-#647 saving_utils.py, proving the detectors trigger if a future PR weakens or removes the guards.

Cost

Pure CPU, no model download, no GPU. ~33 extra tests, well under 2 seconds per matrix cell.

Fix landed at saving: layout-aware MoE LoRA merge + loud-fail on fallback (#5410) #647 (faee224)
Unsloth-side canary tracking the same guards at tests: pinned-symbol canary for unsloth-zoo save_pretrained_merged guards (#5410) unsloth#5433
This re-submission supersedes tests: CPU regression detectors for the MoE merge / save path (#5410) #649 (orphaned by the squash)

Re-submitting unsloth-zoo#649 against main. The original PR was stacked on fix-5410-moe-merge-layout (the #647 branch) and got marked merged when #647 squash-merged, but the squash did not include the four new files, so they never reached main. Adds four CPU-only signals so the upstream drift matrix catches the class of bug that produced unslothai/unsloth#5410 before it ships: 1. tests/test_unsloth_zoo_lora_merge.py is now in the drift-matrix pytest invocation in consolidated-tests-ci.yml. 22 helper-level merge tests already cover PEFT 0.18 swapped + PEFT 0.19+ standard layouts, the fallback counter, and the resolver walk; they were only ever exercised in the default cell. Now they fire across all three matrix combos. 2. tests/test_peft_paramwrapper_layout_drift.py builds a 4-line nn.Module with a fused 3D nn.Parameter, attaches a real peft LoRA via target_parameters, and asserts the lora_A / lora_B shapes match either the swapped or standard signature. A third layout convention from a future PEFT release fires this test on the next matrix run. 3. tests/test_transformers_moe_structure_drift.py imports each tracked MoE block (Qwen3MoeSparseMoeBlock, MixtralSparseMoeBlock) from the installed transformers, instantiates a tiny config, and pins whether experts is fused 3D nn.Parameter or per-expert nn.ModuleList. The same kind of refactor that flipped Qwen3MoE between 4.57.x and 5.x will fire this test on the next refactor. 4. tests/test_moe_merge_e2e_cpu.py drives the per-expert merge helpers in the same order the inner shard loop does, for a synthetic 2-layer / 4-expert state dict, in both PEFT layouts. Asserts all 3 * E * L tensors receive the analytic delta, the _MOE_MERGE_STATE counter records each apply, fallbacks land on unrecognised shapes, the resolver walks the outer ParamWrapper chain, and a self-cycle in module.base_layer terminates. All four files are pure CPU, sub-second, no model download. Total matrix overhead: <2s per cell, three cells, 33 tests. Local: 33 / 33 passed in 0.28s. Confirmed the three new test classes fail with ImportError on pre-#647 saving_utils, proving the detectors trigger if a future PR weakens or removes the #5410 guards.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 675cc82c0b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T07:15:53Z

+        assert gu.dim() == 3 and gu.shape[0] == cfg.num_experts, (
+            f"fused gate_up_proj shape {tuple(gu.shape)} on {block_cls_name}; expected (E, 2I, H)."
+        )
+    elif kind == "module_list":
+        assert len(experts) == cfg.num_experts, (


Compare against Mixtral's actual expert count

When this parametrized test reaches Mixtral, _tiny_config_kwargs() sets num_experts=2 but never sets Mixtral's num_local_experts, which is the field MixtralSparseMoeBlock uses to build its experts (defaulting to 8). As a result the newly hard-gated workflow fails for the Mixtral case with gate_up_proj.shape[0] == 8 or len(experts) == 8 while cfg.num_experts == 2; use the architecture-specific expert-count attribute/fallback instead of assuming cfg.num_experts.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces three new test files to validate Mixture-of-Experts (MoE) merging logic and detect potential layout or structural drifts in the PEFT and Transformers libraries. The feedback focuses on improving code readability and maintainability by adhering to PEP 8 guidelines, specifically recommending against the use of compound statements and one-liner conditionals. Additionally, a suggestion was made to simplify function logic by removing a redundant return statement.

gemini-code-assist · 2026-05-16T07:16:57Z

+def _analytic_gate_up_delta(A, B, alpha, expert_idx, num_experts, role, layout, I, H):
+    r = A.shape[0] // num_experts
+    s, e = expert_idx * r, (expert_idx + 1) * r
+    a = A[s:e].to(torch.float64); b = B[:, s:e].to(torch.float64)


Avoid using semicolons to place multiple statements on a single line. Following PEP 8 guidelines by splitting these into separate lines improves code readability and maintainability.

a = A[s:e].to(torch.float64) b = B[:, s:e].to(torch.float64)

References

Compound statements (multiple statements on the same line) are generally discouraged in PEP 8. ^(link)

gemini-code-assist · 2026-05-16T07:16:57Z

+def _analytic_down_delta(A, B, alpha, expert_idx, num_experts, layout):
+    r = A.shape[0] // num_experts
+    s, e = expert_idx * r, (expert_idx + 1) * r
+    a = A[s:e].to(torch.float64); b = B[:, s:e].to(torch.float64)


Avoid using semicolons to place multiple statements on a single line, as per PEP 8 guidelines.

a = A[s:e].to(torch.float64) b = B[:, s:e].to(torch.float64)

References

Compound statements (multiple statements on the same line) are generally discouraged in PEP 8. ^(link)

gemini-code-assist · 2026-05-16T07:16:57Z

+
+            for out, ref in ((gate_out, gate_ref), (up_out, up_ref), (down_out, down_ref)):
+                err = (out.cpu() - ref.cpu()).abs().max().item()
+                if err > max_err: max_err = err


Using the max() function is more idiomatic and readable than a conditional assignment for updating a maximum value. Additionally, compound one-liner if statements are discouraged by PEP 8.

Suggested change

if err > max_err: max_err = err

max_err = max(max_err, err)

References

Compound statements (multiple statements on the same line) are generally discouraged in PEP 8. ^(link)

gemini-code-assist · 2026-05-16T07:16:57Z

+    num_experts, rank_per, intermediate, hidden = 4, 4, 8, 12
+    TR = num_experts * rank_per
+    W = torch.randn(intermediate, hidden)
+    A = torch.randn(TR, hidden + 7); B = torch.randn(hidden, TR)


Avoid using semicolons to place multiple statements on a single line, as per PEP 8 guidelines.

Suggested change

A = torch.randn(TR, hidden + 7); B = torch.randn(hidden, TR)

A = torch.randn(TR, hidden + 7)

B = torch.randn(hidden, TR)

References

Compound statements (multiple statements on the same line) are generally discouraged in PEP 8. ^(link)

gemini-code-assist · 2026-05-16T07:16:57Z

+        return f"other:{type(experts).__name__}"
+    return f"other:{type(experts).__name__}"


The return statement at line 54 is redundant because it is identical to the fallback return at line 55. Removing the nested return simplifies the function structure.

Suggested change

return f"other:{type(experts).__name__}"

return f"other:{type(experts).__name__}"

return f"other:{type(experts).__name__}"

+from __future__ import annotations
+
+import pytest
+import torch


MixtralConfig accepts num_experts as a stray kwarg but MixtralSparseMoeBlock builds experts from num_local_experts (default 8). Without setting it, the module_list assert hit len=8 vs num_experts=2.

…unslothai#655)

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

github-code-quality Bot found potential problems May 16, 2026

View reviewed changes

Comment thread tests/test_transformers_moe_structure_drift.py

from __future__ import annotations

import pytest

import torch

tests: fix Mixtral drift canary by pinning num_local_experts

b7462ce

MixtralConfig accepts num_experts as a stray kwarg but MixtralSparseMoeBlock builds experts from num_local_experts (default 8). Without setting it, the module_list assert hit len=8 vs num_experts=2.

danielhanchen merged commit adc76ae into main May 17, 2026
15 checks passed

Brishen pushed a commit to Brishen/unsloth-zoo that referenced this pull request May 19, 2026

tests: CPU regression detectors for the MoE merge / save path (#5410) (…

60716db

…unslothai#655)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: CPU regression detectors for the MoE merge / save path (#5410)#655

tests: CPU regression detectors for the MoE merge / save path (#5410)#655
danielhanchen merged 2 commits into
mainfrom
ci-pr5410-detectors-redo

danielhanchen commented May 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	A = torch.randn(TR, hidden + 7); B = torch.randn(hidden, TR)
	A = torch.randn(TR, hidden + 7)
	B = torch.randn(hidden, TR)

		return f"other:{type(experts).__name__}"
		return f"other:{type(experts).__name__}"

Conversation

danielhanchen commented May 16, 2026

Summary

What is missing on main today

What this PR adds (identical to closed #649)

Verification

Cost

Related

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What is missing on `main` today