tests: CPU regression detectors for the MoE merge / save path (#5410)#655
Conversation
Re-submitting unsloth-zoo#649 against main. The original PR was stacked on fix-5410-moe-merge-layout (the #647 branch) and got marked merged when #647 squash-merged, but the squash did not include the four new files, so they never reached main. Adds four CPU-only signals so the upstream drift matrix catches the class of bug that produced unslothai/unsloth#5410 before it ships: 1. tests/test_unsloth_zoo_lora_merge.py is now in the drift-matrix pytest invocation in consolidated-tests-ci.yml. 22 helper-level merge tests already cover PEFT 0.18 swapped + PEFT 0.19+ standard layouts, the fallback counter, and the resolver walk; they were only ever exercised in the default cell. Now they fire across all three matrix combos. 2. tests/test_peft_paramwrapper_layout_drift.py builds a 4-line nn.Module with a fused 3D nn.Parameter, attaches a real peft LoRA via target_parameters, and asserts the lora_A / lora_B shapes match either the swapped or standard signature. A third layout convention from a future PEFT release fires this test on the next matrix run. 3. tests/test_transformers_moe_structure_drift.py imports each tracked MoE block (Qwen3MoeSparseMoeBlock, MixtralSparseMoeBlock) from the installed transformers, instantiates a tiny config, and pins whether experts is fused 3D nn.Parameter or per-expert nn.ModuleList. The same kind of refactor that flipped Qwen3MoE between 4.57.x and 5.x will fire this test on the next refactor. 4. tests/test_moe_merge_e2e_cpu.py drives the per-expert merge helpers in the same order the inner shard loop does, for a synthetic 2-layer / 4-expert state dict, in both PEFT layouts. Asserts all 3 * E * L tensors receive the analytic delta, the _MOE_MERGE_STATE counter records each apply, fallbacks land on unrecognised shapes, the resolver walks the outer ParamWrapper chain, and a self-cycle in module.base_layer terminates. All four files are pure CPU, sub-second, no model download. Total matrix overhead: <2s per cell, three cells, 33 tests. Local: 33 / 33 passed in 0.28s. Confirmed the three new test classes fail with ImportError on pre-#647 saving_utils, proving the detectors trigger if a future PR weakens or removes the #5410 guards.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 675cc82c0b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| assert gu.dim() == 3 and gu.shape[0] == cfg.num_experts, ( | ||
| f"fused gate_up_proj shape {tuple(gu.shape)} on {block_cls_name}; expected (E, 2I, H)." | ||
| ) | ||
| elif kind == "module_list": | ||
| assert len(experts) == cfg.num_experts, ( |
There was a problem hiding this comment.
Compare against Mixtral's actual expert count
When this parametrized test reaches Mixtral, _tiny_config_kwargs() sets num_experts=2 but never sets Mixtral's num_local_experts, which is the field MixtralSparseMoeBlock uses to build its experts (defaulting to 8). As a result the newly hard-gated workflow fails for the Mixtral case with gate_up_proj.shape[0] == 8 or len(experts) == 8 while cfg.num_experts == 2; use the architecture-specific expert-count attribute/fallback instead of assuming cfg.num_experts.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces three new test files to validate Mixture-of-Experts (MoE) merging logic and detect potential layout or structural drifts in the PEFT and Transformers libraries. The feedback focuses on improving code readability and maintainability by adhering to PEP 8 guidelines, specifically recommending against the use of compound statements and one-liner conditionals. Additionally, a suggestion was made to simplify function logic by removing a redundant return statement.
| def _analytic_gate_up_delta(A, B, alpha, expert_idx, num_experts, role, layout, I, H): | ||
| r = A.shape[0] // num_experts | ||
| s, e = expert_idx * r, (expert_idx + 1) * r | ||
| a = A[s:e].to(torch.float64); b = B[:, s:e].to(torch.float64) |
There was a problem hiding this comment.
Avoid using semicolons to place multiple statements on a single line. Following PEP 8 guidelines by splitting these into separate lines improves code readability and maintainability.
a = A[s:e].to(torch.float64)
b = B[:, s:e].to(torch.float64)References
- Compound statements (multiple statements on the same line) are generally discouraged in PEP 8. (link)
| def _analytic_down_delta(A, B, alpha, expert_idx, num_experts, layout): | ||
| r = A.shape[0] // num_experts | ||
| s, e = expert_idx * r, (expert_idx + 1) * r | ||
| a = A[s:e].to(torch.float64); b = B[:, s:e].to(torch.float64) |
There was a problem hiding this comment.
Avoid using semicolons to place multiple statements on a single line, as per PEP 8 guidelines.
a = A[s:e].to(torch.float64)
b = B[:, s:e].to(torch.float64)References
- Compound statements (multiple statements on the same line) are generally discouraged in PEP 8. (link)
|
|
||
| for out, ref in ((gate_out, gate_ref), (up_out, up_ref), (down_out, down_ref)): | ||
| err = (out.cpu() - ref.cpu()).abs().max().item() | ||
| if err > max_err: max_err = err |
There was a problem hiding this comment.
Using the max() function is more idiomatic and readable than a conditional assignment for updating a maximum value. Additionally, compound one-liner if statements are discouraged by PEP 8.
| if err > max_err: max_err = err | |
| max_err = max(max_err, err) |
References
- Compound statements (multiple statements on the same line) are generally discouraged in PEP 8. (link)
| num_experts, rank_per, intermediate, hidden = 4, 4, 8, 12 | ||
| TR = num_experts * rank_per | ||
| W = torch.randn(intermediate, hidden) | ||
| A = torch.randn(TR, hidden + 7); B = torch.randn(hidden, TR) |
There was a problem hiding this comment.
Avoid using semicolons to place multiple statements on a single line, as per PEP 8 guidelines.
| A = torch.randn(TR, hidden + 7); B = torch.randn(hidden, TR) | |
| A = torch.randn(TR, hidden + 7) | |
| B = torch.randn(hidden, TR) |
References
- Compound statements (multiple statements on the same line) are generally discouraged in PEP 8. (link)
| return f"other:{type(experts).__name__}" | ||
| return f"other:{type(experts).__name__}" |
There was a problem hiding this comment.
The return statement at line 54 is redundant because it is identical to the fallback return at line 55. Removing the nested return simplifies the function structure.
| return f"other:{type(experts).__name__}" | |
| return f"other:{type(experts).__name__}" | |
| return f"other:{type(experts).__name__}" |
| from __future__ import annotations | ||
|
|
||
| import pytest | ||
| import torch |
MixtralConfig accepts num_experts as a stray kwarg but MixtralSparseMoeBlock builds experts from num_local_experts (default 8). Without setting it, the module_list assert hit len=8 vs num_experts=2.
Summary
Re-submission of #649 against
main. The original PR was stacked on top offix-5410-moe-merge-layout(the #647 branch). When #647 got squash-merged intomain, GitHub auto-closed #649 and marked it merged, but the squash only included #647's commits, so the four new test files never reachedmain. This PR re-applies the same content cleanly off the currentmain.What is missing on
maintodayThe
unsloth-zooCI matrix runs no MoE-merge regression tests in any drift cell. This PR fixes that.What this PR adds (identical to closed #649)
tests/test_unsloth_zoo_lora_merge.pyto the drift-matrix pytest invocation inconsolidated-tests-ci.ymlmainafter #647) only run in the default cell. Without this line, transformers-5 / peft-0.19 combinations never exercise the merge math.tests/test_peft_paramwrapper_layout_drift.pytests/test_transformers_moe_structure_drift.pytests/test_moe_merge_e2e_cpu.py1e-4fp32 tolerance, much stronger than the bf16 /1e-1tolerance of the helper-level tests.Verification
Local against
unsloth-zoo:mainHEAD (7b90fec):Three isolated
uv venvmatrix points run today against mergedmain:peft018_tfm550peft019_tfm550peft019_tfm4576Confirmed the three new test classes fail with
ImportError(_MOE_MERGE_STATE,_detect_moe_lora_layout,_resolve_num_experts_from_lora_statsmissing) on pre-#647saving_utils.py, proving the detectors trigger if a future PR weakens or removes the guards.Cost
Pure CPU, no model download, no GPU. ~33 extra tests, well under 2 seconds per matrix cell.
Related
faee224)