Skip to content

tests: CPU regression detectors for the MoE merge / save path (#5410)#655

Merged
danielhanchen merged 2 commits into
mainfrom
ci-pr5410-detectors-redo
May 17, 2026
Merged

tests: CPU regression detectors for the MoE merge / save path (#5410)#655
danielhanchen merged 2 commits into
mainfrom
ci-pr5410-detectors-redo

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Summary

Re-submission of #649 against main. The original PR was stacked on top of fix-5410-moe-merge-layout (the #647 branch). When #647 got squash-merged into main, GitHub auto-closed #649 and marked it merged, but the squash only included #647's commits, so the four new test files never reached main. This PR re-applies the same content cleanly off the current main.

What is missing on main today

$ git ls-tree origin/main tests/ | grep -E "moe_merge_e2e|paramwrapper_layout_drift|transformers_moe_structure"
NOT_PRESENT

The unsloth-zoo CI matrix runs no MoE-merge regression tests in any drift cell. This PR fixes that.

What this PR adds (identical to closed #649)

change what regresses if missing
Add tests/test_unsloth_zoo_lora_merge.py to the drift-matrix pytest invocation in consolidated-tests-ci.yml The 22 helper-level merge tests (already present on main after #647) only run in the default cell. Without this line, transformers-5 / peft-0.19 combinations never exercise the merge math.
New tests/test_peft_paramwrapper_layout_drift.py A future PEFT release that flips the 3D-Parameter LoRA layout a third time goes silent.
New tests/test_transformers_moe_structure_drift.py A transformers refactor that flips Qwen3MoE / Mixtral between fused 3D and ModuleList goes silent.
New tests/test_moe_merge_e2e_cpu.py Any internal regression that weakens the loud-fail guard or the resolver walk goes silent. Fires at 1e-4 fp32 tolerance, much stronger than the bf16 / 1e-1 tolerance of the helper-level tests.

Verification

Local against unsloth-zoo:main HEAD (7b90fec):

$ pytest tests/test_unsloth_zoo_lora_merge.py \
         tests/test_peft_paramwrapper_layout_drift.py \
         tests/test_transformers_moe_structure_drift.py \
         tests/test_moe_merge_e2e_cpu.py -q
33 passed in 0.28s

Three isolated uv venv matrix points run today against merged main:

venv peft transformers pytest sim cases
peft018_tfm550 0.18.1 5.5.0 22 / 22 PASS 52 / 52 PASS
peft019_tfm550 0.19.1 5.5.0 22 / 22 PASS 52 / 52 PASS
peft019_tfm4576 0.19.1 4.57.6 22 / 22 PASS 52 / 52 PASS

Confirmed the three new test classes fail with ImportError (_MOE_MERGE_STATE, _detect_moe_lora_layout, _resolve_num_experts_from_lora_stats missing) on pre-#647 saving_utils.py, proving the detectors trigger if a future PR weakens or removes the guards.

Cost

Pure CPU, no model download, no GPU. ~33 extra tests, well under 2 seconds per matrix cell.

Related

Re-submitting unsloth-zoo#649 against main. The original PR was stacked
on fix-5410-moe-merge-layout (the #647 branch) and got marked merged
when #647 squash-merged, but the squash did not include the four new
files, so they never reached main.

Adds four CPU-only signals so the upstream drift matrix catches the
class of bug that produced unslothai/unsloth#5410 before it ships:

1. tests/test_unsloth_zoo_lora_merge.py is now in the drift-matrix
   pytest invocation in consolidated-tests-ci.yml. 22 helper-level
   merge tests already cover PEFT 0.18 swapped + PEFT 0.19+ standard
   layouts, the fallback counter, and the resolver walk; they were
   only ever exercised in the default cell. Now they fire across all
   three matrix combos.

2. tests/test_peft_paramwrapper_layout_drift.py builds a 4-line
   nn.Module with a fused 3D nn.Parameter, attaches a real peft LoRA
   via target_parameters, and asserts the lora_A / lora_B shapes match
   either the swapped or standard signature. A third layout convention
   from a future PEFT release fires this test on the next matrix run.

3. tests/test_transformers_moe_structure_drift.py imports each tracked
   MoE block (Qwen3MoeSparseMoeBlock, MixtralSparseMoeBlock) from the
   installed transformers, instantiates a tiny config, and pins
   whether experts is fused 3D nn.Parameter or per-expert nn.ModuleList.
   The same kind of refactor that flipped Qwen3MoE between 4.57.x and
   5.x will fire this test on the next refactor.

4. tests/test_moe_merge_e2e_cpu.py drives the per-expert merge helpers
   in the same order the inner shard loop does, for a synthetic
   2-layer / 4-expert state dict, in both PEFT layouts. Asserts all
   3 * E * L tensors receive the analytic delta, the _MOE_MERGE_STATE
   counter records each apply, fallbacks land on unrecognised shapes,
   the resolver walks the outer ParamWrapper chain, and a self-cycle
   in module.base_layer terminates.

All four files are pure CPU, sub-second, no model download. Total
matrix overhead: <2s per cell, three cells, 33 tests.

Local: 33 / 33 passed in 0.28s. Confirmed the three new test classes
fail with ImportError on pre-#647 saving_utils, proving the detectors
trigger if a future PR weakens or removes the #5410 guards.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 675cc82c0b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +108 to +112
assert gu.dim() == 3 and gu.shape[0] == cfg.num_experts, (
f"fused gate_up_proj shape {tuple(gu.shape)} on {block_cls_name}; expected (E, 2I, H)."
)
elif kind == "module_list":
assert len(experts) == cfg.num_experts, (
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Compare against Mixtral's actual expert count

When this parametrized test reaches Mixtral, _tiny_config_kwargs() sets num_experts=2 but never sets Mixtral's num_local_experts, which is the field MixtralSparseMoeBlock uses to build its experts (defaulting to 8). As a result the newly hard-gated workflow fails for the Mixtral case with gate_up_proj.shape[0] == 8 or len(experts) == 8 while cfg.num_experts == 2; use the architecture-specific expert-count attribute/fallback instead of assuming cfg.num_experts.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces three new test files to validate Mixture-of-Experts (MoE) merging logic and detect potential layout or structural drifts in the PEFT and Transformers libraries. The feedback focuses on improving code readability and maintainability by adhering to PEP 8 guidelines, specifically recommending against the use of compound statements and one-liner conditionals. Additionally, a suggestion was made to simplify function logic by removing a redundant return statement.

def _analytic_gate_up_delta(A, B, alpha, expert_idx, num_experts, role, layout, I, H):
r = A.shape[0] // num_experts
s, e = expert_idx * r, (expert_idx + 1) * r
a = A[s:e].to(torch.float64); b = B[:, s:e].to(torch.float64)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using semicolons to place multiple statements on a single line. Following PEP 8 guidelines by splitting these into separate lines improves code readability and maintainability.

    a = A[s:e].to(torch.float64)
    b = B[:, s:e].to(torch.float64)
References
  1. Compound statements (multiple statements on the same line) are generally discouraged in PEP 8. (link)

def _analytic_down_delta(A, B, alpha, expert_idx, num_experts, layout):
r = A.shape[0] // num_experts
s, e = expert_idx * r, (expert_idx + 1) * r
a = A[s:e].to(torch.float64); b = B[:, s:e].to(torch.float64)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using semicolons to place multiple statements on a single line, as per PEP 8 guidelines.

    a = A[s:e].to(torch.float64)
    b = B[:, s:e].to(torch.float64)
References
  1. Compound statements (multiple statements on the same line) are generally discouraged in PEP 8. (link)


for out, ref in ((gate_out, gate_ref), (up_out, up_ref), (down_out, down_ref)):
err = (out.cpu() - ref.cpu()).abs().max().item()
if err > max_err: max_err = err
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the max() function is more idiomatic and readable than a conditional assignment for updating a maximum value. Additionally, compound one-liner if statements are discouraged by PEP 8.

Suggested change
if err > max_err: max_err = err
max_err = max(max_err, err)
References
  1. Compound statements (multiple statements on the same line) are generally discouraged in PEP 8. (link)

num_experts, rank_per, intermediate, hidden = 4, 4, 8, 12
TR = num_experts * rank_per
W = torch.randn(intermediate, hidden)
A = torch.randn(TR, hidden + 7); B = torch.randn(hidden, TR)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using semicolons to place multiple statements on a single line, as per PEP 8 guidelines.

Suggested change
A = torch.randn(TR, hidden + 7); B = torch.randn(hidden, TR)
A = torch.randn(TR, hidden + 7)
B = torch.randn(hidden, TR)
References
  1. Compound statements (multiple statements on the same line) are generally discouraged in PEP 8. (link)

Comment on lines +54 to +55
return f"other:{type(experts).__name__}"
return f"other:{type(experts).__name__}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return statement at line 54 is redundant because it is identical to the fallback return at line 55. Removing the nested return simplifies the function structure.

Suggested change
return f"other:{type(experts).__name__}"
return f"other:{type(experts).__name__}"
return f"other:{type(experts).__name__}"

from __future__ import annotations

import pytest
import torch
MixtralConfig accepts num_experts as a stray kwarg but
MixtralSparseMoeBlock builds experts from num_local_experts
(default 8). Without setting it, the module_list assert hit
len=8 vs num_experts=2.
@danielhanchen danielhanchen merged commit adc76ae into main May 17, 2026
15 checks passed
Brishen pushed a commit to Brishen/unsloth-zoo that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant