saving: layout-aware MoE LoRA merge + loud-fail on fallback (#5410) by danielhanchen · Pull Request #647 · unslothai/unsloth-zoo

danielhanchen · 2026-05-15T02:47:21Z

Summary

save_pretrained_merged(..., save_method=\"merged_16bit\") silently drops the entire MoE expert LoRA delta on Qwen3-MoE / Qwen3.5-MoE-style models when running on peft >= 0.19.1 + transformers >= 5.0 (reported in unslothai/unsloth#5410). Four issues stack up:

The per-expert helpers in unsloth_zoo/saving_utils.py hardcode the PEFT 0.18 "swapped" layout (lora_A: (E*r, 2I), lora_B: (H, E*r) for fused gate_up_proj; lora_A: (E*r, H), lora_B: (I, E*r) for fused down_proj). PEFT 0.19+ swaps in/out features for non-transposed 3D parameters (MNT: Pin GitHub action hashes for security huggingface/peft#2521) and produces the opposite shapes.
On layout mismatch an addmm shape error was swallowed by a bare try / except Exception: return W (and the dim-heuristic in the fused helpers fell through to return W), so the merge wrote the unmodified base weight and reported success.
On the dense _merge_and_overwrite_lora flow the per-expert merge loop's num_experts came from the shard-local key scan, which can be a non-divisor of total_rank whenever experts are split across multiple safetensor shards (16 / 17 in some shards of the 128-expert Qwen3-30B-A3B layout).
The merged dir was missing generation_config.json, so chat-tuned models reloaded with default eos / sampling and ran past EOS.

Fix

_detect_moe_lora_layout(lora_A, lora_B, num_experts, out_dim, in_dim) classifies the layout by shape against the per-expert on-disk weight. No version sniffing, so it works on transformers 4.57.x / 5.x and peft 0.18.x / 0.19.x.
_merge_moe_gate_or_up_expert and _merge_moe_down_proj_expert branch on the detected layout. The PEFT 0.18 "swapped" path is byte-identical to the previous behaviour.
_resolve_num_experts_from_lora_stats(lora_stats, fallback) walks module -> base_layer -> ... to read the authoritative num_experts off the wrapped MoE module (Qwen3MoeExperts and similar). _merge_and_overwrite_lora calls it for every MoE LoRA in converted_lora_weights and overrides the shard-local value in moe_num_experts before the per-expert loop runs.
_MOE_MERGE_STATE tracks (attempted, applied, fallback, first_error). Helpers record fallbacks with role / expert / shapes / reason on any unrecognised layout or exception. After the shard loop merge_and_overwrite_lora raises RuntimeError if any fallback fired, so a partially merged checkpoint can no longer be silently written. On success it prints applied/attempted.
The merged_16bit branch also calls model.generation_config.save_pretrained(save_directory) (best-effort, matching the same pattern as fix_tokenizer_config_json).

Tests

tests/test_unsloth_zoo_lora_merge.py now covers:

The existing 16 cases (PEFT 0.18 swapped layout, per-expert + fused + dense) still pass byte-for-byte.
6 new cases:
- PEFT 0.19+ "standard" layout for _merge_moe_gate_expert, _merge_moe_up_expert, _merge_moe_down_proj_expert.
- _detect_moe_lora_layout for swapped, standard, mismatched shapes, and non-divisor num_experts.
- The fallback counter increments and first_error populates on unrecognised shapes.
- _resolve_num_experts_from_lora_stats walks the base_layer chain (covers the inner ParamWrapper case for mlp.experts.down_proj where the outer ParamWrapper has module = None).

End-to-end verification

Full Qwen3-30B-A3B (128 experts x 48 layers, fused 3D in memory, per-expert 2D on disk): load -> attach LoRA (r=32, alpha=64, target_modules=[q,k,v,o,gate,up,down]_proj, lora_dropout=0) -> 5 SFT steps -> save_pretrained_merged(merged_16bit) -> reload -> compare merged-reload logits to the trained in-memory model on a fixed eval batch.

transformers	peft	trl	merged tensors	trained vs merged KL	samples	merged vs base KL	base vs trained KL
5.5.0	0.19.1	0.25.1	18432 / 18432	1.6e-5	3 / 3	17.81	17.85
5.5.0	0.18.1	0.25.1	18432 / 18432	1.3e-5	3 / 3	18.91	19.05
4.57.6	0.19.1	0.25.1	dense path	5.5e-5	3 / 3	15.93	16.02
5.5.0	0.19.1	1.4.0	18432 / 18432	2.1e-4	3 / 3	15.41	15.31

Reading: KL = O(1e-4) between merged-reload and trained model is bf16 noise; samples = 3 / 3 means all 3 greedy generations on held-out prompts match exactly; merged vs base KL approx base vs trained KL confirms the full training delta is baked into the saved merged dir.

Before this patch the first row was KL = 1.86, samples = 1 / 3, 0 / 18432 expert LoRA deltas applied.

Notes:

transformers 4.57.6 has Qwen3MoeSparseMoeBlock.experts = nn.ModuleList(Qwen3MoeMLP) per expert (no fused 3D parameter), so the MoE merge helpers do not fire and every per-expert Linear takes the standard dense _merge_lora path. The MoE helpers are unreachable on transformers < 5; the patch only affects the path that produces the bug.
The trl 1.4 row uses padding_free=False in the reproducer (TRL 1.x raises when padding_free=True is combined with a finite max_length without packing). Unrelated to this patch.
Dense Qwen3-0.6B-Base sanity check on transformers 5.5 + peft 0.19 + trl 0.25 separately: trained vs merged KL = 1.1e-4, top-1 agreement = 0.978, samples 4 / 4. The dense _merge_lora path is untouched.

Test plan

pytest unsloth-zoo/tests/test_unsloth_zoo_lora_merge.py -> 22 passed
Full Qwen3-30B-A3B end-to-end on all four version combinations in the table
Confirm generation_config.json is written into save_directory for merged_16bit
Confirm _MOE_MERGE_STATE raises a clear RuntimeError when the layout is unrecognised (synthetic test + a deliberately mangled shape fixture)

Fixes [Bug] Merged model produces garbage output after save_pretrained_merged unsloth#5410
Likely fixes [Feature Request] Support DevStral Small 2 on Transformers v5 unsloth#4832 (same author, same "garbage after save_pretrained_merged reload" symptom on DevStral Small 2)
Same class of failure recurs at Qwen3 Fine-tuning now in Unsloth! unsloth#2428 (canonical), #3454, #3428, #3547, #1519, #1832
Complements Refactor and consolidate moe lora extractors #629 (Datta0's training-time MoE extractor fix). The training-time grouped-mm path and the save-time per-expert path are separate codepaths; Refactor and consolidate moe lora extractors #629 fixed training, this fixes save.

`save_pretrained_merged(save_method="merged_16bit")` silently dropped the entire MoE expert LoRA delta on Qwen3-MoE / Qwen3.5-MoE-style models with peft >= 0.19.1. The per-expert helpers in `saving_utils.py` hardcoded the PEFT 0.18 "swapped" tensor layout (`lora_A: (E*r, 2I)`, `lora_B: (H, E*r)` for gate_up_proj; `lora_A: (E*r, H)`, `lora_B: (I, E*r)` for down_proj), while PEFT 0.19+ swaps in/out features for non-transposed 3D parameters and produces `lora_A: (E*r, H)`, `lora_B: (2I, E*r)` and `lora_A: (E*r, I)`, `lora_B: (H, E*r)`. The layout mismatch hit a bare `except Exception: return W` and the dim-heuristic fallthrough in the fused helpers, so the merge silently wrote unmodified base weights and reported success. The `num_experts` value used by the per-expert loop was also taken from the shard-local key scan, which is a non-divisor of `total_rank` whenever experts are split across multiple safetensor shards (16/17 of 128 on Qwen3-30B-A3B). Finally the merged dir was missing `generation_config.json`, so chat-tuned models reloaded with default eos / sampling and ran past EOS. Changes: - `_detect_moe_lora_layout(lora_A, lora_B, num_experts, out_dim, in_dim)` classifies the layout by shape against the per-expert disk weight, so no version sniffing is required. Works on transformers 4.57.x / 5.x and peft 0.18.x / 0.19.x. - `_merge_moe_gate_or_up_expert` and `_merge_moe_down_proj_expert` branch on the detected layout. The "swapped" path is byte-identical to the previous behaviour. - `_resolve_num_experts_from_lora_stats` walks `module -> base_layer -> ...` to read the authoritative `num_experts` off the wrapped MoE module (`Qwen3MoeExperts` etc). `_merge_and_overwrite_lora` uses it to override `moe_num_experts[prefix]` after the converted-key build, so the per-expert loop never trips on a shard-local count. - `_MOE_MERGE_STATE` tracks `(attempted, applied, fallback, first_error)`. Each helper records a fallback with role / expert / shapes / reason on any unrecognised layout or exception. After the shard loop `merge_and_overwrite_lora` raises `RuntimeError` if any fallback fired, so partially-merged checkpoints can no longer be silently written. On success it prints `applied/attempted`. - The `merged_16bit` branch also calls `model.generation_config.save_pretrained(save_directory)` (best-effort, matching `fix_tokenizer_config_json`). Tests: - Existing 16 per-expert / fused / dense merge tests in `test_unsloth_zoo_lora_merge.py` still pass byte-for-byte (PEFT 0.18 swapped layout is the default branch). - 6 new tests: * standard layout for `_merge_moe_gate_expert`, `_merge_moe_up_expert`, `_merge_moe_down_proj_expert`, * layout classifier for both conventions and the unknown cases, * fallback counter increments and `first_error` populates on unrecognised shapes, * `_resolve_num_experts_from_lora_stats` walks the `base_layer` chain. End-to-end verification on Qwen3-30B-A3B (128 experts x 48 layers, fused 3D in memory, per-expert 2D on disk), full SFT + save + reload + logit compare: | transformers | peft | trl | merged tensors | trained vs merged KL | samples | |--------------|--------|--------|----------------|----------------------|---------| | 5.5.0 | 0.19.1 | 0.25.1 | 18432 / 18432 | 1.6e-5 | 3 / 3 | | 5.5.0 | 0.18.1 | 0.25.1 | 18432 / 18432 | 1.3e-5 | 3 / 3 | | 4.57.6 | 0.19.1 | 0.25.1 | dense path | 5.5e-5 | 3 / 3 | | 5.5.0 | 0.19.1 | 1.4.0 | 18432 / 18432 | 2.1e-4 | 3 / 3 | Before the patch the M1 row was KL=1.86, samples=1/3, and 0/18432 expert LoRA deltas were applied. transformers 4.57.6 has `experts = nn.ModuleList (Qwen3MoeMLP)` (no fused 3D parameter) so the MoE merge helpers do not fire and every per-expert Linear takes the standard dense `_merge_lora` path. The MoE helpers are unreachable on transformers <5; the patch only affects the path that produces the bug. Fixes unslothai/unsloth#5410. Likely also resolves unslothai/unsloth#4832 (same author, same "garbage after save_pretrained_merged reload" symptom on DevStral Small 2).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a17707b12

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-15T02:49:09Z

+    if _MOE_MERGE_STATE["fallback"] > 0:
+        err = _MOE_MERGE_STATE.get("first_error") or {}
+        raise RuntimeError(


Check MoE fallbacks before uploading merged shards

When push_to_hub=True and a MoE layout fallback is recorded, this new RuntimeError is reached only after the upload step at lines 2442-2448 (and the low-disk path uploads shards inside the merge loop), so the function can still publish a checkpoint with base weights written through before reporting failure. That defeats the loud-fail guard for the exact corrupt-result case it is meant to prevent; run this fallback check before any upload/delete, or raise as soon as a shard records a fallback.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request enhances MoE LoRA merging by supporting both "swapped" and "standard" PEFT layouts, introducing robust error tracking for expert merges, and ensuring authoritative expert counts are resolved across shards. It also adds logic to persist generation_config.json and includes comprehensive tests for these improvements. Feedback highlights that the fallback logic for calculating num_experts in _merge_moe_gate_or_up_expert and _merge_moe_down_proj_expert is flawed because it attempts to access a non-existent rank attribute on the LoraStats object, which would result in incorrect slicing for LoRA ranks greater than 1.

gemini-code-assist · 2026-05-15T02:49:14Z

-            num_experts = total_rank // max(1, getattr(lora_stats, "rank", 0) or 1)
-        if num_experts <= 0 or total_rank % num_experts != 0:
-            return up_W
+            num_experts = lora_stats.lora_A.shape[0] // max(1, getattr(lora_stats, "rank", 0) or 1)


The LoraStats dataclass (defined at line 278) does not have a rank attribute. Consequently, getattr(lora_stats, "rank", 0) will always return 0, and the fallback logic will default to rank=1. This will result in an incorrect num_experts calculation (setting it to total_rank) when the actual LoRA rank is greater than 1, which in turn leads to incorrect slicing of the LoRA matrices during the merge. Since _resolve_num_experts_from_lora_stats is intended to be the authoritative source, this fallback should be more robust or the rank should be explicitly stored in LoraStats during the statistics collection phase.

gemini-code-assist · 2026-05-15T02:49:15Z

        if num_experts is None or num_experts <= 0:
-            num_experts = total_rank // max(1, getattr(lora_stats, "rank", 0) or 1)
-        if num_experts <= 0 or total_rank % num_experts != 0:
+            num_experts = lora_stats.lora_A.shape[0] // max(1, getattr(lora_stats, "rank", 0) or 1)


Similar to the issue in _merge_moe_gate_or_up_expert, getattr(lora_stats, "rank", 0) will always return 0 because the LoraStats dataclass lacks a rank field. This makes the fallback logic for num_experts incorrect for any LoRA with rank > 1. Consider using a more reliable heuristic to determine the rank if it's not explicitly available, or ensure it is captured in the LoraStats object.

The base_layer walk in _resolve_num_experts_from_lora_stats was an unbounded `while module is not None` loop. PEFT's ParamWrapper does not self-reference in practice, but a self-referential or cyclic `base_layer` chain would hang the merge. Bound the walk to 16 hops, dedupe via an id() set, and swallow exceptions on getattr / getattr-of-attrs so a hostile module that raises on attribute access cannot abort the merge. Confirmed by a synthetic suite (52 cases) across three isolated venvs: peft 0.18.1 + transformers 5.5.0, peft 0.19.1 + transformers 5.5.0, peft 0.19.1 + transformers 4.57.6. All 22 existing merge tests still pass byte-for-byte in each.

danielhanchen · 2026-05-15T03:03:27Z

Sandbox simulation report against this branch (HEAD 97bb267):

Three isolated uv venv matrix points, each running the existing 16 merge tests + the 6 new layout / detection / fallback tests + a 52-case synthetic sim suite covering layout detection edges, _resolve_num_experts edges (including self-referential and 2-cycle base_layer chains, and a module whose __getattribute__ raises), per-expert math in both layouts for all 4 expert indices, ambiguous-dim 2I == H / H == I cases, state-machine reset and first_error stickiness, dense _merge_lora regression, generation_config.save_pretrained best-effort, cross-platform path handling, and a mini E2E mock merge:

venv	peft	transformers	pytest tests	sim cases
`peft018_tfm550`	0.18.1	5.5.0	22 / 22 PASS	52 / 52 PASS
`peft019_tfm550`	0.19.1	5.5.0	22 / 22 PASS	52 / 52 PASS
`peft019_tfm4576`	0.19.1	4.57.6	22 / 22 PASS	52 / 52 PASS

The simulation also caught one latent risk before the merge could land: the _resolve_num_experts_from_lora_stats walk was an unbounded while module is not None. PEFT does not self-reference in practice but a hostile / cyclic base_layer chain would hang the merge. The follow-up commit 97bb267 bounds the walk to 16 hops, dedupes via an id() set, and swallows exceptions on attribute access so a module that raises on getattr cannot abort the merge. The 22 + 52 results above are after that fix.

End-to-end on Qwen3-30B-A3B across the four (transformers, peft, trl) combinations in the PR description remains unchanged: 18432 / 18432 expert tensors merged, trained vs merged KL between 1.3e-5 and 2.1e-4, greedy samples 3 / 3.

…ards (#5410) unsloth#5410 was a class of silent-write bug in the save_pretrained_merged path that the existing CI matrix could not detect because the merge-helper tests were not wired through the upstream-drift suite. The full fix lives in unslothai/unsloth-zoo#647 (layout-aware MoE merge helpers, authoritative num_experts resolver, loud-fail counter, generation_config.json save). This PR adds the unsloth-side canary that watches for the four guards staying in place in unsloth-zoo so a future refactor cannot silently regress them. tests/version_compat/test_unsloth_zoo_save_merged_pinned_symbols.py fetches unsloth_zoo/saving_utils.py + tests/test_unsloth_zoo_lora_merge.py from unslothai/unsloth-zoo:main and asserts: - _MOE_MERGE_STATE / _reset_moe_merge_state / _record_moe_merge_fallback are still defined and a `raise RuntimeError(...MoE...)` still fires when fallback > 0. - _detect_moe_lora_layout exists and both "swapped" / "standard" branch labels are reachable in the source. - _resolve_num_experts_from_lora_stats is present AND its base_layer walk is bounded by `for _ in range(N):` (a cyclic ParamWrapper chain must not hang the merge). - merge_and_overwrite_lora still calls model.generation_config.save_pretrained(...). - tests/test_unsloth_zoo_lora_merge.py keeps the six PEFT 0.19+ standard-layout regression tests added in #647. - Local unsloth/save.py still names save_pretrained_merged and routes through merge_and_overwrite_lora (i.e. the entry point still reaches the upstream fix). While #647 is still open, the four symbol tests SKIP cleanly with a message naming #647. When #647 merges into unsloth-zoo main, the same tests automatically become hard gates and catch any future regression. The sixth test (local entry-point grep) passes today. CPU-only static fetch, ~0.1s. Wired into the existing peft-pinned-symbols job in .github/workflows/version-compat-ci.yml so it runs on every PR that touches unsloth/** and on the daily schedule. Local run: 1 passed, 5 skipped (expected; #647 open).

Tighten the docstrings and inline comments added by the layout-aware MoE merge work so the diff is closer to the surrounding house style (see chore #640). No behaviour change; 22 / 22 merge tests still pass.

…ards (#5410) unsloth#5410 was a class of silent-write bug in the save_pretrained_merged path that the existing CI matrix could not detect because the merge-helper tests were not wired through the upstream-drift suite. The full fix lives in unslothai/unsloth-zoo#647 (layout-aware MoE merge helpers, authoritative num_experts resolver, loud-fail counter, generation_config.json save). This PR adds the unsloth-side canary that watches for the four guards staying in place in unsloth-zoo so a future refactor cannot silently regress them. tests/version_compat/test_unsloth_zoo_save_merged_pinned_symbols.py fetches unsloth_zoo/saving_utils.py + tests/test_unsloth_zoo_lora_merge.py from unslothai/unsloth-zoo:main and asserts: - _MOE_MERGE_STATE / _reset_moe_merge_state / _record_moe_merge_fallback are still defined and a `raise RuntimeError(...MoE...)` still fires when fallback > 0. - _detect_moe_lora_layout exists and both "swapped" / "standard" branch labels are reachable in the source. - _resolve_num_experts_from_lora_stats is present AND its base_layer walk is bounded by `for _ in range(N):` (a cyclic ParamWrapper chain must not hang the merge). - merge_and_overwrite_lora still calls model.generation_config.save_pretrained(...). - tests/test_unsloth_zoo_lora_merge.py keeps the six PEFT 0.19+ standard-layout regression tests added in #647. - Local unsloth/save.py still names save_pretrained_merged and routes through merge_and_overwrite_lora (i.e. the entry point still reaches the upstream fix). While #647 is still open, the four symbol tests SKIP cleanly with a message naming #647. When #647 merges into unsloth-zoo main, the same tests automatically become hard gates and catch any future regression. The sixth test (local entry-point grep) passes today. CPU-only static fetch, ~0.1s. Wired into the existing peft-pinned-symbols job in .github/workflows/version-compat-ci.yml so it runs on every PR that touches unsloth/** and on the daily schedule. Local run: 1 passed, 5 skipped (expected; #647 open).

…ards (#5410) (#5433) * tests: pinned-symbol canary for unsloth-zoo save_pretrained_merged guards (#5410) unsloth#5410 was a class of silent-write bug in the save_pretrained_merged path that the existing CI matrix could not detect because the merge-helper tests were not wired through the upstream-drift suite. The full fix lives in unslothai/unsloth-zoo#647 (layout-aware MoE merge helpers, authoritative num_experts resolver, loud-fail counter, generation_config.json save). This PR adds the unsloth-side canary that watches for the four guards staying in place in unsloth-zoo so a future refactor cannot silently regress them. tests/version_compat/test_unsloth_zoo_save_merged_pinned_symbols.py fetches unsloth_zoo/saving_utils.py + tests/test_unsloth_zoo_lora_merge.py from unslothai/unsloth-zoo:main and asserts: - _MOE_MERGE_STATE / _reset_moe_merge_state / _record_moe_merge_fallback are still defined and a `raise RuntimeError(...MoE...)` still fires when fallback > 0. - _detect_moe_lora_layout exists and both "swapped" / "standard" branch labels are reachable in the source. - _resolve_num_experts_from_lora_stats is present AND its base_layer walk is bounded by `for _ in range(N):` (a cyclic ParamWrapper chain must not hang the merge). - merge_and_overwrite_lora still calls model.generation_config.save_pretrained(...). - tests/test_unsloth_zoo_lora_merge.py keeps the six PEFT 0.19+ standard-layout regression tests added in #647. - Local unsloth/save.py still names save_pretrained_merged and routes through merge_and_overwrite_lora (i.e. the entry point still reaches the upstream fix). While #647 is still open, the four symbol tests SKIP cleanly with a message naming #647. When #647 merges into unsloth-zoo main, the same tests automatically become hard gates and catch any future regression. The sixth test (local entry-point grep) passes today. CPU-only static fetch, ~0.1s. Wired into the existing peft-pinned-symbols job in .github/workflows/version-compat-ci.yml so it runs on every PR that touches unsloth/** and on the daily schedule. Local run: 1 passed, 5 skipped (expected; #647 open). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests/version_compat: relax MoE/generation_config regex to fit zoo#647 zoo#647 landed two layout changes that broke the pinned-symbol canary's exact-string regex matches but kept the underlying guarantees intact: - The post-loop MoE LoRA fallback `raise RuntimeError(...)` wraps the "MoE" wording onto a second line; the old `[^\n]*` did not cross newlines. Switch to `.*?` + re.DOTALL. - The generation_config save now binds the attr to a local var `gen_cfg = getattr(model, "generation_config", ...)` and calls `gen_cfg.save_pretrained(save_directory)`, so a literal `generation_config.save_pretrained(` substring no longer matches. Anchor on the conceptual operation: a `generation_config` mention followed (within a small char window) by a `.save_pretrained(` call. That is what the canary actually cares about. Verified locally: pytest tests/version_compat/test_unsloth_zoo_save_merged_pinned_symbols.py -> 2 passed (4 deselected) --------- Co-authored-by: Daniel Han-Chen <info@unsloth.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Brings PR unslothai#527's moeFix branch up to current main (HEAD 57bbdc0). Conflict resolution: - unsloth_zoo/temporary_patches/qwen3_moe.py: took main's version (PR unslothai#574 already added staticmethod() + a refactored extractor that uses extract_moe_lora_weights_for_grouped_mm; my local layout-aware rewrite is now redundant). - unsloth_zoo/temporary_patches/qwen3_vl_moe.py: took main's version (same reason; PR unslothai#574 already wrapped the extractor with staticmethod). Auto-merged cleanly: - unsloth_zoo/temporary_patches/glm4_moe.py: keeps the new patch_glm4_moe_standard registration alongside main's helper-based refactor of patch_glm4_moe. - unsloth_zoo/saving_utils.py: PR unslothai#647 (saving fixes) is already in main as faee224, so my three saving cherry-picks are subsumed. Stashed-then-dropped local shims: - The local _active_merge_device backport in saving_utils.py and the _unsloth_get_mm_token_id / _unsloth_fix_mm_token_type_ids compat shims in rl_replacements.py were never committed; both symbols now exist in main, so the shims were dropped. Verified post-merge: - python -c "import unsloth; import unsloth_zoo.temporary_patches.{glm4_moe,moe_bnb_transformers,qwen3_moe,misc}" succeeds - patch_glm4_moe_standard, _ParamShapeProxy, patch_peft_param_wrapper_4bit_expert_shape, patch_peft_param_wrapper_merge_4bit are all reachable.

Aligns the bitsandbytes 4-bit MoE support with the FP8 MoE support landed on PR unslothai#548 so both quantization kinds share a single harness: - Rename moe_bnb_transformers.py -> moe_utils_bnb4bit.py (matches the moe_utils_fp8.py file name PR unslothai#548 uses). - Add forward_moe_backend_bnb4bit(self, ...) dispatcher with the same shape as forward_moe_backend_fp8: dequantize gate_up_proj/down_proj, hand off to the regular grouped_mm / triton / native backend via a temporary weight swap (_call_with_temporary_moe_weights). - Add _moe_uses_bnb4bit_expert_weights detection helper. - moe_utils.forward_moe_backend now tries bnb4bit dispatch the same way PR unslothai#548 wires fp8 dispatch (try-import + early return on hit). The two branches are independent and stack trivially. - Drop patch_transformers_grouped_linear_4bit. The lower-level _grouped_linear / _batched_linear / batched_mm_experts_forward wrapping is no longer needed for any of the per-arch MoE classes that Unsloth already patches (qwen3*, glm4_moe lite + standard, deepseek_v3, gpt_oss): they all route through forward_moe_backend which now handles bnb4bit. Arches whose experts class is NOT patched per-class (e.g. transformers-default Gemma4MoE) will need a per-class patch instead of the generic interception. Verified on the user's moe_train_infer_grad_check.py harness across 4 tiny MoE archs x {bf16, bnb4bit} on GPU 7. 4bit and bf16 trajectories match per arch (within stochastic noise), confirming the dequant path is numerically equivalent: qwen3_moe 16bit 12.01->7.39 | 4bit 11.96->7.37 qwen3_5_moe 16bit 11.09->0.47 | 4bit 11.11->0.43 glm4_moe 16bit 11.62->6.69 | 4bit 11.66->6.66 deepseek_v3_moe 16bit (33% acc) | 4bit 10.36->0.05 Save (PR unslothai#647 path) and reload load both succeed on every cell. The qwen3_5_moe 4bit reload-accuracy gap (100% post-train -> 0% post-reload) is a pre-existing bnb4bit save/reload roundtrip issue, not introduced by this consolidation.

danielhanchen requested a review from rolandtannous as a code owner May 15, 2026 02:47

This was referenced May 15, 2026

[Bug] Merged model produces garbage output after save_pretrained_merged unslothai/unsloth#5410

Closed

[Feature Request] Support DevStral Small 2 on Transformers v5 unslothai/unsloth#4832

Closed

chatgpt-codex-connector Bot reviewed May 15, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

This was referenced May 15, 2026

tests: CPU regression detectors for the MoE merge / save path (#5410) #649

Merged

tests: pinned-symbol canary for unsloth-zoo save_pretrained_merged guards (#5410) unslothai/unsloth#5433

Merged

saving: trim verbose comments per maintainer style

6b0f75e

Tighten the docstrings and inline comments added by the layout-aware MoE merge work so the diff is closer to the surrounding house style (see chore #640). No behaviour change; 22 / 22 merge tests still pass.

danielhanchen force-pushed the fix-5410-moe-merge-layout branch from b0112e5 to 6b0f75e Compare May 15, 2026 07:17

danielhanchen merged commit faee224 into main May 15, 2026
12 of 15 checks passed

This was referenced May 15, 2026

tests: follow MoE merge wrapper delegation in drift detector #653

Merged

tests: CPU regression detectors for the MoE merge / save path (#5410) #655

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

saving: layout-aware MoE LoRA merge + loud-fail on fallback (#5410)#647

saving: layout-aware MoE LoRA merge + loud-fail on fallback (#5410)#647
danielhanchen merged 3 commits into
mainfrom
fix-5410-moe-merge-layout

danielhanchen commented May 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

danielhanchen commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielhanchen commented May 15, 2026

Summary

Fix

Tests

End-to-end verification

Test plan

Related

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant