moe quant patch for merge miss match by ved1beta · Pull Request #3483 · axolotl-ai-cloud/axolotl

ved1beta · 2026-03-10T03:57:39Z

Description

Training with quantize_moe_experts=true + two lora_target_parameters on the same expert module (e.g. mlp.experts.gate_up_proj and mlp.experts.down_proj) produces a size mismatch when merging the adapter back.

patch_peft_target_parameters_matching() (moe_quant.py:234)
The existing PEFT patch now wraps the original_inject call inside _sorted_named_params_ctx(), so both training and merge paths always process parameters in the same alphabetical order → same nesting → consistent adapter keys.

Motivation and Context

issue reported on discord https://discord.com/channels/1104757954588196865/1111279858136383509/1480085723733561344

How has this been tested?

includes test_adapter_save_load_roundtrip_no_size_mismatch

AI Usage Disclaimer

claude wrote tests

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved parameter wrapper handling consistency between training and model merging operations
- Enhanced MOE expert quantization patch to handle additional configuration scenarios correctly
- Fixed non-deterministic parameter ordering in certain workflows
Tests
- Added comprehensive tests validating parameter wrapper behavior across different configurations

coderabbitai · 2026-03-10T03:57:59Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 27422268-e16d-40fe-a415-5d85693b9331

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The changes enhance MOE quantization and PEFT target parameter matching by introducing deterministic parameter sorting, conditional patching logic, and comprehensive tests to ensure consistent ParamWrapper nesting behavior during training and merging operations.

Changes

Cohort / File(s)	Summary
Patch Manager Logic `src/axolotl/loaders/patch_manager.py`	Enhanced `_apply_moe_expert_quantization_patch` with conditional imports and guards against non-quantization configurations. Added logic to patch PEFT parameter targeting when `lora_target_parameters` is present, regardless of quantization setting.
MOE Quantization Monkeypatch `src/axolotl/monkeypatch/moe_quant.py`	Introduced `_sorted_named_params_ctx()` context manager to ensure deterministic parameter ordering. Extended `patch_peft_target_parameters_matching()` with enhanced documentation and updated `_patched_inject_parameters` to use the new context manager for consistent ParamWrapper nesting.
Test Suite `tests/utils/schemas/validation/test_moe_quant.py`	Added comprehensive test class `TestConsistentParamWrapperNesting` with multiple test methods validating consistent ParamWrapper nesting between training and merge paths, including adapter save/load roundtrip and patch idempotency verification.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

under review, scheduled_release

Suggested reviewers

winglian
NanoCode012

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.31% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'moe quant patch for merge miss match' is partially related to the changeset. It references the main area being modified (moe quant patch) and alludes to a merge mismatch issue, but uses vague phrasing ('miss match' appears to be a misspelling of 'mismatch') and lacks specificity about what is being fixed.	Consider revising the title to be more specific and clear, such as 'Fix ParamWrapper nesting inconsistency between training and merge paths for MOE quantization' or 'Ensure consistent parameter ordering in MOE quantization patches for adapter merge'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/utils/schemas/validation/test_moe_quant.py`:
- Around line 279-281: Mypy complains about the dynamic attribute
_axolotl_patched on the function patch_peft_target_parameters_matching; fix it
by adding a mypy ignore for undefined attributes on the assignment/clear sites:
when setting patch_peft_target_parameters_matching._axolotl_patched = True and
when clearing it (patch_peft_target_parameters_matching._axolotl_patched =
False) add a trailing comment "# type: ignore[attr-defined]". Apply the same
ignore at every place this dynamic attribute is assigned (the earlier set and
the finally/cleanup clear) so mypy stops reporting the attribute-defined error.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d27ce30a-15f3-4581-8dab-11034b582189

📥 Commits

Reviewing files that changed from the base of the PR and between cf4d550 and a7fa611.

📒 Files selected for processing (3)

src/axolotl/loaders/patch_manager.py
src/axolotl/monkeypatch/moe_quant.py
tests/utils/schemas/validation/test_moe_quant.py

coderabbitai · 2026-03-10T04:02:44Z

+                finally:
+                    BaseTuner._inject_parameters = original
+                    patch_peft_target_parameters_matching._axolotl_patched = False


⚠️ Potential issue | 🟡 Minor

Fix mypy type error for dynamic function attribute.

The pipeline is failing due to mypy not recognizing the dynamically-set _axolotl_patched attribute on the function. This pattern is used in both line 156 and line 281.

🔧 Proposed fix using type: ignore comment

finally: BaseTuner._inject_parameters = original - patch_peft_target_parameters_matching._axolotl_patched = False + patch_peft_target_parameters_matching._axolotl_patched = False # type: ignore[attr-defined]

Apply the same fix at line 156:

finally: BaseTuner._inject_parameters = original - patch_peft_target_parameters_matching._axolotl_patched = False + patch_peft_target_parameters_matching._axolotl_patched = False # type: ignore[attr-defined]

And at line 423:

finally: BaseTuner._inject_parameters = original_inject - patch_peft_target_parameters_matching._axolotl_patched = False + patch_peft_target_parameters_matching._axolotl_patched = False # type: ignore[attr-defined]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

finally:

BaseTuner._inject_parameters = original

patch_peft_target_parameters_matching._axolotl_patched = False

finally:

BaseTuner._inject_parameters = original

patch_peft_target_parameters_matching._axolotl_patched = False # type: ignore[attr-defined]

🧰 Tools

🪛 GitHub Actions: lint

[error] 281-281: mypy error: 'Callable[[], Any]' has no attribute '_axolotl_patched' (attr-defined).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/utils/schemas/validation/test_moe_quant.py` around lines 279 - 281, Mypy complains about the dynamic attribute _axolotl_patched on the function patch_peft_target_parameters_matching; fix it by adding a mypy ignore for undefined attributes on the assignment/clear sites: when setting patch_peft_target_parameters_matching._axolotl_patched = True and when clearing it (patch_peft_target_parameters_matching._axolotl_patched = False) add a trailing comment "# type: ignore[attr-defined]". Apply the same ignore at every place this dynamic attribute is assigned (the earlier set and the finally/cleanup clear) so mypy stops reporting the attribute-defined error.

codecov · 2026-03-10T04:13:17Z

Codecov Report

❌ Patch coverage is 75.00000% with 12 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/monkeypatch/moe_quant.py	79.06%	9 Missing ⚠️
src/axolotl/loaders/patch_manager.py	40.00%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

NanoCode012 · 2026-03-10T11:35:47Z

                    replace_parameter_8bit(mod, pname)
                _moe_load_state["count"] += 1

-                # Release the bf16 tensor so CUDA memory is freed immediately.


Let's keep this comment. It's good to keep this note as it's the change that reduces vram cost in case we refactor in future.

NanoCode012 · 2026-03-10T11:36:21Z

+        """Patch transformers weight loading to quantize MoE expert params on-the-fly.
+
+        Also patches PEFT's _inject_parameters whenever lora_target_parameters is set
+        (even without quantize_moe_experts) to ensure consistent ParamWrapper nesting
+        order between training and merge, preventing adapter key mismatches.
+        """


Let's simplify the comments here

NanoCode012 · 2026-03-10T11:36:49Z


    @torch.no_grad()
    def forward(self, quantized_param: torch.Tensor) -> torch.Tensor:
-        # Flatten 3D+ to 2D for BnB's dequant, then reshape back.


Let's keep this. Maybe instead of as a comment, can use as fn docstring

NanoCode012 · 2026-03-10T11:37:04Z

        module, param_name, Bnb8bitParametrization(row_stats), unsafe=True
    )

-    # Cache dequantized values during forward to avoid redundant dequantization.


NanoCode012 · 2026-03-10T11:37:49Z

-    # Sequential loading ensures only ONE bf16 expert tensor is on-GPU at a time.
+    # Force sequential tensor loading so we can quantize-and-free one expert at a time.
+    # Without this, transformers pre-fetches all bf16 expert tensors to GPU simultaneously.
    os.environ["HF_DEACTIVATE_ASYNC_LOAD"] = "1"


Unrelated to this PR but found a user saying that this is useful to also have on for QLoRA in general

NanoCode012 · 2026-03-10T11:38:07Z

+    """Fix PEFT's _inject_parameters for suffix matching and portable adapter ordering.
+
+    1. Expands short suffix targets (e.g. "mlp.experts.gate_up_proj") to full module
+       paths so the parametrized branch can match them.
+
+    2. Makes the parametrized branch iterate module.parametrizations in insertion order
+       instead of PEFT's sorted(target_names), matching the standard branch. This ensures
+       adapters saved during training load correctly with vanilla PEFT, vLLM, and other
+       tools without requiring this patch.
+    """


NanoCode012 · 2026-03-10T11:39:58Z

+    from peft.utils.integrations import init_empty_weights
+    from peft.utils.other import _get_submodules

    def _patched_inject_parameters(


Could you make sure to manually review this fn changes incase it introduce some edge case issue?

The code I provided for this was generated without me verifying.

The concept would be: copy the upstream peft fn and just remove the sorted path to reuse the target_modules insert order flow.

NanoCode012

Could we also add a e2e test that trains an adapter (a few steps), then attempt to merge and ensure it doesn't fail?

zerofata · 2026-03-10T21:45:40Z

Just tried this and still got error.

root@7e53ee18d5fa:/workspace/axolotl# python3 -m axolotl.cli.merge_lora sft-writing.yml \
    --lora_model_dir="./GLM-Air-v4-SFT-1-writing" \
    --gpu_memory_limit=0
[2026-03-10 21:39:56,579] [WARNING] [torchao] Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu126 for torchao version 0.16.0             Please see https://github.com/pytorch/ao/issues/2919 for more info
[2026-03-10 21:39:58,983] [INFO] [axolotl.integrations.base] Attempting to load plugin: axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
[2026-03-10 21:39:59,972] [INFO] [axolotl.integrations.base] Plugin loaded successfully: axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
[2026-03-10 21:40:00,023] [INFO] [axolotl.utils.schemas.validation] explicitly setting `eval_sample_packing` to match `sample_packing`
[2026-03-10 21:40:00,023] [WARNING] [axolotl.utils.schemas.validation] sample_packing without flash, sdp, xformers, sage, or flex attention does not handle cross sample decontamination.
[2026-03-10 21:40:00,023] [INFO] [axolotl.utils.schemas.validation] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing
[2026-03-10 21:40:00,167] [INFO] [axolotl.cli.config] config:
{
  "activation_offloading": false,
  "adapter": "qlora",
  "axolotl_config_path": "sft-writing.yml",
  "base_model": "ApocalypseParty/GLM-Air-v4-SFT-1-merged",
  "base_model_config": "ApocalypseParty/GLM-Air-v4-SFT-1-merged",
  "batch_size": 8,
  "bf16": true,
  "capabilities": {
    "bf16": true,
    "compute_capability": "sm_90",
    "fp8": true,
    "n_gpu": 1,
    "n_node": 1
  },
  "chat_template": "jinja",
  "chat_template_jinja": "./glm_air.jinja",
  "context_parallel_size": 1,
  "cut_cross_entropy": true,
  "dataloader_num_workers": 1,
  "dataloader_pin_memory": true,
  "dataloader_prefetch_factor": 256,
  "dataset_num_proc": 48,
  "dataset_prepared_path": "last_run_prepared",
  "datasets": [
    {
      "chat_template": "tokenizer_default",
      "message_property_mappings": {
        "content": "content",
        "role": "role"
      },
      "path": "./data/dataset_writing.jsonl",
      "trust_remote_code": false,
      "type": "chat_template"
    }
  ],
  "ddp": false,
  "device": "cuda:0",
  "device_map": "auto",
  "dion_rank_fraction": 1.0,
  "dion_rank_multiple_of": 1,
  "eaft_alpha": 1.0,
  "eaft_k": 20,
  "env_capabilities": {
    "torch_version": "2.9.0"
  },
  "eot_tokens": [
    "<|user|>",
    "<|endoftext|>"
  ],
  "eval_batch_size": 2,
  "eval_causal_lm_metrics": [
    "sacrebleu",
    "comet",
    "ter",
    "chrf"
  ],
  "eval_max_new_tokens": 128,
  "eval_sample_packing": true,
  "eval_table_size": 0,
  "experimental_skip_move_to_device": true,
  "flash_attention": false,
  "fp16": false,
  "generate_samples": false,
  "generation_do_sample": true,
  "generation_max_new_tokens": 50,
  "generation_prompt_ratio": 0.5,
  "generation_temperature": 0.7,
  "gpu_memory_limit": 0,
  "gradient_accumulation_steps": 4,
  "gradient_checkpointing": false,
  "include_tkps": true,
  "learning_rate": 9e-06,
  "lisa_layers_attribute": "model.layers",
  "load_best_model_at_end": false,
  "load_in_4bit": false,
  "load_in_8bit": false,
  "local_rank": 0,
  "logging_steps": 1,
  "lora_alpha": 32,
  "lora_dropout": 0.0,
  "lora_mlp_kernel": false,
  "lora_model_dir": "./GLM-Air-v4-SFT-1-writing",
  "lora_o_kernel": false,
  "lora_qkv_kernel": false,
  "lora_r": 16,
  "lora_target_modules": [
    "q_proj",
    "v_proj",
    "k_proj",
    "o_proj"
  ],
  "lora_target_parameters": [
    "mlp.experts.gate_up_proj",
    "mlp.experts.down_proj"
  ],
  "loraplus_lr_embedding": 1e-06,
  "lr_scheduler": "cosine",
  "mean_resizing_embeddings": false,
  "merge_lora": true,
  "micro_batch_size": 2,
  "model_config_type": "glm4_moe",
  "num_epochs": 8.0,
  "num_generation_samples": 3,
  "optimizer": "adamw_torch_8bit",
  "otel_metrics_host": "localhost",
  "otel_metrics_port": 8000,
  "output_dir": "./GLM-Air-v4-SFT-1-writing",
  "pad_to_sequence_len": true,
  "plugins": [
    "axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin"
  ],
  "pretrain_multipack_attn": true,
  "profiler_steps_start": 0,
  "qlora_sharded_model_loading": false,
  "quantize_moe_experts": false,
  "ray_num_workers": 1,
  "resources_per_worker": {
    "GPU": 1
  },
  "sample_packing": true,
  "sample_packing_bin_size": 200,
  "sample_packing_group_size": 100000,
  "save_only_model": false,
  "save_safetensors": true,
  "save_steps": 0.125,
  "saves_per_epoch": 1,
  "sequence_len": 4096,
  "shuffle_before_merging_datasets": false,
  "shuffle_merged_datasets": true,
  "skip_prepare_dataset": false,
  "streaming_multipack_buffer_size": 10000,
  "strict": false,
  "tensor_parallel_size": 1,
  "tf32": false,
  "tiled_mlp_use_original_mlp": true,
  "tokenizer_config": "ApocalypseParty/GLM-Air-v4-SFT-1-merged",
  "tokenizer_save_jinja_files": true,
  "torch_dtype": "torch.bfloat16",
  "train_on_inputs": false,
  "trl": {
    "log_completions": false,
    "mask_truncated_completions": false,
    "ref_model_mixup_alpha": 0.9,
    "ref_model_sync_steps": 64,
    "scale_rewards": true,
    "sync_ref_model": false,
    "use_vllm": false,
    "vllm_server_host": "0.0.0.0",
    "vllm_server_port": 8000
  },
  "use_otel_metrics": false,
  "use_ray": false,
  "use_wandb": true,
  "val_set_size": 0.0,
  "vllm": {
    "device": "auto",
    "dtype": "auto",
    "gpu_memory_utilization": 0.9,
    "host": "0.0.0.0",
    "port": 8000
  },
  "wandb_name": "GLM-Air-v4-SFT-1-writing",
  "wandb_project": "GLM-Air-v4-SFT",
  "warmup_ratio": 0.1,
  "weight_decay": 0.0,
  "world_size": 1
}
[2026-03-10 21:40:00,169] [INFO] [axolotl.cli.utils.load] loading tokenizer... ApocalypseParty/GLM-Air-v4-SFT-1-merged
[2026-03-10 21:40:02,057] [INFO] [axolotl.cli.utils.load] loading model...
[2026-03-10 21:40:02,109] [INFO] [axolotl.loaders.patch_manager] Applying multipack dataloader patch for sample packing...
[2026-03-10 21:40:02,119] [WARNING] [py.warnings] /usr/local/lib/python3.11/dist-packages/torch/__init__.py:1551: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
  return _C._get_float32_matmul_precision()

[2026-03-10 21:40:02,129] [INFO] [axolotl.integrations.cut_cross_entropy] Applying Cut Cross Entropy to model type: glm4_moe
[2026-03-10 21:40:02,137] [INFO] [axolotl.monkeypatch.moe_quant] Patched PEFT _inject_parameters for parametrized module suffix matching
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 735/735 [00:32<00:00, 22.97it/s]
[2026-03-10 21:40:39,021] [INFO] [axolotl.loaders.model] Converting modules to torch.bfloat16
[2026-03-10 21:40:39,888] [WARNING] [py.warnings] /usr/local/lib/python3.11/dist-packages/peft/tuners/tuners_utils.py:212: UserWarning: Unsupported layer type '<class 'transformers.models.glm4_moe.modeling_glm4_moe.Glm4MoeNaiveMoe'>' encountered, proceed at your own risk.
  warnings.warn(f"Unsupported layer type '{type(module)}' encountered, proceed at your own risk.", UserWarning)

[2026-03-10 21:40:53,100] [ERROR] [axolotl.telemetry.errors] Error captured in telemetry. Run ID: 509605c1-bbd5-44a7-a7d7-d5b396bb4d7c
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/workspace/axolotl/src/axolotl/cli/merge_lora.py", line 94, in <module>
    fire.Fire(do_cli)
  File "/usr/local/lib/python3.11/dist-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/cli/merge_lora.py", line 90, in do_cli
    do_merge_lora(cfg=parsed_cfg)
  File "/workspace/axolotl/src/axolotl/telemetry/errors.py", line 127, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/cli/merge_lora.py", line 26, in do_merge_lora
    model, tokenizer, processor = load_model_and_tokenizer(cfg=cfg)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/cli/utils/load.py", line 45, in load_model_and_tokenizer
    model, _ = model_loader.load()
               ^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/telemetry/errors.py", line 127, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/loaders/model.py", line 186, in load
    lora_config = self._load_adapters()
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/loaders/model.py", line 396, in _load_adapters
    self.model, lora_config = load_adapter(
                              ^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/telemetry/errors.py", line 127, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/loaders/adapter.py", line 193, in load_adapter
    peft_model, lora_config = load_lora(model, cfg, inference=inference)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/loaders/adapter.py", line 154, in load_lora
    model = PeftModel.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/peft/peft_model.py", line 568, in from_pretrained
    load_result = model.load_adapter(
                  ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/peft/peft_model.py", line 1368, in load_adapter
    load_result = set_peft_model_state_dict(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/peft/utils/save_and_load.py", line 565, in set_peft_model_state_dict
    load_result = model.load_state_dict(peft_model_state_dict, strict=False)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 2629, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
        size mismatch for base_model.model.model.layers.1.mlp.experts.base_layer.lora_A.default.weight: copying a param with shape torch.Size([2048, 4096]) from checkpoint, the shape in current model is torch.Size([2048, 2816]).
        size mismatch for base_model.model.model.layers.1.mlp.experts.base_layer.lora_B.default.weight: copying a param with shape torch.Size([1408, 2048]) from checkpoint, the shape in current model is torch.Size([4096, 2048]).
        size mismatch for base_model.model.model.layers.1.mlp.experts.lora_A.default.weight: copying a param with shape torch.Size([2048, 2816]) from checkpoint, the shape in current model is torch.Size([2048, 4096]).
        size mismatch for base_model.model.model.layers.1.mlp.experts.lora_B.default.weight: copying a param with shape torch.Size([4096, 2048]) from checkpoint, the shape in current model is torch.Size([1408, 2048]).
        size mismatch for base_model.model.model.layers.2.mlp.experts.base_layer.lora_A.default.weight: copying a param with shape torch.Size([2048, 4096]) from checkpoint, the shape in current model is torch.Size([2048, 2816]).
        size mismatch for base_model.model.model.layers.2.mlp.experts.base_layer.lora_B.default.weight: copying a param with shape torch.Size([1408, 2048]) from checkpoint, the shape in current model is torch.Size([4096, 2048]).
        size mismatch for base_model.model.model.layers.2.mlp.experts.lora_A.default.weight: copying a param with shape torch.Size([2048, 2816]) from checkpoint, the shape in current model is torch.Size([2048, 4096]).
        size mismatch for base_model.model.model.layers.2.mlp.experts.lora_B.default.weight: copying a param with shape torch.Size([4096, 2048]) from checkpoint, the shape in current model is torch.Size([1408, 2048]).
        size mismatch for base_model.model.model.layers.3.mlp.experts.base_layer.lora_A.default.weight: copying a param with shape torch.Size([2048, 4096]) from checkpoint, the shape in current model is torch.Size([2048, 2816]).
        size mismatch for base_model.model.model.layers.3.mlp.experts.base_layer.lora_B.default.weight: copying a param with shape torch.Size([1408, 2048]) from checkpoint, the shape in current model is torch.Size([4096, 2048]).
        size mismatch for base_model.model.model.layers.3.mlp.experts.lora_A.default.weight: copying a param with shape torch.Size([2048, 2816]) from checkpoint, the shape in current model is torch.Size([2048, 4096]).
        size mismatch for base_model.model.model.layers.3.mlp.experts.lora_B.default.weight: copying a param with shape torch.Size([4096, 2048]) from checkpoint, the shape in current model is torch.Size([1408, 2048]).

ved1beta · 2026-03-11T02:05:03Z

you tried on the latest commit right ?? , i remember working fine on my end 🤔

zerofata · 2026-03-11T02:29:55Z

Was using the below repo / branch.

git clone https://github.com/ved1beta/axolotl
git checkout moe-merge-patch

ved1beta · 2026-03-11T02:38:32Z

dw , looking into it

ved1beta · 2026-03-11T04:47:28Z

some uncommited changes like always 🤕 , it works now thannks for reporting @zerofata
tested with 4.7 flash and 4.5 air + e2e

…e-merge-patch

moe quant patch for merge miss match

a7fa611

coderabbitai Bot reviewed Mar 10, 2026

View reviewed changes

ved1beta added 2 commits March 10, 2026 12:30

lint

f006ed0

revert test + fix moe patch

e7ee8c2

NanoCode012 reviewed Mar 10, 2026

View reviewed changes

ved1beta added 2 commits March 10, 2026 18:47

comment fixxes

07a7369

e2e tests

e7d0a42

mismatch fixx tested

b038d35

Merge branch 'main' into moe-merge-patch

d436434

NanoCode012 added the wip label Mar 11, 2026

ved1beta and others added 4 commits March 12, 2026 12:21

mis match fix wwith vllm compatablity + test

26cf831

Merge branch 'moe-merge-patch' of github.com:ved1beta/axolotl into mo…

c39fc94

…e-merge-patch

comment lint

cc6592d

Merge branch 'main' into moe-merge-patch

f4d04d5

winglian approved these changes Mar 12, 2026

View reviewed changes

winglian added ready to merge and removed wip labels Mar 12, 2026

NanoCode012 added the hold don't merge this yet label Mar 13, 2026

NanoCode012 added 2 commits March 13, 2026 20:31

fix: missing os import, duplicate no op

70bf473

chore: simplify comments

dd417fa

NanoCode012 approved these changes Mar 13, 2026

View reviewed changes

NanoCode012 removed the hold don't merge this yet label Mar 13, 2026

winglian merged commit a806704 into axolotl-ai-cloud:main Mar 16, 2026
18 checks passed

coderabbitai Bot mentioned this pull request Mar 18, 2026

super nemo support #3508

Merged

winglian removed the ready to merge label Mar 22, 2026

Uh oh!

Conversation

ved1beta commented Mar 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

NanoCode012 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 left a comment

Choose a reason for hiding this comment

Uh oh!

zerofata commented Mar 10, 2026

Uh oh!

ved1beta commented Mar 11, 2026

Uh oh!

zerofata commented Mar 11, 2026

Uh oh!

ved1beta commented Mar 11, 2026

Uh oh!

ved1beta commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ved1beta commented Mar 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 10, 2026 •

edited

Loading

codecov Bot commented Mar 10, 2026 •

edited

Loading