FIX: monkey patch bitsandbytes oom on v5 by ved1beta · Pull Request #3395 · axolotl-ai-cloud/axolotl

ved1beta · 2026-02-09T17:11:32Z

usee targate_params
to detect lora params

coderabbitai · 2026-02-09T17:14:10Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This pull request adds post-load quantization support for Mixture of Experts (MoE) expert parameters in QLoRA configurations. It introduces a mapping of MoE architecture parameter names, excludes these modules from LoRA targeting, and applies 4-bit quantization to 3D expert tensors after model building to prevent out-of-memory issues with transformers v5.

Changes

Cohort / File(s)	Summary
MoE Architecture Parameter Mapping `src/axolotl/common/architectures.py`	Adds `MOE_EXPERT_PARAMS` constant mapping MoE architecture names to lists of parameter name groups identifying 3D nn.Parameter tensors requiring special quantization handling.
LoRA Configuration Updates `src/axolotl/loaders/adapter.py`	Imports `MOE_EXPERT_PARAMS` and augments LoRA configuration to exclude MoE expert modules from LoRA targeting via `exclude_modules` parameter passed to `LoraConfig`.
Model Loading & Quantization Trigger `src/axolotl/loaders/model.py`	Imports `MOE_EXPERT_PARAMS` and conditionally invokes MoE expert quantization after model building when using QLoRA with 4-bit loading on supported architectures, with runtime detection of `BitsAndBytesConfig` capabilities.
Post-Load MoE Quantization Logic `src/axolotl/monkeypatch/moe_quant.py`	New module providing `quantize_moe_expert_params()` function that applies bitsandbytes 4-bit quantization to 3D expert parameter tensors identified by architecture-specific parameter name mappings.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~23 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly addresses the PR's primary objective—patching bitsandbytes OOM issues on Transformers v5—which aligns with the core changes in model.py, adapter.py, and the new moe_quant.py module.
Linked Issues check	✅ Passed	The PR fully addresses issue `#3374`'s requirements: quantizes MoE expert parameters to free GPU memory before PEFT setup, excludes expert modules from LoRA to prevent ParametrizationList wrapping, and maps MoE architectures with their parameter patterns.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to resolving the OOM issue: introducing MOE_EXPERT_PARAMS mapping, adding quantization logic in model.py, configuring LoRA exclusions in adapter.py, and implementing the quantize_moe_expert_params function.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@src/axolotl/loaders/adapter.py`:
- Around line 121-126: The code mutates cfg.lora_exclude_modules in place by
assigning exclude_modules = cfg.lora_exclude_modules or [] and then appending to
exclude_modules; change this to work on a shallow copy so the original config is
not modified: create exclude_modules = list(cfg.lora_exclude_modules) if
cfg.lora_exclude_modules else [] (or use .copy()) before iterating over
MOE_EXPERT_PARAMS[cfg.model_config_type] / expert_param_names and appending,
ensuring no in-place modification of cfg.lora_exclude_modules in the logic
around exclude_modules and model_config_type.
- Around line 118-133: Only exclude MOE expert param names when they will
actually be wrapped by ParametrizationList (i.e., when MoE expert quantization
is enabled); change the unconditional block that appends
MOE_EXPERT_PARAMS[cfg.model_config_type] to exclude_modules to be guarded by the
quantization flag (e.g., cfg.quantize_moe_expert_params or the same condition
used for QLoRA/load_in_4bit), so that the code that builds exclude_modules
before constructing LoraConfig only adds expert_param_names when
quantize_moe_expert_params is true.

In `@src/axolotl/monkeypatch/moe_quant.py`:
- Line 31: Rename the unused loop variable module_name in the for loop over
model.named_modules() to a throwaway name (e.g., _module_name or _) to satisfy
the linter; update the loop header "for module_name, module in
model.named_modules()" to use the new unused-variable name while leaving the
used variable module intact so the rest of the body (which references module)
continues to work.

🧹 Nitpick comments (1)

src/axolotl/monkeypatch/moe_quant.py (1)

20-42: Quant parameters are hardcoded and may diverge from user's BitsAndBytesConfig.

quant_type defaults to "nf4" and compress_statistics to True, but the user may have configured different values (e.g., "fp4", or bnb_4bit_use_double_quant: false). The caller in model.py (line 197) doesn't forward these settings, so expert params could be quantized with different options than the rest of the model.

Consider reading quant_type and compress_statistics from the model's existing BitsAndBytesConfig and passing them through.

coderabbitai · 2026-02-09T17:20:53Z

+    # Exclude ParametrizationList modules created by MoE expert quantization.
+    # replace_parameter_4bit wraps quantized params in ParametrizationList child
+    # modules that PEFT doesn't support as LoRA targets.
+    exclude_modules = cfg.lora_exclude_modules or []
+    if cfg.model_config_type in MOE_EXPERT_PARAMS:
+        expert_param_names = MOE_EXPERT_PARAMS[cfg.model_config_type]
+        for name in expert_param_names:
+            if name not in exclude_modules:
+                exclude_modules.append(name)
+
    lora_config = LoraConfig(
        r=cfg.lora_r,
        lora_alpha=cfg.lora_alpha,
        target_modules=lora_target_modules,
        target_parameters=lora_target_parameters,
+        exclude_modules=exclude_modules if exclude_modules else None,


⚠️ Potential issue | 🟠 Major

MoE expert params are excluded even for non-quantized LoRA, which may be unintended.

The exclusion is applied unconditionally whenever cfg.model_config_type in MOE_EXPERT_PARAMS, but the ParametrizationList wrappers that motivate this exclusion are only created when quantize_moe_expert_params runs (i.e., QLoRA + load_in_4bit). For plain LoRA without quantization, these params are normal nn.Parameter tensors and excluding them would unnecessarily prevent LoRA from targeting those modules.

Consider guarding the exclusion:

Proposed fix

exclude_modules = cfg.lora_exclude_modules or [] - if cfg.model_config_type in MOE_EXPERT_PARAMS: + if cfg.model_config_type in MOE_EXPERT_PARAMS and cfg.adapter == "qlora" and cfg.load_in_4bit: expert_param_names = MOE_EXPERT_PARAMS[cfg.model_config_type] for name in expert_param_names: if name not in exclude_modules: exclude_modules.append(name)

🤖 Prompt for AI Agents

In `@src/axolotl/loaders/adapter.py` around lines 118 - 133, Only exclude MOE expert param names when they will actually be wrapped by ParametrizationList (i.e., when MoE expert quantization is enabled); change the unconditional block that appends MOE_EXPERT_PARAMS[cfg.model_config_type] to exclude_modules to be guarded by the quantization flag (e.g., cfg.quantize_moe_expert_params or the same condition used for QLoRA/load_in_4bit), so that the code that builds exclude_modules before constructing LoraConfig only adds expert_param_names when quantize_moe_expert_params is true.

coderabbitai · 2026-02-09T17:20:53Z

+    exclude_modules = cfg.lora_exclude_modules or []
+    if cfg.model_config_type in MOE_EXPERT_PARAMS:
+        expert_param_names = MOE_EXPERT_PARAMS[cfg.model_config_type]
+        for name in expert_param_names:
+            if name not in exclude_modules:
+                exclude_modules.append(name)


⚠️ Potential issue | 🟡 Minor

Mutating cfg.lora_exclude_modules list in place.

If cfg.lora_exclude_modules returns a mutable list, exclude_modules = cfg.lora_exclude_modules or [] aliases it, and then .append(name) mutates the original config. This could cause surprising side effects if load_lora is called multiple times (e.g., duplicated entries, or the config object retaining modifications).

Proposed fix — copy the list

- exclude_modules = cfg.lora_exclude_modules or [] + exclude_modules = list(cfg.lora_exclude_modules or [])

🤖 Prompt for AI Agents

In `@src/axolotl/loaders/adapter.py` around lines 121 - 126, The code mutates cfg.lora_exclude_modules in place by assigning exclude_modules = cfg.lora_exclude_modules or [] and then appending to exclude_modules; change this to work on a shallow copy so the original config is not modified: create exclude_modules = list(cfg.lora_exclude_modules) if cfg.lora_exclude_modules else [] (or use .copy()) before iterating over MOE_EXPERT_PARAMS[cfg.model_config_type] / expert_param_names and appending, ensuring no in-place modification of cfg.lora_exclude_modules in the logic around exclude_modules and model_config_type.

coderabbitai · 2026-02-09T17:20:53Z

+        return
+
+    count = 0
+    for module_name, module in model.named_modules():


⚠️ Potential issue | 🟡 Minor

Fix pipeline failure: rename unused loop variable.

The CI lint fails because module_name is not used in the loop body.

Proposed fix

- for module_name, module in model.named_modules(): + for _module_name, module in model.named_modules():

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for module_name, module in model.named_modules():

for _module_name, module in model.named_modules():

🧰 Tools

🪛 GitHub Actions: lint

[error] 31-31: B007 Loop control variable module_name not used within loop body. Rename unused module_name to _module_name.

🪛 Ruff (0.14.14)

[warning] 31-31: Loop control variable module_name not used within loop body

Rename unused module_name to _module_name

(B007)

🤖 Prompt for AI Agents

In `@src/axolotl/monkeypatch/moe_quant.py` at line 31, Rename the unused loop variable module_name in the for loop over model.named_modules() to a throwaway name (e.g., _module_name or _) to satisfy the linter; update the loop header "for module_name, module in model.named_modules()" to use the new unused-variable name while leaving the used variable module intact so the rest of the body (which references module) continues to work.

codecov · 2026-02-09T17:24:19Z

Codecov Report

❌ Patch coverage is 18.82353% with 69 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/monkeypatch/moe_quant.py	0.00%	36 Missing ⚠️
src/axolotl/monkeypatch/lora_kernels.py	0.00%	17 Missing ⚠️
src/axolotl/loaders/adapter.py	52.17%	11 Missing ⚠️
src/axolotl/loaders/model.py	44.44%	5 Missing ⚠️

📢 Thoughts on this report? Let us know!

NanoCode012 · 2026-02-10T07:28:10Z

+        for param_name in target_params:
+            if hasattr(module, param_name):
+                param = getattr(module, param_name)
+                if isinstance(param, torch.nn.Parameter) and param.ndim >= 2:


Any specific reason for param.ndim >= 2:?

NanoCode012 · 2026-02-10T07:28:51Z

+        # Quantize MoE expert weights immediately after model build.
+        # In transformers v5, MoE expert weights are 3D nn.Parameter tensors that
+        # BnB quantization skips (it only handles nn.Linear). This causes OOM because
+        # expert weights stay in full precision. We quantize them here before any other
+        # operations that need GPU memory (like prepare_model_for_kbit_training).
+        if (
+            self.cfg.adapter == "qlora"
+            and self.cfg.load_in_4bit
+            and self.cfg.model_config_type in MOE_EXPERT_PARAMS
+        ):
+            import inspect
+
+            bnb_config_params = inspect.signature(
+                BitsAndBytesConfig.__init__
+            ).parameters
+            if "target_parameters" not in bnb_config_params:
+                from axolotl.monkeypatch.moe_quant import (
+                    quantize_moe_expert_params,
+                )
+
+                quantize_moe_expert_params(self.model, self.cfg.model_config_type)


Would regular bnb lora also have this issue?

NanoCode012 · 2026-02-10T07:29:31Z

+    # Exclude ParametrizationList modules created by MoE expert quantization.
+    # replace_parameter_4bit wraps quantized params in ParametrizationList child
+    # modules that PEFT doesn't support as LoRA targets.
+    exclude_modules = cfg.lora_exclude_modules or []


I'm not sure this config exists

NanoCode012 · 2026-02-10T07:31:14Z

+# the parameter names needed for `target_parameters` in BitsAndBytesConfig or for
+# post-load quantization via bitsandbytes.nn.parametrize.
+# Verified against transformers 5.0.0 source.
+MOE_EXPERT_PARAMS = {


I'm wondering whether there's a better way to handle this. This would require us to maintain this list. Could we use the dict above to detect whether it's a moe model?

The keys for most of the moe layers are the mostly the same either way. Would there be potential conflicts if we default to all of them?

Co-authored-by: Wing Lian <wing.lian@gmail.com>

NanoCode012 · 2026-02-16T06:15:02Z

+            parts = name.split(".")
+            # Find the layer index (first numeric segment) and extract the
+            # repeating suffix after it.
+            # e.g. "model.layers.0.mlp.experts.gate_up_proj" -> "mlp.experts.gate_up_proj"


Maybe we should have some checks for the word "experts" or gate_up_proj / down_proj. They seem to be the common names used.

NanoCode012 · 2026-02-16T10:58:37Z

+            from axolotl.monkeypatch.moe_quant import quantize_moe_expert_params
+
+            self.model._moe_expert_param_names = find_moe_expert_param_names(self.model)
+            self.model._moe_experts_quantized = quantize_moe_expert_params(self.model)


This is called for LoRa, despite the inner function calling replace_parameter_4bit specifically. Do we need to have a specific replace_parameter_8bit for lora ?

narrowed the guard to load_in_4bit only now

NanoCode012 · 2026-02-16T10:59:59Z

+  - v_proj
+  - k_proj
+  - o_proj
+


Maybe we should explicitly set the lora target parameters so it's clear that it's being trained on here.

It doesn't seem possible to Not target those layers as well.It seems to always be on

Still unclear points left

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

lora_o_kernel: false

NanoCode012 · 2026-02-27T09:35:47Z

Mostly superseded by #3439 . We'll move the lora kernel specific changes to a new PR

ved1beta and others added 2 commits February 9, 2026 22:32

moe 4 bit quant for 3d packing , downstream

292818e

Merge branch 'main' into 3d_oom_tv5

8f81839

coderabbitai Bot reviewed Feb 9, 2026

View reviewed changes

NanoCode012 reviewed Feb 10, 2026

View reviewed changes

ved1beta added 2 commits February 10, 2026 14:37

exclude moe_params , + reviews

d8c0592

Merge branch '3d_oom_tv5' of github.com:ved1beta/axolotl into 3d_oom_tv5

ce1b473

NanoCode012 mentioned this pull request Feb 10, 2026

NCCL timeout when using lora_target_modules gate_proj, up_proj, down_proj with MOE models #3149

Open

8 tasks

ved1beta added 7 commits February 11, 2026 19:34

use targate parameters for moe

c69201e

patch with moe_quant revert

38f7987

adpter exclude modules

5a81c15

detected_expert_params

44eaef2

r".*\.parametrizations\..*"

4d46469

comment

1bab4c1

config

e97b14e

winglian reviewed Feb 12, 2026

View reviewed changes

Comment thread src/axolotl/loaders/adapter.py Outdated

ved1beta and others added 6 commits February 13, 2026 07:54

Update src/axolotl/loaders/adapter.py

13d85b6

Co-authored-by: Wing Lian <wing.lian@gmail.com>

Merge branch 'main' into 3d_oom_tv5

013d8f0

lint

c3c6893

fix: simplify defaults

d16a853

true

82dad00

Merge branch '3d_oom_tv5' of github.com:ved1beta/axolotl into HEAD

4d7da67

NanoCode012 reviewed Feb 16, 2026

View reviewed changes

used keywords exp_proj, down_proj, gate_proj

3da012f

NanoCode012 reviewed Feb 16, 2026

View reviewed changes

NanoCode012 previously approved these changes Feb 16, 2026

View reviewed changes

ved1beta and others added 2 commits February 16, 2026 16:47

Update examples/glm4.7/glm4.7-flash-qlora.yaml

41d30ff

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

Merge branch 'main' into 3d_oom_tv5

dc7caa1

ved1beta added 2 commits February 16, 2026 21:04

use lora_target_parameters

6c734e9

Merge branch '3d_oom_tv5' of github.com:ved1beta/axolotl into HEAD

004ba8f

ved1beta mentioned this pull request Feb 17, 2026

Feat:support lora_qkv_kernel lora_o_kernel for GLM4.7 flash #3418

Open

5 tasks

ved1beta and others added 3 commits February 18, 2026 09:47

rmv lora_qkv_kernel: false

2fe6f40

lora_o_kernel: false

support lora _o_proj

91b5aa7

chore: lint

e0b7e93

NanoCode012 mentioned this pull request Feb 26, 2026

Fix: quantize and target moe layers in transformers v5 for adapters and many misc fixes #3439

Merged

4 tasks

NanoCode012 closed this Feb 27, 2026

	for module_name, module in model.named_modules():
	for _module_name, module in model.named_modules():

Uh oh!

Conversation

ved1beta commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NanoCode012 Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NanoCode012 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ved1beta commented Feb 9, 2026 •

edited

Loading

coderabbitai Bot commented Feb 9, 2026 •

edited

Loading

codecov Bot commented Feb 9, 2026 •

edited

Loading

NanoCode012 Feb 16, 2026 •

edited

Loading