Skip to content

FIX: monkey patch bitsandbytes oom on v5 #3395

Closed
ved1beta wants to merge 25 commits into
axolotl-ai-cloud:mainfrom
ved1beta:3d_oom_tv5
Closed

FIX: monkey patch bitsandbytes oom on v5 #3395
ved1beta wants to merge 25 commits into
axolotl-ai-cloud:mainfrom
ved1beta:3d_oom_tv5

Conversation

@ved1beta

@ved1beta ved1beta commented Feb 9, 2026

Copy link
Copy Markdown
Member

usee targate_params
to detect lora params

#3418
#3374

@coderabbitai

coderabbitai Bot commented Feb 9, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request adds post-load quantization support for Mixture of Experts (MoE) expert parameters in QLoRA configurations. It introduces a mapping of MoE architecture parameter names, excludes these modules from LoRA targeting, and applies 4-bit quantization to 3D expert tensors after model building to prevent out-of-memory issues with transformers v5.

Changes

Cohort / File(s) Summary
MoE Architecture Parameter Mapping
src/axolotl/common/architectures.py
Adds MOE_EXPERT_PARAMS constant mapping MoE architecture names to lists of parameter name groups identifying 3D nn.Parameter tensors requiring special quantization handling.
LoRA Configuration Updates
src/axolotl/loaders/adapter.py
Imports MOE_EXPERT_PARAMS and augments LoRA configuration to exclude MoE expert modules from LoRA targeting via exclude_modules parameter passed to LoraConfig.
Model Loading & Quantization Trigger
src/axolotl/loaders/model.py
Imports MOE_EXPERT_PARAMS and conditionally invokes MoE expert quantization after model building when using QLoRA with 4-bit loading on supported architectures, with runtime detection of BitsAndBytesConfig capabilities.
Post-Load MoE Quantization Logic
src/axolotl/monkeypatch/moe_quant.py
New module providing quantize_moe_expert_params() function that applies bitsandbytes 4-bit quantization to 3D expert parameter tensors identified by architecture-specific parameter name mappings.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~23 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly addresses the PR's primary objective—patching bitsandbytes OOM issues on Transformers v5—which aligns with the core changes in model.py, adapter.py, and the new moe_quant.py module.
Linked Issues check ✅ Passed The PR fully addresses issue #3374's requirements: quantizes MoE expert parameters to free GPU memory before PEFT setup, excludes expert modules from LoRA to prevent ParametrizationList wrapping, and maps MoE architectures with their parameter patterns.
Out of Scope Changes check ✅ Passed All changes are directly scoped to resolving the OOM issue: introducing MOE_EXPERT_PARAMS mapping, adding quantization logic in model.py, configuring LoRA exclusions in adapter.py, and implementing the quantize_moe_expert_params function.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/axolotl/loaders/adapter.py`:
- Around line 121-126: The code mutates cfg.lora_exclude_modules in place by
assigning exclude_modules = cfg.lora_exclude_modules or [] and then appending to
exclude_modules; change this to work on a shallow copy so the original config is
not modified: create exclude_modules = list(cfg.lora_exclude_modules) if
cfg.lora_exclude_modules else [] (or use .copy()) before iterating over
MOE_EXPERT_PARAMS[cfg.model_config_type] / expert_param_names and appending,
ensuring no in-place modification of cfg.lora_exclude_modules in the logic
around exclude_modules and model_config_type.
- Around line 118-133: Only exclude MOE expert param names when they will
actually be wrapped by ParametrizationList (i.e., when MoE expert quantization
is enabled); change the unconditional block that appends
MOE_EXPERT_PARAMS[cfg.model_config_type] to exclude_modules to be guarded by the
quantization flag (e.g., cfg.quantize_moe_expert_params or the same condition
used for QLoRA/load_in_4bit), so that the code that builds exclude_modules
before constructing LoraConfig only adds expert_param_names when
quantize_moe_expert_params is true.

In `@src/axolotl/monkeypatch/moe_quant.py`:
- Line 31: Rename the unused loop variable module_name in the for loop over
model.named_modules() to a throwaway name (e.g., _module_name or _) to satisfy
the linter; update the loop header "for module_name, module in
model.named_modules()" to use the new unused-variable name while leaving the
used variable module intact so the rest of the body (which references module)
continues to work.
🧹 Nitpick comments (1)
src/axolotl/monkeypatch/moe_quant.py (1)

20-42: Quant parameters are hardcoded and may diverge from user's BitsAndBytesConfig.

quant_type defaults to "nf4" and compress_statistics to True, but the user may have configured different values (e.g., "fp4", or bnb_4bit_use_double_quant: false). The caller in model.py (line 197) doesn't forward these settings, so expert params could be quantized with different options than the rest of the model.

Consider reading quant_type and compress_statistics from the model's existing BitsAndBytesConfig and passing them through.

Comment thread src/axolotl/loaders/adapter.py Outdated
Comment on lines +118 to +133
# Exclude ParametrizationList modules created by MoE expert quantization.
# replace_parameter_4bit wraps quantized params in ParametrizationList child
# modules that PEFT doesn't support as LoRA targets.
exclude_modules = cfg.lora_exclude_modules or []
if cfg.model_config_type in MOE_EXPERT_PARAMS:
expert_param_names = MOE_EXPERT_PARAMS[cfg.model_config_type]
for name in expert_param_names:
if name not in exclude_modules:
exclude_modules.append(name)

lora_config = LoraConfig(
r=cfg.lora_r,
lora_alpha=cfg.lora_alpha,
target_modules=lora_target_modules,
target_parameters=lora_target_parameters,
exclude_modules=exclude_modules if exclude_modules else None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

MoE expert params are excluded even for non-quantized LoRA, which may be unintended.

The exclusion is applied unconditionally whenever cfg.model_config_type in MOE_EXPERT_PARAMS, but the ParametrizationList wrappers that motivate this exclusion are only created when quantize_moe_expert_params runs (i.e., QLoRA + load_in_4bit). For plain LoRA without quantization, these params are normal nn.Parameter tensors and excluding them would unnecessarily prevent LoRA from targeting those modules.

Consider guarding the exclusion:

Proposed fix
     exclude_modules = cfg.lora_exclude_modules or []
-    if cfg.model_config_type in MOE_EXPERT_PARAMS:
+    if cfg.model_config_type in MOE_EXPERT_PARAMS and cfg.adapter == "qlora" and cfg.load_in_4bit:
         expert_param_names = MOE_EXPERT_PARAMS[cfg.model_config_type]
         for name in expert_param_names:
             if name not in exclude_modules:
                 exclude_modules.append(name)
🤖 Prompt for AI Agents
In `@src/axolotl/loaders/adapter.py` around lines 118 - 133, Only exclude MOE
expert param names when they will actually be wrapped by ParametrizationList
(i.e., when MoE expert quantization is enabled); change the unconditional block
that appends MOE_EXPERT_PARAMS[cfg.model_config_type] to exclude_modules to be
guarded by the quantization flag (e.g., cfg.quantize_moe_expert_params or the
same condition used for QLoRA/load_in_4bit), so that the code that builds
exclude_modules before constructing LoraConfig only adds expert_param_names when
quantize_moe_expert_params is true.

Comment thread src/axolotl/loaders/adapter.py Outdated
Comment on lines +121 to +126
exclude_modules = cfg.lora_exclude_modules or []
if cfg.model_config_type in MOE_EXPERT_PARAMS:
expert_param_names = MOE_EXPERT_PARAMS[cfg.model_config_type]
for name in expert_param_names:
if name not in exclude_modules:
exclude_modules.append(name)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Mutating cfg.lora_exclude_modules list in place.

If cfg.lora_exclude_modules returns a mutable list, exclude_modules = cfg.lora_exclude_modules or [] aliases it, and then .append(name) mutates the original config. This could cause surprising side effects if load_lora is called multiple times (e.g., duplicated entries, or the config object retaining modifications).

Proposed fix — copy the list
-    exclude_modules = cfg.lora_exclude_modules or []
+    exclude_modules = list(cfg.lora_exclude_modules or [])
🤖 Prompt for AI Agents
In `@src/axolotl/loaders/adapter.py` around lines 121 - 126, The code mutates
cfg.lora_exclude_modules in place by assigning exclude_modules =
cfg.lora_exclude_modules or [] and then appending to exclude_modules; change
this to work on a shallow copy so the original config is not modified: create
exclude_modules = list(cfg.lora_exclude_modules) if cfg.lora_exclude_modules
else [] (or use .copy()) before iterating over
MOE_EXPERT_PARAMS[cfg.model_config_type] / expert_param_names and appending,
ensuring no in-place modification of cfg.lora_exclude_modules in the logic
around exclude_modules and model_config_type.

Comment thread src/axolotl/monkeypatch/moe_quant.py Outdated
return

count = 0
for module_name, module in model.named_modules():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix pipeline failure: rename unused loop variable.

The CI lint fails because module_name is not used in the loop body.

Proposed fix
-    for module_name, module in model.named_modules():
+    for _module_name, module in model.named_modules():
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for module_name, module in model.named_modules():
for _module_name, module in model.named_modules():
🧰 Tools
🪛 GitHub Actions: lint

[error] 31-31: B007 Loop control variable module_name not used within loop body. Rename unused module_name to _module_name.

🪛 Ruff (0.14.14)

[warning] 31-31: Loop control variable module_name not used within loop body

Rename unused module_name to _module_name

(B007)

🤖 Prompt for AI Agents
In `@src/axolotl/monkeypatch/moe_quant.py` at line 31, Rename the unused loop
variable module_name in the for loop over model.named_modules() to a throwaway
name (e.g., _module_name or _) to satisfy the linter; update the loop header
"for module_name, module in model.named_modules()" to use the new
unused-variable name while leaving the used variable module intact so the rest
of the body (which references module) continues to work.

@codecov

codecov Bot commented Feb 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 18.82353% with 69 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/monkeypatch/moe_quant.py 0.00% 36 Missing ⚠️
src/axolotl/monkeypatch/lora_kernels.py 0.00% 17 Missing ⚠️
src/axolotl/loaders/adapter.py 52.17% 11 Missing ⚠️
src/axolotl/loaders/model.py 44.44% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

Comment thread src/axolotl/monkeypatch/moe_quant.py Outdated
for param_name in target_params:
if hasattr(module, param_name):
param = getattr(module, param_name)
if isinstance(param, torch.nn.Parameter) and param.ndim >= 2:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason for param.ndim >= 2:?

Comment thread src/axolotl/loaders/model.py Outdated
Comment on lines +177 to +197
# Quantize MoE expert weights immediately after model build.
# In transformers v5, MoE expert weights are 3D nn.Parameter tensors that
# BnB quantization skips (it only handles nn.Linear). This causes OOM because
# expert weights stay in full precision. We quantize them here before any other
# operations that need GPU memory (like prepare_model_for_kbit_training).
if (
self.cfg.adapter == "qlora"
and self.cfg.load_in_4bit
and self.cfg.model_config_type in MOE_EXPERT_PARAMS
):
import inspect

bnb_config_params = inspect.signature(
BitsAndBytesConfig.__init__
).parameters
if "target_parameters" not in bnb_config_params:
from axolotl.monkeypatch.moe_quant import (
quantize_moe_expert_params,
)

quantize_moe_expert_params(self.model, self.cfg.model_config_type)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would regular bnb lora also have this issue?

Comment thread src/axolotl/loaders/adapter.py Outdated
# Exclude ParametrizationList modules created by MoE expert quantization.
# replace_parameter_4bit wraps quantized params in ParametrizationList child
# modules that PEFT doesn't support as LoRA targets.
exclude_modules = cfg.lora_exclude_modules or []

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this config exists

Comment thread src/axolotl/common/architectures.py Outdated
# the parameter names needed for `target_parameters` in BitsAndBytesConfig or for
# post-load quantization via bitsandbytes.nn.parametrize.
# Verified against transformers 5.0.0 source.
MOE_EXPERT_PARAMS = {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether there's a better way to handle this. This would require us to maintain this list. Could we use the dict above to detect whether it's a moe model?

The keys for most of the moe layers are the mostly the same either way. Would there be potential conflicts if we default to all of them?

Comment thread src/axolotl/loaders/adapter.py Outdated
Comment on lines +85 to +88
parts = name.split(".")
# Find the layer index (first numeric segment) and extract the
# repeating suffix after it.
# e.g. "model.layers.0.mlp.experts.gate_up_proj" -> "mlp.experts.gate_up_proj"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have some checks for the word "experts" or gate_up_proj / down_proj. They seem to be the common names used.

from axolotl.monkeypatch.moe_quant import quantize_moe_expert_params

self.model._moe_expert_param_names = find_moe_expert_param_names(self.model)
self.model._moe_experts_quantized = quantize_moe_expert_params(self.model)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called for LoRa, despite the inner function calling replace_parameter_4bit specifically. Do we need to have a specific replace_parameter_8bit for lora ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

narrowed the guard to load_in_4bit only now

NanoCode012
NanoCode012 previously approved these changes Feb 16, 2026
Comment thread examples/glm4.7/glm4.7-flash-qlora.yaml Outdated
- v_proj
- k_proj
- o_proj

@NanoCode012 NanoCode012 Feb 16, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should explicitly set the lora target parameters so it's clear that it's being trained on here.

It doesn't seem possible to Not target those layers as well.It seems to always be on

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done !

@NanoCode012 NanoCode012 dismissed their stale review February 16, 2026 11:02

Still unclear points left

ved1beta and others added 2 commits February 16, 2026 16:47
@NanoCode012

Copy link
Copy Markdown
Collaborator

Mostly superseded by #3439 . We'll move the lora kernel specific changes to a new PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants