FIX: monkey patch bitsandbytes oom on v5 #3395
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis pull request adds post-load quantization support for Mixture of Experts (MoE) expert parameters in QLoRA configurations. It introduces a mapping of MoE architecture parameter names, excludes these modules from LoRA targeting, and applies 4-bit quantization to 3D expert tensors after model building to prevent out-of-memory issues with transformers v5. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~23 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/axolotl/loaders/adapter.py`:
- Around line 121-126: The code mutates cfg.lora_exclude_modules in place by
assigning exclude_modules = cfg.lora_exclude_modules or [] and then appending to
exclude_modules; change this to work on a shallow copy so the original config is
not modified: create exclude_modules = list(cfg.lora_exclude_modules) if
cfg.lora_exclude_modules else [] (or use .copy()) before iterating over
MOE_EXPERT_PARAMS[cfg.model_config_type] / expert_param_names and appending,
ensuring no in-place modification of cfg.lora_exclude_modules in the logic
around exclude_modules and model_config_type.
- Around line 118-133: Only exclude MOE expert param names when they will
actually be wrapped by ParametrizationList (i.e., when MoE expert quantization
is enabled); change the unconditional block that appends
MOE_EXPERT_PARAMS[cfg.model_config_type] to exclude_modules to be guarded by the
quantization flag (e.g., cfg.quantize_moe_expert_params or the same condition
used for QLoRA/load_in_4bit), so that the code that builds exclude_modules
before constructing LoraConfig only adds expert_param_names when
quantize_moe_expert_params is true.
In `@src/axolotl/monkeypatch/moe_quant.py`:
- Line 31: Rename the unused loop variable module_name in the for loop over
model.named_modules() to a throwaway name (e.g., _module_name or _) to satisfy
the linter; update the loop header "for module_name, module in
model.named_modules()" to use the new unused-variable name while leaving the
used variable module intact so the rest of the body (which references module)
continues to work.
🧹 Nitpick comments (1)
src/axolotl/monkeypatch/moe_quant.py (1)
20-42: Quant parameters are hardcoded and may diverge from user's BitsAndBytesConfig.
quant_typedefaults to"nf4"andcompress_statisticstoTrue, but the user may have configured different values (e.g.,"fp4", orbnb_4bit_use_double_quant: false). The caller inmodel.py(line 197) doesn't forward these settings, so expert params could be quantized with different options than the rest of the model.Consider reading
quant_typeandcompress_statisticsfrom the model's existingBitsAndBytesConfigand passing them through.
| # Exclude ParametrizationList modules created by MoE expert quantization. | ||
| # replace_parameter_4bit wraps quantized params in ParametrizationList child | ||
| # modules that PEFT doesn't support as LoRA targets. | ||
| exclude_modules = cfg.lora_exclude_modules or [] | ||
| if cfg.model_config_type in MOE_EXPERT_PARAMS: | ||
| expert_param_names = MOE_EXPERT_PARAMS[cfg.model_config_type] | ||
| for name in expert_param_names: | ||
| if name not in exclude_modules: | ||
| exclude_modules.append(name) | ||
|
|
||
| lora_config = LoraConfig( | ||
| r=cfg.lora_r, | ||
| lora_alpha=cfg.lora_alpha, | ||
| target_modules=lora_target_modules, | ||
| target_parameters=lora_target_parameters, | ||
| exclude_modules=exclude_modules if exclude_modules else None, |
There was a problem hiding this comment.
MoE expert params are excluded even for non-quantized LoRA, which may be unintended.
The exclusion is applied unconditionally whenever cfg.model_config_type in MOE_EXPERT_PARAMS, but the ParametrizationList wrappers that motivate this exclusion are only created when quantize_moe_expert_params runs (i.e., QLoRA + load_in_4bit). For plain LoRA without quantization, these params are normal nn.Parameter tensors and excluding them would unnecessarily prevent LoRA from targeting those modules.
Consider guarding the exclusion:
Proposed fix
exclude_modules = cfg.lora_exclude_modules or []
- if cfg.model_config_type in MOE_EXPERT_PARAMS:
+ if cfg.model_config_type in MOE_EXPERT_PARAMS and cfg.adapter == "qlora" and cfg.load_in_4bit:
expert_param_names = MOE_EXPERT_PARAMS[cfg.model_config_type]
for name in expert_param_names:
if name not in exclude_modules:
exclude_modules.append(name)🤖 Prompt for AI Agents
In `@src/axolotl/loaders/adapter.py` around lines 118 - 133, Only exclude MOE
expert param names when they will actually be wrapped by ParametrizationList
(i.e., when MoE expert quantization is enabled); change the unconditional block
that appends MOE_EXPERT_PARAMS[cfg.model_config_type] to exclude_modules to be
guarded by the quantization flag (e.g., cfg.quantize_moe_expert_params or the
same condition used for QLoRA/load_in_4bit), so that the code that builds
exclude_modules before constructing LoraConfig only adds expert_param_names when
quantize_moe_expert_params is true.
| exclude_modules = cfg.lora_exclude_modules or [] | ||
| if cfg.model_config_type in MOE_EXPERT_PARAMS: | ||
| expert_param_names = MOE_EXPERT_PARAMS[cfg.model_config_type] | ||
| for name in expert_param_names: | ||
| if name not in exclude_modules: | ||
| exclude_modules.append(name) |
There was a problem hiding this comment.
Mutating cfg.lora_exclude_modules list in place.
If cfg.lora_exclude_modules returns a mutable list, exclude_modules = cfg.lora_exclude_modules or [] aliases it, and then .append(name) mutates the original config. This could cause surprising side effects if load_lora is called multiple times (e.g., duplicated entries, or the config object retaining modifications).
Proposed fix — copy the list
- exclude_modules = cfg.lora_exclude_modules or []
+ exclude_modules = list(cfg.lora_exclude_modules or [])🤖 Prompt for AI Agents
In `@src/axolotl/loaders/adapter.py` around lines 121 - 126, The code mutates
cfg.lora_exclude_modules in place by assigning exclude_modules =
cfg.lora_exclude_modules or [] and then appending to exclude_modules; change
this to work on a shallow copy so the original config is not modified: create
exclude_modules = list(cfg.lora_exclude_modules) if cfg.lora_exclude_modules
else [] (or use .copy()) before iterating over
MOE_EXPERT_PARAMS[cfg.model_config_type] / expert_param_names and appending,
ensuring no in-place modification of cfg.lora_exclude_modules in the logic
around exclude_modules and model_config_type.
| return | ||
|
|
||
| count = 0 | ||
| for module_name, module in model.named_modules(): |
There was a problem hiding this comment.
Fix pipeline failure: rename unused loop variable.
The CI lint fails because module_name is not used in the loop body.
Proposed fix
- for module_name, module in model.named_modules():
+ for _module_name, module in model.named_modules():📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for module_name, module in model.named_modules(): | |
| for _module_name, module in model.named_modules(): |
🧰 Tools
🪛 GitHub Actions: lint
[error] 31-31: B007 Loop control variable module_name not used within loop body. Rename unused module_name to _module_name.
🪛 Ruff (0.14.14)
[warning] 31-31: Loop control variable module_name not used within loop body
Rename unused module_name to _module_name
(B007)
🤖 Prompt for AI Agents
In `@src/axolotl/monkeypatch/moe_quant.py` at line 31, Rename the unused loop
variable module_name in the for loop over model.named_modules() to a throwaway
name (e.g., _module_name or _) to satisfy the linter; update the loop header
"for module_name, module in model.named_modules()" to use the new
unused-variable name while leaving the used variable module intact so the rest
of the body (which references module) continues to work.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| for param_name in target_params: | ||
| if hasattr(module, param_name): | ||
| param = getattr(module, param_name) | ||
| if isinstance(param, torch.nn.Parameter) and param.ndim >= 2: |
There was a problem hiding this comment.
Any specific reason for param.ndim >= 2:?
| # Quantize MoE expert weights immediately after model build. | ||
| # In transformers v5, MoE expert weights are 3D nn.Parameter tensors that | ||
| # BnB quantization skips (it only handles nn.Linear). This causes OOM because | ||
| # expert weights stay in full precision. We quantize them here before any other | ||
| # operations that need GPU memory (like prepare_model_for_kbit_training). | ||
| if ( | ||
| self.cfg.adapter == "qlora" | ||
| and self.cfg.load_in_4bit | ||
| and self.cfg.model_config_type in MOE_EXPERT_PARAMS | ||
| ): | ||
| import inspect | ||
|
|
||
| bnb_config_params = inspect.signature( | ||
| BitsAndBytesConfig.__init__ | ||
| ).parameters | ||
| if "target_parameters" not in bnb_config_params: | ||
| from axolotl.monkeypatch.moe_quant import ( | ||
| quantize_moe_expert_params, | ||
| ) | ||
|
|
||
| quantize_moe_expert_params(self.model, self.cfg.model_config_type) |
There was a problem hiding this comment.
Would regular bnb lora also have this issue?
| # Exclude ParametrizationList modules created by MoE expert quantization. | ||
| # replace_parameter_4bit wraps quantized params in ParametrizationList child | ||
| # modules that PEFT doesn't support as LoRA targets. | ||
| exclude_modules = cfg.lora_exclude_modules or [] |
There was a problem hiding this comment.
I'm not sure this config exists
| # the parameter names needed for `target_parameters` in BitsAndBytesConfig or for | ||
| # post-load quantization via bitsandbytes.nn.parametrize. | ||
| # Verified against transformers 5.0.0 source. | ||
| MOE_EXPERT_PARAMS = { |
There was a problem hiding this comment.
I'm wondering whether there's a better way to handle this. This would require us to maintain this list. Could we use the dict above to detect whether it's a moe model?
The keys for most of the moe layers are the mostly the same either way. Would there be potential conflicts if we default to all of them?
Co-authored-by: Wing Lian <wing.lian@gmail.com>
| parts = name.split(".") | ||
| # Find the layer index (first numeric segment) and extract the | ||
| # repeating suffix after it. | ||
| # e.g. "model.layers.0.mlp.experts.gate_up_proj" -> "mlp.experts.gate_up_proj" |
There was a problem hiding this comment.
Maybe we should have some checks for the word "experts" or gate_up_proj / down_proj. They seem to be the common names used.
| from axolotl.monkeypatch.moe_quant import quantize_moe_expert_params | ||
|
|
||
| self.model._moe_expert_param_names = find_moe_expert_param_names(self.model) | ||
| self.model._moe_experts_quantized = quantize_moe_expert_params(self.model) |
There was a problem hiding this comment.
This is called for LoRa, despite the inner function calling replace_parameter_4bit specifically. Do we need to have a specific replace_parameter_8bit for lora ?
There was a problem hiding this comment.
narrowed the guard to load_in_4bit only now
| - v_proj | ||
| - k_proj | ||
| - o_proj | ||
|
|
There was a problem hiding this comment.
Maybe we should explicitly set the lora target parameters so it's clear that it's being trained on here.
It doesn't seem possible to Not target those layers as well.It seems to always be on
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
lora_o_kernel: false
|
Mostly superseded by #3439 . We'll move the lora kernel specific changes to a new PR |
usee targate_params
to detect lora params
#3418
#3374