moe quant patch for merge miss match#3483
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe changes enhance MOE quantization and PEFT target parameter matching by introducing deterministic parameter sorting, conditional patching logic, and comprehensive tests to ensure consistent ParamWrapper nesting behavior during training and merging operations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can suggest fixes for GitHub Check annotations.Configure the |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/utils/schemas/validation/test_moe_quant.py`:
- Around line 279-281: Mypy complains about the dynamic attribute
_axolotl_patched on the function patch_peft_target_parameters_matching; fix it
by adding a mypy ignore for undefined attributes on the assignment/clear sites:
when setting patch_peft_target_parameters_matching._axolotl_patched = True and
when clearing it (patch_peft_target_parameters_matching._axolotl_patched =
False) add a trailing comment "# type: ignore[attr-defined]". Apply the same
ignore at every place this dynamic attribute is assigned (the earlier set and
the finally/cleanup clear) so mypy stops reporting the attribute-defined error.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d27ce30a-15f3-4581-8dab-11034b582189
📒 Files selected for processing (3)
src/axolotl/loaders/patch_manager.pysrc/axolotl/monkeypatch/moe_quant.pytests/utils/schemas/validation/test_moe_quant.py
| finally: | ||
| BaseTuner._inject_parameters = original | ||
| patch_peft_target_parameters_matching._axolotl_patched = False |
There was a problem hiding this comment.
Fix mypy type error for dynamic function attribute.
The pipeline is failing due to mypy not recognizing the dynamically-set _axolotl_patched attribute on the function. This pattern is used in both line 156 and line 281.
🔧 Proposed fix using type: ignore comment
finally:
BaseTuner._inject_parameters = original
- patch_peft_target_parameters_matching._axolotl_patched = False
+ patch_peft_target_parameters_matching._axolotl_patched = False # type: ignore[attr-defined]Apply the same fix at line 156:
finally:
BaseTuner._inject_parameters = original
- patch_peft_target_parameters_matching._axolotl_patched = False
+ patch_peft_target_parameters_matching._axolotl_patched = False # type: ignore[attr-defined]And at line 423:
finally:
BaseTuner._inject_parameters = original_inject
- patch_peft_target_parameters_matching._axolotl_patched = False
+ patch_peft_target_parameters_matching._axolotl_patched = False # type: ignore[attr-defined]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| finally: | |
| BaseTuner._inject_parameters = original | |
| patch_peft_target_parameters_matching._axolotl_patched = False | |
| finally: | |
| BaseTuner._inject_parameters = original | |
| patch_peft_target_parameters_matching._axolotl_patched = False # type: ignore[attr-defined] |
🧰 Tools
🪛 GitHub Actions: lint
[error] 281-281: mypy error: 'Callable[[], Any]' has no attribute '_axolotl_patched' (attr-defined).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/utils/schemas/validation/test_moe_quant.py` around lines 279 - 281,
Mypy complains about the dynamic attribute _axolotl_patched on the function
patch_peft_target_parameters_matching; fix it by adding a mypy ignore for
undefined attributes on the assignment/clear sites: when setting
patch_peft_target_parameters_matching._axolotl_patched = True and when clearing
it (patch_peft_target_parameters_matching._axolotl_patched = False) add a
trailing comment "# type: ignore[attr-defined]". Apply the same ignore at every
place this dynamic attribute is assigned (the earlier set and the
finally/cleanup clear) so mypy stops reporting the attribute-defined error.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| replace_parameter_8bit(mod, pname) | ||
| _moe_load_state["count"] += 1 | ||
|
|
||
| # Release the bf16 tensor so CUDA memory is freed immediately. |
There was a problem hiding this comment.
Let's keep this comment. It's good to keep this note as it's the change that reduces vram cost in case we refactor in future.
| """Patch transformers weight loading to quantize MoE expert params on-the-fly. | ||
|
|
||
| Also patches PEFT's _inject_parameters whenever lora_target_parameters is set | ||
| (even without quantize_moe_experts) to ensure consistent ParamWrapper nesting | ||
| order between training and merge, preventing adapter key mismatches. | ||
| """ |
There was a problem hiding this comment.
Let's simplify the comments here
|
|
||
| @torch.no_grad() | ||
| def forward(self, quantized_param: torch.Tensor) -> torch.Tensor: | ||
| # Flatten 3D+ to 2D for BnB's dequant, then reshape back. |
There was a problem hiding this comment.
Let's keep this. Maybe instead of as a comment, can use as fn docstring
| module, param_name, Bnb8bitParametrization(row_stats), unsafe=True | ||
| ) | ||
|
|
||
| # Cache dequantized values during forward to avoid redundant dequantization. |
| # Sequential loading ensures only ONE bf16 expert tensor is on-GPU at a time. | ||
| # Force sequential tensor loading so we can quantize-and-free one expert at a time. | ||
| # Without this, transformers pre-fetches all bf16 expert tensors to GPU simultaneously. | ||
| os.environ["HF_DEACTIVATE_ASYNC_LOAD"] = "1" |
There was a problem hiding this comment.
Unrelated to this PR but found a user saying that this is useful to also have on for QLoRA in general
| """Fix PEFT's _inject_parameters for suffix matching and portable adapter ordering. | ||
|
|
||
| 1. Expands short suffix targets (e.g. "mlp.experts.gate_up_proj") to full module | ||
| paths so the parametrized branch can match them. | ||
|
|
||
| 2. Makes the parametrized branch iterate module.parametrizations in insertion order | ||
| instead of PEFT's sorted(target_names), matching the standard branch. This ensures | ||
| adapters saved during training load correctly with vanilla PEFT, vLLM, and other | ||
| tools without requiring this patch. | ||
| """ |
| from peft.utils.integrations import init_empty_weights | ||
| from peft.utils.other import _get_submodules | ||
|
|
||
| def _patched_inject_parameters( |
There was a problem hiding this comment.
Could you make sure to manually review this fn changes incase it introduce some edge case issue?
The code I provided for this was generated without me verifying.
The concept would be: copy the upstream peft fn and just remove the sorted path to reuse the target_modules insert order flow.
NanoCode012
left a comment
There was a problem hiding this comment.
Could we also add a e2e test that trains an adapter (a few steps), then attempt to merge and ensure it doesn't fail?
|
Just tried this and still got error. |
|
you tried on the latest commit right ?? , i remember working fine on my end 🤔 |
|
Was using the below repo / branch. git clone https://github.com/ved1beta/axolotl |
|
dw , looking into it |
|
some uncommited changes like always 🤕 , it works now thannks for reporting @zerofata |
Description
Training with quantize_moe_experts=true + two lora_target_parameters on the same expert module (e.g. mlp.experts.gate_up_proj and mlp.experts.down_proj) produces a size mismatch when merging the adapter back.
patch_peft_target_parameters_matching() (moe_quant.py:234)
The existing PEFT patch now wraps the original_inject call inside _sorted_named_params_ctx(), so both training and merge paths always process parameters in the same alphabetical order → same nesting → consistent adapter keys.
Motivation and Context
issue reported on discord https://discord.com/channels/1104757954588196865/1111279858136383509/1480085723733561344
How has this been tested?
includes
test_adapter_save_load_roundtrip_no_size_mismatchAI Usage Disclaimer
claude wrote tests
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests