[Bugfix] Fix MLA weight access crash for quantized layers (NVFP4/INT4)#35576
[Bugfix] Fix MLA weight access crash for quantized layers (NVFP4/INT4)#35576lucaspirola wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces two important bug fixes for running quantized MLA models (NVFP4/INT4). The first change in mla_attention.py correctly handles layers that may not have a weight attribute by using getattr, preventing an AttributeError. It also adds a necessary check to only cast input tensors for float-like weight dtypes, which is crucial for integer-quantized layers. The second change in glm4_moe_lite.py adds a None check before adding a weight name to a set, preventing a TypeError during weight loading for shared experts. Both fixes are correct and directly address the described crashes, improving the robustness of the model execution for quantized models. The changes are well-implemented.
|
Hi @lucaspirola, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Two fixes for crashes when running quantized MLA models: 1. mla_attention.py: Guard `self.kv_b_proj.weight` access — quantized layers (NVFP4 Marlin, INT4 compressed-tensors) store weights as `weight_packed` (int32), not `weight` (float). Use getattr with fallback and only cast when weight dtype is a float type. 2. glm4_moe_lite.py: Guard `loaded_params.add(name)` — `name` can be None during shared-expert weight loading, causing a crash. Signed-off-by: Lucas Pirola <lucaspirola@users.noreply.github.com>
68a836a to
df6f557
Compare
…m-project#35576) When kv_b_proj is NVFP4-quantized, Marlin repacks weights to int32 storage and the .weight tensor's dtype is no longer bfloat16. Accessing .weight.dtype directly crashes with AttributeError. Use getattr(self.kv_b_proj, "weight", None) and guard the dtype check so that quantized projections safely resolve to is_aiter_triton_fp4_bmm_enabled=False without raising. Also guard loaded_params.add(name) in glm4_moe_lite.py against name=None to prevent TypeError when maybe_remap_kv_scale_name returns None.
…d auto-patching Mark PRs vllm-project#34822, vllm-project#35576, vllm-project#34577 as implemented (commits N1, N2, N3). Remove them from the "Critical Open PRs" section. Document that FlashInfer patches now run automatically at startup (Commit K rework) so the post-install script is no longer required.
…m-project#35576) When kv_b_proj is NVFP4-quantized, Marlin repacks weights to int32 storage and the .weight tensor's dtype is no longer bfloat16. Accessing .weight.dtype directly crashes with AttributeError. Use getattr(self.kv_b_proj, "weight", None) and guard the dtype check so that quantized projections safely resolve to is_aiter_triton_fp4_bmm_enabled=False without raising. Also guard loaded_params.add(name) in glm4_moe_lite.py against name=None to prevent TypeError when maybe_remap_kv_scale_name returns None.
…d auto-patching Mark PRs vllm-project#34822, vllm-project#35576, vllm-project#34577 as implemented (commits N1, N2, N3). Remove them from the "Critical Open PRs" section. Document that FlashInfer patches now run automatically at startup (Commit K rework) so the post-install script is no longer required.
…m-project#35576) When kv_b_proj is NVFP4-quantized, Marlin repacks weights to int32 storage and the .weight tensor's dtype is no longer bfloat16. Accessing .weight.dtype directly crashes with AttributeError. Use getattr(self.kv_b_proj, "weight", None) and guard the dtype check so that quantized projections safely resolve to is_aiter_triton_fp4_bmm_enabled=False without raising. Also guard loaded_params.add(name) in glm4_moe_lite.py against name=None to prevent TypeError when maybe_remap_kv_scale_name returns None.
…m-project#35576) When kv_b_proj is NVFP4-quantized, Marlin repacks weights to int32 storage and the .weight tensor's dtype is no longer bfloat16. Accessing .weight.dtype directly crashes with AttributeError. Use getattr(self.kv_b_proj, "weight", None) and guard the dtype check so that quantized projections safely resolve to is_aiter_triton_fp4_bmm_enabled=False without raising. Also guard loaded_params.add(name) in glm4_moe_lite.py against name=None to prevent TypeError when maybe_remap_kv_scale_name returns None.
Summary
Fix two crashes that prevent serving NVFP4 (modelopt) and INT4 (compressed-tensors) quantized MLA models such as GLM-4.7-Flash-REAP-23B-A3B:
mla_attention.py:self.kv_b_proj.weight.dtyperaisesAttributeErrorwhen the layer uses quantized storage formats (e.g. Marlin storesweight_packedas int32, notweightas float). Fix: usegetattr(self.kv_b_proj, "weight", None)and only cast when the weight is a float dtype (BF16/FP16/FP8).glm4_moe_lite.py:loaded_params.add(name)raisesTypeErrorwhenname is Noneduring shared-expert weight loading. Fix: addand name is not Noneguard.Reproduction
Crash 1 —
mla_attention.py:Triggered during chunked prefill context computation when
kv_b_projuses Marlin (NVFP4) or compressed-tensors (INT4) quantization — these store weights asweight_packed(int32), notweight(float).Crash 2 —
glm4_moe_lite.py:Triggered during weight loading when shared-expert fusion produces
name=Noneentries.Test plan
🤖 Generated with Claude Code