[Bugfix] Fix MLA weight access crash for quantized layers (NVFP4/INT4) by lucaspirola · Pull Request #35576 · vllm-project/vllm

lucaspirola · 2026-02-28T03:16:34Z

Summary

Fix two crashes that prevent serving NVFP4 (modelopt) and INT4 (compressed-tensors) quantized MLA models such as GLM-4.7-Flash-REAP-23B-A3B:

mla_attention.py: self.kv_b_proj.weight.dtype raises AttributeError when the layer uses quantized storage formats (e.g. Marlin stores weight_packed as int32, not weight as float). Fix: use getattr(self.kv_b_proj, "weight", None) and only cast when the weight is a float dtype (BF16/FP16/FP8).
glm4_moe_lite.py: loaded_params.add(name) raises TypeError when name is None during shared-expert weight loading. Fix: add and name is not None guard.

Reproduction

# Quantize GLM-4.7-Flash-REAP-23B-A3B to NVFP4 with nvidia-modelopt 0.41.0, then:
vllm serve ./GLM-4.7-Flash-REAP-23B-A3B-NVFP4 \
    --trust-remote-code --dtype bfloat16 --quantization modelopt \
    --enforce-eager --max-model-len 4928
# Crashes on model loading without these patches

Crash 1 — mla_attention.py:

AttributeError: 'MarlinLinearMethod' object has no attribute 'weight'

Triggered during chunked prefill context computation when kv_b_proj uses Marlin (NVFP4) or compressed-tensors (INT4) quantization — these store weights as weight_packed (int32), not weight (float).

Crash 2 — glm4_moe_lite.py:

TypeError: unhashable type: 'NoneType'

Triggered during weight loading when shared-expert fusion produces name=None entries.

Test plan

Tested with GLM-4.7-Flash-REAP-23B-A3B quantized to NVFP4 (modelopt 0.41.0) on RTX 5080 (SM12.0) with TRITON_MLA
Tested with GLM-4.7-Flash-REAP-23B-A3B quantized to INT4 (compressed-tensors) on the same hardware
Verified correct inference output (math, JSON structured output) after patches
Existing MLA models with standard float weights should be unaffected (guard only skips cast when weight is None or non-float)

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces two important bug fixes for running quantized MLA models (NVFP4/INT4). The first change in mla_attention.py correctly handles layers that may not have a weight attribute by using getattr, preventing an AttributeError. It also adds a necessary check to only cast input tensors for float-like weight dtypes, which is crucial for integer-quantized layers. The second change in glm4_moe_lite.py adds a None check before adding a weight name to a set, preventing a TypeError during weight loading for shared experts. Both fixes are correct and directly address the described crashes, improving the robustness of the model execution for quantized models. The changes are well-implemented.

mergify · 2026-02-28T03:21:22Z

Hi @lucaspirola, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Two fixes for crashes when running quantized MLA models: 1. mla_attention.py: Guard `self.kv_b_proj.weight` access — quantized layers (NVFP4 Marlin, INT4 compressed-tensors) store weights as `weight_packed` (int32), not `weight` (float). Use getattr with fallback and only cast when weight dtype is a float type. 2. glm4_moe_lite.py: Guard `loaded_params.add(name)` — `name` can be None during shared-expert weight loading, causing a crash. Signed-off-by: Lucas Pirola <lucaspirola@users.noreply.github.com>

…m-project#35576) When kv_b_proj is NVFP4-quantized, Marlin repacks weights to int32 storage and the .weight tensor's dtype is no longer bfloat16. Accessing .weight.dtype directly crashes with AttributeError. Use getattr(self.kv_b_proj, "weight", None) and guard the dtype check so that quantized projections safely resolve to is_aiter_triton_fp4_bmm_enabled=False without raising. Also guard loaded_params.add(name) in glm4_moe_lite.py against name=None to prevent TypeError when maybe_remap_kv_scale_name returns None.

…d auto-patching Mark PRs vllm-project#34822, vllm-project#35576, vllm-project#34577 as implemented (commits N1, N2, N3). Remove them from the "Critical Open PRs" section. Document that FlashInfer patches now run automatically at startup (Commit K rework) so the post-install script is no longer required.

…m-project#35576) When kv_b_proj is NVFP4-quantized, Marlin repacks weights to int32 storage and the .weight tensor's dtype is no longer bfloat16. Accessing .weight.dtype directly crashes with AttributeError. Use getattr(self.kv_b_proj, "weight", None) and guard the dtype check so that quantized projections safely resolve to is_aiter_triton_fp4_bmm_enabled=False without raising. Also guard loaded_params.add(name) in glm4_moe_lite.py against name=None to prevent TypeError when maybe_remap_kv_scale_name returns None.

…d auto-patching Mark PRs vllm-project#34822, vllm-project#35576, vllm-project#34577 as implemented (commits N1, N2, N3). Remove them from the "Critical Open PRs" section. Document that FlashInfer patches now run automatically at startup (Commit K rework) so the post-install script is no longer required.

…m-project#35576) When kv_b_proj is NVFP4-quantized, Marlin repacks weights to int32 storage and the .weight tensor's dtype is no longer bfloat16. Accessing .weight.dtype directly crashes with AttributeError. Use getattr(self.kv_b_proj, "weight", None) and guard the dtype check so that quantized projections safely resolve to is_aiter_triton_fp4_bmm_enabled=False without raising. Also guard loaded_params.add(name) in glm4_moe_lite.py against name=None to prevent TypeError when maybe_remap_kv_scale_name returns None.

lucaspirola requested review from LucasWilkinson and MatthewBonanni as code owners February 28, 2026 03:16

mergify bot added the bug Something isn't working label Feb 28, 2026

gemini-code-assist bot reviewed Feb 28, 2026

View reviewed changes

lucaspirola force-pushed the fix/glm4-mla-weight-guards branch from 68a836a to df6f557 Compare February 28, 2026 04:43

Merge branch 'main' into fix/glm4-mla-weight-guards

5ff89d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix MLA weight access crash for quantized layers (NVFP4/INT4)#35576

[Bugfix] Fix MLA weight access crash for quantized layers (NVFP4/INT4)#35576
lucaspirola wants to merge 2 commits intovllm-project:mainfrom
lucaspirola:fix/glm4-mla-weight-guards

lucaspirola commented Feb 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lucaspirola commented Feb 28, 2026

Summary

Reproduction

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant