Skip to content

[Bugfix] Fix MLA weight access crash for quantized layers (NVFP4/INT4)#35576

Open
lucaspirola wants to merge 2 commits intovllm-project:mainfrom
lucaspirola:fix/glm4-mla-weight-guards
Open

[Bugfix] Fix MLA weight access crash for quantized layers (NVFP4/INT4)#35576
lucaspirola wants to merge 2 commits intovllm-project:mainfrom
lucaspirola:fix/glm4-mla-weight-guards

Conversation

@lucaspirola
Copy link
Copy Markdown

Summary

Fix two crashes that prevent serving NVFP4 (modelopt) and INT4 (compressed-tensors) quantized MLA models such as GLM-4.7-Flash-REAP-23B-A3B:

  • mla_attention.py: self.kv_b_proj.weight.dtype raises AttributeError when the layer uses quantized storage formats (e.g. Marlin stores weight_packed as int32, not weight as float). Fix: use getattr(self.kv_b_proj, "weight", None) and only cast when the weight is a float dtype (BF16/FP16/FP8).
  • glm4_moe_lite.py: loaded_params.add(name) raises TypeError when name is None during shared-expert weight loading. Fix: add and name is not None guard.

Reproduction

# Quantize GLM-4.7-Flash-REAP-23B-A3B to NVFP4 with nvidia-modelopt 0.41.0, then:
vllm serve ./GLM-4.7-Flash-REAP-23B-A3B-NVFP4 \
    --trust-remote-code --dtype bfloat16 --quantization modelopt \
    --enforce-eager --max-model-len 4928
# Crashes on model loading without these patches

Crash 1 — mla_attention.py:

AttributeError: 'MarlinLinearMethod' object has no attribute 'weight'

Triggered during chunked prefill context computation when kv_b_proj uses Marlin (NVFP4) or compressed-tensors (INT4) quantization — these store weights as weight_packed (int32), not weight (float).

Crash 2 — glm4_moe_lite.py:

TypeError: unhashable type: 'NoneType'

Triggered during weight loading when shared-expert fusion produces name=None entries.

Test plan

  • Tested with GLM-4.7-Flash-REAP-23B-A3B quantized to NVFP4 (modelopt 0.41.0) on RTX 5080 (SM12.0) with TRITON_MLA
  • Tested with GLM-4.7-Flash-REAP-23B-A3B quantized to INT4 (compressed-tensors) on the same hardware
  • Verified correct inference output (math, JSON structured output) after patches
  • Existing MLA models with standard float weights should be unaffected (guard only skips cast when weight is None or non-float)

🤖 Generated with Claude Code

@mergify mergify bot added the bug Something isn't working label Feb 28, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two important bug fixes for running quantized MLA models (NVFP4/INT4). The first change in mla_attention.py correctly handles layers that may not have a weight attribute by using getattr, preventing an AttributeError. It also adds a necessary check to only cast input tensors for float-like weight dtypes, which is crucial for integer-quantized layers. The second change in glm4_moe_lite.py adds a None check before adding a weight name to a set, preventing a TypeError during weight loading for shared experts. Both fixes are correct and directly address the described crashes, improving the robustness of the model execution for quantized models. The changes are well-implemented.

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 28, 2026

Hi @lucaspirola, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Two fixes for crashes when running quantized MLA models:

1. mla_attention.py: Guard `self.kv_b_proj.weight` access — quantized
   layers (NVFP4 Marlin, INT4 compressed-tensors) store weights as
   `weight_packed` (int32), not `weight` (float). Use getattr with
   fallback and only cast when weight dtype is a float type.

2. glm4_moe_lite.py: Guard `loaded_params.add(name)` — `name` can be
   None during shared-expert weight loading, causing a crash.

Signed-off-by: Lucas Pirola <lucaspirola@users.noreply.github.com>
@lucaspirola lucaspirola force-pushed the fix/glm4-mla-weight-guards branch from 68a836a to df6f557 Compare February 28, 2026 04:43
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 2, 2026
…m-project#35576)

When kv_b_proj is NVFP4-quantized, Marlin repacks weights to int32 storage
and the .weight tensor's dtype is no longer bfloat16. Accessing .weight.dtype
directly crashes with AttributeError.

Use getattr(self.kv_b_proj, "weight", None) and guard the dtype check so
that quantized projections safely resolve to is_aiter_triton_fp4_bmm_enabled=False
without raising.

Also guard loaded_params.add(name) in glm4_moe_lite.py against name=None
to prevent TypeError when maybe_remap_kv_scale_name returns None.
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 2, 2026
…d auto-patching

Mark PRs vllm-project#34822, vllm-project#35576, vllm-project#34577 as implemented (commits N1, N2, N3).
Remove them from the "Critical Open PRs" section.
Document that FlashInfer patches now run automatically at startup (Commit K
rework) so the post-install script is no longer required.
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 4, 2026
…m-project#35576)

When kv_b_proj is NVFP4-quantized, Marlin repacks weights to int32 storage
and the .weight tensor's dtype is no longer bfloat16. Accessing .weight.dtype
directly crashes with AttributeError.

Use getattr(self.kv_b_proj, "weight", None) and guard the dtype check so
that quantized projections safely resolve to is_aiter_triton_fp4_bmm_enabled=False
without raising.

Also guard loaded_params.add(name) in glm4_moe_lite.py against name=None
to prevent TypeError when maybe_remap_kv_scale_name returns None.
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 4, 2026
…d auto-patching

Mark PRs vllm-project#34822, vllm-project#35576, vllm-project#34577 as implemented (commits N1, N2, N3).
Remove them from the "Critical Open PRs" section.
Document that FlashInfer patches now run automatically at startup (Commit K
rework) so the post-install script is no longer required.
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 5, 2026
…m-project#35576)

When kv_b_proj is NVFP4-quantized, Marlin repacks weights to int32 storage
and the .weight tensor's dtype is no longer bfloat16. Accessing .weight.dtype
directly crashes with AttributeError.

Use getattr(self.kv_b_proj, "weight", None) and guard the dtype check so
that quantized projections safely resolve to is_aiter_triton_fp4_bmm_enabled=False
without raising.

Also guard loaded_params.add(name) in glm4_moe_lite.py against name=None
to prevent TypeError when maybe_remap_kv_scale_name returns None.
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 18, 2026
…m-project#35576)

When kv_b_proj is NVFP4-quantized, Marlin repacks weights to int32 storage
and the .weight tensor's dtype is no longer bfloat16. Accessing .weight.dtype
directly crashes with AttributeError.

Use getattr(self.kv_b_proj, "weight", None) and guard the dtype check so
that quantized projections safely resolve to is_aiter_triton_fp4_bmm_enabled=False
without raising.

Also guard loaded_params.add(name) in glm4_moe_lite.py against name=None
to prevent TypeError when maybe_remap_kv_scale_name returns None.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant