Skip to content

[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models#34695

Merged
MatthewBonanni merged 3 commits intovllm-project:mainfrom
haosdent:fix-34561
Mar 13, 2026
Merged

[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models#34695
MatthewBonanni merged 3 commits intovllm-project:mainfrom
haosdent:fix-34561

Conversation

@haosdent
Copy link
Copy Markdown
Contributor

@haosdent haosdent commented Feb 17, 2026

Purpose

Fix AttributeError: 'ColumnParallelLinear' object has no attribute 'weight' when running MLA models with AWQ/GPTQ quantization (e.g., cyankiwi/GLM-4.7-Flash-AWQ-4bit).

Closes #34561.

Root cause: MLA attention code accesses self.kv_b_proj.weight.dtype in 3 places, but AWQ/GPTQ-quantized ColumnParallelLinear layers store weights as qweight (packed int32), not weight. The code only accounted for unquantized and FP8-quantized weights.

Fix: Guard .weight.dtype accesses with hasattr(self.kv_b_proj, "weight") checks:

  • Line 406 (MLAAttention.__init__): Added hasattr guard in the and chain for the ROCm fp4 BMM check. Short-circuits to False for AWQ/GPTQ — correct, since packed int32 weights can't be used with fp4 BMM.
  • Lines 2357-2371 (MLACommonImpl._compute_prefill_context): Introduced a local _kv_b_proj_w_dtype that uses weight.dtype when available, falling back to params_dtype (always present on LinearBase) for quantized layers. params_dtype is the model's compute dtype (e.g., bf16), which is the correct input dtype that AWQ/GPTQ layers expect.

Correctness verified across all quantization methods:

Quantization hasattr(weight) dtype used fp4 BMM check Prefill cast behavior
Unquantized bf16 True bf16 Enabled if fp4bmm available Cast to bf16
Unquantized fp16 True fp16 Disabled (fp16≠bf16) Cast to fp16
FP8 True float8_e4m3fn Disabled (fp8≠bf16) fp8 prefill: cast to fp8; no fp8 prefill: skip cast
AWQ (W4A16) False params_dtype (bf16) Disabled (short-circuit) Cast to bf16
GPTQ (W4A16) False params_dtype (fp16) Disabled (short-circuit) Cast to fp16

Test Plan

  1. Syntax and structural verification — AST parse, confirmed exactly 2 hasattr guards and 1 params_dtype fallback.
  2. Mock-based logic verification — tested all 5 quantization scenarios (unquantized bf16, unquantized fp16, FP8 with/without FP8 prefill, AWQ) with mock objects simulating the ColumnParallelLinear layer behavior.

Test Result

Unquantized bf16: hasattr(weight)=True, weight.dtype=bf16
FP8: hasattr(weight)=True, weight.dtype=fp8
AWQ: hasattr(weight)=False, params_dtype=bf16
Line 406 logic: PASS for all 3 scenarios
Unquant, no fp8 prefill: cast=True, dtype=torch.bfloat16 -- PASS
FP8, no fp8 prefill: cast=False -- PASS
FP8, fp8 prefill: cast=True, dtype=torch.float8_e4m3fn -- PASS
AWQ, no fp8 prefill: cast=True, dtype=torch.bfloat16 -- PASS

All verification checks PASSED

…-project#34561)

Fix AttributeError when using AWQ/GPTQ quantized MLA models (e.g.,
GLM-4.7-Flash-AWQ) by guarding `kv_b_proj.weight.dtype` accesses
with `hasattr` checks and falling back to `params_dtype`.

Signed-off-by: haosdent <haosdent@gmail.com>
@mergify mergify bot added the bug Something isn't working label Feb 17, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a crash that occurs when running MLA models with AWQ/GPTQ quantization. The root cause, an AttributeError from accessing a non-existent .weight attribute on quantized layers, is correctly identified. The fix is clean and robust, using hasattr guards to prevent the error. The fallback to params_dtype for quantized layers in _compute_prefill_context is a logical and well-justified approach. The changes are minimal, targeted, and well-documented in the pull request description. Overall, this is an excellent bugfix.

@haosdent haosdent changed the title [Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models [WIP][Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models Feb 17, 2026
@mgoin mgoin requested a review from pavanimajety February 17, 2026 15:47
@mgoin
Copy link
Copy Markdown
Member

mgoin commented Feb 17, 2026

cc @LucasWilkinson @MatthewBonanni @pavanimajety to review

@haosdent haosdent changed the title [WIP][Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models [Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models Feb 23, 2026
@cjackal
Copy link
Copy Markdown
Contributor

cjackal commented Feb 23, 2026

I've checked that with this PR deepseek v3 awq checkpoint is loaded successfully and gets normal accuracy(gsm8k: 0.945) (naturally). The change looks accurate and simple, can we review this PR to unblock awq/gptq models from run on latest main w/ transformers v5?

@haosdent
Copy link
Copy Markdown
Contributor Author

Thanks @cjackal 's test. @LucasWilkinson @MatthewBonanni @pavanimajety can you help to review? Thank you in advance.

@babyplutokurt
Copy link
Copy Markdown

I also verified this patch fixs the error serving GLM-4.7-Flash-GPTQ-4bits, with single and batch requests.

Copy link
Copy Markdown
Collaborator

@MatthewBonanni MatthewBonanni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution!

@MatthewBonanni MatthewBonanni added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 6, 2026
@MatthewBonanni MatthewBonanni enabled auto-merge (squash) March 6, 2026 12:33
@MatthewBonanni MatthewBonanni merged commit 6d53efd into vllm-project:main Mar 13, 2026
54 checks passed
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
khairulkabir1661 added a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
Restore three upstream changes in MLACommonImpl that were accidentally
removed in initial AITER commits:

1. Add back logger.info_once for backend selection (TRT-LLM, FlashInfer,
   CUDNN, FlashAttention) - helpful for debugging
2. Restore FA4 support in _pad_v logic - FA4 natively handles different
   head dimensions like FA3 on Hopper
3. Restore params_dtype fallback for AWQ/GPTQ quantized models (PR vllm-project#34695)
   - Quantized layers may lack .weight attribute

These changes are in MLACommonImpl (shared backend selector), not related
to AITER fused kernel functionality which is in MLAAttention class.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…-project#34695)

Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
…-project#34695)

Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…-project#34695)

Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: GLM-4.7-Flash-AWQ fails with AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'

6 participants