[BugFix] register quant scale tensors as buffer by BoyuanFeng · Pull Request #31395 · vllm-project/vllm

BoyuanFeng · 2025-12-26T19:58:58Z

Prior to this pr, pytest -sv "tests/compile/test_fusion_attn.py::test_attention_quant_pattern[AttentionBackendEnum.TRITON_ATTN-nvidia/Llama-4-Scout-17B-16E-Instruct-FP8-TestAttentionFp8StaticQuantPatternModel-+quant_fp8-dtype1-256-128-40-8]" leads to illegal memory access error. I found that the IMA comes from torch.ops._C.static_scaled_fp8_quant as described in #31377.

@lengrongfu pointed out the issue is that scale tensor is not on GPU so IMA happened. In tests/compile/test_fusion_attn.py we actually have model_unfused.to(device). So why it is not on gpu?

vllm/tests/compile/test_fusion_attn.py

Line 365 in 87f1b8c

model_unfused = model_unfused.to(device)

It turns out that _k_scale, _v_scale, _q_scale, _prob_scale are not registered as buffers. model_unfused.to(device) only moves parameters and buffers. The fix is to register these 4 tensors as buffers.

This fixes the unit test failure.

Close: #31377

Signed-off-by: Boyuan Feng <boyuan@meta.com>

chatgpt-codex-connector · 2025-12-26T19:59:04Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request correctly addresses an illegal memory access error by registering the quantization scale tensors (_k_scale, _v_scale, _q_scale, _prob_scale) as buffers. This is the proper solution to ensure these tensors are moved to the correct device when model.to(device) is called. While this fix is correct, I've identified a similar critical issue that has been overlooked in the same file. The q_range, k_range, and v_range tensors in the Attention and MLAAttention classes are also initialized as raw tensors and not registered as buffers. When calculate_kv_scales is enabled, this will lead to a device mismatch RuntimeError during calculations, which is the same class of bug. To prevent this, I strongly recommend also registering q_range, k_range, and v_range as buffers.

ProExpertProg · 2025-12-26T21:39:07Z

This doesn't actually solve the issue for me, I had to run in a Docker container to repro:

$ docker run -it --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -v ~/git/vllm:/vllm \ # path for the local vllm repo
    --ipc=host \
    --entrypoint /bin/bash \
    vllm/vllm-openai:latest
# in container
$ pip install tblib
$ pytest /vllm/tests/compile/test_fusion_attn.py -s -v

________________ test_attention_quant_pattern[AttentionBackendEnum.TRITON_ATTN-nvidia/Llama-4-Scout-17B-16E-Instruct-FP8-TestAttentionFp8StaticQuantPatternModel-+quant_fp8-dtype0-256-128-40-8] ________________

num_qo_heads = 40, num_kv_heads = 8, head_size = 128, batch_size = 256, dtype = torch.bfloat16, custom_ops = '+quant_fp8', model_name = 'nvidia/Llama-4-Scout-17B-16E-Instruct-FP8'
model_class = <class 'tests.compile.test_fusion_attn.TestAttentionFp8StaticQuantPatternModel'>, backend = <AttentionBackendEnum.TRITON_ATTN: 'vllm.v1.attention.backends.triton_attn.TritonAttentionBackend'>
dist_init = None

>   ???
E   AssertionError: Tensor-likes are not close!
E   
E   Mismatched elements: 1310660 / 1310720 (100.0%)
E   Greatest absolute difference: 6752.0 at index (156, 3552) (up to 0.01 allowed)
E   Greatest relative difference: inf at index (32, 0) (up to 0.01 allowed)

BoyuanFeng · 2025-12-26T21:43:24Z

@ProExpertProg there are 3 separate issues for attn_fusion. One is the numeric mismatch issue you have seen. @angelayi is fixing the issue.

Another one is an IMA error which is fixed by this PR.

There is also an implicit compilation config cache issue, fixed by #31376.

ProExpertProg · 2025-12-26T21:57:55Z

Ok, please post the PRs in #pr-merge-requests once CI is done so we can get the CI fixed ASAP

robertgshaw2-redhat · 2025-12-27T15:45:57Z

nice job @BoyuanFeng

robertgshaw2-redhat · 2025-12-27T16:49:30Z

note that the test failure is related

Signed-off-by: Boyuan Feng <boyuan@meta.com>

BoyuanFeng · 2025-12-27T22:32:45Z

@robertgshaw2-redhat the unit test failure comes from dummy weight load. This PR adds the q/k/v/prob scales to state dict. So dummy weight load randomly initializes its value. When quant is not used, model checkpoint does not contain q/k/v/prob scales so their values are still random values from dummy load, instead of expected default value (i.e., value 1).

The issue is fixed by updating process_weights_after_loading to handle cases that we should not load quant weights.

Signed-off-by: Boyuan Feng <boyuan@meta.com>

### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (vllm-project/vllm#31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: Boyuan Feng <boyuan@meta.com>

### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (vllm-project/vllm#31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>

Signed-off-by: Boyuan Feng <boyuan@meta.com>

### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (vllm-project/vllm#31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (vllm-project/vllm#31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>

### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (vllm-project/vllm#31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

register quant scale tensors as buffer

174a4bf

Signed-off-by: Boyuan Feng <boyuan@meta.com>

BoyuanFeng requested a review from LucasWilkinson as a code owner December 26, 2025 19:58

gemini-code-assist bot reviewed Dec 26, 2025

View reviewed changes

ProExpertProg approved these changes Dec 26, 2025

View reviewed changes

ProExpertProg enabled auto-merge (squash) December 26, 2025 21:25

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 26, 2025

process quant scales after weight loading

c185469

Signed-off-by: Boyuan Feng <boyuan@meta.com>

auto-merge was automatically disabled December 27, 2025 22:29
Head branch was pushed to by a user without write access

DarkLight1337 merged commit 62def07 into vllm-project:main Dec 28, 2025
46 checks passed

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Dec 28, 2025

[BugFix] register quant scale tensors as buffer (vllm-project#31395)

56a817f

Signed-off-by: Boyuan Feng <boyuan@meta.com>

leo-pony mentioned this pull request Dec 29, 2025

Update corresponding vllm commit ID to 12 29 vllm-project/vllm-ascend#5475

Merged

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Dec 30, 2025

[BugFix] register quant scale tensors as buffer (vllm-project#31395)

f8202c5

Signed-off-by: Boyuan Feng <boyuan@meta.com>

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[BugFix] register quant scale tensors as buffer (vllm-project#31395)

5d1404f

Signed-off-by: Boyuan Feng <boyuan@meta.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] register quant scale tensors as buffer#31395

[BugFix] register quant scale tensors as buffer#31395
DarkLight1337 merged 2 commits intovllm-project:mainfrom
BoyuanFeng:bf/register-quant-scale-buffer

BoyuanFeng commented Dec 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

ProExpertProg commented Dec 26, 2025

Uh oh!

BoyuanFeng commented Dec 26, 2025

Uh oh!

ProExpertProg commented Dec 26, 2025

Uh oh!

robertgshaw2-redhat commented Dec 27, 2025

Uh oh!

robertgshaw2-redhat commented Dec 27, 2025

Uh oh!

BoyuanFeng commented Dec 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

BoyuanFeng commented Dec 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot commented Dec 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

ProExpertProg commented Dec 26, 2025

Uh oh!

BoyuanFeng commented Dec 26, 2025

Uh oh!

ProExpertProg commented Dec 26, 2025

Uh oh!

robertgshaw2-redhat commented Dec 27, 2025

Uh oh!

robertgshaw2-redhat commented Dec 27, 2025

Uh oh!

BoyuanFeng commented Dec 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BoyuanFeng commented Dec 26, 2025 •

edited by github-actions bot

Loading