Skip to content

[BugFix] register quant scale tensors as buffer#31395

Merged
DarkLight1337 merged 2 commits intovllm-project:mainfrom
BoyuanFeng:bf/register-quant-scale-buffer
Dec 28, 2025
Merged

[BugFix] register quant scale tensors as buffer#31395
DarkLight1337 merged 2 commits intovllm-project:mainfrom
BoyuanFeng:bf/register-quant-scale-buffer

Conversation

@BoyuanFeng
Copy link
Copy Markdown
Collaborator

@BoyuanFeng BoyuanFeng commented Dec 26, 2025

Prior to this pr, pytest -sv "tests/compile/test_fusion_attn.py::test_attention_quant_pattern[AttentionBackendEnum.TRITON_ATTN-nvidia/Llama-4-Scout-17B-16E-Instruct-FP8-TestAttentionFp8StaticQuantPatternModel-+quant_fp8-dtype1-256-128-40-8]" leads to illegal memory access error. I found that the IMA comes from torch.ops._C.static_scaled_fp8_quant as described in #31377.

@lengrongfu pointed out the issue is that scale tensor is not on GPU so IMA happened. In tests/compile/test_fusion_attn.py we actually have model_unfused.to(device). So why it is not on gpu?

model_unfused = model_unfused.to(device)

It turns out that _k_scale, _v_scale, _q_scale, _prob_scale are not registered as buffers. model_unfused.to(device) only moves parameters and buffers. The fix is to register these 4 tensors as buffers.

This fixes the unit test failure.

Close: #31377

Signed-off-by: Boyuan Feng <boyuan@meta.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses an illegal memory access error by registering the quantization scale tensors (_k_scale, _v_scale, _q_scale, _prob_scale) as buffers. This is the proper solution to ensure these tensors are moved to the correct device when model.to(device) is called. While this fix is correct, I've identified a similar critical issue that has been overlooked in the same file. The q_range, k_range, and v_range tensors in the Attention and MLAAttention classes are also initialized as raw tensors and not registered as buffers. When calculate_kv_scales is enabled, this will lead to a device mismatch RuntimeError during calculations, which is the same class of bug. To prevent this, I strongly recommend also registering q_range, k_range, and v_range as buffers.

@ProExpertProg ProExpertProg enabled auto-merge (squash) December 26, 2025 21:25
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 26, 2025
@ProExpertProg
Copy link
Copy Markdown
Collaborator

This doesn't actually solve the issue for me, I had to run in a Docker container to repro:

$ docker run -it --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -v ~/git/vllm:/vllm \ # path for the local vllm repo
    --ipc=host \
    --entrypoint /bin/bash \
    vllm/vllm-openai:latest
# in container
$ pip install tblib
$ pytest /vllm/tests/compile/test_fusion_attn.py -s -v

________________ test_attention_quant_pattern[AttentionBackendEnum.TRITON_ATTN-nvidia/Llama-4-Scout-17B-16E-Instruct-FP8-TestAttentionFp8StaticQuantPatternModel-+quant_fp8-dtype0-256-128-40-8] ________________

num_qo_heads = 40, num_kv_heads = 8, head_size = 128, batch_size = 256, dtype = torch.bfloat16, custom_ops = '+quant_fp8', model_name = 'nvidia/Llama-4-Scout-17B-16E-Instruct-FP8'
model_class = <class 'tests.compile.test_fusion_attn.TestAttentionFp8StaticQuantPatternModel'>, backend = <AttentionBackendEnum.TRITON_ATTN: 'vllm.v1.attention.backends.triton_attn.TritonAttentionBackend'>
dist_init = None

>   ???
E   AssertionError: Tensor-likes are not close!
E   
E   Mismatched elements: 1310660 / 1310720 (100.0%)
E   Greatest absolute difference: 6752.0 at index (156, 3552) (up to 0.01 allowed)
E   Greatest relative difference: inf at index (32, 0) (up to 0.01 allowed)

@BoyuanFeng
Copy link
Copy Markdown
Collaborator Author

@ProExpertProg there are 3 separate issues for attn_fusion. One is the numeric mismatch issue you have seen. @angelayi is fixing the issue.

Another one is an IMA error which is fixed by this PR.

There is also an implicit compilation config cache issue, fixed by #31376.

@ProExpertProg
Copy link
Copy Markdown
Collaborator

Ok, please post the PRs in #pr-merge-requests once CI is done so we can get the CI fixed ASAP

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

nice job @BoyuanFeng

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

note that the test failure is related

Signed-off-by: Boyuan Feng <boyuan@meta.com>
auto-merge was automatically disabled December 27, 2025 22:29

Head branch was pushed to by a user without write access

@BoyuanFeng
Copy link
Copy Markdown
Collaborator Author

@robertgshaw2-redhat the unit test failure comes from dummy weight load. This PR adds the q/k/v/prob scales to state dict. So dummy weight load randomly initializes its value. When quant is not used, model checkpoint does not contain q/k/v/prob scales so their values are still random values from dummy load, instead of expected default value (i.e., value 1).

The issue is fixed by updating process_weights_after_loading to handle cases that we should not load quant weights.

@DarkLight1337 DarkLight1337 merged commit 62def07 into vllm-project:main Dec 28, 2025
46 checks passed
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Dec 28, 2025
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Dec 29, 2025
### What this PR does / why we need it?
- Fixes vllm break:
1. [[BugFix] register quant scale tensors as buffer #31395]
(vllm-project/vllm#31395)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Dec 30, 2025
shenchuxiaofugui pushed a commit to shenchuxiaofugui/vllm-ascend that referenced this pull request Dec 31, 2025
### What this PR does / why we need it?
- Fixes vllm break:
1. [[BugFix] register quant scale tensors as buffer #31395]
(vllm-project/vllm#31395)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?
- Fixes vllm break:
1. [[BugFix] register quant scale tensors as buffer #31395]
(vllm-project/vllm#31395)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
- Fixes vllm break:
1. [[BugFix] register quant scale tensors as buffer #31395]
(vllm-project/vllm#31395)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?
- Fixes vllm break:
1. [[BugFix] register quant scale tensors as buffer #31395]
(vllm-project/vllm#31395)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: torch.ops._C.static_scaled_fp8_quant IMA error

4 participants