[BugFix] register quant scale tensors as buffer#31395
[BugFix] register quant scale tensors as buffer#31395DarkLight1337 merged 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Boyuan Feng <boyuan@meta.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request correctly addresses an illegal memory access error by registering the quantization scale tensors (_k_scale, _v_scale, _q_scale, _prob_scale) as buffers. This is the proper solution to ensure these tensors are moved to the correct device when model.to(device) is called. While this fix is correct, I've identified a similar critical issue that has been overlooked in the same file. The q_range, k_range, and v_range tensors in the Attention and MLAAttention classes are also initialized as raw tensors and not registered as buffers. When calculate_kv_scales is enabled, this will lead to a device mismatch RuntimeError during calculations, which is the same class of bug. To prevent this, I strongly recommend also registering q_range, k_range, and v_range as buffers.
|
This doesn't actually solve the issue for me, I had to run in a Docker container to repro: |
|
@ProExpertProg there are 3 separate issues for attn_fusion. One is the numeric mismatch issue you have seen. @angelayi is fixing the issue. Another one is an IMA error which is fixed by this PR. There is also an implicit compilation config cache issue, fixed by #31376. |
|
Ok, please post the PRs in #pr-merge-requests once CI is done so we can get the CI fixed ASAP |
|
nice job @BoyuanFeng |
|
note that the test failure is related |
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Head branch was pushed to by a user without write access
|
@robertgshaw2-redhat the unit test failure comes from dummy weight load. This PR adds the q/k/v/prob scales to state dict. So dummy weight load randomly initializes its value. When quant is not used, model checkpoint does not contain q/k/v/prob scales so their values are still random values from dummy load, instead of expected default value (i.e., value 1). The issue is fixed by updating |
Signed-off-by: Boyuan Feng <boyuan@meta.com>
### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (vllm-project/vllm#31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com>
### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (vllm-project/vllm#31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com>
### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (vllm-project/vllm#31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (vllm-project/vllm#31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it? - Fixes vllm break: 1. [[BugFix] register quant scale tensors as buffer #31395] (vllm-project/vllm#31395) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Prior to this pr,
pytest -sv "tests/compile/test_fusion_attn.py::test_attention_quant_pattern[AttentionBackendEnum.TRITON_ATTN-nvidia/Llama-4-Scout-17B-16E-Instruct-FP8-TestAttentionFp8StaticQuantPatternModel-+quant_fp8-dtype1-256-128-40-8]"leads to illegal memory access error. I found that the IMA comes fromtorch.ops._C.static_scaled_fp8_quantas described in #31377.@lengrongfu pointed out the issue is that scale tensor is not on GPU so IMA happened. In
tests/compile/test_fusion_attn.pywe actually havemodel_unfused.to(device). So why it is not on gpu?vllm/tests/compile/test_fusion_attn.py
Line 365 in 87f1b8c
It turns out that
_k_scale,_v_scale,_q_scale,_prob_scaleare not registered as buffers.model_unfused.to(device)only moves parameters and buffers. The fix is to register these 4 tensors as buffers.This fixes the unit test failure.
Close: #31377