[BugFix]This PR aims to fix the precision issue of the LoRA feature i… by liuchenbing · Pull Request #4046 · vllm-project/vllm-ascend

liuchenbing · 2025-11-07T05:47:52Z

vLLM version: v0.11.0
vLLM main: vllm-project/vllm

What this PR does / why we need it?

   Fix the precision issue of the LoRA feature in vllm-ascend

Does this PR introduce any user-facing change?

How was this patch tested?

pytest tests/lora/test_llama_tp.py::test_llama_lora -s

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

…n vllm-ascend. vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liuchenbing <chenliumail@163.com>

github-actions · 2025-11-07T05:48:00Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request aims to fix a precision issue with LoRA features and enable bfloat16 kernels more broadly. The change in vllm_ascend/lora/punica_npu.py correctly casts the input tensor x to torch.float32 in add_lora_embedding, which aligns with the kernel's expectation and should resolve the precision problem. However, the changes in the C++ kernel files (bgmv_expand.cpp, bgmv_shrink.cpp, sgmv_expand.cpp, sgmv_shrink.cpp) introduce a critical compilation issue. While the calls to the bfloat16_t kernels are now unconditional, their definitions remain inside conditional compilation blocks (#if (__CCE_AICORE__ >= 220)). This will cause build failures on hardware with __CCE_AICORE__ < 220. Please address this by making the kernel definitions unconditional as well.

gemini-code-assist · 2025-11-07T05:49:16Z

+        bgmv_expand_bfloat16_t<<<blockDim, nullptr, stream>>>(x, weight, indices, indicesSize, yIn, yOut, batchSize,
+                                                              numTokensPerCore, maxLoRARank, outputHiddenDim,
+                                                              sliceOffset, outputFullDim);


You've removed the conditional compilation directive for the bgmv_expand_bfloat16_t kernel call, making it unconditional. However, the kernel's definition using BGMV_EXPAND_TYPE_DECLARE(bfloat16_t) at line 346 is still inside an #if (__CCE_AICORE__ >= 220) block. This will cause a compilation error on platforms where __CCE_AICORE__ < 220, as the function will be called but not defined. You should also remove the conditional compilation around the kernel declaration to fix this.

gemini-code-assist · 2025-11-07T05:49:16Z

+        bgmv_shrink_bfloat16_t<<<blockDim, nullptr, stream>>>(x, weight, indices, indicesSize, y, batchSize, numTokensPerCore,
+                                                              inputHiddenDim, maxLoRARank, scale);


You've removed the conditional compilation directive for the bgmv_shrink_bfloat16_t kernel call, making it unconditional. However, the kernel's definition using BGMV_SHRINK_TYPE_DECLARE(bfloat16_t) at line 230 is still inside an #if (__CCE_AICORE__ >= 220) block. This will cause a compilation error on platforms where __CCE_AICORE__ < 220, as the function will be called but not defined. You should also remove the conditional compilation around the kernel declaration to fix this.

gemini-code-assist · 2025-11-07T05:49:16Z

+        sgmv_expand_bfloat16_t<<<blockDim, nullptr, stream>>>(x, weight, loraIndices, loraIndicesSize,
+                                                              seqLen, seqLenSize, yIn, yOut, batchSize,
+                                                              numTokensPerCore, maxLoRARank, outputHiddenDim,
+                                                              sliceOffset, outputFullDim);


You've removed the conditional compilation directive for the sgmv_expand_bfloat16_t kernel call, making it unconditional. However, the kernel's definition using SGMV_EXPAND_TYPE_DECLARE(bfloat16_t) at line 361 is still inside an #if (__CCE_AICORE__ >= 220) block. This will cause a compilation error on platforms where __CCE_AICORE__ < 220, as the function will be called but not defined. You should also remove the conditional compilation around the kernel declaration to fix this.

gemini-code-assist · 2025-11-07T05:49:16Z

+        sgmv_shrink_bfloat16_t<<<blockDim, nullptr, stream>>>(x, weight, loraIndices, loraIndicesSize,
+                                                              seqLen, seqLenSize,
+                                                              y, batchSize,
+                                                              numTokensPerCore, inputHiddenDim, maxLoRARank,
+                                                              scale);


You've removed the conditional compilation directive for the sgmv_shrink_bfloat16_t kernel call, making it unconditional. However, the kernel's definition using SGMV_SHRINK_TYPE_DECLARE(bfloat16_t) at line 246 is still inside an #if (__CCE_AICORE__ >= 220) block. This will cause a compilation error on platforms where __CCE_AICORE__ < 220, as the function will be called but not defined. You should also remove the conditional compilation around the kernel declaration to fix this.

paulyu12 · 2025-11-07T07:01:36Z

This PR can fix 2 bugs:

The accuracy issue when we add Llama-2-7b-hf LoRA e2e testcase.
LoRA custom operators do not support dtype bfloat16, which is also mentioned at [Bug]: LoRA not working in v0.11.0rc0 #3668 (comment)

@liuchenbing Could you consider fixing according to the GEMINI review opinion ?

…n vllm-ascend. vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liuchenbing <chenliumail@163.com>

paulyu12 · 2025-11-13T02:29:01Z

This PR is duplicated to #4141. We'll concentrate on that one, and this will be closed.

### What this PR does / why we need it? This PR depends on PR #4046. And only if the latter merged, it will work. This PR aims to solve the issue #3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com>

…-project#4075) ### What this PR does / why we need it? This PR depends on PR vllm-project#4046. And only if the latter merged, it will work. This PR aims to solve the issue vllm-project#3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com>

…-project#4075) ### What this PR does / why we need it? This PR depends on PR vllm-project#4046. And only if the latter merged, it will work. This PR aims to solve the issue vllm-project#3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…-project#4075) ### What this PR does / why we need it? This PR depends on PR vllm-project#4046. And only if the latter merged, it will work. This PR aims to solve the issue vllm-project#3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com>

…-project#4075) ### What this PR does / why we need it? This PR depends on PR vllm-project#4046. And only if the latter merged, it will work. This PR aims to solve the issue vllm-project#3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…-project#4075) ### What this PR does / why we need it? This PR depends on PR vllm-project#4046. And only if the latter merged, it will work. This PR aims to solve the issue vllm-project#3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com>

[BugFix]This PR aims to fix the precision issue of the LoRA feature i…

b3f16d4

…n vllm-ascend. vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liuchenbing <chenliumail@163.com>

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

This was referenced Nov 7, 2025

[Bug]: LoRA not working in v0.11.0rc0 #3668

Closed

[Test][e2e][LoRA] Add more e2e tests to cover scenarios of LoRA #4075

Merged

liuchenbing added 4 commits November 11, 2025 08:32

[BugFix]This PR aims to fix the precision issue of the LoRA feature i…

536431b

…n vllm-ascend. vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liuchenbing <chenliumail@163.com>

[BugFix]This PR aims to fix the precision issue of the LoRA feature i…

7b2b59d

…n vllm-ascend. vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liuchenbing <chenliumail@163.com>

[BugFix]This PR aims to fix the precision issue of the LoRA feature i…

af719cf

…n vllm-ascend. vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liuchenbing <chenliumail@163.com>

[BugFix]This PR aims to fix the precision issue of the LoRA feature i…

9ec244e

…n vllm-ascend. vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liuchenbing <chenliumail@163.com>

paulyu12 added ready read for review ready-for-test start test by label for PR labels Nov 12, 2025

paulyu12 closed this Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix]This PR aims to fix the precision issue of the LoRA feature i…#4046

[BugFix]This PR aims to fix the precision issue of the LoRA feature i…#4046
liuchenbing wants to merge 5 commits intovllm-project:mainfrom
liuchenbing:main_lora

liuchenbing commented Nov 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Uh oh!

gemini-code-assist bot Nov 7, 2025

Uh oh!

gemini-code-assist bot Nov 7, 2025

Uh oh!

gemini-code-assist bot Nov 7, 2025

Uh oh!

paulyu12 commented Nov 7, 2025

Uh oh!

paulyu12 commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		bgmv_shrink_bfloat16_t<<<blockDim, nullptr, stream>>>(x, weight, indices, indicesSize, y, batchSize, numTokensPerCore,
		inputHiddenDim, maxLoRARank, scale);

Conversation

liuchenbing commented Nov 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

paulyu12 commented Nov 7, 2025

Uh oh!

paulyu12 commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liuchenbing commented Nov 7, 2025 •

edited by github-actions bot

Loading