[BugFix]This PR aims to fix the precision issue of the LoRA feature i…#4046
[BugFix]This PR aims to fix the precision issue of the LoRA feature i…#4046liuchenbing wants to merge 5 commits intovllm-project:mainfrom
Conversation
…n vllm-ascend. vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liuchenbing <chenliumail@163.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request aims to fix a precision issue with LoRA features and enable bfloat16 kernels more broadly. The change in vllm_ascend/lora/punica_npu.py correctly casts the input tensor x to torch.float32 in add_lora_embedding, which aligns with the kernel's expectation and should resolve the precision problem. However, the changes in the C++ kernel files (bgmv_expand.cpp, bgmv_shrink.cpp, sgmv_expand.cpp, sgmv_shrink.cpp) introduce a critical compilation issue. While the calls to the bfloat16_t kernels are now unconditional, their definitions remain inside conditional compilation blocks (#if (__CCE_AICORE__ >= 220)). This will cause build failures on hardware with __CCE_AICORE__ < 220. Please address this by making the kernel definitions unconditional as well.
| bgmv_expand_bfloat16_t<<<blockDim, nullptr, stream>>>(x, weight, indices, indicesSize, yIn, yOut, batchSize, | ||
| numTokensPerCore, maxLoRARank, outputHiddenDim, | ||
| sliceOffset, outputFullDim); |
There was a problem hiding this comment.
You've removed the conditional compilation directive for the bgmv_expand_bfloat16_t kernel call, making it unconditional. However, the kernel's definition using BGMV_EXPAND_TYPE_DECLARE(bfloat16_t) at line 346 is still inside an #if (__CCE_AICORE__ >= 220) block. This will cause a compilation error on platforms where __CCE_AICORE__ < 220, as the function will be called but not defined. You should also remove the conditional compilation around the kernel declaration to fix this.
| bgmv_shrink_bfloat16_t<<<blockDim, nullptr, stream>>>(x, weight, indices, indicesSize, y, batchSize, numTokensPerCore, | ||
| inputHiddenDim, maxLoRARank, scale); |
There was a problem hiding this comment.
You've removed the conditional compilation directive for the bgmv_shrink_bfloat16_t kernel call, making it unconditional. However, the kernel's definition using BGMV_SHRINK_TYPE_DECLARE(bfloat16_t) at line 230 is still inside an #if (__CCE_AICORE__ >= 220) block. This will cause a compilation error on platforms where __CCE_AICORE__ < 220, as the function will be called but not defined. You should also remove the conditional compilation around the kernel declaration to fix this.
| sgmv_expand_bfloat16_t<<<blockDim, nullptr, stream>>>(x, weight, loraIndices, loraIndicesSize, | ||
| seqLen, seqLenSize, yIn, yOut, batchSize, | ||
| numTokensPerCore, maxLoRARank, outputHiddenDim, | ||
| sliceOffset, outputFullDim); |
There was a problem hiding this comment.
You've removed the conditional compilation directive for the sgmv_expand_bfloat16_t kernel call, making it unconditional. However, the kernel's definition using SGMV_EXPAND_TYPE_DECLARE(bfloat16_t) at line 361 is still inside an #if (__CCE_AICORE__ >= 220) block. This will cause a compilation error on platforms where __CCE_AICORE__ < 220, as the function will be called but not defined. You should also remove the conditional compilation around the kernel declaration to fix this.
| sgmv_shrink_bfloat16_t<<<blockDim, nullptr, stream>>>(x, weight, loraIndices, loraIndicesSize, | ||
| seqLen, seqLenSize, | ||
| y, batchSize, | ||
| numTokensPerCore, inputHiddenDim, maxLoRARank, | ||
| scale); |
There was a problem hiding this comment.
You've removed the conditional compilation directive for the sgmv_shrink_bfloat16_t kernel call, making it unconditional. However, the kernel's definition using SGMV_SHRINK_TYPE_DECLARE(bfloat16_t) at line 246 is still inside an #if (__CCE_AICORE__ >= 220) block. This will cause a compilation error on platforms where __CCE_AICORE__ < 220, as the function will be called but not defined. You should also remove the conditional compilation around the kernel declaration to fix this.
|
This PR can fix 2 bugs:
@liuchenbing Could you consider fixing according to the GEMINI review opinion ? |
…n vllm-ascend. vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liuchenbing <chenliumail@163.com>
…n vllm-ascend. vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liuchenbing <chenliumail@163.com>
…n vllm-ascend. vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liuchenbing <chenliumail@163.com>
…n vllm-ascend. vLLM version: v0.11.0 vLLM main: vllm-project/vllm Signed-off-by: liuchenbing <chenliumail@163.com>
|
This PR is duplicated to #4141. We'll concentrate on that one, and this will be closed. |
### What this PR does / why we need it? This PR depends on PR #4046. And only if the latter merged, it will work. This PR aims to solve the issue #3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com>
…-project#4075) ### What this PR does / why we need it? This PR depends on PR vllm-project#4046. And only if the latter merged, it will work. This PR aims to solve the issue vllm-project#3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com>
…-project#4075) ### What this PR does / why we need it? This PR depends on PR vllm-project#4046. And only if the latter merged, it will work. This PR aims to solve the issue vllm-project#3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com>
…-project#4075) ### What this PR does / why we need it? This PR depends on PR vllm-project#4046. And only if the latter merged, it will work. This PR aims to solve the issue vllm-project#3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com>
…-project#4075) ### What this PR does / why we need it? This PR depends on PR vllm-project#4046. And only if the latter merged, it will work. This PR aims to solve the issue vllm-project#3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…-project#4075) ### What this PR does / why we need it? This PR depends on PR vllm-project#4046. And only if the latter merged, it will work. This PR aims to solve the issue vllm-project#3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com>
…-project#4075) ### What this PR does / why we need it? This PR depends on PR vllm-project#4046. And only if the latter merged, it will work. This PR aims to solve the issue vllm-project#3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…-project#4075) ### What this PR does / why we need it? This PR depends on PR vllm-project#4046. And only if the latter merged, it will work. This PR aims to solve the issue vllm-project#3240. The new-added Llama-2-7b-hf and Qwen3-0.6B testcases will cover the senarios that the LoRA weights are added to q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens and lm_head modules. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_llama2_lora.py pytest -sv tests/e2e/singlecard/test_qwen3_multi_loras.py - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: paulyu12 <507435917@qq.com>
vLLM version: v0.11.0
vLLM main: vllm-project/vllm
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?
pytest tests/lora/test_llama_tp.py::test_llama_lora -s