Add Custom Kernels For LoRA Performance#2325
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request introduces custom SGMV (Segmented Grouped Matrix-Vector) kernels for LoRA to improve performance on Ascend devices. The changes include new C++ kernels for sgmv_expand and sgmv_shrink, and updates to the Python wrappers to utilize them.
While the performance improvements are valuable, my review has identified several critical issues that must be addressed. There are potential out-of-bounds memory access vulnerabilities in both of the new C++ kernels (sgmv_expand.cpp, sgmv_shrink.cpp) due to missing bounds checks. Additionally, a critical safety check for LoRA indices has been removed in the Python code (punica_npu.py), which could also lead to out-of-bounds access.
Furthermore, the Python API wrappers for the new SGMV operations (lora_ops.py) have several unused or ignored parameters, which makes the API confusing and should be cleaned up.
Please address these critical and high-severity issues to ensure the correctness and stability of the new kernels.
vLLM version: v0.10.0 vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: liuchn <909698896@qq.com>
vLLM version: v0.10.0 vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: liuchn <909698896@qq.com>
| bgmv_shrink, sgmv_expand, | ||
| sgmv_expand_slice, sgmv_shrink) | ||
| else: | ||
| #print("This not is 310P") |
There was a problem hiding this comment.
remove uesless comment
### What this PR does / why we need it?
Add two custom operators (sgmv_shrink and sgmv_expand) to address the
performance issues of LoRA. Meanwhile, enable the graph mode for LoRA
operators to enter ACL, so as to improve the model inference
performance.
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput
have increased by about 100%.
Signed-off-by: liuchn <909698896@qq.com>
- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@1f83e7d
---------
Signed-off-by: liuchn <909698896@qq.com>
Co-authored-by: liuchn <909698896@qq.com>
### What this PR does / why we need it?
Add two custom operators (sgmv_shrink and sgmv_expand) to address the
performance issues of LoRA. Meanwhile, enable the graph mode for LoRA
operators to enter ACL, so as to improve the model inference
performance.
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput
have increased by about 100%.
Signed-off-by: liuchn <909698896@qq.com>
- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@1f83e7d
---------
Signed-off-by: liuchn <909698896@qq.com>
Co-authored-by: liuchn <909698896@qq.com>
### What this PR does / why we need it?
Add two custom operators (sgmv_shrink and sgmv_expand) to address the
performance issues of LoRA. Meanwhile, enable the graph mode for LoRA
operators to enter ACL, so as to improve the model inference
performance.
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput
have increased by about 100%.
Signed-off-by: liuchn <909698896@qq.com>
- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@1f83e7d
---------
Signed-off-by: liuchn <909698896@qq.com>
Co-authored-by: liuchn <909698896@qq.com>
### What this PR does / why we need it?
Add two custom operators (sgmv_shrink and sgmv_expand) to address the
performance issues of LoRA. Meanwhile, enable the graph mode for LoRA
operators to enter ACL, so as to improve the model inference
performance.
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput
have increased by about 100%.
Signed-off-by: liuchn <909698896@qq.com>
- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@1f83e7d
---------
Signed-off-by: liuchn <909698896@qq.com>
Co-authored-by: liuchn <909698896@qq.com>
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?
Signed-off-by: liuchn 909698896@qq.com