Add Custom Kernels For LoRA Performance by liuchenbing · Pull Request #2325 · vllm-project/vllm-ascend

liuchenbing · 2025-08-11T15:01:47Z

What this PR does / why we need it?

  Add two custom operators (sgmv_shrink and sgmv_expand) to address the performance issues of LoRA. Meanwhile, enable the graph mode for LoRA operators to enter ACL, so as to improve the model inference performance.

Does this PR introduce any user-facing change?

  no user-facing change

How was this patch tested?

  Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput have increased by about 100%.

Signed-off-by: liuchn 909698896@qq.com

vLLM version: v0.10.0
vLLM main: vllm-project/vllm@1f83e7d

github-actions · 2025-08-11T15:01:57Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces custom SGMV (Segmented Grouped Matrix-Vector) kernels for LoRA to improve performance on Ascend devices. The changes include new C++ kernels for sgmv_expand and sgmv_shrink, and updates to the Python wrappers to utilize them.

While the performance improvements are valuable, my review has identified several critical issues that must be addressed. There are potential out-of-bounds memory access vulnerabilities in both of the new C++ kernels (sgmv_expand.cpp, sgmv_shrink.cpp) due to missing bounds checks. Additionally, a critical safety check for LoRA indices has been removed in the Python code (punica_npu.py), which could also lead to out-of-bounds access.

Furthermore, the Python API wrappers for the new SGMV operations (lora_ops.py) have several unused or ignored parameters, which makes the API confusing and should be cleaned up.

Please address these critical and high-severity issues to ensure the correctness and stability of the new kernels.

vLLM version: v0.10.0 vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: liuchn <909698896@qq.com>

wangxiyuan · 2025-08-14T01:14:12Z

                                         bgmv_shrink, sgmv_expand,
                                         sgmv_expand_slice, sgmv_shrink)
 else:
+    #print("This not is 310P")


remove uesless comment

### What this PR does / why we need it? Add two custom operators (sgmv_shrink and sgmv_expand) to address the performance issues of LoRA. Meanwhile, enable the graph mode for LoRA operators to enter ACL, so as to improve the model inference performance. ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput have increased by about 100%. Signed-off-by: liuchn <909698896@qq.com> - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1f83e7d --------- Signed-off-by: liuchn <909698896@qq.com> Co-authored-by: liuchn <909698896@qq.com>

gemini-code-assist Bot reviewed Aug 11, 2025

View reviewed changes

github-actions Bot added the module:core label Aug 11, 2025

Add Custom Kernels For LoRA Performance

6dc600b

vLLM version: v0.10.0 vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: liuchn <909698896@qq.com>

liuchenbing force-pushed the main_new branch from 36499e2 to 6dc600b Compare August 16, 2025 02:45

Add Custom Kernels For LoRA Performance

da292cd

vLLM version: v0.10.0 vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: liuchn <909698896@qq.com>

wangxiyuan approved these changes Aug 19, 2025

View reviewed changes

wangxiyuan merged commit 3648d18 into vllm-project:main Aug 19, 2025
6 checks passed

Yikun mentioned this pull request Aug 19, 2025

Fix lint for lora (#2325) #2439

Closed

Yikun mentioned this pull request Sep 20, 2025

[Bug]: Remove outofdate commits to improve perf test #3051

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Custom Kernels For LoRA Performance#2325

Add Custom Kernels For LoRA Performance#2325
wangxiyuan merged 2 commits intovllm-project:mainfrom
liuchenbing:main_new

liuchenbing commented Aug 11, 2025 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Aug 11, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangxiyuan Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liuchenbing commented Aug 11, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented Aug 11, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangxiyuan Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liuchenbing commented Aug 11, 2025 •

edited by github-actions Bot

Loading