Skip to content

Add Custom Kernels For LoRA Performance#2325

Merged
wangxiyuan merged 2 commits intovllm-project:mainfrom
liuchenbing:main_new
Aug 19, 2025
Merged

Add Custom Kernels For LoRA Performance#2325
wangxiyuan merged 2 commits intovllm-project:mainfrom
liuchenbing:main_new

Conversation

@liuchenbing
Copy link
Copy Markdown
Contributor

@liuchenbing liuchenbing commented Aug 11, 2025

What this PR does / why we need it?

  Add two custom operators (sgmv_shrink and sgmv_expand) to address the performance issues of LoRA. Meanwhile, enable the graph mode for LoRA operators to enter ACL, so as to improve the model inference performance.

Does this PR introduce any user-facing change?

  no user-facing change

How was this patch tested?

  Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput have increased by about 100%.
image

Signed-off-by: liuchn 909698896@qq.com

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces custom SGMV (Segmented Grouped Matrix-Vector) kernels for LoRA to improve performance on Ascend devices. The changes include new C++ kernels for sgmv_expand and sgmv_shrink, and updates to the Python wrappers to utilize them.

While the performance improvements are valuable, my review has identified several critical issues that must be addressed. There are potential out-of-bounds memory access vulnerabilities in both of the new C++ kernels (sgmv_expand.cpp, sgmv_shrink.cpp) due to missing bounds checks. Additionally, a critical safety check for LoRA indices has been removed in the Python code (punica_npu.py), which could also lead to out-of-bounds access.

Furthermore, the Python API wrappers for the new SGMV operations (lora_ops.py) have several unused or ignored parameters, which makes the API confusing and should be cleaned up.

Please address these critical and high-severity issues to ensure the correctness and stability of the new kernels.

Comment thread csrc/kernels/sgmv_expand.cpp Outdated
Comment thread csrc/kernels/sgmv_shrink.cpp Outdated
Comment thread vllm_ascend/lora/punica_wrapper/punica_npu.py Outdated
Comment thread vllm_ascend/lora/punica_wrapper/lora_ops.py
Comment thread vllm_ascend/lora/punica_wrapper/lora_ops.py
Comment thread vllm_ascend/lora/punica_wrapper/lora_ops.py
vLLM version: v0.10.0
vLLM main: vllm-project/vllm@14a5d90
Signed-off-by: liuchn <909698896@qq.com>
vLLM version: v0.10.0
vLLM main: vllm-project/vllm@14a5d90
Signed-off-by: liuchn <909698896@qq.com>
bgmv_shrink, sgmv_expand,
sgmv_expand_slice, sgmv_shrink)
else:
#print("This not is 310P")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove uesless comment

@wangxiyuan wangxiyuan merged commit 3648d18 into vllm-project:main Aug 19, 2025
6 checks passed
@Yikun Yikun mentioned this pull request Aug 19, 2025
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
### What this PR does / why we need it?
Add two custom operators (sgmv_shrink and sgmv_expand) to address the
performance issues of LoRA. Meanwhile, enable the graph mode for LoRA
operators to enter ACL, so as to improve the model inference
performance.
### Does this PR introduce _any_ user-facing change?
      no user-facing change
### How was this patch tested?
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput
have increased by about 100%.

Signed-off-by: liuchn <909698896@qq.com>

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@1f83e7d

---------

Signed-off-by: liuchn <909698896@qq.com>
Co-authored-by: liuchn <909698896@qq.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
Add two custom operators (sgmv_shrink and sgmv_expand) to address the
performance issues of LoRA. Meanwhile, enable the graph mode for LoRA
operators to enter ACL, so as to improve the model inference
performance.
### Does this PR introduce _any_ user-facing change?
      no user-facing change
### How was this patch tested?
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput
have increased by about 100%.

Signed-off-by: liuchn <909698896@qq.com>

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@1f83e7d

---------

Signed-off-by: liuchn <909698896@qq.com>
Co-authored-by: liuchn <909698896@qq.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
### What this PR does / why we need it?
Add two custom operators (sgmv_shrink and sgmv_expand) to address the
performance issues of LoRA. Meanwhile, enable the graph mode for LoRA
operators to enter ACL, so as to improve the model inference
performance.
### Does this PR introduce _any_ user-facing change?
      no user-facing change
### How was this patch tested?
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput
have increased by about 100%.

Signed-off-by: liuchn <909698896@qq.com>

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@1f83e7d

---------

Signed-off-by: liuchn <909698896@qq.com>
Co-authored-by: liuchn <909698896@qq.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
### What this PR does / why we need it?
Add two custom operators (sgmv_shrink and sgmv_expand) to address the
performance issues of LoRA. Meanwhile, enable the graph mode for LoRA
operators to enter ACL, so as to improve the model inference
performance.
### Does this PR introduce _any_ user-facing change?
      no user-facing change
### How was this patch tested?
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput
have increased by about 100%.

Signed-off-by: liuchn <909698896@qq.com>

- vLLM version: v0.10.0
- vLLM main:
vllm-project/vllm@1f83e7d

---------

Signed-off-by: liuchn <909698896@qq.com>
Co-authored-by: liuchn <909698896@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants