[Ascend] perf: optimize rope embedding with triton kernel for huge performance gain by ZCG12345 · Pull Request #5918 · vllm-project/vllm-ascend

ZCG12345 · 2026-01-15T07:32:08Z

What this PR does / why we need it?

Implement a high-performance Triton custom kernel for the rotary position embedding (RoPE) operator on Ascend NPU platform
Fix critical bugs in the Triton RoPE kernel registration and invocation process: including incorrect fake impl function name matching, wrong torch ops namespace for kernel call, missing self parameter in cos/sin slice fetching, and syntax errors in function type annotations.
Achieve extreme performance optimization for the core RoPE operator: the single inference latency is reduced from 57.1 μs to 9 μs, with 6.34x performance improvement and 84.24% latency reduction.
The RoPE operator is a hot path that is executed in every transformer layer during LLM inference, the optimization will directly reduce the overall inference latency and improve the throughput of LLM serving on Ascend NPU.
Keep full backward compatibility: the Triton kernel is enabled only when HAS_TRITON=True, and automatically fall back to the original Ascend NPU native implementation if Triton is not available, no functional regression.

Does this PR introduce any user-facing change?

NO

No changes to any public APIs, interfaces or inference behaviors of vLLM.
No impact on the text generation quality and correctness of the large model.
The optimization is transparent to end users, only the inference speed (latency/throughput) is improved without any functional change.

How was this patch tested?

Environment Validation: Tested on Ascend NPU platform with vLLM-Ascend framework, Triton library installed and enabled (HAS_TRITON=True).
Kernel Registration Test: Verified the Triton RoPE kernel (rope_forward_triton) is successfully registered to torch.ops._C_ascend namespace without any ValueError/NameError/SyntaxError.
Functional Correctness Test: Run large model (GLM4/MoE) inference on the Ascend NPU platform, the generated text content is completely correct (no garbled text, no logical errors), consistent with the original implementation.
Performance Benchmark Test: Measure the single execution latency of the RoPE operator before/after optimization, confirm the latency is stably reduced from 57.1 μs to 9 μs, the performance gain is valid and stable.
Fallback Mechanism Test: Manually disable Triton (HAS_TRITON=False), verify the code correctly falls back to the original Ascend NPU native RoPE implementation, no service crash and normal inference.
Compatibility Test: Test with different tensor shapes/sizes of query/key, all cases work correctly with the Triton kernel, no shape mismatch error.

operator supply by Hexiang Wang
vLLM version: v0.13.0
vLLM main: vllm-project/vllm@11b6af5

Signed-off-by: ZCG12345 <2097562023@qq.com>

github-actions · 2026-01-15T07:32:23Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces a high-performance Triton kernel for Rotary Position Embedding (RoPE) on Ascend NPUs, which yields a significant performance improvement. The changes include the Triton kernel implementation, its registration as a custom PyTorch op, and integration into the existing RoPE logic. My review focuses on ensuring the correctness and maintainability of the new code. I've identified a potential critical issue regarding an unresolved reference that could lead to a NameError, and some maintainability issues with comments in Chinese that should be translated to English. Overall, the optimization is a great addition, and addressing these points will improve the quality of the contribution.

gemini-code-assist · 2026-01-15T07:34:35Z

+direct_register_custom_op(op_name="rope_forward_triton",
+                          op_func=rope_forward_triton,
+                          fake_impl=_rope_forward_triton_fake,
+                          mutates_args=[],
+                          dispatch_key="PrivateUse1")


The function rope_forward_triton is used here as the op_func for the custom operator, but it is not defined or imported within this file. This will likely result in a NameError when this module is imported. To fix this, you should import it from vllm_ascend.ops.triton.rope at the top of the file.

gemini-code-assist · 2026-01-15T07:34:35Z

+        cos = cos.view(-1, self.rotary_dim)#就是rope_dim 数据转换为2维
+        sin = sin.view(-1, self.rotary_dim)


The comment on line 193 is in Chinese. To ensure the codebase is maintainable and accessible to all contributors, please provide comments in English. If the code is self-explanatory, you can remove the comment.

Suggested change

cos = cos.view(-1, self.rotary_dim)#就是rope_dim 数据转换为2维

sin = sin.view(-1, self.rotary_dim)

cos = cos.view(-1, self.rotary_dim)

sin = sin.view(-1, self.rotary_dim)

gemini-code-assist · 2026-01-15T07:34:35Z

+        q = query.contiguous().view(query.shape[0], -1,
+                                            self.head_size)#B,N,H,D 批次 tokens数量，头数，头的维度
+        k = key.contiguous().view(key.shape[0], -1, self.head_size)


This comment is in Chinese. Please translate it to English for better code maintainability. Additionally, the .view() call appears to be redundant as the query and key tensors should already have the correct 3D shape (num_tokens, num_heads, head_size) at this point. A simple .contiguous() call should be sufficient to ensure the tensors are contiguous in memory before passing them to the Triton kernel.

Suggested change

q = query.contiguous().view(query.shape[0], -1,

self.head_size)#B,N,H,D 批次 tokens数量，头数，头的维度

k = key.contiguous().view(key.shape[0], -1, self.head_size)

q = query.contiguous()

k = key.contiguous()

whx-sjtu · 2026-01-15T08:33:54Z

+    cos, sin = get_cos_and_sin_slice()
+    if  HAS_TRITON:
+
+        cos = cos.view(-1, self.rotary_dim)#就是rope_dim 数据转换为2维


remove chinese note

whx-sjtu · 2026-01-15T08:35:02Z

-  rev: v1.7.7
-  hooks:
-  - id: actionlint
+# - repo: https://github.com/rhysd/actionlint


what's this?

When I Run lint this place need annotation. sorry I fogot release it after the local testing finished

whx-sjtu · 2026-01-15T08:36:33Z

    if self.cos_sin_cache.dtype != query.dtype:
        self.cos_sin_cache = self.cos_sin_cache.to(query.dtype)
+    cos, sin = get_cos_and_sin_slice()
+    if  HAS_TRITON:


This will go into triton rope whenever triton is intalled. We need more performance data to support this change.

fine, I will change it to torch_npu branch in next commit ,you can check again at that time

Signed-off-by: ZCG12345 <2097562023@qq.com>

whx-sjtu · 2026-01-19T06:43:37Z

        self.cos_sin_cache = self.cos_sin_cache.to(query.device)
    if self.cos_sin_cache.dtype != query.dtype:
        self.cos_sin_cache = self.cos_sin_cache.to(query.dtype)
+    cos, sin = get_cos_and_sin_slice()


There might be problems to put this index_select here. Please contact @Angazenn to make sure of this.

get_cos_and_sin_slice does not take index_select, I think its ok to move it here

…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [CI] Upgrade CANN to 8.5.0 (vllm-project#6070) Default enable MLAPO (vllm-project#5952) [Doc] Supplement PD separation parameters of DeepSeek V3.1 (vllm-project#6053) [Ascend] perf: optimize rope embedding with triton kernel for huge performance gain (vllm-project#5918) [Ops] update causal_conv1d_update (vllm-project#5984) [CI]Update triton ascend version in 3.2.0 (vllm-project#6067) [bugfix] fix the complex and potentially problematic generate_kv_idx. (vllm-project#5957)

…rformance gain (vllm-project#5918) ### What this PR does / why we need it? 1. Implement a **high-performance Triton custom kernel** for the rotary position embedding (RoPE) operator on **Ascend NPU** platform 2. Fix critical bugs in the Triton RoPE kernel registration and invocation process: including incorrect fake impl function name matching, wrong torch ops namespace for kernel call, missing self parameter in cos/sin slice fetching, and syntax errors in function type annotations. 3. Achieve **extreme performance optimization** for the core RoPE operator: the single inference latency is reduced from **57.1 μs** to **9 μs**, with **6.34x performance improvement** and **84.24% latency reduction**. 4. The RoPE operator is a **hot path** that is executed in every transformer layer during LLM inference, the optimization will directly reduce the overall inference latency and improve the throughput of LLM serving on Ascend NPU. 5. Keep full backward compatibility: the Triton kernel is enabled only when `HAS_TRITON=True`, and automatically fall back to the original Ascend NPU native implementation if Triton is not available, no functional regression. ### Does this PR introduce _any_ user-facing change? **NO** - No changes to any public APIs, interfaces or inference behaviors of vLLM. - No impact on the text generation quality and correctness of the large model. - The optimization is transparent to end users, only the inference speed (latency/throughput) is improved without any functional change. ### How was this patch tested? 1. **Environment Validation**: Tested on Ascend NPU platform with vLLM-Ascend framework, Triton library installed and enabled (`HAS_TRITON=True`). 2. **Kernel Registration Test**: Verified the Triton RoPE kernel (`rope_forward_triton`) is successfully registered to `torch.ops._C_ascend` namespace without any `ValueError/NameError/SyntaxError`. 3. **Functional Correctness Test**: Run large model (GLM4/MoE) inference on the Ascend NPU platform, the generated text content is **completely correct** (no garbled text, no logical errors), consistent with the original implementation. 4. **Performance Benchmark Test**: Measure the single execution latency of the RoPE operator before/after optimization, confirm the latency is stably reduced from 57.1 μs to 9 μs, the performance gain is valid and stable. 5. **Fallback Mechanism Test**: Manually disable Triton (`HAS_TRITON=False`), verify the code correctly falls back to the original Ascend NPU native RoPE implementation, no service crash and normal inference. 6. **Compatibility Test**: Test with different tensor shapes/sizes of query/key, all cases work correctly with the Triton kernel, no shape mismatch error. - operator supply by Hexiang Wang - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 --------- Signed-off-by: ZCG12345 <2097562023@qq.com>

…rformance gain (vllm-project#5918) ### What this PR does / why we need it? 1. Implement a **high-performance Triton custom kernel** for the rotary position embedding (RoPE) operator on **Ascend NPU** platform 2. Fix critical bugs in the Triton RoPE kernel registration and invocation process: including incorrect fake impl function name matching, wrong torch ops namespace for kernel call, missing self parameter in cos/sin slice fetching, and syntax errors in function type annotations. 3. Achieve **extreme performance optimization** for the core RoPE operator: the single inference latency is reduced from **57.1 μs** to **9 μs**, with **6.34x performance improvement** and **84.24% latency reduction**. 4. The RoPE operator is a **hot path** that is executed in every transformer layer during LLM inference, the optimization will directly reduce the overall inference latency and improve the throughput of LLM serving on Ascend NPU. 5. Keep full backward compatibility: the Triton kernel is enabled only when `HAS_TRITON=True`, and automatically fall back to the original Ascend NPU native implementation if Triton is not available, no functional regression. ### Does this PR introduce _any_ user-facing change? **NO** - No changes to any public APIs, interfaces or inference behaviors of vLLM. - No impact on the text generation quality and correctness of the large model. - The optimization is transparent to end users, only the inference speed (latency/throughput) is improved without any functional change. ### How was this patch tested? 1. **Environment Validation**: Tested on Ascend NPU platform with vLLM-Ascend framework, Triton library installed and enabled (`HAS_TRITON=True`). 2. **Kernel Registration Test**: Verified the Triton RoPE kernel (`rope_forward_triton`) is successfully registered to `torch.ops._C_ascend` namespace without any `ValueError/NameError/SyntaxError`. 3. **Functional Correctness Test**: Run large model (GLM4/MoE) inference on the Ascend NPU platform, the generated text content is **completely correct** (no garbled text, no logical errors), consistent with the original implementation. 4. **Performance Benchmark Test**: Measure the single execution latency of the RoPE operator before/after optimization, confirm the latency is stably reduced from 57.1 μs to 9 μs, the performance gain is valid and stable. 5. **Fallback Mechanism Test**: Manually disable Triton (`HAS_TRITON=False`), verify the code correctly falls back to the original Ascend NPU native RoPE implementation, no service crash and normal inference. 6. **Compatibility Test**: Test with different tensor shapes/sizes of query/key, all cases work correctly with the Triton kernel, no shape mismatch error. - operator supply by Hexiang Wang - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 --------- Signed-off-by: ZCG12345 <2097562023@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…rformance gain (vllm-project#5918) ### What this PR does / why we need it? 1. Implement a **high-performance Triton custom kernel** for the rotary position embedding (RoPE) operator on **Ascend NPU** platform 2. Fix critical bugs in the Triton RoPE kernel registration and invocation process: including incorrect fake impl function name matching, wrong torch ops namespace for kernel call, missing self parameter in cos/sin slice fetching, and syntax errors in function type annotations. 3. Achieve **extreme performance optimization** for the core RoPE operator: the single inference latency is reduced from **57.1 μs** to **9 μs**, with **6.34x performance improvement** and **84.24% latency reduction**. 4. The RoPE operator is a **hot path** that is executed in every transformer layer during LLM inference, the optimization will directly reduce the overall inference latency and improve the throughput of LLM serving on Ascend NPU. 5. Keep full backward compatibility: the Triton kernel is enabled only when `HAS_TRITON=True`, and automatically fall back to the original Ascend NPU native implementation if Triton is not available, no functional regression. ### Does this PR introduce _any_ user-facing change? **NO** - No changes to any public APIs, interfaces or inference behaviors of vLLM. - No impact on the text generation quality and correctness of the large model. - The optimization is transparent to end users, only the inference speed (latency/throughput) is improved without any functional change. ### How was this patch tested? 1. **Environment Validation**: Tested on Ascend NPU platform with vLLM-Ascend framework, Triton library installed and enabled (`HAS_TRITON=True`). 2. **Kernel Registration Test**: Verified the Triton RoPE kernel (`rope_forward_triton`) is successfully registered to `torch.ops._C_ascend` namespace without any `ValueError/NameError/SyntaxError`. 3. **Functional Correctness Test**: Run large model (GLM4/MoE) inference on the Ascend NPU platform, the generated text content is **completely correct** (no garbled text, no logical errors), consistent with the original implementation. 4. **Performance Benchmark Test**: Measure the single execution latency of the RoPE operator before/after optimization, confirm the latency is stably reduced from 57.1 μs to 9 μs, the performance gain is valid and stable. 5. **Fallback Mechanism Test**: Manually disable Triton (`HAS_TRITON=False`), verify the code correctly falls back to the original Ascend NPU native RoPE implementation, no service crash and normal inference. 6. **Compatibility Test**: Test with different tensor shapes/sizes of query/key, all cases work correctly with the Triton kernel, no shape mismatch error. - operator supply by Hexiang Wang - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 --------- Signed-off-by: ZCG12345 <2097562023@qq.com>

…rformance gain (vllm-project#5918) ### What this PR does / why we need it? 1. Implement a **high-performance Triton custom kernel** for the rotary position embedding (RoPE) operator on **Ascend NPU** platform 2. Fix critical bugs in the Triton RoPE kernel registration and invocation process: including incorrect fake impl function name matching, wrong torch ops namespace for kernel call, missing self parameter in cos/sin slice fetching, and syntax errors in function type annotations. 3. Achieve **extreme performance optimization** for the core RoPE operator: the single inference latency is reduced from **57.1 μs** to **9 μs**, with **6.34x performance improvement** and **84.24% latency reduction**. 4. The RoPE operator is a **hot path** that is executed in every transformer layer during LLM inference, the optimization will directly reduce the overall inference latency and improve the throughput of LLM serving on Ascend NPU. 5. Keep full backward compatibility: the Triton kernel is enabled only when `HAS_TRITON=True`, and automatically fall back to the original Ascend NPU native implementation if Triton is not available, no functional regression. ### Does this PR introduce _any_ user-facing change? **NO** - No changes to any public APIs, interfaces or inference behaviors of vLLM. - No impact on the text generation quality and correctness of the large model. - The optimization is transparent to end users, only the inference speed (latency/throughput) is improved without any functional change. ### How was this patch tested? 1. **Environment Validation**: Tested on Ascend NPU platform with vLLM-Ascend framework, Triton library installed and enabled (`HAS_TRITON=True`). 2. **Kernel Registration Test**: Verified the Triton RoPE kernel (`rope_forward_triton`) is successfully registered to `torch.ops._C_ascend` namespace without any `ValueError/NameError/SyntaxError`. 3. **Functional Correctness Test**: Run large model (GLM4/MoE) inference on the Ascend NPU platform, the generated text content is **completely correct** (no garbled text, no logical errors), consistent with the original implementation. 4. **Performance Benchmark Test**: Measure the single execution latency of the RoPE operator before/after optimization, confirm the latency is stably reduced from 57.1 μs to 9 μs, the performance gain is valid and stable. 5. **Fallback Mechanism Test**: Manually disable Triton (`HAS_TRITON=False`), verify the code correctly falls back to the original Ascend NPU native RoPE implementation, no service crash and normal inference. 6. **Compatibility Test**: Test with different tensor shapes/sizes of query/key, all cases work correctly with the Triton kernel, no shape mismatch error. - operator supply by Hexiang Wang - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 --------- Signed-off-by: ZCG12345 <2097562023@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…rformance gain (vllm-project#5918) ### What this PR does / why we need it? 1. Implement a **high-performance Triton custom kernel** for the rotary position embedding (RoPE) operator on **Ascend NPU** platform 2. Fix critical bugs in the Triton RoPE kernel registration and invocation process: including incorrect fake impl function name matching, wrong torch ops namespace for kernel call, missing self parameter in cos/sin slice fetching, and syntax errors in function type annotations. 3. Achieve **extreme performance optimization** for the core RoPE operator: the single inference latency is reduced from **57.1 μs** to **9 μs**, with **6.34x performance improvement** and **84.24% latency reduction**. 4. The RoPE operator is a **hot path** that is executed in every transformer layer during LLM inference, the optimization will directly reduce the overall inference latency and improve the throughput of LLM serving on Ascend NPU. 5. Keep full backward compatibility: the Triton kernel is enabled only when `HAS_TRITON=True`, and automatically fall back to the original Ascend NPU native implementation if Triton is not available, no functional regression. ### Does this PR introduce _any_ user-facing change? **NO** - No changes to any public APIs, interfaces or inference behaviors of vLLM. - No impact on the text generation quality and correctness of the large model. - The optimization is transparent to end users, only the inference speed (latency/throughput) is improved without any functional change. ### How was this patch tested? 1. **Environment Validation**: Tested on Ascend NPU platform with vLLM-Ascend framework, Triton library installed and enabled (`HAS_TRITON=True`). 2. **Kernel Registration Test**: Verified the Triton RoPE kernel (`rope_forward_triton`) is successfully registered to `torch.ops._C_ascend` namespace without any `ValueError/NameError/SyntaxError`. 3. **Functional Correctness Test**: Run large model (GLM4/MoE) inference on the Ascend NPU platform, the generated text content is **completely correct** (no garbled text, no logical errors), consistent with the original implementation. 4. **Performance Benchmark Test**: Measure the single execution latency of the RoPE operator before/after optimization, confirm the latency is stably reduced from 57.1 μs to 9 μs, the performance gain is valid and stable. 5. **Fallback Mechanism Test**: Manually disable Triton (`HAS_TRITON=False`), verify the code correctly falls back to the original Ascend NPU native RoPE implementation, no service crash and normal inference. 6. **Compatibility Test**: Test with different tensor shapes/sizes of query/key, all cases work correctly with the Triton kernel, no shape mismatch error. - operator supply by Hexiang Wang - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@11b6af5 --------- Signed-off-by: ZCG12345 <2097562023@qq.com>

modify06011501

ba06ede

Signed-off-by: ZCG12345 <2097562023@qq.com>

ZCG12345 requested review from realliujiaxu, wangxiyuan and zzzzwwjj as code owners January 15, 2026 07:32

github-actions bot added the module:ops label Jan 15, 2026

gemini-code-assist bot reviewed Jan 15, 2026

View reviewed changes

whx-sjtu requested changes Jan 15, 2026

View reviewed changes

ZCG12345 added 2 commits January 16, 2026 02:24

modify06011601

224c765

Signed-off-by: ZCG12345 <2097562023@qq.com>

modify06011602

508acda

Signed-off-by: ZCG12345 <2097562023@qq.com>

whx-sjtu reviewed Jan 19, 2026

View reviewed changes

wangxiyuan approved these changes Jan 19, 2026

View reviewed changes

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 19, 2026

whx-sjtu approved these changes Jan 19, 2026

View reviewed changes

wangxiyuan enabled auto-merge (squash) January 19, 2026 08:48

ZCG12345 closed this Jan 20, 2026

auto-merge was automatically disabled January 20, 2026 01:22
Pull request was closed

ZCG12345 reopened this Jan 20, 2026

ZCG12345 closed this Jan 20, 2026

ZCG12345 reopened this Jan 20, 2026

Merge branch 'vllm-project:main' into main

5c0ef5b

ZCG12345 marked this pull request as draft January 21, 2026 01:32

ZCG12345 marked this pull request as ready for review January 21, 2026 01:32

wangxiyuan merged commit 8900e33 into vllm-project:main Jan 21, 2026
20 checks passed

Yikun mentioned this pull request Feb 5, 2026

[v0.13.0rc2] FAQ / Feedback | 问题/反馈 #6186

Closed

wangxiyuan mentioned this pull request Feb 24, 2026

[Misc]: test #6787

Closed

		cos = cos.view(-1, self.rotary_dim)#就是rope_dim 数据转换为2维
		sin = sin.view(-1, self.rotary_dim)

Conversation

ZCG12345 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ZCG12345 commented Jan 15, 2026 •

edited

Loading