[Ops][Refactor] Remove custom rotary_embedding operator by wangxiyuan · Pull Request #6523 · vllm-project/vllm-ascend

wangxiyuan · 2026-02-04T01:24:49Z

What this PR does / why we need it?

This PR removes the custom rotary_embedding operator and its associated C++ kernel implementation, PyTorch bindings, and tests.

The codebase now falls back to using the native torch_npu._npu_rotary_embedding implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain.

Does this PR introduce any user-facing change?

No. This is an internal refactoring and does not introduce any user-facing changes.

How was this patch tested?

The tests for the custom rotary_embedding operator have been removed along with the operator itself. The correctness of the fallback to the native torch_npu implementation is verified by existing CI tests for attention layers and models that use rotary embeddings.

vLLM version: v0.15.0
vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

github-actions · 2026-02-04T01:25:02Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-02-04T01:25:07Z

Summary of Changes

Hello @wangxiyuan, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request streamlines the codebase by completely removing a custom Rotary Embedding operation. This includes its C++ kernel implementation, associated C++ and Python bindings, meta-registrations, and all dedicated test files. The removal simplifies the project structure and maintenance, indicating a shift towards relying on alternative or upstream implementations for this functionality.

Highlights

Custom Rotary Embedding Kernel Removal: The entire custom C++ kernel implementation for Rotary Embedding (pos_encoding_kernels.cpp) has been removed from the codebase.
C++ Bindings and Registrations Clean-up: All declarations and implementations related to the custom Rotary Embedding operation have been removed from csrc/ops.h, csrc/torch_binding.cpp, and csrc/torch_binding_meta.cpp, including its registration with the PyTorch operator library.
Test File Deletion: Both unit tests (tests/ut/ops/test_rotary_embedding.py) and end-to-end tests (tests/e2e/nightly/single_node/ops/singlecard_ops/test_rotary_embedding.py) specifically designed for the custom Rotary Embedding have been removed.
Python Logic Simplification: The Python-side logic in vllm_ascend/ops/rotary_embedding.py that conditionally invoked the custom kernel has been removed, along with the _custom_rotary_embedding_enabled helper function.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

csrc/kernels/pos_encoding_kernels.cpp
- Removed the entire file, which contained the custom Rotary Embedding kernel.
csrc/ops.h
- Removed the external declaration for rotary_embedding_impl.
csrc/torch_binding.cpp
- Removed the rotary_embedding function and its corresponding torch.ops._C_ascend registration.
csrc/torch_binding_meta.cpp
- Removed the rotary_embedding_meta function and its meta-implementation registration.
tests/e2e/nightly/single_node/ops/singlecard_ops/test_rotary_embedding.py
- Removed the end-to-end test file for the custom rotary embedding.
tests/ut/ops/test_rotary_embedding.py
- Removed the unit test file for the custom rotary embedding.
vllm_ascend/meta_registration.py
- Removed the rotary_embedding_meta function and its registration from the meta-registration utility.
vllm_ascend/ops/rotary_embedding.py
- Removed the _custom_rotary_embedding_enabled function.
- Removed the conditional logic that called the custom torch.ops._C_ascend.rotary_embedding kernel.

Activity

The pull request was opened by wangxiyuan with a title indicating a miscellaneous change to remove the custom Rotary Embedding operation.
The description body is empty, suggesting the change is self-explanatory given the title and file modifications.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request cleanly removes the custom rotary_embedding operator, including its C++ kernel, PyTorch bindings, and associated tests. The implementation now falls back to the native torch_npu._npu_rotary_embedding operator. This is a good simplification of the codebase. The removal appears to be complete, with no dangling references left behind.

To improve the pull request, I recommend filling out the description to provide context for this change, as per the repository's contribution guidelines. I have provided a suggested title and summary below for your convenience.

Suggested PR Title:

[Ops][Refactor] Remove custom rotary_embedding operator

Suggested PR Summary:

### What this PR does / why we need it?
This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests.

The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain.

### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring and does not introduce any user-facing changes.

### How was this patch tested?
The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings.

github-actions · 2026-02-06T02:33:35Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [Patch] Remove the patch of MiniCPM (vllm-project#5975) [P/D] layerwise connector support recompute scheduler (vllm-project#5900) [CI] Add workflow support for lint image build (vllm-project#6489) [Bugfix] Fix problematic dummy_run & improper input_batch_size in eagle (vllm-project#6517) [Refactor]310p_e2e test case update (vllm-project#6539) [Refactor]refactor p2p connector (vllm-project#6551) [Refactor]refactor 310p attention impl and add ut (vllm-project#6579) [Refactor]refactor 310p ops and add ut (vllm-project#6591) [Ops][Refactor] Remove custom rotary_embedding operator (vllm-project#6523) [Lint]Style: Convert `vllm-ascend/` to ruff format(new Batch vllm-project#8) (vllm-project#6604) [Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (vllm-project#5301) [CI] Fix broken CI (vllm-project#6599) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#10) (vllm-project#6173) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#11) (vllm-project#6176) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#8) (vllm-project#6129) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#7) (vllm-project#6023) [CI][Misc] Some improvement for github action (vllm-project#6587) [Image] Bump mooncake version to v0.3.8.post1 (vllm-project#6428)

…cache (#6602) ### What this PR does / why we need it? This PR adapts #5450, #6523 to v0.13.0, to fix #6612 . This PR extends original `rope_triton_forward` and `split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are: 1. avoiding pre-computation of `cos` `sin` before model execution, which helps to remove redundant codes. 2. allowing eagle3 draft model to have different rope parameters with main model (see #6612 ). This help to recover accept rate && accuracy in that case. In addition, this kernel change only introduces very small performance degradation. Those `index_select` or `chunk` operations are now changed into simple memory access in triton kernel **Highlights** - **RoPE Cache Unification**: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling. - **Triton Kernel Integration**: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations. - **Custom Operation Registration**: Registered `rope_forward_oot` as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation. - **Refactored RoPE Forward Pass**: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system. --------- Signed-off-by: Angazenn <supperccell@163.com>

…#6523) ### What this PR does / why we need it? This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests. The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring and does not introduce any user-facing changes. ### How was this patch tested? The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

…#6523) ### What this PR does / why we need it? This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests. The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring and does not introduce any user-facing changes. ### How was this patch tested? The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…#6523) ### What this PR does / why we need it? This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests. The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring and does not introduce any user-facing changes. ### How was this patch tested? The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

…#6523) ### What this PR does / why we need it? This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests. The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring and does not introduce any user-facing changes. ### How was this patch tested? The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…#6523) ### What this PR does / why we need it? This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests. The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain. ### Does this PR introduce _any_ user-facing change? No. This is an internal refactoring and does not introduce any user-facing changes. ### How was this patch tested? The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

wangxiyuan requested review from realliujiaxu, whx-sjtu and zzzzwwjj as code owners February 4, 2026 01:24

github-actions bot added module:tests module:ops module:core labels Feb 4, 2026

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

wangxiyuan changed the title ~~[Misc] Remove custom op rotary_embedding~~ [Ops][Refactor] Remove custom rotary_embedding operator Feb 4, 2026

wangxiyuan force-pushed the remove_rotary_embedding branch from 43b9bc2 to 4071c57 Compare February 4, 2026 01:37

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Feb 4, 2026

github-actions bot added the merge-conflicts label Feb 6, 2026

wangxiyuan force-pushed the remove_rotary_embedding branch from 4071c57 to f9d64c3 Compare February 6, 2026 06:38

github-actions bot removed the merge-conflicts label Feb 6, 2026

[Misc] Remove custom op rotary_embedding

a86fcba

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

wangxiyuan force-pushed the remove_rotary_embedding branch from f9d64c3 to a86fcba Compare February 6, 2026 07:39

Merge branch 'main' into remove_rotary_embedding

16c2bb1

wangxiyuan merged commit 6c49f95 into vllm-project:main Feb 7, 2026
27 checks passed

Angazenn mentioned this pull request Feb 11, 2026

[v0.13.0][Ops] Make triton rope support index_selecting from cos_sin_cache #6602

Merged

wangxiyuan mentioned this pull request Feb 24, 2026

[Misc]: test #6787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ops][Refactor] Remove custom rotary_embedding operator#6523

[Ops][Refactor] Remove custom rotary_embedding operator#6523
wangxiyuan merged 2 commits intovllm-project:mainfrom
wangxiyuan:remove_rotary_embedding

wangxiyuan commented Feb 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wangxiyuan commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wangxiyuan commented Feb 4, 2026 •

edited

Loading