Skip to content

[Ops][Refactor] Remove custom rotary_embedding operator#6523

Merged
wangxiyuan merged 2 commits intovllm-project:mainfrom
wangxiyuan:remove_rotary_embedding
Feb 7, 2026
Merged

[Ops][Refactor] Remove custom rotary_embedding operator#6523
wangxiyuan merged 2 commits intovllm-project:mainfrom
wangxiyuan:remove_rotary_embedding

Conversation

@wangxiyuan
Copy link
Collaborator

@wangxiyuan wangxiyuan commented Feb 4, 2026

What this PR does / why we need it?

This PR removes the custom rotary_embedding operator and its associated C++ kernel implementation, PyTorch bindings, and tests.

The codebase now falls back to using the native torch_npu._npu_rotary_embedding implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain.

Does this PR introduce any user-facing change?

No. This is an internal refactoring and does not introduce any user-facing changes.

How was this patch tested?

The tests for the custom rotary_embedding operator have been removed along with the operator itself. The correctness of the fallback to the native torch_npu implementation is verified by existing CI tests for attention layers and models that use rotary embeddings.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @wangxiyuan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request streamlines the codebase by completely removing a custom Rotary Embedding operation. This includes its C++ kernel implementation, associated C++ and Python bindings, meta-registrations, and all dedicated test files. The removal simplifies the project structure and maintenance, indicating a shift towards relying on alternative or upstream implementations for this functionality.

Highlights

  • Custom Rotary Embedding Kernel Removal: The entire custom C++ kernel implementation for Rotary Embedding (pos_encoding_kernels.cpp) has been removed from the codebase.
  • C++ Bindings and Registrations Clean-up: All declarations and implementations related to the custom Rotary Embedding operation have been removed from csrc/ops.h, csrc/torch_binding.cpp, and csrc/torch_binding_meta.cpp, including its registration with the PyTorch operator library.
  • Test File Deletion: Both unit tests (tests/ut/ops/test_rotary_embedding.py) and end-to-end tests (tests/e2e/nightly/single_node/ops/singlecard_ops/test_rotary_embedding.py) specifically designed for the custom Rotary Embedding have been removed.
  • Python Logic Simplification: The Python-side logic in vllm_ascend/ops/rotary_embedding.py that conditionally invoked the custom kernel has been removed, along with the _custom_rotary_embedding_enabled helper function.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/kernels/pos_encoding_kernels.cpp
    • Removed the entire file, which contained the custom Rotary Embedding kernel.
  • csrc/ops.h
    • Removed the external declaration for rotary_embedding_impl.
  • csrc/torch_binding.cpp
    • Removed the rotary_embedding function and its corresponding torch.ops._C_ascend registration.
  • csrc/torch_binding_meta.cpp
    • Removed the rotary_embedding_meta function and its meta-implementation registration.
  • tests/e2e/nightly/single_node/ops/singlecard_ops/test_rotary_embedding.py
    • Removed the end-to-end test file for the custom rotary embedding.
  • tests/ut/ops/test_rotary_embedding.py
    • Removed the unit test file for the custom rotary embedding.
  • vllm_ascend/meta_registration.py
    • Removed the rotary_embedding_meta function and its registration from the meta-registration utility.
  • vllm_ascend/ops/rotary_embedding.py
    • Removed the _custom_rotary_embedding_enabled function.
    • Removed the conditional logic that called the custom torch.ops._C_ascend.rotary_embedding kernel.
Activity
  • The pull request was opened by wangxiyuan with a title indicating a miscellaneous change to remove the custom Rotary Embedding operation.
  • The description body is empty, suggesting the change is self-explanatory given the title and file modifications.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request cleanly removes the custom rotary_embedding operator, including its C++ kernel, PyTorch bindings, and associated tests. The implementation now falls back to the native torch_npu._npu_rotary_embedding operator. This is a good simplification of the codebase. The removal appears to be complete, with no dangling references left behind.

To improve the pull request, I recommend filling out the description to provide context for this change, as per the repository's contribution guidelines. I have provided a suggested title and summary below for your convenience.

Suggested PR Title:

[Ops][Refactor] Remove custom rotary_embedding operator

Suggested PR Summary:

### What this PR does / why we need it?
This PR removes the custom `rotary_embedding` operator and its associated C++ kernel implementation, PyTorch bindings, and tests.

The codebase now falls back to using the native `torch_npu._npu_rotary_embedding` implementation. This change simplifies the codebase by removing custom, platform-specific kernel code and relying on the standard NPU library implementation, which is presumably more optimized and easier to maintain.

### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring and does not introduce any user-facing changes.

### How was this patch tested?
The tests for the custom `rotary_embedding` operator have been removed along with the operator itself. The correctness of the fallback to the native `torch_npu` implementation is verified by existing CI tests for attention layers and models that use rotary embeddings.

@wangxiyuan wangxiyuan changed the title [Misc] Remove custom op rotary_embedding [Ops][Refactor] Remove custom rotary_embedding operator Feb 4, 2026
@wangxiyuan wangxiyuan force-pushed the remove_rotary_embedding branch from 43b9bc2 to 4071c57 Compare February 4, 2026 01:37
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Feb 4, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
@wangxiyuan wangxiyuan force-pushed the remove_rotary_embedding branch from f9d64c3 to a86fcba Compare February 6, 2026 07:39
@wangxiyuan wangxiyuan merged commit 6c49f95 into vllm-project:main Feb 7, 2026
27 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Feb 9, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend:
  [Patch] Remove the patch of MiniCPM (vllm-project#5975)
  [P/D] layerwise connector support recompute scheduler (vllm-project#5900)
  [CI] Add workflow support for lint image build (vllm-project#6489)
  [Bugfix] Fix problematic dummy_run & improper input_batch_size in eagle (vllm-project#6517)
  [Refactor]310p_e2e test case update (vllm-project#6539)
  [Refactor]refactor p2p connector (vllm-project#6551)
  [Refactor]refactor 310p attention impl and add ut (vllm-project#6579)
  [Refactor]refactor 310p ops and add ut (vllm-project#6591)
  [Ops][Refactor] Remove custom rotary_embedding operator (vllm-project#6523)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(new Batch vllm-project#8) (vllm-project#6604)
  [Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder  (vllm-project#5301)
  [CI] Fix broken CI (vllm-project#6599)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#10) (vllm-project#6173)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#11) (vllm-project#6176)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#8) (vllm-project#6129)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#7) (vllm-project#6023)
  [CI][Misc] Some improvement for github action (vllm-project#6587)
  [Image] Bump mooncake version to v0.3.8.post1 (vllm-project#6428)
wangxiyuan pushed a commit that referenced this pull request Feb 11, 2026
…cache (#6602)

### What this PR does / why we need it?
This PR adapts #5450, #6523 to v0.13.0, to fix #6612 .

This PR extends original `rope_triton_forward` and
`split_qkv_rmsnorm_rope` to support `cos_sin_cache` && `positions` as
inputs. This fully aligns to vLLM RoPE api interface. Compared with
earlier implementation for RoPE, the benefits are:

1. avoiding pre-computation of `cos` `sin` before model execution, which
helps to remove redundant codes.
2. allowing eagle3 draft model to have different rope parameters with
main model (see #6612 ). This help to recover accept rate && accuracy in
that case.

In addition, this kernel change only introduces very small performance
degradation. Those `index_select` or `chunk` operations are now changed
into simple memory access in triton kernel

**Highlights**

- **RoPE Cache Unification**: Replaced separate _sin and _cos global
tensors with a unified cos_sin_cache and explicit positions tensor for
Rotary Positional Embeddings (RoPE), streamlining data handling.
- **Triton Kernel Integration**: Updated Triton kernels
(split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the
cos_sin_cache and positions for more efficient and integrated RoPE
calculations.
- **Custom Operation Registration**: Registered `rope_forward_oot` as a
new custom operation, allowing its use in fused compilation passes and
providing a dedicated entry point for the new RoPE implementation.
- **Refactored RoPE Forward Pass**: Modified the rope_forward_oot
function to accept the new cos_sin_cache and positions arguments,
enabling a more flexible and integrated RoPE application within the
system.

---------

Signed-off-by: Angazenn <supperccell@163.com>
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Feb 12, 2026
…#6523)

### What this PR does / why we need it?
This PR removes the custom `rotary_embedding` operator and its
associated C++ kernel implementation, PyTorch bindings, and tests.

The codebase now falls back to using the native
`torch_npu._npu_rotary_embedding` implementation. This change simplifies
the codebase by removing custom, platform-specific kernel code and
relying on the standard NPU library implementation, which is presumably
more optimized and easier to maintain.

### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring and does not introduce any
user-facing changes.

### How was this patch tested?
The tests for the custom `rotary_embedding` operator have been removed
along with the operator itself. The correctness of the fallback to the
native `torch_npu` implementation is verified by existing CI tests for
attention layers and models that use rotary embeddings.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: momochenchuw <chenchuw@huawei.com>
@wangxiyuan wangxiyuan mentioned this pull request Feb 24, 2026
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
…#6523)

### What this PR does / why we need it?
This PR removes the custom `rotary_embedding` operator and its
associated C++ kernel implementation, PyTorch bindings, and tests.

The codebase now falls back to using the native
`torch_npu._npu_rotary_embedding` implementation. This change simplifies
the codebase by removing custom, platform-specific kernel code and
relying on the standard NPU library implementation, which is presumably
more optimized and easier to maintain.

### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring and does not introduce any
user-facing changes.

### How was this patch tested?
The tests for the custom `rotary_embedding` operator have been removed
along with the operator itself. The correctness of the fallback to the
native `torch_npu` implementation is verified by existing CI tests for
attention layers and models that use rotary embeddings.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
…#6523)

### What this PR does / why we need it?
This PR removes the custom `rotary_embedding` operator and its
associated C++ kernel implementation, PyTorch bindings, and tests.

The codebase now falls back to using the native
`torch_npu._npu_rotary_embedding` implementation. This change simplifies
the codebase by removing custom, platform-specific kernel code and
relying on the standard NPU library implementation, which is presumably
more optimized and easier to maintain.

### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring and does not introduce any
user-facing changes.

### How was this patch tested?
The tests for the custom `rotary_embedding` operator have been removed
along with the operator itself. The correctness of the fallback to the
native `torch_npu` implementation is verified by existing CI tests for
attention layers and models that use rotary embeddings.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
…#6523)

### What this PR does / why we need it?
This PR removes the custom `rotary_embedding` operator and its
associated C++ kernel implementation, PyTorch bindings, and tests.

The codebase now falls back to using the native
`torch_npu._npu_rotary_embedding` implementation. This change simplifies
the codebase by removing custom, platform-specific kernel code and
relying on the standard NPU library implementation, which is presumably
more optimized and easier to maintain.

### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring and does not introduce any
user-facing changes.

### How was this patch tested?
The tests for the custom `rotary_embedding` operator have been removed
along with the operator itself. The correctness of the fallback to the
native `torch_npu` implementation is verified by existing CI tests for
attention layers and models that use rotary embeddings.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
…#6523)

### What this PR does / why we need it?
This PR removes the custom `rotary_embedding` operator and its
associated C++ kernel implementation, PyTorch bindings, and tests.

The codebase now falls back to using the native
`torch_npu._npu_rotary_embedding` implementation. This change simplifies
the codebase by removing custom, platform-specific kernel code and
relying on the standard NPU library implementation, which is presumably
more optimized and easier to maintain.

### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring and does not introduce any
user-facing changes.

### How was this patch tested?
The tests for the custom `rotary_embedding` operator have been removed
along with the operator itself. The correctness of the fallback to the
native `torch_npu` implementation is verified by existing CI tests for
attention layers and models that use rotary embeddings.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant