[v0.13.0][Ops] Make triton rope support index_selecting from cos_sin_cache by Angazenn · Pull Request #6602 · vllm-project/vllm-ascend

Angazenn · 2026-02-06T09:36:52Z

What this PR does / why we need it?

This PR adapts #5450, #6523 to v0.13.0, to fix #6612 .

This PR extends original rope_triton_forward and split_qkv_rmsnorm_rope to support cos_sin_cache && positions as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are:

avoiding pre-computation of cos sin before model execution, which helps to remove redundant codes.
allowing eagle3 draft model to have different rope parameters with main model (see [Bug]: Potential accuracy && accept rate degradation if rope parameters in eagle3 draft model is different from main model. #6612 ). This help to recover accept rate && accuracy in that case.

In addition, this kernel change only introduces very small performance degradation. Those index_select or chunk operations are now changed into simple memory access in triton kernel

Highlights

RoPE Cache Unification: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling.
Triton Kernel Integration: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations.
Custom Operation Registration: Registered rope_forward_oot as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation.
Refactored RoPE Forward Pass: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: Angazenn <supperccell@163.com>

gemini-code-assist · 2026-02-06T09:37:24Z

Summary of Changes

Hello @Angazenn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring of how Rotary Positional Embeddings (RoPE) are handled within the system. The core change involves transitioning from using separate sin and cos tensors, often managed globally, to a more integrated approach utilizing a cos_sin_cache tensor alongside explicit positions information. This update streamlines the RoPE application, especially within performance-critical Triton kernels and compilation passes, by providing a unified source for positional embedding data. The refactoring simplifies the RoPE implementation, removes global state dependencies, and prepares the groundwork for potentially more optimized RoPE computations.

Highlights

RoPE Cache Unification: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling.
Triton Kernel Integration: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations.
Refactored RoPE Forward Pass: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system.
Custom Operation Registration: Registered rope_forward_triton_with_positions as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation.
Deprecated Global Updates: Removed the update_cos_sin function and its calls, indicating a shift away from global state management for RoPE trigonometric values in favor of explicit cache passing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

vllm_ascend/ascend_forward_context.py
- Removed _sin and _cos global variables.
vllm_ascend/compilation/passes/qknorm_rope_fusion_pass.py
- Updated get_inputs to provide cos_sin_cache and positions instead of cos and sin.
- Modified pattern and replacement functions to use cos_sin_cache and positions for torch.ops.vllm.rope_forward_triton_with_positions and torch.ops.vllm.qkv_rmsnorm_rope.
vllm_ascend/ops/register_custom_ops.py
- Imported rope_forward_triton_with_positions.
- Registered rope_forward_triton_with_positions as a new custom operation.
vllm_ascend/ops/rotary_embedding.py
- Renamed _rope_forward_oot to rope_forward_oot.
- Modified rope_forward_oot to accept cos_sin_cache, head_size, and rotary_dim as direct arguments.
- Integrated a new Triton-based RoPE path using torch.ops.vllm.rope_forward_triton_with_positions.
- Updated AscendRotaryEmbedding's forward_oot and forward methods to pass the new RoPE arguments.
vllm_ascend/ops/triton/linearnorm/split_qkv_rmsnorm_rope.py
- Modified split_qkv_rmsnorm_rope_kernel to use cos_sin_ptr and pos_ptr for fetching trigonometric values.
- Updated split_qkv_rmsnorm_rope_impl and split_qkv_rmsnorm_rope_impl_fake to accept cos_sin_cache and positions.
vllm_ascend/ops/triton/rope.py
- Added torch import.
- Enhanced _triton_rope kernel to support loading cos and sin from a combined cos_sin_cache using positions.
- Introduced rope_forward_triton_with_positions as a new entry point for RoPE with cached values and positions.
vllm_ascend/spec_decode/eagle_proposer.py
- Removed import of update_cos_sin.
- Eliminated calls to update_cos_sin in dummy_run, _propose, and _run_merged_draft.

Activity

The pull request is currently in a draft state, indicating ongoing development.
No specific review comments or approvals have been recorded yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the Rotary Position Embedding (RoPE) implementation to use a unified cos_sin_cache and positions tensor, moving away from global sin and cos caches. This is a good change for code clarity and maintainability. The PR introduces a new Triton kernel and updates fusion passes accordingly. I've found a couple of critical syntax errors that will prevent the code from running. Please see the detailed comments. Additionally, the PR title and description should be updated to follow the repository's style guide.

Suggested PR Title:

[main][Attention][Feature] Refactor RoPE to use cos_sin_cache

Suggested PR Summary:

### What this PR does / why we need it?

This PR refactors the Rotary Position Embedding (RoPE) implementation to use a unified `cos_sin_cache` and `positions` tensor. This change simplifies the RoPE logic by removing global `sin` and `cos` caches and making data dependencies explicit. It introduces a new Triton kernel `rope_forward_triton_with_positions` and updates fusion passes to leverage this new implementation.

This refactoring improves code clarity, maintainability, and is a prerequisite for further performance optimizations by enabling more fusion opportunities.

### Does this PR introduce _any_ user-facing change?

No, this is a refactoring of the internal implementation and does not introduce any user-facing changes.

### How was this patch tested?

CI should pass with existing and any new unit tests. It is recommended to add specific unit tests for the new Triton kernels and to verify the correctness of the `qknorm_rope_fusion_pass`.

vllm_ascend/compilation/passes/qknorm_rope_fusion_pass.py

vllm_ascend/ops/triton/rope.py

Signed-off-by: Angazenn <supperccell@163.com>

vllm_ascend/ops/register_custom_ops.py

@whx-sjtu

…s not supported (#6749)  ### What this PR does / why we need it? With #6602 , `npu_rotary_embedding` unifies all rope implementation in AscendRotaryEmbedding, but allows a wider range of application of fusion op `split_qkv_rmsnorm_rope`. This PR restricts the fusion of `split_qkv_rmsnorm_rope` to only cases where `head_size` == 128 && `rotary_dim` == `head_size`. Further enhancement and generalization of this op will be accomplished by @whx-sjtu . ### Does this PR introduce _any_ user-facing change?  ### How was this patch tested?  --------- Signed-off-by: Angazenn <supperccell@163.com>

support cos_sin_cache

217956d

Signed-off-by: Angazenn <supperccell@163.com>

Angazenn added ready read for review ready-for-test start test by label for PR labels Feb 6, 2026

Angazenn marked this pull request as ready for review February 6, 2026 09:38

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

vllm_ascend/compilation/passes/qknorm_rope_fusion_pass.py Outdated Show resolved Hide resolved

vllm_ascend/ops/triton/rope.py Outdated Show resolved Hide resolved

bugfix

aec59e9

Signed-off-by: Angazenn <supperccell@163.com>

Angazenn force-pushed the cos_sin_dev branch from 7e236cc to aec59e9 Compare February 6, 2026 17:35

Angazenn marked this pull request as draft February 7, 2026 04:22

Angazenn marked this pull request as ready for review February 7, 2026 06:17

Angazenn force-pushed the cos_sin_dev branch from 8ff17c7 to 3802834 Compare February 7, 2026 08:00

fix ut

7cb7adb

Signed-off-by: Angazenn <supperccell@163.com>

Angazenn force-pushed the cos_sin_dev branch from 3802834 to 7cb7adb Compare February 7, 2026 08:42

Angazenn added 2 commits February 10, 2026 16:22

modify api

768309c

Signed-off-by: Angazenn <supperccell@163.com>

fix lint

de72cba

Signed-off-by: Angazenn <supperccell@163.com>

Angazenn changed the title ~~[draft]support cos_sin_cache~~ [v0.13.0][Ops] Make triton rope support index_selecting from cos_sin_cache Feb 10, 2026

Angazenn added 2 commits February 10, 2026 17:07

bugfix

b562879

Signed-off-by: Angazenn <supperccell@163.com>

rm custom rope

ae2f3ee

Signed-off-by: Angazenn <supperccell@163.com>

whx-sjtu reviewed Feb 11, 2026

View reviewed changes

vllm_ascend/ops/register_custom_ops.py Show resolved Hide resolved

whx-sjtu approved these changes Feb 11, 2026

View reviewed changes

wangxiyuan merged commit 2dc55ac into vllm-project:releases/v0.13.0 Feb 11, 2026
13 checks passed

Angazenn mentioned this pull request Feb 13, 2026

[v0.13.0][Fusion]add checks to skip fusion where split_rmsnorm_rope is not supported #6749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.13.0][Ops] Make triton rope support index_selecting from cos_sin_cache#6602

[v0.13.0][Ops] Make triton rope support index_selecting from cos_sin_cache#6602
wangxiyuan merged 7 commits intovllm-project:releases/v0.13.0from
Angazenn:cos_sin_dev

Angazenn commented Feb 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Angazenn commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Angazenn commented Feb 6, 2026 •

edited

Loading