Skip to content

[v0.13.0][Ops] Make triton rope support index_selecting from cos_sin_cache#6602

Merged
wangxiyuan merged 7 commits intovllm-project:releases/v0.13.0from
Angazenn:cos_sin_dev
Feb 11, 2026
Merged

[v0.13.0][Ops] Make triton rope support index_selecting from cos_sin_cache#6602
wangxiyuan merged 7 commits intovllm-project:releases/v0.13.0from
Angazenn:cos_sin_dev

Conversation

@Angazenn
Copy link
Collaborator

@Angazenn Angazenn commented Feb 6, 2026

What this PR does / why we need it?

This PR adapts #5450, #6523 to v0.13.0, to fix #6612 .

This PR extends original rope_triton_forward and split_qkv_rmsnorm_rope to support cos_sin_cache && positions as inputs. This fully aligns to vLLM RoPE api interface. Compared with earlier implementation for RoPE, the benefits are:

  1. avoiding pre-computation of cos sin before model execution, which helps to remove redundant codes.
  2. allowing eagle3 draft model to have different rope parameters with main model (see [Bug]: Potential accuracy && accept rate degradation if rope parameters in eagle3 draft model is different from main model. #6612 ). This help to recover accept rate && accuracy in that case.

In addition, this kernel change only introduces very small performance degradation. Those index_select or chunk operations are now changed into simple memory access in triton kernel

Highlights

  • RoPE Cache Unification: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling.
  • Triton Kernel Integration: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations.
  • Custom Operation Registration: Registered rope_forward_oot as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation.
  • Refactored RoPE Forward Pass: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: Angazenn <supperccell@163.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Angazenn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring of how Rotary Positional Embeddings (RoPE) are handled within the system. The core change involves transitioning from using separate sin and cos tensors, often managed globally, to a more integrated approach utilizing a cos_sin_cache tensor alongside explicit positions information. This update streamlines the RoPE application, especially within performance-critical Triton kernels and compilation passes, by providing a unified source for positional embedding data. The refactoring simplifies the RoPE implementation, removes global state dependencies, and prepares the groundwork for potentially more optimized RoPE computations.

Highlights

  • RoPE Cache Unification: Replaced separate _sin and _cos global tensors with a unified cos_sin_cache and explicit positions tensor for Rotary Positional Embeddings (RoPE), streamlining data handling.
  • Triton Kernel Integration: Updated Triton kernels (split_qkv_rmsnorm_rope_kernel, _triton_rope) to directly consume the cos_sin_cache and positions for more efficient and integrated RoPE calculations.
  • Refactored RoPE Forward Pass: Modified the rope_forward_oot function to accept the new cos_sin_cache and positions arguments, enabling a more flexible and integrated RoPE application within the system.
  • Custom Operation Registration: Registered rope_forward_triton_with_positions as a new custom operation, allowing its use in fused compilation passes and providing a dedicated entry point for the new RoPE implementation.
  • Deprecated Global Updates: Removed the update_cos_sin function and its calls, indicating a shift away from global state management for RoPE trigonometric values in favor of explicit cache passing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • vllm_ascend/ascend_forward_context.py
    • Removed _sin and _cos global variables.
  • vllm_ascend/compilation/passes/qknorm_rope_fusion_pass.py
    • Updated get_inputs to provide cos_sin_cache and positions instead of cos and sin.
    • Modified pattern and replacement functions to use cos_sin_cache and positions for torch.ops.vllm.rope_forward_triton_with_positions and torch.ops.vllm.qkv_rmsnorm_rope.
  • vllm_ascend/ops/register_custom_ops.py
    • Imported rope_forward_triton_with_positions.
    • Registered rope_forward_triton_with_positions as a new custom operation.
  • vllm_ascend/ops/rotary_embedding.py
    • Renamed _rope_forward_oot to rope_forward_oot.
    • Modified rope_forward_oot to accept cos_sin_cache, head_size, and rotary_dim as direct arguments.
    • Integrated a new Triton-based RoPE path using torch.ops.vllm.rope_forward_triton_with_positions.
    • Updated AscendRotaryEmbedding's forward_oot and forward methods to pass the new RoPE arguments.
  • vllm_ascend/ops/triton/linearnorm/split_qkv_rmsnorm_rope.py
    • Modified split_qkv_rmsnorm_rope_kernel to use cos_sin_ptr and pos_ptr for fetching trigonometric values.
    • Updated split_qkv_rmsnorm_rope_impl and split_qkv_rmsnorm_rope_impl_fake to accept cos_sin_cache and positions.
  • vllm_ascend/ops/triton/rope.py
    • Added torch import.
    • Enhanced _triton_rope kernel to support loading cos and sin from a combined cos_sin_cache using positions.
    • Introduced rope_forward_triton_with_positions as a new entry point for RoPE with cached values and positions.
  • vllm_ascend/spec_decode/eagle_proposer.py
    • Removed import of update_cos_sin.
    • Eliminated calls to update_cos_sin in dummy_run, _propose, and _run_merged_draft.
Activity
  • The pull request is currently in a draft state, indicating ongoing development.
  • No specific review comments or approvals have been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Angazenn Angazenn added ready read for review ready-for-test start test by label for PR labels Feb 6, 2026
@Angazenn Angazenn marked this pull request as ready for review February 6, 2026 09:38
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Rotary Position Embedding (RoPE) implementation to use a unified cos_sin_cache and positions tensor, moving away from global sin and cos caches. This is a good change for code clarity and maintainability. The PR introduces a new Triton kernel and updates fusion passes accordingly. I've found a couple of critical syntax errors that will prevent the code from running. Please see the detailed comments. Additionally, the PR title and description should be updated to follow the repository's style guide.

Suggested PR Title:

[main][Attention][Feature] Refactor RoPE to use cos_sin_cache

Suggested PR Summary:

### What this PR does / why we need it?

This PR refactors the Rotary Position Embedding (RoPE) implementation to use a unified `cos_sin_cache` and `positions` tensor. This change simplifies the RoPE logic by removing global `sin` and `cos` caches and making data dependencies explicit. It introduces a new Triton kernel `rope_forward_triton_with_positions` and updates fusion passes to leverage this new implementation.

This refactoring improves code clarity, maintainability, and is a prerequisite for further performance optimizations by enabling more fusion opportunities.

### Does this PR introduce _any_ user-facing change?

No, this is a refactoring of the internal implementation and does not introduce any user-facing changes.

### How was this patch tested?

CI should pass with existing and any new unit tests. It is recommended to add specific unit tests for the new Triton kernels and to verify the correctness of the `qknorm_rope_fusion_pass`.

Signed-off-by: Angazenn <supperccell@163.com>
@Angazenn Angazenn marked this pull request as draft February 7, 2026 04:22
@Angazenn Angazenn marked this pull request as ready for review February 7, 2026 06:17
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
@Angazenn Angazenn changed the title [draft]support cos_sin_cache [v0.13.0][Ops] Make triton rope support index_selecting from cos_sin_cache Feb 10, 2026
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
@wangxiyuan wangxiyuan merged commit 2dc55ac into vllm-project:releases/v0.13.0 Feb 11, 2026
13 checks passed
whx-sjtu pushed a commit that referenced this pull request Feb 14, 2026
…s not supported (#6749)

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
With #6602 , `npu_rotary_embedding` unifies all rope implementation in
AscendRotaryEmbedding, but allows a wider range of application of fusion
op `split_qkv_rmsnorm_rope`. This PR restricts the fusion of
`split_qkv_rmsnorm_rope` to only cases where `head_size` == 128 &&
`rotary_dim` == `head_size`. Further enhancement and generalization of
this op will be accomplished by @whx-sjtu .

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: Angazenn <supperccell@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants