Skip to content

support triton of mrope#5664

Merged
wangxiyuan merged 1 commit intovllm-project:mainfrom
shiyuan680:mrope
Jan 13, 2026
Merged

support triton of mrope#5664
wangxiyuan merged 1 commit intovllm-project:mainfrom
shiyuan680:mrope

Conversation

@shiyuan680
Copy link
Copy Markdown
Contributor

@shiyuan680 shiyuan680 commented Jan 6, 2026

What this PR does / why we need it?

this pr support use triton mrope like cuda_forward, which performance is equal to ascendc ops
this triton ops should use cann 8.5.0

Does this PR introduce any user-facing change?

How was this patch tested?

test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Triton-based mRoPE, including a new end-to-end test, updates to unit tests, and the core implementation in AscendMRotaryEmbedding. While the feature is a good addition, I've found several critical issues that must be addressed. The new e2e test contains bugs in its reference implementation and test logic, which invalidates its results. More importantly, the AscendMRotaryEmbedding implementation has a critical flaw in its caching logic that will lead to incorrect computations by using stale data. I've also noted a minor issue in the unit test mocks. Please see the detailed comments for specifics and suggestions for fixes.


if mrope_section[1] > 0:
cos_row[t_end:h_end] = token_cos[t_end:h_end, 1]
sin_row[t_end:h_end] = token_cos[t_end:h_end, 1]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a copy-paste error here. sin_row is being assigned values from token_cos instead of token_sin. This makes the reference implementation incorrect, and the test will validate the Triton kernel against wrong values.

Suggested change
sin_row[t_end:h_end] = token_cos[t_end:h_end, 1]
sin_row[t_end:h_end] = token_sin[t_end:h_end, 1]

Comment on lines +168 to +175
q_gold, k_gold = pytorch_forward_native(q_gold,
k_gold,
cos,
sin,
mrope_section,
num_tokens,
head_size,
True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The arguments passed to pytorch_forward_native are incorrect. The head_size parameter is receiving num_tokens, and the rotary_dim parameter is receiving head_size. The actual rotary_dim from the test parameters is not being used at all. This makes the test logic flawed and it will not correctly validate the Triton kernel's output.

Suggested change
q_gold, k_gold = pytorch_forward_native(q_gold,
k_gold,
cos,
sin,
mrope_section,
num_tokens,
head_size,
True)
q_gold, k_gold = pytorch_forward_native(q_gold,
k_gold,
cos,
sin,
mrope_section,
head_size,
rotary_dim,
True)

Comment thread vllm_ascend/ops/rotary_embedding.py Outdated
Comment on lines +552 to +553
self.cos = None
self.sin = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

These attributes are used in forward_triton to cache cos and sin values. However, this caching is stateful and incorrect because the values depend on positions, which can change between calls. This will lead to using stale cached data. These attributes should be removed, and the caching logic in forward_triton should be corrected to be stateless within the forward pass.

Comment on lines +575 to +594
if self.cos is None and self.sin is None:
cos_sin = self.cos_sin_cache[positions] # type: ignore
cos, sin = cos_sin.chunk(2, dim=-1)
self.cos = cos.contiguous()
self.sin = sin.contiguous()
query_shape = query.shape
key_shape = key.shape

assert self.mrope_section

q, k = triton_mrope(
query,
key,
self.cos,
self.sin,
self.mrope_section,
self.head_size,
self.rotary_dim,
self.mrope_interleaved,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The caching logic for cos and sin is incorrect. These tensors depend on positions, which can change on each call to forward_triton. Caching them as instance attributes (self.cos, self.sin) will cause subsequent calls with different positions to use stale values, leading to incorrect results. The cos and sin tensors should be computed on every call and not stored on self.

        cos_sin = self.cos_sin_cache[positions]  # type: ignore
        cos, sin = cos_sin.chunk(2, dim=-1)
        query_shape = query.shape
        key_shape = key.shape

        assert self.mrope_section

        q, k = triton_mrope(
            query,
            key,
            cos.contiguous(),
            sin.contiguous(),
            self.mrope_section,
            self.head_size,
            self.rotary_dim,
            self.mrope_interleaved,
        )

Comment thread tests/ut/ops/test_rotary_embedding.py Outdated
@patch('vllm.triton_utils.HAS_TRITON', True)
@patch('vllm.config.ModelConfig.__post_init__', MagicMock())
@patch('vllm.config.VllmConfig.__post_init__', MagicMock())
@patch('vllm.triton_utils.HAS_TRITON', return_value=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This patch for vllm.triton_utils.HAS_TRITON is both redundant and incorrect. It's redundant because HAS_TRITON is already patched on line 457. It's incorrect because it uses return_value=True, which is meant for patching callables, but HAS_TRITON is a boolean constant. Please remove this line.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 6, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@shiyuan680 shiyuan680 force-pushed the mrope branch 4 times, most recently from 9e4526f to 0fab5fb Compare January 7, 2026 08:09
@weijinqian0 weijinqian0 added ready read for review ready-for-test start test by label for PR labels Jan 8, 2026
@shiyuan680 shiyuan680 force-pushed the mrope branch 2 times, most recently from 23d51ed to fddc350 Compare January 8, 2026 08:17
Comment thread vllm_ascend/ops/rotary_embedding.py Outdated
"beta_fast": beta_fast,
"beta_slow": beta_slow
}
super().__init__(head_size, rotary_dim, max_position_embeddings, base,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove init

@shiyuan680 shiyuan680 force-pushed the mrope branch 2 times, most recently from 0ab48f6 to e60f393 Compare January 9, 2026 01:40
Signed-off-by: shiyuan680 <917935075@qq.com>
@wangxiyuan wangxiyuan merged commit 7af3b88 into vllm-project:main Jan 13, 2026
16 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 13, 2026
…to eplb_refactor

* 'main' of https://github.com/vllm-project/vllm-ascend:
  [CI] Unblock 4-cards test (vllm-project#5831)
  [Refactor] Provide a framework to accommodate operators for different hardware devices (vllm-project#5735)
  [Refactor] Modify the binding logic to allocate CPU cores for each NPU card (vllm-project#5555)
  [BugFix] Support setting tp=1 for the Eagle draft model to take effect (vllm-project#5519)
  support triton of mrope (vllm-project#5664)
  [bugfix] A2 Environment Pooling for Memcache Compatibility (vllm-project#5601)
  [Doc] Update community contributors and versioning naming to follow vLLM (vllm-project#5820)
  [Refactor] Add comments for Metadata classes in attention module (vllm-project#5789)
  [Bugfix] bugfix for the order of dummy run pad and sync (vllm-project#5777)
  [CI] Move nightly-a2 test to hk (vllm-project#5807)
  [CI] Show disk usage for CI shared volume (vllm-project#5821)
  Bump actions/checkout from 4 to 6 (vllm-project#5795)
  Bump actions/github-script from 7 to 8 (vllm-project#5796)
  [bugfix](cp) align max_context_chunk to cp_virtual_block_size (vllm-project#5767)
  [bugfix]limit graph replay sync (vllm-project#5761)
  [CI]Add Kimi k2 nightly test (vllm-project#5682)
  [Doc] add tls check to pd disaggregation readme  (vllm-project#5638)
  [CI] adpat v0.13.0 change (vllm-project#5793)
guanguan0308 pushed a commit to guanguan0308/vllm-ascend that referenced this pull request Jan 13, 2026
### What this PR does / why we need it?
this pr support use triton mrope like cuda_forward, which performance is
equal to ascendc ops
this triton ops should use cann 8.5.0
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: shiyuan680 <917935075@qq.com>
guanguan0308 pushed a commit to guanguan0308/vllm-ascend that referenced this pull request Jan 13, 2026
### What this PR does / why we need it?
this pr support use triton mrope like cuda_forward, which performance is
equal to ascendc ops
this triton ops should use cann 8.5.0
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: shiyuan680 <917935075@qq.com>
aipaes pushed a commit to aipaes/vllm-ascend that referenced this pull request Jan 15, 2026
this pr support use triton mrope like cuda_forward, which performance is
equal to ascendc ops
this triton ops should use cann 8.5.0

test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: shiyuan680 <917935075@qq.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
### What this PR does / why we need it?
this pr support use triton mrope like cuda_forward, which performance is
equal to ascendc ops
this triton ops should use cann 8.5.0
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: shiyuan680 <917935075@qq.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
### What this PR does / why we need it?
this pr support use triton mrope like cuda_forward, which performance is
equal to ascendc ops
this triton ops should use cann 8.5.0
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: shiyuan680 <917935075@qq.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?
this pr support use triton mrope like cuda_forward, which performance is
equal to ascendc ops
this triton ops should use cann 8.5.0
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: shiyuan680 <917935075@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
NickJudyHvv added a commit to NickJudyHvv/vllm-ascend that referenced this pull request Mar 2, 2026
Adapted from vllm-ascend PR vllm-project#5664. Adds forward_triton path to
AscendMRotaryEmbedding that uses vllm's triton_mrope kernel,
with corresponding test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NickJudyHvv added a commit to NickJudyHvv/vllm-ascend that referenced this pull request Mar 2, 2026
Adapted from vllm-ascend PR vllm-project#5664 and PR vllm-project#6042. Adds Triton-based mRoPE
support to AscendMRotaryEmbedding, with fix to only use Triton path when
mrope_interleaved is True.
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
this pr support use triton mrope like cuda_forward, which performance is
equal to ascendc ops
this triton ops should use cann 8.5.0
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: shiyuan680 <917935075@qq.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?
this pr support use triton mrope like cuda_forward, which performance is
equal to ascendc ops
this triton ops should use cann 8.5.0
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: shiyuan680 <917935075@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?
this pr support use triton mrope like cuda_forward, which performance is
equal to ascendc ops
this triton ops should use cann 8.5.0
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test in qwen3-vl-235b acc textvqa
native 81.82
npu triton 81.58
cuda triton 81.52
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: shiyuan680 <917935075@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ops module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants