Skip to content

[Perf] [NPU]Fix rotary embedding dimension mismatch & add auto-adaptation for full-dim cos/sin in mindiesd RoP#2369

Open
Li-8916 wants to merge 8 commits intovllm-project:mainfrom
Li-8916:optimize-rope-on-npu
Open

[Perf] [NPU]Fix rotary embedding dimension mismatch & add auto-adaptation for full-dim cos/sin in mindiesd RoP#2369
Li-8916 wants to merge 8 commits intovllm-project:mainfrom
Li-8916:optimize-rope-on-npu

Conversation

@Li-8916
Copy link
Copy Markdown

@Li-8916 Li-8916 commented Mar 31, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR optimizes the MindIE-SD Rotary Position Embedding (RoPE) operator on Ascend NPUs to resolve dimension mismatch and redundant expansion issues in models such as Wan 2.2, where cos/sin tensors are pre-expanded to full dimension:
Fix dimension errors: Add automatic dimension detection. When cos.shape[-1] == x.shape[-1], automatically disable half_head_dim to avoid repeated expansion of already full-dimension cos/sin.
Support dual input formats: Compatible with both half-dimension cos/sin (D/2) and full-dimension cos/sin (D), adapting to RoPE preprocessing logic in different models.
Improve robustness: Eliminate shape mismatch errors caused by hardcoded parameters and make the mindiesd RoPE operator adaptive to input layouts.
Maintain performance: No changes to the existing NPU fused acceleration logic; the lightweight dimension check introduces no performance overhead.

Test Plan

Test Environment
Hardware: Ascend NPU 910B / 910B4
Framework: vLLM-Omni + PyTorch for Ascend + MindIE-SD
Model: Wan 2.2 (main model using this RoPE operator)
Test Cases
Basic functionality test
Run with half-dimension cos/sin (D/2), verify correct output shape and values.
Run with full-dimension cos/sin (D), verify new logic is triggered and half_head_dim is set to False.
Dimension matching test
Verify no repeat expand/repeat when cos.shape[-1] == x.shape[-1].
Verify original expansion logic remains unchanged when dimensions differ.

Test Result

All test cases passed on Ascend NPU:

model image resolution seconds fps interface step Number of NPUs E2E time E2E time after rope operator replacement
wan2.2-diffusers 832*480 2s 16 40 4 (cfg1 ulyness4) 86s 72s

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Li-8916 and others added 4 commits March 31, 2026 03:44
…rectly

Signed-off-by: Li-8916 <lishilin314@163.com>
Signed-off-by: Li-8916 <lishilin314@163.com>
Signed-off-by: Li-8916 <lishilin314@163.com>
Co-authored-by: vasede <1399968934@qq.com>
@Li-8916 Li-8916 requested a review from hsliuustc0106 as a code owner March 31, 2026 09:52
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a25f1311b5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +427 to +428
query = self.rope(query,freqs_cos, freqs_sin)
key = self.rope(key,freqs_cos, freqs_sin)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cast rotary tensors before invoking NPU RoPE kernel

In the new NPU branch, self.rope is called with freqs_cos/freqs_sin directly, but WAN rotary tables are produced as float32 while query/key are typically bf16/fp16; this creates a mixed-dtype call path only on mindiesd+NPU. Unlike the previous apply_rotary_emb_wan path (which wrote into empty_like) and unlike apply_rope_to_qk in layers/rope.py (which explicitly casts cos/sin), this can trigger dtype mismatch errors or silently force higher-precision execution in fused kernels, hurting correctness/performance for NPU inference.

Useful? React with 👍 / 👎.

Li-8916 added 2 commits March 31, 2026 19:18
Signed-off-by: Li-8916 <lishilin314@163.com>
Moved the import statement for find_spec to a new location.

Signed-off-by: Li-8916 <lishilin314@163.com>
freqs_cos, freqs_sin = rotary_emb
query = apply_rotary_emb_wan(query, freqs_cos, freqs_sin)
key = apply_rotary_emb_wan(key, freqs_cos, freqs_sin)
if find_spec("mindiesd") is not None and current_omni_platform.is_npu():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any better ways to add supports from mindiesd? cc @gcanlin

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to introduce it here. Will investigate how mindie-sd implement it.

Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR #2393 probably has the same effect and provide the general optimization for all platforms.

Li-8916 added 2 commits March 31, 2026 19:28
Signed-off-by: Li-8916 <lishilin314@163.com>
Signed-off-by: Li-8916 <lishilin314@163.com>
@Li-8916 Li-8916 changed the title Fix rotary embedding dimension mismatch & add auto-adaptation for full-dim cos/sin in mindiesd RoP [Perf] [NPU]Fix rotary embedding dimension mismatch & add auto-adaptation for full-dim cos/sin in mindiesd RoP Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants