[Model] GLM5 adaptation by yydyzr · Pull Request #6642 · vllm-project/vllm-ascend

yydyzr · 2026-02-09T12:42:48Z

What this PR does / why we need it?

GLM5 adaptation

use torch_npu.npu_lightning_indexer for GLM5
forbid eagle proposer when fullgraph mode is enabled because of bugs
add quatization config for GLM5

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

by ci

vLLM main: vllm-project/vllm@978a37c

github-actions · 2026-02-09T12:43:01Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-02-09T12:43:05Z

Summary of Changes

Hello @yydyzr, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue, likely related to a specific configuration or version (implied by "dsv32"), within the vllm_ascend attention mechanism. The changes involve refining the application of Rotary Positional Embeddings (RoPE) by adjusting its style and refactoring the lightning_indexer functionality. Specifically, the custom build for lightning_indexer is disabled, and its usage is transitioned to a torch_npu module, streamlining the integration of this operation.

Highlights

Build Configuration Update: The lightning_indexer custom operation has been commented out from the build process for ascend910_93 SOC versions in csrc/build_aclnn.sh.
Rotary Positional Embedding (RoPE) Adjustment: The rope_forward_triton function call in vllm_ascend/attention/sfa_v1.py has been updated to use is_neox_style=False instead of True, changing the RoPE application style.
Lightning Indexer Call Migration: The npu_lightning_indexer call in vllm_ascend/attention/sfa_v1.py has been migrated from torch.ops._C_ascend.npu_lightning_indexer to torch_npu.npu_lightning_indexer, indicating a shift from a custom C++ operation to a torch_npu module implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

csrc/build_aclnn.sh
- Disabled the build of the lightning_indexer custom operation for ascend910_93 SOC versions by commenting out its entry.
vllm_ascend/attention/sfa_v1.py
- Modified the rope_forward_triton function call to set is_neox_style from True to False.
- Replaced the call to torch.ops._C_ascend.npu_lightning_indexer with torch_npu.npu_lightning_indexer.

Activity

No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Suggested PR Title:

[Attention][BugFix] Fix RoPE and lightning_indexer for DeepSeek-V2 support

Suggested PR Summary:

### What this PR does / why we need it?

This pull request addresses issues with running DeepSeek-V2 models on Ascend NPUs by making two key changes:

1.  **Corrects RoPE Implementation**: The Rotary Position Embedding (RoPE) style is changed from `neox_style` to non-`neox_style` within the Sparse Flash Attention (SFA) implementation. This is necessary for the correct functioning of DeepSeek-V2.
2.  **Switches `lightning_indexer` Operator**: The custom C++ operator `npu_lightning_indexer` is replaced with the native `torch_npu.npu_lightning_indexer`. The build for the custom operator has been disabled accordingly.

These changes ensure that the attention mechanism behaves correctly for DeepSeek-V2.

### Does this PR introduce _any_ user-facing change?

No, this PR contains backend fixes and does not introduce any user-facing changes.

### How was this patch tested?

CI passed with existing tests.

gemini-code-assist · 2026-02-09T12:44:36Z

            cos = cos.view(-1, self.qk_rope_head_dim)
            sin = sin.view(-1, self.qk_rope_head_dim)
-            q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=True)
+            q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=False)


Hardcoding is_neox_style to False can lead to incorrect behavior for models that require the Neox-style RoPE. This parameter should be sourced from the model's configuration to ensure correctness across different models. The rotary_emb object, which is available as self.rotary_emb, likely contains this configuration. Using self.rotary_emb.is_neox_style would make the implementation robust and model-agnostic.

Suggested change

q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=False)

q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=self.rotary_emb.is_neox_style)

github-actions · 2026-02-10T13:22:21Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: yydyzr <liuyuncong1@huawei.com>

whx-sjtu · 2026-02-11T13:33:44Z

+        self.use_torch_npu_lightning_indexer = False
+        if self.vllm_config.model_config.hf_config.model_type in ["glm_moe_dsa"]:
+            self.is_rope_neox_style = False
+            self.use_torch_npu_lightning_indexer = True


It's better to obtain is_neox_style from model_config. Refactor here later.

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [Docs] Fix GLM-5 deploy command (vllm-project#6711) [npugraph_ex]enable npugraph_ex by default (vllm-project#6664) [doc]add GLM5.md (vllm-project#6709) [Model] GLM5 adaptation (vllm-project#6642) [Bugfix] Update target probs to target logits in rejection sample (vllm-project#6685) [Main][Ops] Make triton rope support index_selecting from cos_sin_cache (vllm-project#5450) [CI]fix nightly multi node test error for wait for pod ready (vllm-project#6675) [main to main] upgrade main 0210 (vllm-project#6673) [main][Quant] Remove unused rotation functions and parameters from W4A4 LAOS quantization (vllm-project#6648) [Test][BugFix] Fix torch.rand usage in triton penalty test (vllm-project#6680) Add Worker Interface:check_health (vllm-project#6681)

### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: vllm-project/vllm@978a37c --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: vllm-project/vllm@978a37c --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com>

### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: vllm-project/vllm@978a37c --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: vllm-project/vllm@978a37c --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com>

### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: vllm-project/vllm@978a37c --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: vllm-project/vllm@978a37c --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com>

yydyzr requested review from weijinqian0, whx-sjtu and zzzzwwjj as code owners February 9, 2026 12:42

gemini-code-assist Bot reviewed Feb 9, 2026

View reviewed changes

yydyzr requested review from wangxiyuan and yiz-liu as code owners February 10, 2026 12:34

github-actions Bot added the merge-conflicts label Feb 10, 2026

yydyzr force-pushed the br_latest branch from 568f5b0 to a9cb9c5 Compare February 10, 2026 13:32

github-actions Bot removed the merge-conflicts label Feb 10, 2026

yydyzr changed the title ~~fix dsv32~~ [GLM5 adaptation] Feb 11, 2026

yydyzr changed the title ~~[GLM5 adaptation]~~ [Model] GLM5 adaptation Feb 11, 2026

yydyzr force-pushed the br_latest branch 2 times, most recently from e8765cc to cf511d4 Compare February 11, 2026 11:16

yiz-liu added ready read for review ready-for-test start test by label for PR labels Feb 11, 2026

linfeng-yuan added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR labels Feb 11, 2026

yiz-liu approved these changes Feb 11, 2026

View reviewed changes

adapt glm5

0ea5f86

Signed-off-by: yydyzr <liuyuncong1@huawei.com>

yydyzr force-pushed the br_latest branch from 5db8b68 to 0ea5f86 Compare February 11, 2026 13:23

whx-sjtu approved these changes Feb 11, 2026

View reviewed changes

whx-sjtu reviewed Feb 11, 2026

View reviewed changes

clean code

f164767

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>

wangxiyuan approved these changes Feb 11, 2026

View reviewed changes

wangxiyuan merged commit ff3a50d into vllm-project:main Feb 11, 2026
27 checks passed

wangxiyuan mentioned this pull request Feb 24, 2026

[Misc]: test #6787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] GLM5 adaptation#6642

[Model] GLM5 adaptation#6642
wangxiyuan merged 2 commits intovllm-project:mainfrom
yydyzr:br_latest

yydyzr commented Feb 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Feb 9, 2026

Uh oh!

gemini-code-assist Bot commented Feb 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Feb 9, 2026

Uh oh!

github-actions Bot commented Feb 10, 2026

Uh oh!

whx-sjtu Feb 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=False)
	q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=self.rotary_emb.is_neox_style)

Conversation

yydyzr commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented Feb 9, 2026

Uh oh!

gemini-code-assist Bot commented Feb 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Feb 10, 2026

Uh oh!

whx-sjtu Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yydyzr commented Feb 9, 2026 •

edited

Loading

whx-sjtu Feb 11, 2026 •

edited

Loading