Skip to content

[Model] GLM5 adaptation#6642

Merged
wangxiyuan merged 2 commits intovllm-project:mainfrom
yydyzr:br_latest
Feb 11, 2026
Merged

[Model] GLM5 adaptation#6642
wangxiyuan merged 2 commits intovllm-project:mainfrom
yydyzr:br_latest

Conversation

@yydyzr
Copy link
Copy Markdown
Contributor

@yydyzr yydyzr commented Feb 9, 2026

What this PR does / why we need it?

GLM5 adaptation

  1. use torch_npu.npu_lightning_indexer for GLM5
  2. forbid eagle proposer when fullgraph mode is enabled because of bugs
  3. add quatization config for GLM5

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

by ci

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 9, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yydyzr, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue, likely related to a specific configuration or version (implied by "dsv32"), within the vllm_ascend attention mechanism. The changes involve refining the application of Rotary Positional Embeddings (RoPE) by adjusting its style and refactoring the lightning_indexer functionality. Specifically, the custom build for lightning_indexer is disabled, and its usage is transitioned to a torch_npu module, streamlining the integration of this operation.

Highlights

  • Build Configuration Update: The lightning_indexer custom operation has been commented out from the build process for ascend910_93 SOC versions in csrc/build_aclnn.sh.
  • Rotary Positional Embedding (RoPE) Adjustment: The rope_forward_triton function call in vllm_ascend/attention/sfa_v1.py has been updated to use is_neox_style=False instead of True, changing the RoPE application style.
  • Lightning Indexer Call Migration: The npu_lightning_indexer call in vllm_ascend/attention/sfa_v1.py has been migrated from torch.ops._C_ascend.npu_lightning_indexer to torch_npu.npu_lightning_indexer, indicating a shift from a custom C++ operation to a torch_npu module implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/build_aclnn.sh
    • Disabled the build of the lightning_indexer custom operation for ascend910_93 SOC versions by commenting out its entry.
  • vllm_ascend/attention/sfa_v1.py
    • Modified the rope_forward_triton function call to set is_neox_style from True to False.
    • Replaced the call to torch.ops._C_ascend.npu_lightning_indexer with torch_npu.npu_lightning_indexer.
Activity
  • No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:

[Attention][BugFix] Fix RoPE and lightning_indexer for DeepSeek-V2 support

Suggested PR Summary:

### What this PR does / why we need it?

This pull request addresses issues with running DeepSeek-V2 models on Ascend NPUs by making two key changes:

1.  **Corrects RoPE Implementation**: The Rotary Position Embedding (RoPE) style is changed from `neox_style` to non-`neox_style` within the Sparse Flash Attention (SFA) implementation. This is necessary for the correct functioning of DeepSeek-V2.
2.  **Switches `lightning_indexer` Operator**: The custom C++ operator `npu_lightning_indexer` is replaced with the native `torch_npu.npu_lightning_indexer`. The build for the custom operator has been disabled accordingly.

These changes ensure that the attention mechanism behaves correctly for DeepSeek-V2.

### Does this PR introduce _any_ user-facing change?

No, this PR contains backend fixes and does not introduce any user-facing changes.

### How was this patch tested?

CI passed with existing tests.

Comment thread vllm_ascend/attention/sfa_v1.py Outdated
cos = cos.view(-1, self.qk_rope_head_dim)
sin = sin.view(-1, self.qk_rope_head_dim)
q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=True)
q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Hardcoding is_neox_style to False can lead to incorrect behavior for models that require the Neox-style RoPE. This parameter should be sourced from the model's configuration to ensure correctness across different models. The rotary_emb object, which is available as self.rotary_emb, likely contains this configuration. Using self.rotary_emb.is_neox_style would make the implementation robust and model-agnostic.

Suggested change
q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=False)
q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=self.rotary_emb.is_neox_style)

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@yydyzr yydyzr changed the title fix dsv32 [GLM5 adaptation] Feb 11, 2026
@yydyzr yydyzr changed the title [GLM5 adaptation] [Model] GLM5 adaptation Feb 11, 2026
@yydyzr yydyzr force-pushed the br_latest branch 2 times, most recently from e8765cc to cf511d4 Compare February 11, 2026 11:16
@yiz-liu yiz-liu added ready read for review ready-for-test start test by label for PR labels Feb 11, 2026
@linfeng-yuan linfeng-yuan added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR labels Feb 11, 2026
Signed-off-by: yydyzr <liuyuncong1@huawei.com>
self.use_torch_npu_lightning_indexer = False
if self.vllm_config.model_config.hf_config.model_type in ["glm_moe_dsa"]:
self.is_rope_neox_style = False
self.use_torch_npu_lightning_indexer = True
Copy link
Copy Markdown
Collaborator

@whx-sjtu whx-sjtu Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to obtain is_neox_style from model_config. Refactor here later.

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
@wangxiyuan wangxiyuan merged commit ff3a50d into vllm-project:main Feb 11, 2026
27 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Feb 12, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend:
  [Docs] Fix GLM-5 deploy command (vllm-project#6711)
  [npugraph_ex]enable npugraph_ex by default (vllm-project#6664)
  [doc]add GLM5.md (vllm-project#6709)
  [Model] GLM5 adaptation (vllm-project#6642)
  [Bugfix] Update target probs to target logits in rejection sample (vllm-project#6685)
  [Main][Ops] Make triton rope support index_selecting from cos_sin_cache (vllm-project#5450)
  [CI]fix nightly multi node test error for wait for pod ready (vllm-project#6675)
  [main  to main] upgrade main 0210 (vllm-project#6673)
  [main][Quant] Remove unused rotation functions and parameters from W4A4 LAOS quantization (vllm-project#6648)
  [Test][BugFix] Fix torch.rand usage in triton penalty test (vllm-project#6680)
  Add Worker Interface:check_health (vllm-project#6681)
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Feb 12, 2026
### What this PR does / why we need it?
GLM5 adaptation
1. use torch_npu.npu_lightning_indexer for GLM5
2. forbid eagle proposer when fullgraph mode is enabled because of bugs
3. add quatization config for GLM5
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM main:
vllm-project/vllm@978a37c

---------

Signed-off-by: yydyzr <liuyuncong1@huawei.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
Signed-off-by: momochenchuw <chenchuw@huawei.com>
@wangxiyuan wangxiyuan mentioned this pull request Feb 24, 2026
banxiaduhuo pushed a commit to banxiaduhuo/vllm-ascend that referenced this pull request Feb 26, 2026
### What this PR does / why we need it?
GLM5 adaptation
1. use torch_npu.npu_lightning_indexer for GLM5
2. forbid eagle proposer when fullgraph mode is enabled because of bugs
3. add quatization config for GLM5
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM main:
vllm-project/vllm@978a37c

---------

Signed-off-by: yydyzr <liuyuncong1@huawei.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?
GLM5 adaptation
1. use torch_npu.npu_lightning_indexer for GLM5
2. forbid eagle proposer when fullgraph mode is enabled because of bugs
3. add quatization config for GLM5
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM main:
vllm-project/vllm@978a37c

---------

Signed-off-by: yydyzr <liuyuncong1@huawei.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
GLM5 adaptation
1. use torch_npu.npu_lightning_indexer for GLM5
2. forbid eagle proposer when fullgraph mode is enabled because of bugs
3. add quatization config for GLM5
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM main:
vllm-project/vllm@978a37c

---------

Signed-off-by: yydyzr <liuyuncong1@huawei.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?
GLM5 adaptation
1. use torch_npu.npu_lightning_indexer for GLM5
2. forbid eagle proposer when fullgraph mode is enabled because of bugs
3. add quatization config for GLM5
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM main:
vllm-project/vllm@978a37c

---------

Signed-off-by: yydyzr <liuyuncong1@huawei.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?
GLM5 adaptation
1. use torch_npu.npu_lightning_indexer for GLM5
2. forbid eagle proposer when fullgraph mode is enabled because of bugs
3. add quatization config for GLM5
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
- vLLM main:
vllm-project/vllm@978a37c

---------

Signed-off-by: yydyzr <liuyuncong1@huawei.com>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants