[Model] GLM5 adaptation#6642
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello @yydyzr, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an issue, likely related to a specific configuration or version (implied by "dsv32"), within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
Suggested PR Title:
[Attention][BugFix] Fix RoPE and lightning_indexer for DeepSeek-V2 supportSuggested PR Summary:
### What this PR does / why we need it?
This pull request addresses issues with running DeepSeek-V2 models on Ascend NPUs by making two key changes:
1. **Corrects RoPE Implementation**: The Rotary Position Embedding (RoPE) style is changed from `neox_style` to non-`neox_style` within the Sparse Flash Attention (SFA) implementation. This is necessary for the correct functioning of DeepSeek-V2.
2. **Switches `lightning_indexer` Operator**: The custom C++ operator `npu_lightning_indexer` is replaced with the native `torch_npu.npu_lightning_indexer`. The build for the custom operator has been disabled accordingly.
These changes ensure that the attention mechanism behaves correctly for DeepSeek-V2.
### Does this PR introduce _any_ user-facing change?
No, this PR contains backend fixes and does not introduce any user-facing changes.
### How was this patch tested?
CI passed with existing tests.| cos = cos.view(-1, self.qk_rope_head_dim) | ||
| sin = sin.view(-1, self.qk_rope_head_dim) | ||
| q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=True) | ||
| q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=False) |
There was a problem hiding this comment.
Hardcoding is_neox_style to False can lead to incorrect behavior for models that require the Neox-style RoPE. This parameter should be sourced from the model's configuration to ensure correctness across different models. The rotary_emb object, which is available as self.rotary_emb, likely contains this configuration. Using self.rotary_emb.is_neox_style would make the implementation robust and model-agnostic.
| q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=False) | |
| q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=self.rotary_emb.is_neox_style) |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
e8765cc to
cf511d4
Compare
Signed-off-by: yydyzr <liuyuncong1@huawei.com>
| self.use_torch_npu_lightning_indexer = False | ||
| if self.vllm_config.model_config.hf_config.model_type in ["glm_moe_dsa"]: | ||
| self.is_rope_neox_style = False | ||
| self.use_torch_npu_lightning_indexer = True |
There was a problem hiding this comment.
It's better to obtain is_neox_style from model_config. Refactor here later.
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [Docs] Fix GLM-5 deploy command (vllm-project#6711) [npugraph_ex]enable npugraph_ex by default (vllm-project#6664) [doc]add GLM5.md (vllm-project#6709) [Model] GLM5 adaptation (vllm-project#6642) [Bugfix] Update target probs to target logits in rejection sample (vllm-project#6685) [Main][Ops] Make triton rope support index_selecting from cos_sin_cache (vllm-project#5450) [CI]fix nightly multi node test error for wait for pod ready (vllm-project#6675) [main to main] upgrade main 0210 (vllm-project#6673) [main][Quant] Remove unused rotation functions and parameters from W4A4 LAOS quantization (vllm-project#6648) [Test][BugFix] Fix torch.rand usage in triton penalty test (vllm-project#6680) Add Worker Interface:check_health (vllm-project#6681)
### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: vllm-project/vllm@978a37c --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: vllm-project/vllm@978a37c --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: vllm-project/vllm@978a37c --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: vllm-project/vllm@978a37c --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: vllm-project/vllm@978a37c --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? GLM5 adaptation 1. use torch_npu.npu_lightning_indexer for GLM5 2. forbid eagle proposer when fullgraph mode is enabled because of bugs 3. add quatization config for GLM5 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci - vLLM main: vllm-project/vllm@978a37c --------- Signed-off-by: yydyzr <liuyuncong1@huawei.com> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
What this PR does / why we need it?
GLM5 adaptation
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
by ci