[CI] Main2main upgrade vllm to 0330 by 22dimensions · Pull Request #7962 · vllm-project/vllm-ascend

22dimensions · 2026-04-02T13:46:01Z

What this PR does / why we need it?

Main2main upgrade vllm to 0330
fix breaks:

Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728) vllm#37728 add clear_row method for BlockTable
[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 vllm#37975 Adapt GatedDeltaNetAttention Refactor
[ROCm][Bugfix] fix exception related to trust_remote_code for MiniMax-M2.1-MXFP4 vllm#37698 update maybe_update_config in vllm_ascend/quantization/modelslim_config.py to adapt this pr change
[Feature] Support per-draft-model MoE backend via --speculative-config vllm#37880 This pr add the feat where we can set different moe backends between draft and target model, we should overwrite it in the draft proposer
[kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models vllm#37853 for now just to skip test_cpu_offloading.py test case utils this feature has been adapted.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI

vLLM version: v0.18.0
vLLM main: vllm-project/vllm@29e4870

gemini-code-assist · 2026-04-02T13:47:28Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the Ascend backend for vLLM by integrating a new, optimized Gated Delta Net attention mechanism and improving KV cache management. It also addresses compatibility issues with vLLM's buffer types in the model runner and introduces a critical fix for speculative decoding with async scheduling. These changes aim to enhance performance and stability for models running on Ascend hardware.

Highlights

GDN Attention Integration: Introduced a new Ascend-optimized Gated Delta Net (GDN) attention implementation, AscendGatedDeltaNetAttention, in vllm_ascend/ops/gdn.py. This includes custom operations for convolution and recurrent attention, specifically designed for models like Qwen3Next and Qwen3.5.
Refactored Batch Invariant Check: The vllm_is_batch_invariant function was moved from vllm.model_executor.layers.batch_invariant to vllm_ascend.batch_invariant and updated to be compatible with newer vLLM versions, ensuring correct behavior across different environments.
Enhanced KV Cache Handling: Modified attention mechanisms in attention_v1.py to directly pass kv_cache and include logic for initializing key_cache and value_cache for DecodeOnly mode, improving flexibility in KV cache management.
Model Runner Buffer Compatibility: Implemented utility methods and adjusted buffer handling within NPUModelRunner to ensure compatibility with both CpuGpuBuffer and plain torch.Tensor types for various internal buffers like positions, input IDs, and sequence lengths.
Speculative Decoding and Async Scheduling: Added a safeguard to disable async scheduling when speculative decoding is active on Ascend, addressing a current limitation where GPU-side token correction is not yet supported, preventing accuracy divergence.
KV Offload Manager Update: Updated the KV offload mechanism in vllm_ascend/kv_offload/npu.py to use CPUOffloadingManager, aligning with recent changes in vLLM's KV offload architecture.
Qwen3.5 Patch Consolidation: Removed dedicated Qwen3.5 Gated Delta Net and Decoder Layer patches, consolidating their functionality under the new AscendGatedDeltaNetAttention for a more streamlined implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files

Ignored by pattern: .github/workflows/** (6)
- .github/workflows/_e2e_test.yaml
- .github/workflows/bot_pr_create.yaml
- .github/workflows/dockerfiles/Dockerfile.lint
- .github/workflows/pr_test_full.yaml
- .github/workflows/pr_test_light.yaml
- .github/workflows/schedule_codecov_refresh.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Suggested PR Title:

[Ops][Misc] Support GatedDeltaNet and improve vLLM version compatibility for Ascend

Suggested PR Summary:

### What this PR does / why we need it?
This PR implements `AscendGatedDeltaNetAttention` in a dedicated ops file and refactors Qwen3.5 and Qwen3Next patches to use this centralized implementation. It significantly improves compatibility across vLLM versions by introducing abstraction methods in `NPUModelRunner` to handle both `CpuGpuBuffer` and plain Tensors. Key improvements include fixing `kv_cache` initialization for `DecodeOnly` mode, adding `clear_row` functionality to `BlockTable`, and disabling async scheduling during speculative decoding on Ascend to prevent accuracy divergence. A redundant global import in the worker patch initialization was identified as a potential conflict for 310P devices and should be restricted to the appropriate conditional block.

### Does this PR introduce _any_ user-facing change?
Async scheduling is now automatically disabled when speculative decoding is active on Ascend platforms to ensure model accuracy.

### How was this patch tested?
Tested with existing model suites and speculative decoding configurations on Ascend hardware.

github-actions · 2026-04-02T13:53:48Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2026-04-03T12:39:15Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Potabk · 2026-04-07T09:07:33Z

vllm-project/vllm#37880 This pr add the feat where we can set different moe backends between draft and target model, we should overwrite it in the draft proposer

github-actions · 2026-04-07T11:03:52Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: 22dimensions <waitingwind@foxmail.com>

This function is not available in all vLLM versions, particularly in versions where spec_decode was still evolving. By adding a fallback that returns the config as-is, we maintain compatibility with older vLLM releases while still supporting the newer enhanced version. Fixes 94 test failures in main2main CI run #23837548280. Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>

vLLM v0.18.0 calls clear_row() on block tables to reset request rows. Both BlockTable and MultiGroupBlockTable were missing this method, causing AttributeError when remove_request() was called. Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>

…t and Qwen3.5 Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>

This fixes a method signature mismatch that was causing TypeErrors during engine initialization. The vLLM base class QuantizationConfig.maybe_update_config() expects three parameters (model_name, hf_config, revision), but the vllm-ascend AscendModelSlimConfig override was missing the hf_config parameter. This was causing 92% of CI test failures (47 out of 51 failed test cases). Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>

Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>

Signed-off-by: 22dimensions <waitingwind@foxmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>

Signed-off-by: 22dimensions <waitingwind@foxmail.com>

22dimensions requested review from MengqingCao, Yikun, nalinaly, realliujiaxu, wangxiyuan, weijinqian0, whx-sjtu and zzzzwwjj as code owners April 2, 2026 13:46

gemini-code-assist Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread vllm_ascend/patch/worker/__init__.py Outdated

22dimensions force-pushed the main2main_0330 branch from 9c65690 to 147d20e Compare April 2, 2026 13:53

22dimensions added the ready read for review label Apr 2, 2026

github-actions Bot added ci/build module:ops module:core labels Apr 2, 2026

22dimensions added ready-for-test start test by label for PR and removed ci/build module:ops module:core labels Apr 2, 2026

github-actions Bot added the merge-conflicts label Apr 3, 2026

22dimensions force-pushed the main2main_0330 branch from 71e6f02 to e3a4ee9 Compare April 4, 2026 16:26

github-actions Bot removed the merge-conflicts label Apr 4, 2026

22dimensions force-pushed the main2main_0330 branch from 9fd6872 to 82c183e Compare April 7, 2026 11:03

github-actions Bot added the merge-conflicts label Apr 7, 2026

22dimensions force-pushed the main2main_0330 branch from 82c183e to 8dbea59 Compare April 7, 2026 11:06

github-actions Bot removed the merge-conflicts label Apr 7, 2026

22dimensions and others added 9 commits April 7, 2026 19:08

upgrade to 0330

eaf4380

Signed-off-by: 22dimensions <waitingwind@foxmail.com>

[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Nex…

a142a7f

…t and Qwen3.5 Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>

support set different moe backend

984aa3e

Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>

skip test_cpu_offloading

19bdfa4

Signed-off-by: 22dimensions <waitingwind@foxmail.com>

fix lint

bdcfabd

Signed-off-by: 22dimensions <waitingwind@foxmail.com>

update dockerfile vllm commit

d0772ad

Signed-off-by: 22dimensions <waitingwind@foxmail.com>

22dimensions force-pushed the main2main_0330 branch from 8dbea59 to d0772ad Compare April 7, 2026 11:09

22dimensions changed the title ~~Main2main 0330~~ [CI] Main2main upgrade vllm to 0330 Apr 7, 2026

22dimensions changed the title ~~[CI] Main2main upgrade vllm to 0330~~ [CI] Main2main upgrade vllm to 0330 Apr 7, 2026

Apply suggestion from @gemini-code-assist[bot]

10dad5b

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>

22dimensions force-pushed the main2main_0330 branch from 79db9fe to 10dad5b Compare April 7, 2026 14:16

fix lint

5b2adff

Signed-off-by: 22dimensions <waitingwind@foxmail.com>

MengqingCao reviewed Apr 8, 2026

View reviewed changes

Comment thread vllm_ascend/ops/gdn.py

MengqingCao reviewed Apr 8, 2026

View reviewed changes

Comment thread vllm_ascend/ops/gdn.py

wangxiyuan approved these changes Apr 8, 2026

View reviewed changes

wangxiyuan merged commit 4140e84 into vllm-project:main Apr 8, 2026
41 checks passed

wxsIcey mentioned this pull request Apr 13, 2026

[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 #7581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Main2main upgrade vllm to 0330#7962

[CI] Main2main upgrade vllm to 0330#7962
wangxiyuan merged 11 commits intovllm-project:mainfrom
22dimensions:main2main_0330

22dimensions commented Apr 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 2, 2026

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

Potabk commented Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

22dimensions commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist Bot commented Apr 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions Bot commented Apr 2, 2026

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

Potabk commented Apr 7, 2026

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

22dimensions commented Apr 2, 2026 •

edited

Loading