Skip to content

[CI] Main2main upgrade vllm to 0330#7962

Merged
wangxiyuan merged 11 commits intovllm-project:mainfrom
22dimensions:main2main_0330
Apr 8, 2026
Merged

[CI] Main2main upgrade vllm to 0330#7962
wangxiyuan merged 11 commits intovllm-project:mainfrom
22dimensions:main2main_0330

Conversation

@22dimensions
Copy link
Copy Markdown
Collaborator

@22dimensions 22dimensions commented Apr 2, 2026

What this PR does / why we need it?

Main2main upgrade vllm to 0330
fix breaks:

  1. Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728) vllm#37728 add clear_row method for BlockTable
  2. [Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 vllm#37975 Adapt GatedDeltaNetAttention Refactor
  3. [ROCm][Bugfix] fix exception related to trust_remote_code for MiniMax-M2.1-MXFP4 vllm#37698 update maybe_update_config in vllm_ascend/quantization/modelslim_config.py to adapt this pr change
  4. [Feature] Support per-draft-model MoE backend via --speculative-config vllm#37880 This pr add the feat where we can set different moe backends between draft and target model, we should overwrite it in the draft proposer
  5. [kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models vllm#37853 for now just to skip test_cpu_offloading.py test case utils this feature has been adapted.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the Ascend backend for vLLM by integrating a new, optimized Gated Delta Net attention mechanism and improving KV cache management. It also addresses compatibility issues with vLLM's buffer types in the model runner and introduces a critical fix for speculative decoding with async scheduling. These changes aim to enhance performance and stability for models running on Ascend hardware.

Highlights

  • GDN Attention Integration: Introduced a new Ascend-optimized Gated Delta Net (GDN) attention implementation, AscendGatedDeltaNetAttention, in vllm_ascend/ops/gdn.py. This includes custom operations for convolution and recurrent attention, specifically designed for models like Qwen3Next and Qwen3.5.
  • Refactored Batch Invariant Check: The vllm_is_batch_invariant function was moved from vllm.model_executor.layers.batch_invariant to vllm_ascend.batch_invariant and updated to be compatible with newer vLLM versions, ensuring correct behavior across different environments.
  • Enhanced KV Cache Handling: Modified attention mechanisms in attention_v1.py to directly pass kv_cache and include logic for initializing key_cache and value_cache for DecodeOnly mode, improving flexibility in KV cache management.
  • Model Runner Buffer Compatibility: Implemented utility methods and adjusted buffer handling within NPUModelRunner to ensure compatibility with both CpuGpuBuffer and plain torch.Tensor types for various internal buffers like positions, input IDs, and sequence lengths.
  • Speculative Decoding and Async Scheduling: Added a safeguard to disable async scheduling when speculative decoding is active on Ascend, addressing a current limitation where GPU-side token correction is not yet supported, preventing accuracy divergence.
  • KV Offload Manager Update: Updated the KV offload mechanism in vllm_ascend/kv_offload/npu.py to use CPUOffloadingManager, aligning with recent changes in vLLM's KV offload architecture.
  • Qwen3.5 Patch Consolidation: Removed dedicated Qwen3.5 Gated Delta Net and Decoder Layer patches, consolidating their functionality under the new AscendGatedDeltaNetAttention for a more streamlined implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (6)
    • .github/workflows/_e2e_test.yaml
    • .github/workflows/bot_pr_create.yaml
    • .github/workflows/dockerfiles/Dockerfile.lint
    • .github/workflows/pr_test_full.yaml
    • .github/workflows/pr_test_light.yaml
    • .github/workflows/schedule_codecov_refresh.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:

[Ops][Misc] Support GatedDeltaNet and improve vLLM version compatibility for Ascend

Suggested PR Summary:

### What this PR does / why we need it?
This PR implements `AscendGatedDeltaNetAttention` in a dedicated ops file and refactors Qwen3.5 and Qwen3Next patches to use this centralized implementation. It significantly improves compatibility across vLLM versions by introducing abstraction methods in `NPUModelRunner` to handle both `CpuGpuBuffer` and plain Tensors. Key improvements include fixing `kv_cache` initialization for `DecodeOnly` mode, adding `clear_row` functionality to `BlockTable`, and disabling async scheduling during speculative decoding on Ascend to prevent accuracy divergence. A redundant global import in the worker patch initialization was identified as a potential conflict for 310P devices and should be restricted to the appropriate conditional block.

### Does this PR introduce _any_ user-facing change?
Async scheduling is now automatically disabled when speculative decoding is active on Ascend platforms to ensure model accuracy.

### How was this patch tested?
Tested with existing model suites and speculative decoding configurations on Ascend hardware.

Comment thread vllm_ascend/patch/worker/__init__.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 3, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@Potabk
Copy link
Copy Markdown
Collaborator

Potabk commented Apr 7, 2026

vllm-project/vllm#37880 This pr add the feat where we can set different moe backends between draft and target model, we should overwrite it in the draft proposer

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

22dimensions and others added 9 commits April 7, 2026 19:08
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
This function is not available in all vLLM versions, particularly
in versions where spec_decode was still evolving. By adding a
fallback that returns the config as-is, we maintain compatibility
with older vLLM releases while still supporting the newer
enhanced version.

Fixes 94 test failures in main2main CI run #23837548280.

Co-Authored-By: Claude Code <noreply@anthropic.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
vLLM v0.18.0 calls clear_row() on block tables to reset request rows.
Both BlockTable and MultiGroupBlockTable were missing this method,
causing AttributeError when remove_request() was called.

Co-Authored-By: Claude Code <noreply@anthropic.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
…t and Qwen3.5

Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
This fixes a method signature mismatch that was causing TypeErrors during
engine initialization. The vLLM base class QuantizationConfig.maybe_update_config()
expects three parameters (model_name, hf_config, revision), but the vllm-ascend
AscendModelSlimConfig override was missing the hf_config parameter.

This was causing 92% of CI test failures (47 out of 51 failed test cases).

Co-Authored-By: Claude Code <noreply@anthropic.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
@22dimensions 22dimensions changed the title Main2main 0330 [CI] Main2main upgrade vllm to 0330 Apr 7, 2026
@22dimensions 22dimensions changed the title [CI] Main2main upgrade vllm to 0330 [CI] Main2main upgrade vllm to 0330 Apr 7, 2026
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Comment thread vllm_ascend/ops/gdn.py
Comment thread vllm_ascend/ops/gdn.py
@wangxiyuan wangxiyuan merged commit 4140e84 into vllm-project:main Apr 8, 2026
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants