[CI] Main2main upgrade vllm to 0330#7962
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refines the Ascend backend for vLLM by integrating a new, optimized Gated Delta Net attention mechanism and improving KV cache management. It also addresses compatibility issues with vLLM's buffer types in the model runner and introduces a critical fix for speculative decoding with async scheduling. These changes aim to enhance performance and stability for models running on Ascend hardware. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
Suggested PR Title:
[Ops][Misc] Support GatedDeltaNet and improve vLLM version compatibility for AscendSuggested PR Summary:
### What this PR does / why we need it?
This PR implements `AscendGatedDeltaNetAttention` in a dedicated ops file and refactors Qwen3.5 and Qwen3Next patches to use this centralized implementation. It significantly improves compatibility across vLLM versions by introducing abstraction methods in `NPUModelRunner` to handle both `CpuGpuBuffer` and plain Tensors. Key improvements include fixing `kv_cache` initialization for `DecodeOnly` mode, adding `clear_row` functionality to `BlockTable`, and disabling async scheduling during speculative decoding on Ascend to prevent accuracy divergence. A redundant global import in the worker patch initialization was identified as a potential conflict for 310P devices and should be restricted to the appropriate conditional block.
### Does this PR introduce _any_ user-facing change?
Async scheduling is now automatically disabled when speculative decoding is active on Ascend platforms to ensure model accuracy.
### How was this patch tested?
Tested with existing model suites and speculative decoding configurations on Ascend hardware.9c65690 to
147d20e
Compare
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
71e6f02 to
e3a4ee9
Compare
|
vllm-project/vllm#37880 This pr add the feat where we can set different moe backends between draft and target model, we should overwrite it in the draft proposer |
9fd6872 to
82c183e
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
82c183e to
8dbea59
Compare
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
This function is not available in all vLLM versions, particularly in versions where spec_decode was still evolving. By adding a fallback that returns the config as-is, we maintain compatibility with older vLLM releases while still supporting the newer enhanced version. Fixes 94 test failures in main2main CI run #23837548280. Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>
vLLM v0.18.0 calls clear_row() on block tables to reset request rows. Both BlockTable and MultiGroupBlockTable were missing this method, causing AttributeError when remove_request() was called. Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>
…t and Qwen3.5 Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>
This fixes a method signature mismatch that was causing TypeErrors during engine initialization. The vLLM base class QuantizationConfig.maybe_update_config() expects three parameters (model_name, hf_config, revision), but the vllm-ascend AscendModelSlimConfig override was missing the hf_config parameter. This was causing 92% of CI test failures (47 out of 51 failed test cases). Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
8dbea59 to
d0772ad
Compare
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com>
79db9fe to
10dad5b
Compare
What this PR does / why we need it?
Main2main upgrade vllm to 0330
fix breaks:
--speculative-configvllm#37880 This pr add the feat where we can set different moe backends between draft and target model, we should overwrite it in the draft proposerDoes this PR introduce any user-facing change?
No
How was this patch tested?
CI