Skip to content

[Misc][Main2Main] Upgrade vLLM to 0427#8899

Open
shen-shanshan wants to merge 35 commits intovllm-project:mainfrom
shen-shanshan:pr/8856
Open

[Misc][Main2Main] Upgrade vLLM to 0427#8899
shen-shanshan wants to merge 35 commits intovllm-project:mainfrom
shen-shanshan:pr/8856

Conversation

@shen-shanshan
Copy link
Copy Markdown
Collaborator

@shen-shanshan shen-shanshan commented May 6, 2026

What this PR does / why we need it?

Based on #8856.

Sync to vLLM 4d51588e2381018348f1022dfa3a7698899805b7.


Fix:


# Error Category Upstream Commit Affected vllm-ascend Path Fix
1 encoder_compilation_time AttributeError Code Bug c08f3b2a6 (#39240) worker/worker.py:567 getattr fallback
2 AscendRMSNormGated activation TypeError Code Bug 893611813 (#40245) ops/layernorm.py:160, _310p/ops/layernorm.py:43 Accept activation kwarg
3 AscendFusedMoEMethod.apply topk_weights TypeError Code Bug many (e.g., 5e584ce9e (#35782), 809d83c2d (#40560), 4d51588e2 (#40860)) ops/fused_moe/fused_moe.py:107 Major refactor — follow-up PR
4 _all_lora_classes is tuple Code Bug a250f1bd5 (#35077) lora/utils.py:188 Rebuild tuple instead of .add()
5 ProfilingChunkScheduler hash_block_size TypeError Code Bug 7b1bc0a3e (#40946) core/scheduler_profiling_chunk.py:57 Forward new kwarg
6 _moe_C.topk_softmax AttributeError Code Bug MoE router refactor router dispatch override needed Provide torch_npu topk-softmax (with Issue 4)
7 global experts shape mismatch Code Bug follow-on of Issue 4 quantization/methods/w8a8_dynamic.py:198 Resolve once Issue 4 is fixed

# Error vLLM PR Root Cause Fix
1 AssertionError: assert common_attn_metadata.seq_lens_cpu_upper_bound is not None vllm-project/vllm#40654 Upstream vLLM PR #40654 ("Avoid seq_lens_cpu GPU→CPU sync") added a new required field seq_lens_cpu_upper_bound to CommonAttentionMetadata. Several attention backends (including cross_attention.py) now assert this field is not None. vllm-ascend's AscendCommonAttentionMetadata subclass had 6 creation sites, none of which set this field. 1. vllm_ascend/worker/model_runner_v1.py:2480 — Primary fix: added seq_lens_cpu_upper_bound=self.optimistic_seq_lens_cpu[:num_reqs_padded] to the main cm_base construction (matches upstream pattern where optimistic_seq_lens_cpu serves as the upper bound).
2. vllm_ascend/attention/utils.py:216 — Added propagation in unpadded() method so sliced copies preserve the field.
3. vllm_ascend/spec_decode/dflash_proposer.py:179 — Added seq_lens_cpu_upper_bound for DFlash proposer's graph-capture metadata.
4. vllm_ascend/spec_decode/eagle_proposer.py (3 locations): 422 — Graph-capture metadata for EAGLE proposer, 1583 — prepare_inputs() post-rejection metadata, 1662 — prepare_inputs_padded() metadata propagation.
5. vllm_ascend/worker/v2/attn_utils.py:99 — V2 attention utilities metadata construction.
2 Qwen3-MoE acc problem #35782、#40560 vLLM 的 MoE runner 重构(PR #35782、#40560)改变了 TP all-reduce 的调用路径,导致 Ascend 上 ALLTOALL/MC2 通信模式出现双重 TP all-reduce。
具体机制:
1. vLLM 旧代码:MoE runner 调用 layer.maybe_all_reduce_tensor_model_parallel(),AscendFusedMoE 覆写了该方法,对不同通信类型做正确判断(ALLTOALL/MC2 的 finalize() 已包含 TP 聚合,无需再 reduce)。
2. vLLM 新代码(升级后):MoERunner.forward() 直接调用 vllm.distributed.tensor_model_parallel_all_reduce(),完全绕过了 Ascend 的覆写。
3. 在 Ascend A3 (910B) 上,默认 MoE 通信类型为 ALLTOALL/MC2/FUSED_MC2: moe_comm_method.finalize() 已包含 routed expert 的 TP all-reduce,_forward_shared_experts() 已对 shared expert 做 TP all-reduce(第 678 行),但 MoERunner.forward() 的 _maybe_reduce_final_output() 又做了一遍 TP all-reduce → 双重 reduce,输出被放大。
4. 在 Ascend A2 (910A) 上,默认通信类型为 AllGather:finalize() 不含 TP all-reduce,_forward_shared_experts() 跳过 TP all-reduce(条件判断不满足),_maybe_reduce_final_output() 做唯一的 TP all-reduce → 正确。
在 vllm_ascend/ops/fused_moe/fused_moe.py 的 AscendMoERunner 中新增两个覆写:
_fused_output_is_reduced property:当通信类型为 ALLTOALL/MC2/FUSED_MC2 时返回 True,告诉上游 MoERunner.forward() 不要再做 _maybe_reduce_final_output 的 TP all-reduce;
_maybe_reduce_shared_expert_output 方法:直接返回 shared_output(不做额外 reduce),因为 _forward_shared_experts 已处理。

Does this PR introduce any user-facing change?

How was this patch tested?

wxsIcey and others added 26 commits May 6, 2026 14:43
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…0410

vLLM PR #40410 split the single EagleCudaGraphManager into separate
PrefillEagleCudaGraphManager and DecodeEagleCudaGraphManager, each with
a different capture() signature:
  - Prefill: capture(forward_fn, full_cg_attn_states, ...)
  - Decode:  capture(forward_fn, model_state, input_buffers, block_tables, ...)

The upstream speculator now calls self.prefill_cudagraph_manager.capture()
with only (forward_fn, attn_states), but EagleAclGraphManager still had
the old decode-style signature requiring 4 extra positional args, causing:
  TypeError: EagleAclGraphManager.capture() missing 4 required positional
  arguments: 'input_buffers', 'block_tables', 'attn_groups', 'kv_cache_config'

Fix by importing PrefillEagleCudaGraphManager and dispatching capture() to
the correct parent class based on self.is_draft_model_prefill.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…or vLLM PR #40654

vLLM PR #40654 added seq_lens_cpu_upper_bound as a new required field to
InputBatch (a CPU upper-bound on seq_lens to avoid GPU->CPU sync).
AscendInputBatch inherits from InputBatch and must supply this field.

Compute it the same way as upstream: num_computed_tokens_np + num_scheduled_tokens,
zero-padded to num_reqs_padded, then pass it when constructing AscendInputBatch
in prepare_inputs().

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…n Ascend (vLLM PR #40860)

vLLM PR #40860 ([Feat] DeepSeek V4 Rebased) introduced
resolve_kv_cache_block_sizes() into engine/core.py and added a restriction
that hybrid KV cache groups with multiple block sizes do not support context
parallelism (dcp_world_size/pcp_world_size > 1), raising:
  ValueError: Hybrid KV cache groups with multiple block sizes do not
  support context parallelism (dcp_world_size/pcp_world_size > 1).

This restriction is correct for CUDA (the CUDA MLA implementation cannot
combine hybrid KV with CP), but Ascend has dedicated CP backends for
MLA (mla_cp.py) and SFA (sfa_cp.py) that handle this combination.

Fix by patching resolve_kv_cache_block_sizes() to skip the ValueError for
multiple-groups + CP on Ascend, and instead compute scheduler_block_size as
lcm(group_block_sizes) * dcp * pcp for proper alignment.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request upgrades the vLLM dependency and introduces a comprehensive compatibility layer to support both the current pinned version and newer upstream releases. The changes focus on refactoring MoE layer logic, ensuring consistent patch application across engine-core processes, and enabling context parallelism on Ascend hardware by overriding KV cache block size resolution. These updates maintain feature parity while preparing the codebase for future vLLM version migrations.

Highlights

  • vLLM Dependency Upgrade: Upgraded the vLLM dependency to the latest commit hash to incorporate upstream improvements.
  • Cross-Version Compatibility: Implemented version-aware compatibility shims using 'vllm_version_is' to ensure stability across both pinned vLLM v0.19.1 and newer versions.
  • MoE Layer Refactoring: Refactored MoE layer implementations, unifying shared expert handling and cleaning up redundant class structures.
  • Global Patching Mechanism: Introduced a '_ensure_global_patch' mechanism to ensure Ascend-specific overrides are correctly applied in engine-core subprocesses.
  • KV Cache Resolution Patch: Added a patch for 'resolve_kv_cache_block_sizes' to support context parallelism on Ascend hardware, bypassing upstream restrictions.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (3)
    • .github/workflows/_e2e_test.yaml
    • .github/workflows/pr_test_full.yaml
    • .github/workflows/pr_test_light.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@shen-shanshan shen-shanshan added ready read for review ready-for-test start test by label for PR labels May 6, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:

[Ops][Misc] Support vLLM v0.19.1 and upstream compatibility patches

Suggested PR Summary:

### What this PR does / why we need it?
This PR implements compatibility updates for vLLM v0.19.1. It merges `SharedFusedMoE` logic into the base `FusedMoE` classes, handles version-specific changes in `CompilationTimes`, `LoRA` utilities, and `RMSNormGated`. Additionally, it introduces a global patching mechanism and a specific patch for `resolve_kv_cache_block_sizes` to enable hybrid KV cache with context parallelism on Ascend.

Feedback: Several critical issues were found, including missing imports for `vllm_version_is` in multiple files and the use of an undefined variable `is_legacy` in the MoE initialization logic.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Updated unit tests for MoE and Eagle proposer.

Comment thread vllm_ascend/ops/fused_moe/fused_moe.py
Comment thread vllm_ascend/ops/fused_moe/fused_moe.py Outdated
Comment thread vllm_ascend/worker/worker.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
@shen-shanshan shen-shanshan requested a review from weijinqian0 as a code owner May 6, 2026 07:17
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
@shen-shanshan shen-shanshan changed the title [Misc][Main2Main] Upgrade vLLM to 0429 [Misc][Main2Main] Upgrade vLLM to 0427 May 7, 2026
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation module:core module:ops module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants