[Misc][Main2Main] Upgrade vLLM to 0427 by shen-shanshan · Pull Request #8899 · vllm-project/vllm-ascend

shen-shanshan · 2026-05-06T06:52:33Z

What this PR does / why we need it?

Based on #8856.

Sync to vLLM 4d51588e2381018348f1022dfa3a7698899805b7.

Fix:

MoE refactor @wxsIcey, introduced by [MoE Refactor] Remove SharedFusedMoE class vllm#35782, [MoE Refactor] Move the shared/fused expert output sum into MoERunnerBase vllm#35949, [MoE Refactor] Combine MoERunnerBase + DefaultMoERunner vllm#40560, [MoE Refactor] Rename FusedMoE.make_expert_params_mapping to fused_moe_make_expert_params_mapping vllm#40671.
TypeError: rejection_sample() got an unexpected keyword argument 'synthetic_mode' -> Add synthetic_mode and synthetic_conditional_rates param to ascend rejection_sample().

#	Error	Category	Upstream Commit	Affected vllm-ascend Path	Fix
1	`encoder_compilation_time` AttributeError	Code Bug	`c08f3b2a6` (#39240)	`worker/worker.py:567`	`getattr` fallback
2	`AscendRMSNormGated activation` TypeError	Code Bug	`893611813` (#40245)	`ops/layernorm.py:160`, `_310p/ops/layernorm.py:43`	Accept `activation` kwarg
3	`AscendFusedMoEMethod.apply topk_weights` TypeError	Code Bug	many (e.g., `5e584ce9e` (#35782), `809d83c2d` (#40560), `4d51588e2` (#40860))	`ops/fused_moe/fused_moe.py:107`	Major refactor — follow-up PR
4	`_all_lora_classes` is tuple	Code Bug	`a250f1bd5` (#35077)	`lora/utils.py:188`	Rebuild tuple instead of `.add()`
5	`ProfilingChunkScheduler hash_block_size` TypeError	Code Bug	`7b1bc0a3e` (#40946)	`core/scheduler_profiling_chunk.py:57`	Forward new kwarg
6	`_moe_C.topk_softmax` AttributeError	Code Bug	MoE router refactor	router dispatch override needed	Provide `torch_npu` topk-softmax (with Issue 4)
7	global experts shape mismatch	Code Bug	follow-on of Issue 4	`quantization/methods/w8a8_dynamic.py:198`	Resolve once Issue 4 is fixed

#	Error	vLLM PR	Root Cause	Fix
1	AssertionError: `assert common_attn_metadata.seq_lens_cpu_upper_bound is not None`	vllm-project/vllm#40654	Upstream vLLM PR #40654 ("Avoid seq_lens_cpu GPU→CPU sync") added a new required field seq_lens_cpu_upper_bound to CommonAttentionMetadata. Several attention backends (including cross_attention.py) now assert this field is not None. vllm-ascend's AscendCommonAttentionMetadata subclass had 6 creation sites, none of which set this field.	1. vllm_ascend/worker/model_runner_v1.py:2480 — Primary fix: added seq_lens_cpu_upper_bound=self.optimistic_seq_lens_cpu[:num_reqs_padded] to the main cm_base construction (matches upstream pattern where optimistic_seq_lens_cpu serves as the upper bound). 2. vllm_ascend/attention/utils.py:216 — Added propagation in unpadded() method so sliced copies preserve the field. 3. vllm_ascend/spec_decode/dflash_proposer.py:179 — Added seq_lens_cpu_upper_bound for DFlash proposer's graph-capture metadata. 4. vllm_ascend/spec_decode/eagle_proposer.py (3 locations): 422 — Graph-capture metadata for EAGLE proposer, 1583 — prepare_inputs() post-rejection metadata, 1662 — prepare_inputs_padded() metadata propagation. 5. vllm_ascend/worker/v2/attn_utils.py:99 — V2 attention utilities metadata construction.
2	Qwen3-MoE acc problem	#35782、#40560	vLLM 的 MoE runner 重构（PR #35782、#40560）改变了 TP all-reduce 的调用路径，导致 Ascend 上 ALLTOALL/MC2 通信模式出现双重 TP all-reduce。具体机制： 1. vLLM 旧代码：MoE runner 调用 layer.maybe_all_reduce_tensor_model_parallel()，AscendFusedMoE 覆写了该方法，对不同通信类型做正确判断（ALLTOALL/MC2 的 finalize() 已包含 TP 聚合，无需再 reduce）。 2. vLLM 新代码（升级后）：MoERunner.forward() 直接调用 vllm.distributed.tensor_model_parallel_all_reduce()，完全绕过了 Ascend 的覆写。 3. 在 Ascend A3 (910B) 上，默认 MoE 通信类型为 ALLTOALL/MC2/FUSED_MC2: moe_comm_method.finalize() 已包含 routed expert 的 TP all-reduce，_forward_shared_experts() 已对 shared expert 做 TP all-reduce（第 678 行），但 MoERunner.forward() 的 _maybe_reduce_final_output() 又做了一遍 TP all-reduce → 双重 reduce，输出被放大。 4. 在 Ascend A2 (910A) 上，默认通信类型为 AllGather：finalize() 不含 TP all-reduce，_forward_shared_experts() 跳过 TP all-reduce（条件判断不满足），_maybe_reduce_final_output() 做唯一的 TP all-reduce → 正确。	在 vllm_ascend/ops/fused_moe/fused_moe.py 的 AscendMoERunner 中新增两个覆写： _fused_output_is_reduced property：当通信类型为 ALLTOALL/MC2/FUSED_MC2 时返回 True，告诉上游 MoERunner.forward() 不要再做 _maybe_reduce_final_output 的 TP all-reduce； _maybe_reduce_shared_expert_output 方法：直接返回 shared_output（不做额外 reduce），因为 _forward_shared_experts 已处理。

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.19.1
vLLM main: vllm-project/vllm@d886c26

Signed-off-by: wxsIcey <1790571317@qq.com>

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

…0410 vLLM PR #40410 split the single EagleCudaGraphManager into separate PrefillEagleCudaGraphManager and DecodeEagleCudaGraphManager, each with a different capture() signature: - Prefill: capture(forward_fn, full_cg_attn_states, ...) - Decode: capture(forward_fn, model_state, input_buffers, block_tables, ...) The upstream speculator now calls self.prefill_cudagraph_manager.capture() with only (forward_fn, attn_states), but EagleAclGraphManager still had the old decode-style signature requiring 4 extra positional args, causing: TypeError: EagleAclGraphManager.capture() missing 4 required positional arguments: 'input_buffers', 'block_tables', 'attn_groups', 'kv_cache_config' Fix by importing PrefillEagleCudaGraphManager and dispatching capture() to the correct parent class based on self.is_draft_model_prefill. Signed-off-by: gcanlin <canlinguosdu@gmail.com>

…or vLLM PR #40654 vLLM PR #40654 added seq_lens_cpu_upper_bound as a new required field to InputBatch (a CPU upper-bound on seq_lens to avoid GPU->CPU sync). AscendInputBatch inherits from InputBatch and must supply this field. Compute it the same way as upstream: num_computed_tokens_np + num_scheduled_tokens, zero-padded to num_reqs_padded, then pass it when constructing AscendInputBatch in prepare_inputs(). Signed-off-by: gcanlin <canlinguosdu@gmail.com>

…n Ascend (vLLM PR #40860) vLLM PR #40860 ([Feat] DeepSeek V4 Rebased) introduced resolve_kv_cache_block_sizes() into engine/core.py and added a restriction that hybrid KV cache groups with multiple block sizes do not support context parallelism (dcp_world_size/pcp_world_size > 1), raising: ValueError: Hybrid KV cache groups with multiple block sizes do not support context parallelism (dcp_world_size/pcp_world_size > 1). This restriction is correct for CUDA (the CUDA MLA implementation cannot combine hybrid KV with CP), but Ascend has dedicated CP backends for MLA (mla_cp.py) and SFA (sfa_cp.py) that handle this combination. Fix by patching resolve_kv_cache_block_sizes() to skip the ValueError for multiple-groups + CP on Ascend, and instead compute scheduler_block_size as lcm(group_block_sizes) * dcp * pcp for proper alignment. Signed-off-by: gcanlin <canlinguosdu@gmail.com>

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gemini-code-assist · 2026-05-06T06:53:24Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request upgrades the vLLM dependency and introduces a comprehensive compatibility layer to support both the current pinned version and newer upstream releases. The changes focus on refactoring MoE layer logic, ensuring consistent patch application across engine-core processes, and enabling context parallelism on Ascend hardware by overriding KV cache block size resolution. These updates maintain feature parity while preparing the codebase for future vLLM version migrations.

Highlights

vLLM Dependency Upgrade: Upgraded the vLLM dependency to the latest commit hash to incorporate upstream improvements.
Cross-Version Compatibility: Implemented version-aware compatibility shims using 'vllm_version_is' to ensure stability across both pinned vLLM v0.19.1 and newer versions.
MoE Layer Refactoring: Refactored MoE layer implementations, unifying shared expert handling and cleaning up redundant class structures.
Global Patching Mechanism: Introduced a '_ensure_global_patch' mechanism to ensure Ascend-specific overrides are correctly applied in engine-core subprocesses.
KV Cache Resolution Patch: Added a patch for 'resolve_kv_cache_block_sizes' to support context parallelism on Ascend hardware, bypassing upstream restrictions.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files

Ignored by pattern: .github/workflows/** (3)
- .github/workflows/_e2e_test.yaml
- .github/workflows/pr_test_full.yaml
- .github/workflows/pr_test_light.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Suggested PR Title:

[Ops][Misc] Support vLLM v0.19.1 and upstream compatibility patches

Suggested PR Summary:

### What this PR does / why we need it?
This PR implements compatibility updates for vLLM v0.19.1. It merges `SharedFusedMoE` logic into the base `FusedMoE` classes, handles version-specific changes in `CompilationTimes`, `LoRA` utilities, and `RMSNormGated`. Additionally, it introduces a global patching mechanism and a specific patch for `resolve_kv_cache_block_sizes` to enable hybrid KV cache with context parallelism on Ascend.

Feedback: Several critical issues were found, including missing imports for `vllm_version_is` in multiple files and the use of an undefined variable `is_legacy` in the MoE initialization logic.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Updated unit tests for MoE and Eagle proposer.

github-actions · 2026-05-06T07:03:04Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

Signed-off-by: shen-shanshan <467638484@qq.com>

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

wxsIcey and others added 26 commits May 6, 2026 14:43

remove shared_fused_moe

d0ecb8a

Signed-off-by: wxsIcey <1790571317@qq.com>

remove shared FusedMoE

385b088

Signed-off-by: wxsIcey <1790571317@qq.com>

main2main

9aca18a

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

fix lint

daaabc9

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

test

db6c460

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

test

8e88657

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix

57a8345

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix

570861f

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix

2b35e44

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix lint

7b8776a

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix CpuGpuBuffer ut error

8a6e016

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

[BugFix] Fix Ascend MoE routing expert count with EPLB

5ae19f3

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix

bb445d9

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix: #36286 breaks test_qwen3_ome_tp2_ep2_mrv2

352b32d

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix ut

5b2177e

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

revert to 4d51588e2

8e48b1b

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

revert 40410

2b2b581

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

remove 0.19.1 in CI tests

01daa8a

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

remove chinese comments

5f96872

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix conf.py

29a42cb

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix 310p shared fused moe

9a40e0b

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix base_quant_method bug

e54c7fc

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

patch kv_utils

17edd7a

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

shen-shanshan requested review from MengqingCao, realliujiaxu, wangxiyuan and whx-sjtu as code owners May 6, 2026 06:52

shen-shanshan requested review from LCAIZJ, Yikun, paulyu12 and zzzzwwjj as code owners May 6, 2026 06:52

shen-shanshan added ready read for review ready-for-test start test by label for PR labels May 6, 2026

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Comment thread vllm_ascend/ops/fused_moe/fused_moe.py

Comment thread vllm_ascend/ops/fused_moe/fused_moe.py Outdated

Comment thread vllm_ascend/worker/worker.py Outdated

github-actions Bot added documentation Improvements or additions to documentation ci/build module:tests module:ops module:core labels May 6, 2026

fix seq_lens_cpu_upper_bound

227ea22

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

shen-shanshan requested a review from weijinqian0 as a code owner May 6, 2026 07:17

shen-shanshan added 5 commits May 6, 2026 07:49

update

25599a0

Signed-off-by: shen-shanshan <467638484@qq.com>

fix moe

9b56217

Signed-off-by: shen-shanshan <467638484@qq.com>

remove 0.19.1 wrapper

3a9c9bf

Signed-off-by: shen-shanshan <467638484@qq.com>

update commit

7910d85

Signed-off-by: shen-shanshan <467638484@qq.com>

update commit

340ddcc

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan changed the title ~~[Misc][Main2Main] Upgrade vLLM to 0429~~ [Misc][Main2Main] Upgrade vLLM to 0427 May 7, 2026

shen-shanshan added 3 commits May 7, 2026 15:50

update commit

a81db63

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

fix qwen3-moe acc

d312589

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

update commit

2d37dd0

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc][Main2Main] Upgrade vLLM to 0427#8899

[Misc][Main2Main] Upgrade vLLM to 0427#8899
shen-shanshan wants to merge 35 commits intovllm-project:mainfrom
shen-shanshan:pr/8856

shen-shanshan commented May 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shen-shanshan commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist Bot commented May 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shen-shanshan commented May 6, 2026 •

edited

Loading