[Misc][Main2Main] Upgrade vLLM to `0.20.1` by shen-shanshan · Pull Request #8880 · vllm-project/vllm-ascend

shen-shanshan · 2026-05-06T02:40:23Z

What this PR does / why we need it?

Based on #8856.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.19.1
vLLM main: vllm-project/vllm@d886c26

github-actions · 2026-05-06T02:40:46Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-05-06T02:41:41Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the project to vLLM 0.20.1. The changes focus on reconciling the Ascend backend with upstream API modifications, particularly in MoE layers, KV cache management, and LoRA utilities. These updates ensure that the Ascend platform remains compatible with the latest vLLM features while maintaining performance optimizations.

Highlights

vLLM Upgrade: Upgraded the underlying vLLM dependency to version 0.20.1, necessitating various compatibility adjustments across the codebase.
MoE Refactoring: Extensively refactored FusedMoE and shared expert logic to align with upstream changes and improve performance on Ascend hardware.
Compatibility Patches: Introduced new patches for KV cache block size resolution and LoRA class registration to support newer vLLM APIs while maintaining backward compatibility.
Logical Expert Calculation: Added a centralized utility to calculate logical experts, improving consistency across different quantization methods.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files

Ignored by pattern: .github/workflows/** (4)
- .github/workflows/_e2e_test.yaml
- .github/workflows/pr_test_full.yaml
- .github/workflows/pr_test_light.yaml
- .github/workflows/scripts/config.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-05-06T02:41:43Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

gemini-code-assist

Code Review

This pull request upgrades the vLLM dependency to version 0.20.1 and implements extensive compatibility fixes across the Ascend-specific codebase. Key changes include adapting FusedMoE and SharedFusedMoE for both standard and 310P platforms, updating LayerNorm and LoRA utilities to handle vLLM API changes, and introducing version-aware logic for speculative decoding (Eagle proposer) and scheduler components. Additionally, a new patch for KV cache block size resolution was added to support context parallelism on Ascend. Feedback includes a style guide violation regarding the PR title and summary format, and a critical bug in AscendMoERunner where a missing argument in legacy vLLM versions would lead to incorrect initialization.

gemini-code-assist · 2026-05-06T02:43:57Z

    "cann_image_tag": "8.5.1-910b-ubuntu22.04-py3.11",
    # vLLM commit hash for main branch
-    "main_vllm_commit": "d886c26d4d4fef7d079696beb4ece1cfb4b008a8",
+    "main_vllm_commit": "132765e3560659ff63ebd236203672e991b70e08",


The Pull Request title and summary do not adhere to the repository style guide. Please update them to follow the required format.

Suggested PR Title:

[Main2Main][Misc][Misc] Upgrade vLLM to 0.20.1

Suggested PR Summary:

### What this PR does / why we need it? This PR upgrades the vLLM dependency to version 0.20.1. It includes necessary adaptations for Ascend-specific operators (FusedMoE, LayerNorm), worker logic, and speculative decoding components to maintain compatibility with the updated vLLM core. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested with existing unit tests and new tests for MoE logical experts and Eagle proposer.

References

The PR Title and Summary must follow specific formats defined in the Repository Style Guide. ^(link)

gemini-code-assist · 2026-05-06T02:43:58Z

+            self._shared_experts if is_legacy else kwargs.pop("shared_experts", None),
            self.quant_method,
-            self.reduce_results,
            self.vllm_config.parallel_config.enable_dbo,


The AscendMoERunner call is missing the reduce_results argument when is_legacy is True. In vLLM 0.19.1, DefaultMoERunner (which AscendMoERunner inherited from) required reduce_results as the 8th positional argument. Passing enable_dbo as the 8th argument will lead to incorrect initialization on older vLLM versions.

self.quant_method, *((self.reduce_results, self.vllm_config.parallel_config.enable_dbo) if is_legacy else (self.vllm_config.parallel_config.enable_dbo,)),

Signed-off-by: wxsIcey <1790571317@qq.com>

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

…0410 vLLM PR #40410 split the single EagleCudaGraphManager into separate PrefillEagleCudaGraphManager and DecodeEagleCudaGraphManager, each with a different capture() signature: - Prefill: capture(forward_fn, full_cg_attn_states, ...) - Decode: capture(forward_fn, model_state, input_buffers, block_tables, ...) The upstream speculator now calls self.prefill_cudagraph_manager.capture() with only (forward_fn, attn_states), but EagleAclGraphManager still had the old decode-style signature requiring 4 extra positional args, causing: TypeError: EagleAclGraphManager.capture() missing 4 required positional arguments: 'input_buffers', 'block_tables', 'attn_groups', 'kv_cache_config' Fix by importing PrefillEagleCudaGraphManager and dispatching capture() to the correct parent class based on self.is_draft_model_prefill. Signed-off-by: gcanlin <canlinguosdu@gmail.com>

…or vLLM PR #40654 vLLM PR #40654 added seq_lens_cpu_upper_bound as a new required field to InputBatch (a CPU upper-bound on seq_lens to avoid GPU->CPU sync). AscendInputBatch inherits from InputBatch and must supply this field. Compute it the same way as upstream: num_computed_tokens_np + num_scheduled_tokens, zero-padded to num_reqs_padded, then pass it when constructing AscendInputBatch in prepare_inputs(). Signed-off-by: gcanlin <canlinguosdu@gmail.com>

…n Ascend (vLLM PR #40860) vLLM PR #40860 ([Feat] DeepSeek V4 Rebased) introduced resolve_kv_cache_block_sizes() into engine/core.py and added a restriction that hybrid KV cache groups with multiple block sizes do not support context parallelism (dcp_world_size/pcp_world_size > 1), raising: ValueError: Hybrid KV cache groups with multiple block sizes do not support context parallelism (dcp_world_size/pcp_world_size > 1). This restriction is correct for CUDA (the CUDA MLA implementation cannot combine hybrid KV with CP), but Ascend has dedicated CP backends for MLA (mla_cp.py) and SFA (sfa_cp.py) that handle this combination. Fix by patching resolve_kv_cache_block_sizes() to skip the ValueError for multiple-groups + CP on Ascend, and instead compute scheduler_block_size as lcm(group_block_sizes) * dcp * pcp for proper alignment. Signed-off-by: gcanlin <canlinguosdu@gmail.com>

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

shen-shanshan requested review from LCAIZJ, MengqingCao, Yikun, paulyu12, realliujiaxu, wangxiyuan, whx-sjtu and zzzzwwjj as code owners May 6, 2026 02:40

shen-shanshan added ready read for review ready-for-test start test by label for PR labels May 6, 2026

github-actions Bot added documentation Improvements or additions to documentation ci/build module:tests module:ops module:core module:quantization labels May 6, 2026

github-actions Bot added the merge-conflicts label May 6, 2026

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

wxsIcey and others added 9 commits May 6, 2026 10:44

remove shared_fused_moe

9ea8be7

Signed-off-by: wxsIcey <1790571317@qq.com>

remove shared FusedMoE

6e48ed6

Signed-off-by: wxsIcey <1790571317@qq.com>

main2main

2379e6e

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

fix lint

96876a6

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

test

d6e6d32

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

test

32b4ed7

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix

348c7a1

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix

6fa8d29

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix

dd39d4a

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin and others added 18 commits May 6, 2026 10:47

fix lint

d50dd42

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix CpuGpuBuffer ut error

ef2cb7f

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

[BugFix] Fix Ascend MoE routing expert count with EPLB

104cf71

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix

a7007c6

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix: #36286 breaks test_qwen3_ome_tp2_ep2_mrv2

2bdbec5

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix ut

2058168

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

revert to 4d51588e2

4ce0f95

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

revert 40410

fac56a4

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

remove 0.19.1 in CI tests

dbce58d

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

remove chinese comments

f1d8097

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix conf.py

aa063b5

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix 310p shared fused moe

1c743b2

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

fix base_quant_method bug

075ec2e

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

patch kv_utils

de52c0c

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

main2main

4817bbc

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

shen-shanshan force-pushed the pr/8856 branch from 05517be to 4817bbc Compare May 6, 2026 02:49

github-actions Bot removed the merge-conflicts label May 6, 2026

shen-shanshan added 2 commits May 6, 2026 11:36

fix

fc14359

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

fix is_legacy

2ccc3ec

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

shen-shanshan closed this May 6, 2026

shen-shanshan deleted the pr/8856 branch May 6, 2026 06:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc][Main2Main] Upgrade vLLM to `0.20.1`#8880

[Misc][Main2Main] Upgrade vLLM to `0.20.1`#8880
shen-shanshan wants to merge 29 commits into
vllm-project:mainfrom
shen-shanshan:pr/8856

shen-shanshan commented May 6, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

gemini-code-assist Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shen-shanshan commented May 6, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

gemini-code-assist Bot commented May 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shen-shanshan commented May 6, 2026 •

edited by github-actions Bot

Loading