Skip to content

[Misc][Main2Main] Upgrade vLLM to 0.20.1#8880

Closed
shen-shanshan wants to merge 29 commits into
vllm-project:mainfrom
shen-shanshan:pr/8856
Closed

[Misc][Main2Main] Upgrade vLLM to 0.20.1#8880
shen-shanshan wants to merge 29 commits into
vllm-project:mainfrom
shen-shanshan:pr/8856

Conversation

@shen-shanshan
Copy link
Copy Markdown
Collaborator

@shen-shanshan shen-shanshan commented May 6, 2026

What this PR does / why we need it?

Based on #8856.

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the project to vLLM 0.20.1. The changes focus on reconciling the Ascend backend with upstream API modifications, particularly in MoE layers, KV cache management, and LoRA utilities. These updates ensure that the Ascend platform remains compatible with the latest vLLM features while maintaining performance optimizations.

Highlights

  • vLLM Upgrade: Upgraded the underlying vLLM dependency to version 0.20.1, necessitating various compatibility adjustments across the codebase.
  • MoE Refactoring: Extensively refactored FusedMoE and shared expert logic to align with upstream changes and improve performance on Ascend hardware.
  • Compatibility Patches: Introduced new patches for KV cache block size resolution and LoRA class registration to support newer vLLM APIs while maintaining backward compatibility.
  • Logical Expert Calculation: Added a centralized utility to calculate logical experts, improving consistency across different quantization methods.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (4)
    • .github/workflows/_e2e_test.yaml
    • .github/workflows/pr_test_full.yaml
    • .github/workflows/pr_test_light.yaml
    • .github/workflows/scripts/config.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request upgrades the vLLM dependency to version 0.20.1 and implements extensive compatibility fixes across the Ascend-specific codebase. Key changes include adapting FusedMoE and SharedFusedMoE for both standard and 310P platforms, updating LayerNorm and LoRA utilities to handle vLLM API changes, and introducing version-aware logic for speculative decoding (Eagle proposer) and scheduler components. Additionally, a new patch for KV cache block size resolution was added to support context parallelism on Ascend. Feedback includes a style guide violation regarding the PR title and summary format, and a critical bug in AscendMoERunner where a missing argument in legacy vLLM versions would lead to incorrect initialization.

Comment thread docs/source/conf.py
"cann_image_tag": "8.5.1-910b-ubuntu22.04-py3.11",
# vLLM commit hash for main branch
"main_vllm_commit": "d886c26d4d4fef7d079696beb4ece1cfb4b008a8",
"main_vllm_commit": "132765e3560659ff63ebd236203672e991b70e08",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Pull Request title and summary do not adhere to the repository style guide. Please update them to follow the required format.

Suggested PR Title:

[Main2Main][Misc][Misc] Upgrade vLLM to 0.20.1

Suggested PR Summary:

### What this PR does / why we need it?
This PR upgrades the vLLM dependency to version 0.20.1. It includes necessary adaptations for Ascend-specific operators (FusedMoE, LayerNorm), worker logic, and speculative decoding components to maintain compatibility with the updated vLLM core.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tested with existing unit tests and new tests for MoE logical experts and Eagle proposer.
References
  1. The PR Title and Summary must follow specific formats defined in the Repository Style Guide. (link)

self._shared_experts if is_legacy else kwargs.pop("shared_experts", None),
self.quant_method,
self.reduce_results,
self.vllm_config.parallel_config.enable_dbo,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The AscendMoERunner call is missing the reduce_results argument when is_legacy is True. In vLLM 0.19.1, DefaultMoERunner (which AscendMoERunner inherited from) required reduce_results as the 8th positional argument. Passing enable_dbo as the 8th argument will lead to incorrect initialization on older vLLM versions.

            self.quant_method,
            *((self.reduce_results, self.vllm_config.parallel_config.enable_dbo) if is_legacy else (self.vllm_config.parallel_config.enable_dbo,)),

wxsIcey and others added 9 commits May 6, 2026 10:44
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
gcanlin and others added 18 commits May 6, 2026 10:47
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…0410

vLLM PR #40410 split the single EagleCudaGraphManager into separate
PrefillEagleCudaGraphManager and DecodeEagleCudaGraphManager, each with
a different capture() signature:
  - Prefill: capture(forward_fn, full_cg_attn_states, ...)
  - Decode:  capture(forward_fn, model_state, input_buffers, block_tables, ...)

The upstream speculator now calls self.prefill_cudagraph_manager.capture()
with only (forward_fn, attn_states), but EagleAclGraphManager still had
the old decode-style signature requiring 4 extra positional args, causing:
  TypeError: EagleAclGraphManager.capture() missing 4 required positional
  arguments: 'input_buffers', 'block_tables', 'attn_groups', 'kv_cache_config'

Fix by importing PrefillEagleCudaGraphManager and dispatching capture() to
the correct parent class based on self.is_draft_model_prefill.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…or vLLM PR #40654

vLLM PR #40654 added seq_lens_cpu_upper_bound as a new required field to
InputBatch (a CPU upper-bound on seq_lens to avoid GPU->CPU sync).
AscendInputBatch inherits from InputBatch and must supply this field.

Compute it the same way as upstream: num_computed_tokens_np + num_scheduled_tokens,
zero-padded to num_reqs_padded, then pass it when constructing AscendInputBatch
in prepare_inputs().

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…n Ascend (vLLM PR #40860)

vLLM PR #40860 ([Feat] DeepSeek V4 Rebased) introduced
resolve_kv_cache_block_sizes() into engine/core.py and added a restriction
that hybrid KV cache groups with multiple block sizes do not support context
parallelism (dcp_world_size/pcp_world_size > 1), raising:
  ValueError: Hybrid KV cache groups with multiple block sizes do not
  support context parallelism (dcp_world_size/pcp_world_size > 1).

This restriction is correct for CUDA (the CUDA MLA implementation cannot
combine hybrid KV with CP), but Ascend has dedicated CP backends for
MLA (mla_cp.py) and SFA (sfa_cp.py) that handle this combination.

Fix by patching resolve_kv_cache_block_sizes() to skip the ValueError for
multiple-groups + CP on Ascend, and instead compute scheduler_block_size as
lcm(group_block_sizes) * dcp * pcp for proper alignment.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
@shen-shanshan shen-shanshan deleted the pr/8856 branch May 6, 2026 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation module:core module:ops module:quantization module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants