[Misc][Main2Main] Upgrade vLLM to 0.20.1#8880
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request updates the project to vLLM 0.20.1. The changes focus on reconciling the Ascend backend with upstream API modifications, particularly in MoE layers, KV cache management, and LoRA utilities. These updates ensure that the Ascend platform remains compatible with the latest vLLM features while maintaining performance optimizations. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
There was a problem hiding this comment.
Code Review
This pull request upgrades the vLLM dependency to version 0.20.1 and implements extensive compatibility fixes across the Ascend-specific codebase. Key changes include adapting FusedMoE and SharedFusedMoE for both standard and 310P platforms, updating LayerNorm and LoRA utilities to handle vLLM API changes, and introducing version-aware logic for speculative decoding (Eagle proposer) and scheduler components. Additionally, a new patch for KV cache block size resolution was added to support context parallelism on Ascend. Feedback includes a style guide violation regarding the PR title and summary format, and a critical bug in AscendMoERunner where a missing argument in legacy vLLM versions would lead to incorrect initialization.
| "cann_image_tag": "8.5.1-910b-ubuntu22.04-py3.11", | ||
| # vLLM commit hash for main branch | ||
| "main_vllm_commit": "d886c26d4d4fef7d079696beb4ece1cfb4b008a8", | ||
| "main_vllm_commit": "132765e3560659ff63ebd236203672e991b70e08", |
There was a problem hiding this comment.
The Pull Request title and summary do not adhere to the repository style guide. Please update them to follow the required format.
Suggested PR Title:
[Main2Main][Misc][Misc] Upgrade vLLM to 0.20.1Suggested PR Summary:
### What this PR does / why we need it?
This PR upgrades the vLLM dependency to version 0.20.1. It includes necessary adaptations for Ascend-specific operators (FusedMoE, LayerNorm), worker logic, and speculative decoding components to maintain compatibility with the updated vLLM core.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tested with existing unit tests and new tests for MoE logical experts and Eagle proposer.References
- The PR Title and Summary must follow specific formats defined in the Repository Style Guide. (link)
| self._shared_experts if is_legacy else kwargs.pop("shared_experts", None), | ||
| self.quant_method, | ||
| self.reduce_results, | ||
| self.vllm_config.parallel_config.enable_dbo, |
There was a problem hiding this comment.
The AscendMoERunner call is missing the reduce_results argument when is_legacy is True. In vLLM 0.19.1, DefaultMoERunner (which AscendMoERunner inherited from) required reduce_results as the 8th positional argument. Passing enable_dbo as the 8th argument will lead to incorrect initialization on older vLLM versions.
self.quant_method,
*((self.reduce_results, self.vllm_config.parallel_config.enable_dbo) if is_legacy else (self.vllm_config.parallel_config.enable_dbo,)),Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…0410 vLLM PR #40410 split the single EagleCudaGraphManager into separate PrefillEagleCudaGraphManager and DecodeEagleCudaGraphManager, each with a different capture() signature: - Prefill: capture(forward_fn, full_cg_attn_states, ...) - Decode: capture(forward_fn, model_state, input_buffers, block_tables, ...) The upstream speculator now calls self.prefill_cudagraph_manager.capture() with only (forward_fn, attn_states), but EagleAclGraphManager still had the old decode-style signature requiring 4 extra positional args, causing: TypeError: EagleAclGraphManager.capture() missing 4 required positional arguments: 'input_buffers', 'block_tables', 'attn_groups', 'kv_cache_config' Fix by importing PrefillEagleCudaGraphManager and dispatching capture() to the correct parent class based on self.is_draft_model_prefill. Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…or vLLM PR #40654 vLLM PR #40654 added seq_lens_cpu_upper_bound as a new required field to InputBatch (a CPU upper-bound on seq_lens to avoid GPU->CPU sync). AscendInputBatch inherits from InputBatch and must supply this field. Compute it the same way as upstream: num_computed_tokens_np + num_scheduled_tokens, zero-padded to num_reqs_padded, then pass it when constructing AscendInputBatch in prepare_inputs(). Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…n Ascend (vLLM PR #40860) vLLM PR #40860 ([Feat] DeepSeek V4 Rebased) introduced resolve_kv_cache_block_sizes() into engine/core.py and added a restriction that hybrid KV cache groups with multiple block sizes do not support context parallelism (dcp_world_size/pcp_world_size > 1), raising: ValueError: Hybrid KV cache groups with multiple block sizes do not support context parallelism (dcp_world_size/pcp_world_size > 1). This restriction is correct for CUDA (the CUDA MLA implementation cannot combine hybrid KV with CP), but Ascend has dedicated CP backends for MLA (mla_cp.py) and SFA (sfa_cp.py) that handle this combination. Fix by patching resolve_kv_cache_block_sizes() to skip the ValueError for multiple-groups + CP on Ascend, and instead compute scheduler_block_size as lcm(group_block_sizes) * dcp * pcp for proper alignment. Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
What this PR does / why we need it?
Based on #8856.
Does this PR introduce any user-facing change?
How was this patch tested?