[Misc][Main2Main] Upgrade vLLM to 0427#8899
[Misc][Main2Main] Upgrade vLLM to 0427#8899shen-shanshan wants to merge 35 commits intovllm-project:mainfrom
Conversation
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…0410 vLLM PR #40410 split the single EagleCudaGraphManager into separate PrefillEagleCudaGraphManager and DecodeEagleCudaGraphManager, each with a different capture() signature: - Prefill: capture(forward_fn, full_cg_attn_states, ...) - Decode: capture(forward_fn, model_state, input_buffers, block_tables, ...) The upstream speculator now calls self.prefill_cudagraph_manager.capture() with only (forward_fn, attn_states), but EagleAclGraphManager still had the old decode-style signature requiring 4 extra positional args, causing: TypeError: EagleAclGraphManager.capture() missing 4 required positional arguments: 'input_buffers', 'block_tables', 'attn_groups', 'kv_cache_config' Fix by importing PrefillEagleCudaGraphManager and dispatching capture() to the correct parent class based on self.is_draft_model_prefill. Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…or vLLM PR #40654 vLLM PR #40654 added seq_lens_cpu_upper_bound as a new required field to InputBatch (a CPU upper-bound on seq_lens to avoid GPU->CPU sync). AscendInputBatch inherits from InputBatch and must supply this field. Compute it the same way as upstream: num_computed_tokens_np + num_scheduled_tokens, zero-padded to num_reqs_padded, then pass it when constructing AscendInputBatch in prepare_inputs(). Signed-off-by: gcanlin <canlinguosdu@gmail.com>
…n Ascend (vLLM PR #40860) vLLM PR #40860 ([Feat] DeepSeek V4 Rebased) introduced resolve_kv_cache_block_sizes() into engine/core.py and added a restriction that hybrid KV cache groups with multiple block sizes do not support context parallelism (dcp_world_size/pcp_world_size > 1), raising: ValueError: Hybrid KV cache groups with multiple block sizes do not support context parallelism (dcp_world_size/pcp_world_size > 1). This restriction is correct for CUDA (the CUDA MLA implementation cannot combine hybrid KV with CP), but Ascend has dedicated CP backends for MLA (mla_cp.py) and SFA (sfa_cp.py) that handle this combination. Fix by patching resolve_kv_cache_block_sizes() to skip the ValueError for multiple-groups + CP on Ascend, and instead compute scheduler_block_size as lcm(group_block_sizes) * dcp * pcp for proper alignment. Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request upgrades the vLLM dependency and introduces a comprehensive compatibility layer to support both the current pinned version and newer upstream releases. The changes focus on refactoring MoE layer logic, ensuring consistent patch application across engine-core processes, and enabling context parallelism on Ascend hardware by overriding KV cache block size resolution. These updates maintain feature parity while preparing the codebase for future vLLM version migrations. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
Suggested PR Title:
[Ops][Misc] Support vLLM v0.19.1 and upstream compatibility patchesSuggested PR Summary:
### What this PR does / why we need it?
This PR implements compatibility updates for vLLM v0.19.1. It merges `SharedFusedMoE` logic into the base `FusedMoE` classes, handles version-specific changes in `CompilationTimes`, `LoRA` utilities, and `RMSNormGated`. Additionally, it introduces a global patching mechanism and a specific patch for `resolve_kv_cache_block_sizes` to enable hybrid KV cache with context parallelism on Ascend.
Feedback: Several critical issues were found, including missing imports for `vllm_version_is` in multiple files and the use of an undefined variable `is_legacy` in the MoE initialization logic.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated unit tests for MoE and Eagle proposer.|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
What this PR does / why we need it?
Based on #8856.
Sync to vLLM
4d51588e2381018348f1022dfa3a7698899805b7.Fix:
TypeError: rejection_sample() got an unexpected keyword argument 'synthetic_mode'-> Addsynthetic_modeandsynthetic_conditional_ratesparam to ascendrejection_sample().encoder_compilation_timeAttributeErrorc08f3b2a6(#39240)worker/worker.py:567getattrfallbackAscendRMSNormGated activationTypeError893611813(#40245)ops/layernorm.py:160,_310p/ops/layernorm.py:43activationkwargAscendFusedMoEMethod.apply topk_weightsTypeError5e584ce9e(#35782),809d83c2d(#40560),4d51588e2(#40860))ops/fused_moe/fused_moe.py:107_all_lora_classesis tuplea250f1bd5(#35077)lora/utils.py:188.add()ProfilingChunkScheduler hash_block_sizeTypeError7b1bc0a3e(#40946)core/scheduler_profiling_chunk.py:57_moe_C.topk_softmaxAttributeErrortorch_nputopk-softmax (with Issue 4)quantization/methods/w8a8_dynamic.py:198assert common_attn_metadata.seq_lens_cpu_upper_bound is not None2. vllm_ascend/attention/utils.py:216 — Added propagation in unpadded() method so sliced copies preserve the field.
3. vllm_ascend/spec_decode/dflash_proposer.py:179 — Added seq_lens_cpu_upper_bound for DFlash proposer's graph-capture metadata.
4. vllm_ascend/spec_decode/eagle_proposer.py (3 locations): 422 — Graph-capture metadata for EAGLE proposer, 1583 — prepare_inputs() post-rejection metadata, 1662 — prepare_inputs_padded() metadata propagation.
5. vllm_ascend/worker/v2/attn_utils.py:99 — V2 attention utilities metadata construction.
具体机制:
1. vLLM 旧代码:MoE runner 调用 layer.maybe_all_reduce_tensor_model_parallel(),AscendFusedMoE 覆写了该方法,对不同通信类型做正确判断(ALLTOALL/MC2 的 finalize() 已包含 TP 聚合,无需再 reduce)。
2. vLLM 新代码(升级后):MoERunner.forward() 直接调用 vllm.distributed.tensor_model_parallel_all_reduce(),完全绕过了 Ascend 的覆写。
3. 在 Ascend A3 (910B) 上,默认 MoE 通信类型为 ALLTOALL/MC2/FUSED_MC2: moe_comm_method.finalize() 已包含 routed expert 的 TP all-reduce,_forward_shared_experts() 已对 shared expert 做 TP all-reduce(第 678 行),但 MoERunner.forward() 的 _maybe_reduce_final_output() 又做了一遍 TP all-reduce → 双重 reduce,输出被放大。
4. 在 Ascend A2 (910A) 上,默认通信类型为 AllGather:finalize() 不含 TP all-reduce,_forward_shared_experts() 跳过 TP all-reduce(条件判断不满足),_maybe_reduce_final_output() 做唯一的 TP all-reduce → 正确。
_fused_output_is_reduced property:当通信类型为 ALLTOALL/MC2/FUSED_MC2 时返回 True,告诉上游 MoERunner.forward() 不要再做 _maybe_reduce_final_output 的 TP all-reduce;
_maybe_reduce_shared_expert_output 方法:直接返回 shared_output(不做额外 reduce),因为 _forward_shared_experts 已处理。
Does this PR introduce any user-facing change?
How was this patch tested?