[Misc][main2main] Adapt vllm-ascend to vLLM 0420#8428
[Misc][main2main] Adapt vllm-ascend to vLLM 0420#8428wangxiyuan merged 7 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
cfc9e07 to
930570a
Compare
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request aligns the vllm-ascend extension with recent upstream changes in vLLM, specifically focusing on FusedMoE router compatibility and updating the tracked vLLM commit hash. These changes ensure that the Ascend-specific operations remain functional with the latest core vLLM developments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
930570a to
ef180c0
Compare
There was a problem hiding this comment.
Code Review
This pull request updates the vLLM commit hash in the documentation and introduces a _get_num_logical_experts helper function across several MoE implementation files to correctly handle expert counts in zero_experts_compute. Feedback indicates that using global_num_experts as a fallback is incorrect because it includes redundant padding experts, which can disable zero-expert logic. Additionally, the logic should be adjusted to account for shared experts when mix_placement is enabled.
Suggested PR Title:
[Ops][Misc] Update vLLM commit and refine logical expert count in MoESuggested PR Summary:
### What this PR does / why we need it?
This PR updates the vLLM commit hash in the documentation and introduces a `_get_num_logical_experts` helper function across several MoE implementation files to correctly handle expert counts in `zero_experts_compute`. Feedback indicates that using `global_num_experts` as a fallback is incorrect because it includes redundant padding experts, which can disable zero-expert logic. Additionally, the logic should be adjusted to account for shared experts when `mix_placement` is enabled.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with existing tests.| expert_indices=topk_ids, | ||
| expert_scales=topk_weights, | ||
| num_experts=global_num_experts, | ||
| num_experts=_get_num_logical_experts(layer, global_num_experts), |
There was a problem hiding this comment.
The num_experts parameter in zero_experts_compute defines the boundary between real experts and padding (zero) experts. Using global_num_experts as the fallback is incorrect because it includes global_redundant_expert_num (padding), which effectively disables the zero-expert logic for those padding experts. The boundary should be the total number of non-padding experts (routed + shared).
| num_experts=_get_num_logical_experts(layer, global_num_experts), | |
| num_experts=_get_num_logical_experts(layer, global_num_experts - global_redundant_expert_num), |
| expert_indices=topk_ids, | ||
| expert_scales=topk_weights, | ||
| num_experts=global_num_experts, | ||
| num_experts=_get_num_logical_experts(layer, global_num_experts), |
There was a problem hiding this comment.
Using global_num_experts as the fallback for num_experts in zero_experts_compute fails to identify padding experts when global_redundant_expert_num > 0.
Additionally, when mix_placement is active, logical_num_experts (which typically represents the number of routed experts) should be incremented by n_shared_experts to ensure shared experts are not incorrectly treated as zero experts. Since valid_global_expert_num is already calculated as the routed expert count, it should be used as the fallback for the routed portion.
| num_experts=_get_num_logical_experts(layer, global_num_experts), | |
| num_experts=_get_num_logical_experts(layer, valid_global_expert_num) + (n_shared_experts if mix_placement else 0), |
062d5fd to
6d65f94
Compare
6d65f94 to
066f590
Compare
c8dc4e8 to
5c620d5
Compare
0f79355 to
4de9623
Compare
**Commit range:** `6f786f2`..`d886c26` | Priority | Change | vLLM File | vllm-ascend File | Description | |:---|:---|:---|:---|:---| | P0 | Zero-expert logical expert boundary | `vllm/model_executor/layers/fused_moe/layer.py` | `vllm_ascend/ops/fused_moe/fused_moe.py`, `vllm_ascend/_310p/fused_moe/fused_moe.py`, `vllm_ascend/quantization/methods/w8a8_dynamic.py`, `vllm_ascend/_310p/quantization/methods/w8a8_dynamic.py` | Use `logical_num_experts` when available so Ascend zero-expert handling stays compatible with the upstream FusedMoE router changes on main. | | P1 | Main2main commit reference bump | `docs/source/conf.py`, `.github/workflows/pr_test_full.yaml`, `.github/workflows/pr_test_light.yaml`, `.github/workflows/dockerfiles/Dockerfile.lint` | same | Update main-branch vLLM commit references from `6f786f2c506cb07f4566771fdc62e640e2c4a176` to `d886c26d4d4fef7d079696beb4ece1cfb4b008a8`. | - `.github/workflows/dockerfiles/Dockerfile.lint` - `.github/workflows/pr_test_full.yaml` - `.github/workflows/pr_test_light.yaml` - `docs/source/conf.py` - `vllm_ascend/ops/fused_moe/fused_moe.py` - `vllm_ascend/_310p/fused_moe/fused_moe.py` - `vllm_ascend/quantization/methods/w8a8_dynamic.py` - `vllm_ascend/_310p/quantization/methods/w8a8_dynamic.py` Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
CI Fix Summary (run ID: 24638481321)
Commit range: 6f786f2..f150107
Issues Fixed
Error Upstream Cause Commit Affected Files Fix Description
AttributeError: 'float' object has no attribute 'language_model' c08f3b2a — Measure encoder compile time seperate from llm backbone vllm_ascend/worker/worker.py, tests/ut/worker/test_worker_v1.py Return CompilationTimes on vLLM main, keep scalar return for v0.19.0 via vllm_version_is("0.19.0").
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
4de9623 to
19e0e84
Compare
Commit range:
6f786f2..d886c26What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?