[Feat] 310p support MoE W8A8 quantizaition#6641
[Feat] 310p support MoE W8A8 quantizaition#6641wangxiyuan merged 7 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello @pu-zhe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the vLLM Ascend backend by introducing support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on the Ascend 310P hardware. The changes involve implementing a new dynamic quantization scheme, integrating it deeply into the fused MoE forward pass, including expert selection, MLP computations, and inter-expert communication. This enables more efficient execution of MoE models on Ascend 310P by reducing memory footprint and potentially improving inference speed through quantized operations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. The changes include a new quantization scheme, a unified MLP implementation for 310P, and refactoring of MoE and quantization configuration logic. New e2e and unit tests are also added to verify the new functionality.
The implementation is well-structured, but I've found a few issues that need to be addressed:
- An inconsistency in an e2e test configuration.
- A bug in a unit test for the new dynamic quantization method.
- A type hint mismatch in the new MoE MLP implementation.
Additionally, the pull request description is currently empty. I've provided a suggestion for the title and summary below, following the repository's style guide.
Suggested PR Title:
[MoE][Feature] Support W8A8 quantization for MoE on 310PSuggested PR Summary:
### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by:
- Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization.
### Does this PR introduce _any_ user-facing change?
Yes, users can now run MoE models with W8A8 quantization on 310P by setting `quantization="ascend"`.
### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp4_w8a8` to test MoE W8A8 quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method.Signed-off-by: pu-zhe <zpuaa@outlook.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: pu-zhe <zpuaa@outlook.com>
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [Feat] 310p support MoE W8A8 quantizaition (vllm-project#6641) [TEST]add a qwen3-30b acc case with mooncake mempool (vllm-project#6244) [MOE Refactor] Remove QuantType in prepare_finalize.py (vllm-project#6534) [EPLB] Avoiding eplb's dependency on a specified model (vllm-project#6528) [Doc][Misc] Restructure tutorial documentation (vllm-project#6501) implement batch invariant with ascendc (vllm-project#6590) [Refact]Refact MLA/SFA weight prefetch to consist with moe weight prefetch (vllm-project#6629) [Misc] upgrade to vllm main (vllm-project#6646) [main][Docs] Fix spelling errors across documentation (vllm-project#6649) [bugfix]Fix no attribute 'data' when MLAPO is enable (vllm-project#6601) [DOC]Add Memcache Usage Guide (vllm-project#6476) [main][bugfix] Fix spec acceptance rate problem in vllm_0.15.0 (vllm-project#6606) [Test][LoRA] Add e2e test for base model inference (vllm-project#6624) [refactor]Optimized the kvcache usage of Deepseek v3.2 (vllm-project#6610) [Feat](sfa,dcp) support dcp for sfa (vllm-project#6563) [BugFix] Add support for rotary_dim parameter when using partial rope in rotary_embedding (vllm-project#6581) [fix bug] fix tensor mismatch bug in sigmoid operate test case (vllm-project#6619) [Kernel]: Optimize DispatchFFNCombine performance (vllm-project#6468) [MISC] Clean up useless env USE_OPTIMIZED_MODEL (vllm-project#6618)
### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> Signed-off-by: mikequan0425 <mikequan0425@foxmail.com>
### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>
### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>
### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>
What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by:
AscendW8A8DynamicFusedMoEMethod310.unified_apply_mlp) for 310P that handles both quantized and unquantized paths.Does this PR introduce any user-facing change?
No
How was this patch tested?
Added a new e2e test
test_qwen3_moe_tp2_w8a8to test MoE W8A8 quantization in a multi-card setup.Added several new unit tests for the 310P-specific MoE components, including
experts_selector,fused_moe,moe_comm_method,moe_mlp, and the neww8a8_dynamicquantization method.vLLM version: v0.15.0
vLLM main: vllm-project/vllm@d7e17aa