[Feat] 310p support MoE W8A8 quantizaition by pu-zhe · Pull Request #6641 · vllm-project/vllm-ascend

pu-zhe · 2026-02-09T12:29:25Z

What this PR does / why we need it?

This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by:

Implementing a new quantization scheme AscendW8A8DynamicFusedMoEMethod310.
Adding a unified MLP implementation (unified_apply_mlp) for 310P that handles both quantized and unquantized paths.
Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations.
Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a new e2e test test_qwen3_moe_tp2_w8a8 to test MoE W8A8 quantization in a multi-card setup.
Added several new unit tests for the 310P-specific MoE components, including experts_selector, fused_moe, moe_comm_method, moe_mlp, and the new w8a8_dynamic quantization method.
vLLM version: v0.15.0
vLLM main: vllm-project/vllm@d7e17aa

github-actions · 2026-02-09T12:29:41Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-02-09T12:30:04Z

Summary of Changes

Hello @pu-zhe, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the vLLM Ascend backend by introducing support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on the Ascend 310P hardware. The changes involve implementing a new dynamic quantization scheme, integrating it deeply into the fused MoE forward pass, including expert selection, MLP computations, and inter-expert communication. This enables more efficient execution of MoE models on Ascend 310P by reducing memory footprint and potentially improving inference speed through quantized operations.

Highlights

W8A8 Dynamic Quantization for MoE: Implements and enables W8A8 dynamic quantization for Mixture-of-Experts (MoE) models specifically on the Ascend 310P platform, reducing memory footprint and potentially improving inference speed.
Unified MLP Application: Introduces a new unified_apply_mlp function that intelligently handles both quantized (W8A8) and unquantized operations within MoE layers, streamlining the forward pass logic.
MoE Communication Integration: Integrates W8A8 quantization into the MoE communication methods (e.g., AllGatherCommImpl), ensuring that quantized data is correctly dispatched and combined across experts during distributed inference.
Comprehensive Testing: Adds extensive unit tests for expert selection, fused MoE operations, MoE communication methods, MLP application, token dispatchers, and the W8A8 dynamic quantization method itself, alongside an end-to-end test for a Qwen3 MoE model with W8A8 quantization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

tests/e2e/310p/multicard/test_moe_model_multicard.py
- Added an end-to-end test case for Qwen3 MoE model with W8A8 quantization on Ascend.
tests/ut/_310p/fused_moe/test_experts_selector_310.py
- Added unit tests for the select_experts function, covering various configurations.
tests/ut/_310p/fused_moe/test_fused_moe_310.py
- Added comprehensive unit tests for fused MoE operations, including mocking distributed environment and NPU calls.
tests/ut/_310p/fused_moe/test_moe_comm_method_310.py
- Added unit tests for MoE communication methods (AllGather, MC2, AlltoAll), verifying their prepare, finalize, and fused_experts functionalities.
tests/ut/_310p/fused_moe/test_moe_mlp_310.py
- Added unit tests for the unified_apply_mlp function, validating both quantized and unquantized MLP computations.
tests/ut/_310p/fused_moe/test_token_dispatcher_310.py
- Added unit tests for MoE token dispatchers, covering dispatch and combine operations with and without quantization.
tests/ut/_310p/quantization/test_modelslim_config_310.py
- Added copyright header.
tests/ut/_310p/quantization/test_w8a8_310.py
- Renamed to tests/ut/_310p/quantization/test_w8a8_static_310.py and added copyright header.
tests/ut/_310p/quantization/test_w8a8_dynamic_310.py
- Added unit tests for the AscendW8A8DynamicFusedMoEMethod310, verifying weight and dynamic quantization parameter retrieval.
vllm_ascend/_310p/fused_moe/experts_selector.py
- Removed unused weight prefetching logic.
vllm_ascend/_310p/fused_moe/fused_moe.py
- Enabled W8A8 quantization support in get_quant_type and refactored forward_impl to integrate the new quantization scheme.
vllm_ascend/_310p/fused_moe/moe_comm_method.py
- Implemented fused_experts method in AllGatherCommImpl310 to orchestrate quantized MLP operations.
vllm_ascend/_310p/fused_moe/moe_mlp.py
- Added new file containing quant_apply_mlp, unquant_apply_mlp, and unified_apply_mlp for handling MLP computations with and without W8A8 quantization.
vllm_ascend/_310p/fused_moe/token_dispatcher.py
- Streamlined token_dispatch by removing quantization-related parameters and checks, allowing for W8A8 integration.
vllm_ascend/_310p/quantization/methods/init.py
- Enabled the import of w8a8_dynamic quantization method.
vllm_ascend/_310p/quantization/methods/w8a8_dynamic.py
- Added new file implementing the AscendW8A8DynamicFusedMoEMethod310 for MoE layers.
vllm_ascend/_310p/quantization/methods/w8a8_static.py
- Modified get_perchannel_param to simplify deq_scale dtype handling.
vllm_ascend/_310p/quantization/modelslim_config.py
- Updated quantization configuration and routing logic to support W8A8 dynamic quantization for FusedMoE layers.

Activity

The pull request introduces a new feature: W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on the Ascend 310P.
It involves significant code additions for new quantization methods, MLP application, and extensive unit tests to validate the new functionality.
Existing MoE-related files have been modified to integrate the new quantization logic, ensuring seamless operation within the vLLM Ascend framework.
The modelslim_config.py was updated to correctly route MoE layers to the new W8A8 dynamic quantization scheme, enabling its use in model loading and execution.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. The changes include a new quantization scheme, a unified MLP implementation for 310P, and refactoring of MoE and quantization configuration logic. New e2e and unit tests are also added to verify the new functionality.

The implementation is well-structured, but I've found a few issues that need to be addressed:

An inconsistency in an e2e test configuration.
A bug in a unit test for the new dynamic quantization method.
A type hint mismatch in the new MoE MLP implementation.

Additionally, the pull request description is currently empty. I've provided a suggestion for the title and summary below, following the repository's style guide.

Suggested PR Title:

[MoE][Feature] Support W8A8 quantization for MoE on 310P

Suggested PR Summary:

### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by:
- Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization.

### Does this PR introduce _any_ user-facing change?
Yes, users can now run MoE models with W8A8 quantization on 310P by setting `quantization="ascend"`.

### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp4_w8a8` to test MoE W8A8 quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method.

Signed-off-by: pu-zhe <zpuaa@outlook.com>

github-actions · 2026-02-10T06:16:24Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: pu-zhe <zpuaa@outlook.com>

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [Feat] 310p support MoE W8A8 quantizaition (vllm-project#6641) [TEST]add a qwen3-30b acc case with mooncake mempool (vllm-project#6244) [MOE Refactor] Remove QuantType in prepare_finalize.py (vllm-project#6534) [EPLB] Avoiding eplb's dependency on a specified model (vllm-project#6528) [Doc][Misc] Restructure tutorial documentation (vllm-project#6501) implement batch invariant with ascendc (vllm-project#6590) [Refact]Refact MLA/SFA weight prefetch to consist with moe weight prefetch (vllm-project#6629) [Misc] upgrade to vllm main (vllm-project#6646) [main][Docs] Fix spelling errors across documentation (vllm-project#6649) [bugfix]Fix no attribute 'data' when MLAPO is enable (vllm-project#6601) [DOC]Add Memcache Usage Guide (vllm-project#6476) [main][bugfix] Fix spec acceptance rate problem in vllm_0.15.0 (vllm-project#6606) [Test][LoRA] Add e2e test for base model inference (vllm-project#6624) [refactor]Optimized the kvcache usage of Deepseek v3.2 (vllm-project#6610) [Feat](sfa,dcp) support dcp for sfa (vllm-project#6563) [BugFix] Add support for rotary_dim parameter when using partial rope in rotary_embedding (vllm-project#6581) [fix bug] fix tensor mismatch bug in sigmoid operate test case (vllm-project#6619) [Kernel]: Optimize DispatchFFNCombine performance (vllm-project#6468) [MISC] Clean up useless env USE_OPTIMIZED_MODEL (vllm-project#6618)

### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> Signed-off-by: mikequan0425 <mikequan0425@foxmail.com>

### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>

### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>

### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by: - Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`. - Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths. - Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations. - Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8 quantization in a multi-card setup. - Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method. - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@d7e17aa --------- Signed-off-by: pu-zhe <zpuaa@outlook.com>

pu-zhe requested a review from wangxiyuan as a code owner February 9, 2026 12:29

github-actions bot added the module:tests label Feb 9, 2026

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

Comment thread tests/e2e/310p/multicard/test_moe_model_multicard.py

Comment thread tests/ut/_310p/quantization/test_w8a8_dynamic_310.py Outdated

Comment thread vllm_ascend/_310p/fused_moe/moe_mlp.py Outdated

submit

5ac957c

Signed-off-by: pu-zhe <zpuaa@outlook.com>

pu-zhe force-pushed the moe_w8a8 branch from e728fa9 to 5ac957c Compare February 9, 2026 13:34

pu-zhe added 5 commits February 9, 2026 21:45

cleancode

f166b01

Signed-off-by: pu-zhe <zpuaa@outlook.com>

cleancode

a3ed4f6

Signed-off-by: pu-zhe <zpuaa@outlook.com>

Add type ignore comment to fused_experts method

5f51eb7

Signed-off-by: pu-zhe <zpuaa@outlook.com>

fix ut

5090284

Signed-off-by: pu-zhe <zpuaa@outlook.com>

fix ut

84c0653

Signed-off-by: pu-zhe <zpuaa@outlook.com>

github-actions bot added the merge-conflicts label Feb 10, 2026

Merge branch 'main' into moe_w8a8

3739e73

Signed-off-by: pu-zhe <zpuaa@outlook.com>

github-actions bot removed the merge-conflicts label Feb 10, 2026

wangxiyuan approved these changes Feb 10, 2026

View reviewed changes

pu-zhe requested a review from wangxiyuan February 10, 2026 08:03

wangxiyuan merged commit 02886e2 into vllm-project:main Feb 10, 2026
25 checks passed

Tflowers-0129 mentioned this pull request Feb 24, 2026

[RFC]: Support Qwen3 for 300I DUO #6600

Open

wangxiyuan mentioned this pull request Feb 24, 2026

[Misc]: test #6787

Closed

pu-zhe deleted the moe_w8a8 branch February 28, 2026 02:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] 310p support MoE W8A8 quantizaition#6641

[Feat] 310p support MoE W8A8 quantizaition#6641
wangxiyuan merged 7 commits intovllm-project:mainfrom
pu-zhe:moe_w8a8

pu-zhe commented Feb 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

gemini-code-assist bot commented Feb 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pu-zhe commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

gemini-code-assist bot commented Feb 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pu-zhe commented Feb 9, 2026 •

edited

Loading