Skip to content

[Feat] 310p support MoE W8A8 quantizaition#6641

Merged
wangxiyuan merged 7 commits intovllm-project:mainfrom
pu-zhe:moe_w8a8
Feb 10, 2026
Merged

[Feat] 310p support MoE W8A8 quantizaition#6641
wangxiyuan merged 7 commits intovllm-project:mainfrom
pu-zhe:moe_w8a8

Conversation

@pu-zhe
Copy link
Copy Markdown
Contributor

@pu-zhe pu-zhe commented Feb 9, 2026

What this PR does / why we need it?

This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by:

  • Implementing a new quantization scheme AscendW8A8DynamicFusedMoEMethod310.
  • Adding a unified MLP implementation (unified_apply_mlp) for 310P that handles both quantized and unquantized paths.
  • Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations.
  • Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Added a new e2e test test_qwen3_moe_tp2_w8a8 to test MoE W8A8 quantization in a multi-card setup.

  • Added several new unit tests for the 310P-specific MoE components, including experts_selector, fused_moe, moe_comm_method, moe_mlp, and the new w8a8_dynamic quantization method.

  • vLLM version: v0.15.0

  • vLLM main: vllm-project/vllm@d7e17aa

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 9, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @pu-zhe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the vLLM Ascend backend by introducing support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on the Ascend 310P hardware. The changes involve implementing a new dynamic quantization scheme, integrating it deeply into the fused MoE forward pass, including expert selection, MLP computations, and inter-expert communication. This enables more efficient execution of MoE models on Ascend 310P by reducing memory footprint and potentially improving inference speed through quantized operations.

Highlights

  • W8A8 Dynamic Quantization for MoE: Implements and enables W8A8 dynamic quantization for Mixture-of-Experts (MoE) models specifically on the Ascend 310P platform, reducing memory footprint and potentially improving inference speed.
  • Unified MLP Application: Introduces a new unified_apply_mlp function that intelligently handles both quantized (W8A8) and unquantized operations within MoE layers, streamlining the forward pass logic.
  • MoE Communication Integration: Integrates W8A8 quantization into the MoE communication methods (e.g., AllGatherCommImpl), ensuring that quantized data is correctly dispatched and combined across experts during distributed inference.
  • Comprehensive Testing: Adds extensive unit tests for expert selection, fused MoE operations, MoE communication methods, MLP application, token dispatchers, and the W8A8 dynamic quantization method itself, alongside an end-to-end test for a Qwen3 MoE model with W8A8 quantization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • tests/e2e/310p/multicard/test_moe_model_multicard.py
    • Added an end-to-end test case for Qwen3 MoE model with W8A8 quantization on Ascend.
  • tests/ut/_310p/fused_moe/test_experts_selector_310.py
    • Added unit tests for the select_experts function, covering various configurations.
  • tests/ut/_310p/fused_moe/test_fused_moe_310.py
    • Added comprehensive unit tests for fused MoE operations, including mocking distributed environment and NPU calls.
  • tests/ut/_310p/fused_moe/test_moe_comm_method_310.py
    • Added unit tests for MoE communication methods (AllGather, MC2, AlltoAll), verifying their prepare, finalize, and fused_experts functionalities.
  • tests/ut/_310p/fused_moe/test_moe_mlp_310.py
    • Added unit tests for the unified_apply_mlp function, validating both quantized and unquantized MLP computations.
  • tests/ut/_310p/fused_moe/test_token_dispatcher_310.py
    • Added unit tests for MoE token dispatchers, covering dispatch and combine operations with and without quantization.
  • tests/ut/_310p/quantization/test_modelslim_config_310.py
    • Added copyright header.
  • tests/ut/_310p/quantization/test_w8a8_310.py
    • Renamed to tests/ut/_310p/quantization/test_w8a8_static_310.py and added copyright header.
  • tests/ut/_310p/quantization/test_w8a8_dynamic_310.py
    • Added unit tests for the AscendW8A8DynamicFusedMoEMethod310, verifying weight and dynamic quantization parameter retrieval.
  • vllm_ascend/_310p/fused_moe/experts_selector.py
    • Removed unused weight prefetching logic.
  • vllm_ascend/_310p/fused_moe/fused_moe.py
    • Enabled W8A8 quantization support in get_quant_type and refactored forward_impl to integrate the new quantization scheme.
  • vllm_ascend/_310p/fused_moe/moe_comm_method.py
    • Implemented fused_experts method in AllGatherCommImpl310 to orchestrate quantized MLP operations.
  • vllm_ascend/_310p/fused_moe/moe_mlp.py
    • Added new file containing quant_apply_mlp, unquant_apply_mlp, and unified_apply_mlp for handling MLP computations with and without W8A8 quantization.
  • vllm_ascend/_310p/fused_moe/token_dispatcher.py
    • Streamlined token_dispatch by removing quantization-related parameters and checks, allowing for W8A8 integration.
  • vllm_ascend/_310p/quantization/methods/init.py
    • Enabled the import of w8a8_dynamic quantization method.
  • vllm_ascend/_310p/quantization/methods/w8a8_dynamic.py
    • Added new file implementing the AscendW8A8DynamicFusedMoEMethod310 for MoE layers.
  • vllm_ascend/_310p/quantization/methods/w8a8_static.py
    • Modified get_perchannel_param to simplify deq_scale dtype handling.
  • vllm_ascend/_310p/quantization/modelslim_config.py
    • Updated quantization configuration and routing logic to support W8A8 dynamic quantization for FusedMoE layers.
Activity
  • The pull request introduces a new feature: W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on the Ascend 310P.
  • It involves significant code additions for new quantization methods, MLP application, and extensive unit tests to validate the new functionality.
  • Existing MoE-related files have been modified to integrate the new quantization logic, ensuring seamless operation within the vLLM Ascend framework.
  • The modelslim_config.py was updated to correctly route MoE layers to the new W8A8 dynamic quantization scheme, enabling its use in model loading and execution.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. The changes include a new quantization scheme, a unified MLP implementation for 310P, and refactoring of MoE and quantization configuration logic. New e2e and unit tests are also added to verify the new functionality.

The implementation is well-structured, but I've found a few issues that need to be addressed:

  • An inconsistency in an e2e test configuration.
  • A bug in a unit test for the new dynamic quantization method.
  • A type hint mismatch in the new MoE MLP implementation.

Additionally, the pull request description is currently empty. I've provided a suggestion for the title and summary below, following the repository's style guide.

Suggested PR Title:

[MoE][Feature] Support W8A8 quantization for MoE on 310P

Suggested PR Summary:

### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved by:
- Implementing a new quantization scheme `AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8 quantization.

### Does this PR introduce _any_ user-facing change?
Yes, users can now run MoE models with W8A8 quantization on 310P by setting `quantization="ascend"`.

### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp4_w8a8` to test MoE W8A8 quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components, including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`, and the new `w8a8_dynamic` quantization method.

Comment thread tests/e2e/310p/multicard/test_moe_model_multicard.py
Comment thread tests/ut/_310p/quantization/test_w8a8_dynamic_310.py Outdated
Comment thread vllm_ascend/_310p/fused_moe/moe_mlp.py Outdated
Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: pu-zhe <zpuaa@outlook.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: pu-zhe <zpuaa@outlook.com>
@pu-zhe pu-zhe requested a review from wangxiyuan February 10, 2026 08:03
@wangxiyuan wangxiyuan merged commit 02886e2 into vllm-project:main Feb 10, 2026
25 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Feb 11, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend:
  [Feat] 310p support MoE W8A8 quantizaition (vllm-project#6641)
  [TEST]add a qwen3-30b acc case with mooncake mempool (vllm-project#6244)
  [MOE Refactor] Remove QuantType in prepare_finalize.py (vllm-project#6534)
  [EPLB] Avoiding eplb's dependency on a specified model (vllm-project#6528)
  [Doc][Misc] Restructure tutorial documentation (vllm-project#6501)
  implement batch invariant with ascendc (vllm-project#6590)
  [Refact]Refact MLA/SFA weight prefetch to consist with moe weight prefetch (vllm-project#6629)
  [Misc] upgrade to vllm main (vllm-project#6646)
  [main][Docs] Fix spelling errors across documentation (vllm-project#6649)
  [bugfix]Fix no attribute 'data' when MLAPO is enable  (vllm-project#6601)
  [DOC]Add Memcache Usage Guide (vllm-project#6476)
  [main][bugfix] Fix spec acceptance rate problem in vllm_0.15.0 (vllm-project#6606)
  [Test][LoRA] Add e2e test for base model inference (vllm-project#6624)
  [refactor]Optimized the kvcache usage of Deepseek v3.2 (vllm-project#6610)
  [Feat](sfa,dcp) support dcp for sfa (vllm-project#6563)
  [BugFix] Add support for rotary_dim parameter when using partial rope in rotary_embedding (vllm-project#6581)
  [fix bug] fix tensor mismatch bug in sigmoid operate test case (vllm-project#6619)
  [Kernel]: Optimize DispatchFFNCombine performance (vllm-project#6468)
  [MISC] Clean up useless env USE_OPTIMIZED_MODEL (vllm-project#6618)
mikequan0425 pushed a commit to taoyao1221/vllm-ascend that referenced this pull request Feb 11, 2026
### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for
Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved
by:
- Implementing a new quantization scheme
`AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P
that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly
route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8
quantization.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8
quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components,
including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`,
and the new `w8a8_dynamic` quantization method.

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@d7e17aa

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: mikequan0425 <mikequan0425@foxmail.com>
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Feb 12, 2026
### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for
Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved
by:
- Implementing a new quantization scheme
`AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P
that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly
route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8
quantization.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8
quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components,
including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`,
and the new `w8a8_dynamic` quantization method.

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@d7e17aa

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: momochenchuw <chenchuw@huawei.com>
@wangxiyuan wangxiyuan mentioned this pull request Feb 24, 2026
banxiaduhuo pushed a commit to banxiaduhuo/vllm-ascend that referenced this pull request Feb 26, 2026
### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for
Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved
by:
- Implementing a new quantization scheme
`AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P
that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly
route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8
quantization.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8
quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components,
including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`,
and the new `w8a8_dynamic` quantization method.

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@d7e17aa

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
@pu-zhe pu-zhe deleted the moe_w8a8 branch February 28, 2026 02:48
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for
Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved
by:
- Implementing a new quantization scheme
`AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P
that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly
route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8
quantization.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8
quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components,
including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`,
and the new `w8a8_dynamic` quantization method.

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@d7e17aa

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for
Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved
by:
- Implementing a new quantization scheme
`AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P
that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly
route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8
quantization.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8
quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components,
including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`,
and the new `w8a8_dynamic` quantization method.

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@d7e17aa

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for
Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved
by:
- Implementing a new quantization scheme
`AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P
that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly
route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8
quantization.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8
quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components,
including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`,
and the new `w8a8_dynamic` quantization method.

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@d7e17aa

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?
This PR introduces support for W8A8 dynamic quantization for
Mixture-of-Experts (MoE) models on Ascend 310P devices. This is achieved
by:
- Implementing a new quantization scheme
`AscendW8A8DynamicFusedMoEMethod310`.
- Adding a unified MLP implementation (`unified_apply_mlp`) for 310P
that handles both quantized and unquantized paths.
- Refactoring the MoE and quantization configuration logic to correctly
route to the new 310P-specific implementations.
- Adding new e2e and unit tests to verify the functionality of MoE W8A8
quantization.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Added a new e2e test `test_qwen3_moe_tp2_w8a8` to test MoE W8A8
quantization in a multi-card setup.
- Added several new unit tests for the 310P-specific MoE components,
including `experts_selector`, `fused_moe`, `moe_comm_method`, `moe_mlp`,
and the new `w8a8_dynamic` quantization method.

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@d7e17aa

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants