add mxfp8 moe quantization by Eric-dot · Pull Request #6670 · vllm-project/vllm-ascend

Eric-dot · 2026-02-10T13:36:12Z

What this PR does / why we need it?

support mxfp8 quantization (Qwen MOE )
Using adaptor to make the hardware-specific behavior clearer and more maintainable

How was this patch tested?

vLLM version: v0.15.0
vLLM main: vllm-project/vllm@1339784

github-actions · 2026-02-10T13:36:28Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-02-10T13:36:40Z

Summary of Changes

Hello @Eric-dot, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for MXFP8 quantization within the Mixture-of-Experts (MoE) framework, specifically targeting Ascend A5 devices. It refactors the quantization logic to allow for device-specific dispatch of quantized MLP operations and centralizes the management of quantization parameters. These changes enable more efficient execution of MoE models by utilizing mixed-precision formats for weights and activations, aiming to reduce memory footprint and improve performance.

Highlights

MXFP8 MoE Quantization: Introduced a new W8A8 MXFP8 dynamic quantization scheme specifically for Mixture-of-Experts (MoE) layers on Ascend devices.
A5-Specific MLP Quantization: Added a dedicated quant_apply_mlp_A5 function and integrated it via a DeviceOperator for handling A5-specific quantized MLP operations, centralizing the dispatch logic.
Centralized Quantization Parameter Parsing: Created a new quant_parser.py module to manage and parse A5-specific quantization configurations and types, improving modularity and maintainability.
MoE Forward Pass Integration: Updated the fused_experts method to leverage the new A5 quantization scheme, passing relevant parameters for mixed-precision computation and communication within the MoE framework.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

vllm_ascend/device/device_op.py
- Added quant_apply_mlp static method to A5DeviceAdaptor for dispatching quantized MLP operations.
vllm_ascend/ops/fused_moe/fused_moe.py
- Refactored QuantType import to a new location.
- Enhanced quantization type detection logic within _get_quant_type.
- Updated fused_experts to support A5-specific quantization and FP8 communication, including parsing new parameters.
vllm_ascend/ops/fused_moe/moe_comm_method.py
- Updated imports for quantization types and parse_a5_quant_params.
- Extended fused_experts to accept and pass A5-specific quantization arguments to unified_apply_mlp.
vllm_ascend/ops/fused_moe/moe_mlp.py
- Introduced a new A5-specific quantized MLP function (quant_apply_mlp_A5) for mixed-precision operations.
- Added a null check for weight_prefetch_method before calling post-processing.
- Refactored unified_apply_mlp to use the DeviceOperator for dispatching quantized MLP operations.
vllm_ascend/quantization/methods/w8a8_mxfp8.py
- Implemented a new MoE quantization scheme (AscendW8A8MXFP8DynamicFusedMoEMethod) for W8A8 MXFP8.
- Defined how weights and scales are handled and how the quantized MoE forward pass is executed, including expert selection and passing A5-specific quantization parameters.
vllm_ascend/quantization/quant_parser.py
- Added a new module to centralize quantization type mappings and provide utility functions for parsing A5-specific quantization parameters.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new W8A8_MXFP8 dynamic fused MoE quantization method for Ascend devices, specifically targeting A5 hardware. Key changes include refactoring the QuantType enum, adding a device-specific quant_apply_mlp_A5 function, and centralizing the dispatch of quantized MLP operations through a DeviceOperator class. A new quant_parser.py file is added to define and parse A5-specific quantization parameters. The fused_experts method in moe_comm_method.py and fused_moe.py is updated to accept and pass these new quantization parameters, enabling FP8 communication for MoE, though FP8 communication for MC2 operations is temporarily disabled due to accuracy issues. Review comments highlight a potential issue where the unified_apply_mlp function directly imports and calls DeviceOperator.quant_apply_mlp, creating a hard dependency that might bypass the intended device-specific dispatch mechanism, and also point out a redundant assertion in fused_experts.

gemini-code-assist · 2026-02-10T13:40:06Z

-            topk_scales=topk_scales,
-            need_trans=need_trans,
-        )
+def unified_apply_mlp(hidden_states: torch.Tensor,


The docstring mentions that the quant path is dispatched by DeviceOperator (A5 vs non-A5). However, the code directly imports and calls DeviceOperator.quant_apply_mlp. This creates a hard dependency and bypasses the intended dispatch mechanism. This could lead to incorrect behavior if the code is run on a device where DeviceOperator is not correctly configured. It's crucial to ensure that the dispatch mechanism is correctly used to select the appropriate implementation based on the device.

# A5 需要的量化参数统一透传（非A5忽略即可） act_quant_type=act_quant_type, weight_quant_type=weight_quant_type, scale_type=scale_type, per_token_scale_type=per_token_scale_type, use_bf16=kwargs.get("use_bf16", True),

gemini-code-assist · 2026-02-10T13:40:06Z

+        use_fp8_comm = False
+        act_quant_type, weight_quant_type, \
+            scale_type, per_token_scale_type, round_mode = parse_a5_quant_params(**kwargs)
+        assert moe_comm_method is not None, "Missing communication context"


The assertion assert moe_comm_method is not None, "Missing communication context" is repeated on line 139. Consider removing this redundant check.

SlightwindSec · 2026-02-11T01:55:42Z

 from vllm_ascend.ops.fused_moe.experts_selector import select_experts, zero_experts_compute
 from vllm_ascend.ops.fused_moe.moe_comm_method import AllGatherCommImpl, FusedExpertsResult, setup_moe_comm_method
-from vllm_ascend.quantization.methods.base import QuantType
+from vllm_ascend.ops.fused_moe.prepare_finalize import QuantType


There's no need to import QuantType from prepare_finalize, as it is already being imported there from vllm_ascend.quantization.methods.base. It's better to import it directly from the base module to avoid an indirect import.

The var is already Initialized hence no need to call it

SlightwindSec · 2026-02-11T01:56:38Z

+        quant_method = self.quant_method
+        if not hasattr(quant_method, "quant_method") or quant_method.quant_method is None:
+            return QuantType.NONE

-        if method is not None:
-            quant_type = getattr(method, "quant_type", QuantType.NONE)
+        method = quant_method.quant_method

-        return quant_type
+        if hasattr(method, "quant_type"):
+            from vllm_ascend.quantization.methods.base import QuantType as SchemeQuantType
+
+            scheme_quant_type = method.quant_type
+            if scheme_quant_type == SchemeQuantType.W8A8:
+                return QuantType.W8A8
+            elif scheme_quant_type == SchemeQuantType.W4A8:
+                return QuantType.W4A8
+
+        return QuantType.NONE


Since QuantType only contains NONE, W8A8, and W4A8, the original implementation of _get_quant_type is logically equivalent and much more concise. Let's stick with the original implementation to keep the code clean.

SlightwindSec · 2026-02-11T01:59:18Z

+                # shared_experts=shared_experts,
+                # quantized_x_for_share=quantized_x_for_share,
+                # dynamic_scale_for_share=dynamic_scale_for_share,


Please just delete the commented-out code.

SlightwindSec · 2026-02-11T01:59:43Z

+        # to save npu memory because they're no longer used.
+        dispose_tensor(unquantized_hidden_states)
+    else:
+        if dynamic_scale.ndim == 2: # w8a8mx gmm量化需要输入3维scale,拆分尾轴2出来


Please use English for all comments.

SlightwindSec · 2026-02-11T10:32:36Z

It looks like the DCO check failed. Please follow this guide to fix it: https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html

Signed-off-by: fangrongcan <17343701736@163.com> add mxfp8 moe quantization Signed-off-by: fangrongcan <17343701736@163.com>

Signed-off-by: wangyao-i <iwangyao@outlook.com> Signed-off-by: fangrongcan <17343701736@163.com>

Signed-off-by: fangrongcan <17343701736@163.com> cleancode Signed-off-by: fangrongcan <17343701736@163.com>

Signed-off-by: fangrongcan <17343701736@163.com>

linfeng-yuan

Some modifications should be done in token_dispatcher.py:

token_combine function is wrongly deleted which must lead to instantiation error with abstract class and function.
Some useless and vague branches should be merged with original implementation instead of adding extra codes.
It is not necessary to keep fp8_comm flag since we defautly enable communication with quant in moe_dispatch.