add mxfp8 moe quantization#6670
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello @Eric-dot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive support for MXFP8 quantization within the Mixture-of-Experts (MoE) framework, specifically targeting Ascend A5 devices. It refactors the quantization logic to allow for device-specific dispatch of quantized MLP operations and centralizes the management of quantization parameters. These changes enable more efficient execution of MoE models by utilizing mixed-precision formats for weights and activations, aiming to reduce memory footprint and improve performance. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new W8A8_MXFP8 dynamic fused MoE quantization method for Ascend devices, specifically targeting A5 hardware. Key changes include refactoring the QuantType enum, adding a device-specific quant_apply_mlp_A5 function, and centralizing the dispatch of quantized MLP operations through a DeviceOperator class. A new quant_parser.py file is added to define and parse A5-specific quantization parameters. The fused_experts method in moe_comm_method.py and fused_moe.py is updated to accept and pass these new quantization parameters, enabling FP8 communication for MoE, though FP8 communication for MC2 operations is temporarily disabled due to accuracy issues. Review comments highlight a potential issue where the unified_apply_mlp function directly imports and calls DeviceOperator.quant_apply_mlp, creating a hard dependency that might bypass the intended device-specific dispatch mechanism, and also point out a redundant assertion in fused_experts.
| topk_scales=topk_scales, | ||
| need_trans=need_trans, | ||
| ) | ||
| def unified_apply_mlp(hidden_states: torch.Tensor, |
There was a problem hiding this comment.
The docstring mentions that the quant path is dispatched by DeviceOperator (A5 vs non-A5). However, the code directly imports and calls DeviceOperator.quant_apply_mlp. This creates a hard dependency and bypasses the intended dispatch mechanism. This could lead to incorrect behavior if the code is run on a device where DeviceOperator is not correctly configured. It's crucial to ensure that the dispatch mechanism is correctly used to select the appropriate implementation based on the device.
# A5 需要的量化参数统一透传(非A5忽略即可)
act_quant_type=act_quant_type,
weight_quant_type=weight_quant_type,
scale_type=scale_type,
per_token_scale_type=per_token_scale_type,
use_bf16=kwargs.get("use_bf16", True),| use_fp8_comm = False | ||
| act_quant_type, weight_quant_type, \ | ||
| scale_type, per_token_scale_type, round_mode = parse_a5_quant_params(**kwargs) | ||
| assert moe_comm_method is not None, "Missing communication context" |
| from vllm_ascend.ops.fused_moe.experts_selector import select_experts, zero_experts_compute | ||
| from vllm_ascend.ops.fused_moe.moe_comm_method import AllGatherCommImpl, FusedExpertsResult, setup_moe_comm_method | ||
| from vllm_ascend.quantization.methods.base import QuantType | ||
| from vllm_ascend.ops.fused_moe.prepare_finalize import QuantType |
There was a problem hiding this comment.
There's no need to import QuantType from prepare_finalize, as it is already being imported there from vllm_ascend.quantization.methods.base. It's better to import it directly from the base module to avoid an indirect import.
There was a problem hiding this comment.
The var is already Initialized hence no need to call it
There was a problem hiding this comment.
The var is already Initialized hence no need to call it
| quant_method = self.quant_method | ||
| if not hasattr(quant_method, "quant_method") or quant_method.quant_method is None: | ||
| return QuantType.NONE | ||
|
|
||
| if method is not None: | ||
| quant_type = getattr(method, "quant_type", QuantType.NONE) | ||
| method = quant_method.quant_method | ||
|
|
||
| return quant_type | ||
| if hasattr(method, "quant_type"): | ||
| from vllm_ascend.quantization.methods.base import QuantType as SchemeQuantType | ||
|
|
||
| scheme_quant_type = method.quant_type | ||
| if scheme_quant_type == SchemeQuantType.W8A8: | ||
| return QuantType.W8A8 | ||
| elif scheme_quant_type == SchemeQuantType.W4A8: | ||
| return QuantType.W4A8 | ||
|
|
||
| return QuantType.NONE |
There was a problem hiding this comment.
Since QuantType only contains NONE, W8A8, and W4A8, the original implementation of _get_quant_type is logically equivalent and much more concise. Let's stick with the original implementation to keep the code clean.
| # shared_experts=shared_experts, | ||
| # quantized_x_for_share=quantized_x_for_share, | ||
| # dynamic_scale_for_share=dynamic_scale_for_share, |
There was a problem hiding this comment.
Please just delete the commented-out code.
| # to save npu memory because they're no longer used. | ||
| dispose_tensor(unquantized_hidden_states) | ||
| else: | ||
| if dynamic_scale.ndim == 2: # w8a8mx gmm量化需要输入3维scale,拆分尾轴2出来 |
There was a problem hiding this comment.
Please use English for all comments.
|
It looks like the DCO check failed. Please follow this guide to fix it: https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html |
Signed-off-by: fangrongcan <17343701736@163.com> add mxfp8 moe quantization Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: wangyao-i <iwangyao@outlook.com> Signed-off-by: fangrongcan <17343701736@163.com>
There was a problem hiding this comment.
Some modifications should be done in token_dispatcher.py:
token_combinefunction is wrongly deleted which must lead to instantiation error with abstract class and function.- Some useless and vague branches should be merged with original implementation instead of adding extra codes.
- It is not necessary to keep
fp8_commflag since we defautly enable communication with quant in moe_dispatch.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
…abled Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com>
…heck Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
| dynamic_eplb: bool = False, | ||
| **kwargs, | ||
| ) -> torch.Tensor: | ||
| adaptor_cls = DeviceOperator |
There was a problem hiding this comment.
use DeviceOperator directly.
There was a problem hiding this comment.
We used it at the beginning but lead to mypy lint error. I would fix it at once.
There was a problem hiding this comment.
@wangxiyuan We've narrowed DeviceOperator type to pass mypy. Please have a quick check for the update~
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it? support mxfp8 quantization (Qwen MOE ) Using adaptor to make the hardware-specific behavior clearer and more maintainable ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 --------- Signed-off-by: fangrongcan <17343701736@163.com> Signed-off-by: wangyao-i <iwangyao@outlook.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com> Co-authored-by: fangrongcan <f00876277@china.huawei.com> Co-authored-by: wangyao-i <iwangyao@outlook.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it? support mxfp8 quantization (Qwen MOE ) Using adaptor to make the hardware-specific behavior clearer and more maintainable ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 --------- Signed-off-by: fangrongcan <17343701736@163.com> Signed-off-by: wangyao-i <iwangyao@outlook.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com> Co-authored-by: fangrongcan <f00876277@china.huawei.com> Co-authored-by: wangyao-i <iwangyao@outlook.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it? support mxfp8 quantization (Qwen MOE ) Using adaptor to make the hardware-specific behavior clearer and more maintainable ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 --------- Signed-off-by: fangrongcan <17343701736@163.com> Signed-off-by: wangyao-i <iwangyao@outlook.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com> Co-authored-by: fangrongcan <f00876277@china.huawei.com> Co-authored-by: wangyao-i <iwangyao@outlook.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it? support mxfp8 quantization (Qwen MOE ) Using adaptor to make the hardware-specific behavior clearer and more maintainable ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: vllm-project/vllm@1339784 --------- Signed-off-by: fangrongcan <17343701736@163.com> Signed-off-by: wangyao-i <iwangyao@outlook.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com> Co-authored-by: fangrongcan <f00876277@china.huawei.com> Co-authored-by: wangyao-i <iwangyao@outlook.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>
- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [#6670](#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([#731](#731)), landed super-kernel fusion for quantized DSR1 ([#3485](#3485)), and added initial MoE support for Model Runner v2 ([#7922](#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (#7157)](#7157) for A5 support, landed initial build support ([#7151](#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([#6670](#6670), [#7877](#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([#7573](#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([#4805](#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([#4154](#4154)), and added a penalty-related Triton kernel for sampling performance ([#7794](#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([#2384](#2384), [#2459](#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([#2849](#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([#7024](#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com>
- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
What this PR does / why we need it?
support mxfp8 quantization (Qwen MOE )
Using adaptor to make the hardware-specific behavior clearer and more maintainable
How was this patch tested?