Skip to content

add mxfp8 moe quantization#6670

Merged
whx-sjtu merged 24 commits into
vllm-project:mainfrom
Eric-dot:main
Mar 2, 2026
Merged

add mxfp8 moe quantization#6670
whx-sjtu merged 24 commits into
vllm-project:mainfrom
Eric-dot:main

Conversation

@Eric-dot
Copy link
Copy Markdown
Contributor

@Eric-dot Eric-dot commented Feb 10, 2026

What this PR does / why we need it?

support mxfp8 quantization (Qwen MOE )
Using adaptor to make the hardware-specific behavior clearer and more maintainable

How was this patch tested?

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Eric-dot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for MXFP8 quantization within the Mixture-of-Experts (MoE) framework, specifically targeting Ascend A5 devices. It refactors the quantization logic to allow for device-specific dispatch of quantized MLP operations and centralizes the management of quantization parameters. These changes enable more efficient execution of MoE models by utilizing mixed-precision formats for weights and activations, aiming to reduce memory footprint and improve performance.

Highlights

  • MXFP8 MoE Quantization: Introduced a new W8A8 MXFP8 dynamic quantization scheme specifically for Mixture-of-Experts (MoE) layers on Ascend devices.
  • A5-Specific MLP Quantization: Added a dedicated quant_apply_mlp_A5 function and integrated it via a DeviceOperator for handling A5-specific quantized MLP operations, centralizing the dispatch logic.
  • Centralized Quantization Parameter Parsing: Created a new quant_parser.py module to manage and parse A5-specific quantization configurations and types, improving modularity and maintainability.
  • MoE Forward Pass Integration: Updated the fused_experts method to leverage the new A5 quantization scheme, passing relevant parameters for mixed-precision computation and communication within the MoE framework.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • vllm_ascend/device/device_op.py
    • Added quant_apply_mlp static method to A5DeviceAdaptor for dispatching quantized MLP operations.
  • vllm_ascend/ops/fused_moe/fused_moe.py
    • Refactored QuantType import to a new location.
    • Enhanced quantization type detection logic within _get_quant_type.
    • Updated fused_experts to support A5-specific quantization and FP8 communication, including parsing new parameters.
  • vllm_ascend/ops/fused_moe/moe_comm_method.py
    • Updated imports for quantization types and parse_a5_quant_params.
    • Extended fused_experts to accept and pass A5-specific quantization arguments to unified_apply_mlp.
  • vllm_ascend/ops/fused_moe/moe_mlp.py
    • Introduced a new A5-specific quantized MLP function (quant_apply_mlp_A5) for mixed-precision operations.
    • Added a null check for weight_prefetch_method before calling post-processing.
    • Refactored unified_apply_mlp to use the DeviceOperator for dispatching quantized MLP operations.
  • vllm_ascend/quantization/methods/w8a8_mxfp8.py
    • Implemented a new MoE quantization scheme (AscendW8A8MXFP8DynamicFusedMoEMethod) for W8A8 MXFP8.
    • Defined how weights and scales are handled and how the quantized MoE forward pass is executed, including expert selection and passing A5-specific quantization parameters.
  • vllm_ascend/quantization/quant_parser.py
    • Added a new module to centralize quantization type mappings and provide utility functions for parsing A5-specific quantization parameters.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new W8A8_MXFP8 dynamic fused MoE quantization method for Ascend devices, specifically targeting A5 hardware. Key changes include refactoring the QuantType enum, adding a device-specific quant_apply_mlp_A5 function, and centralizing the dispatch of quantized MLP operations through a DeviceOperator class. A new quant_parser.py file is added to define and parse A5-specific quantization parameters. The fused_experts method in moe_comm_method.py and fused_moe.py is updated to accept and pass these new quantization parameters, enabling FP8 communication for MoE, though FP8 communication for MC2 operations is temporarily disabled due to accuracy issues. Review comments highlight a potential issue where the unified_apply_mlp function directly imports and calls DeviceOperator.quant_apply_mlp, creating a hard dependency that might bypass the intended device-specific dispatch mechanism, and also point out a redundant assertion in fused_experts.

Comment thread vllm_ascend/ops/fused_moe/moe_mlp.py Outdated
topk_scales=topk_scales,
need_trans=need_trans,
)
def unified_apply_mlp(hidden_states: torch.Tensor,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The docstring mentions that the quant path is dispatched by DeviceOperator (A5 vs non-A5). However, the code directly imports and calls DeviceOperator.quant_apply_mlp. This creates a hard dependency and bypasses the intended dispatch mechanism. This could lead to incorrect behavior if the code is run on a device where DeviceOperator is not correctly configured. It's crucial to ensure that the dispatch mechanism is correctly used to select the appropriate implementation based on the device.

        # A5 需要的量化参数统一透传(非A5忽略即可)
        act_quant_type=act_quant_type,
        weight_quant_type=weight_quant_type,
        scale_type=scale_type,
        per_token_scale_type=per_token_scale_type,
        use_bf16=kwargs.get("use_bf16", True),

use_fp8_comm = False
act_quant_type, weight_quant_type, \
scale_type, per_token_scale_type, round_mode = parse_a5_quant_params(**kwargs)
assert moe_comm_method is not None, "Missing communication context"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assertion assert moe_comm_method is not None, "Missing communication context" is repeated on line 139. Consider removing this redundant check.

Comment thread vllm_ascend/ops/fused_moe/fused_moe.py Outdated
from vllm_ascend.ops.fused_moe.experts_selector import select_experts, zero_experts_compute
from vllm_ascend.ops.fused_moe.moe_comm_method import AllGatherCommImpl, FusedExpertsResult, setup_moe_comm_method
from vllm_ascend.quantization.methods.base import QuantType
from vllm_ascend.ops.fused_moe.prepare_finalize import QuantType
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to import QuantType from prepare_finalize, as it is already being imported there from vllm_ascend.quantization.methods.base. It's better to import it directly from the base module to avoid an indirect import.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The var is already Initialized hence no need to call it

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The var is already Initialized hence no need to call it

Comment thread vllm_ascend/ops/fused_moe/fused_moe.py Outdated
Comment on lines +238 to +253
quant_method = self.quant_method
if not hasattr(quant_method, "quant_method") or quant_method.quant_method is None:
return QuantType.NONE

if method is not None:
quant_type = getattr(method, "quant_type", QuantType.NONE)
method = quant_method.quant_method

return quant_type
if hasattr(method, "quant_type"):
from vllm_ascend.quantization.methods.base import QuantType as SchemeQuantType

scheme_quant_type = method.quant_type
if scheme_quant_type == SchemeQuantType.W8A8:
return QuantType.W8A8
elif scheme_quant_type == SchemeQuantType.W4A8:
return QuantType.W4A8

return QuantType.NONE
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since QuantType only contains NONE, W8A8, and W4A8, the original implementation of _get_quant_type is logically equivalent and much more concise. Let's stick with the original implementation to keep the code clean.

Comment on lines +161 to +163
# shared_experts=shared_experts,
# quantized_x_for_share=quantized_x_for_share,
# dynamic_scale_for_share=dynamic_scale_for_share,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just delete the commented-out code.

Comment thread vllm_ascend/ops/fused_moe/moe_mlp.py Outdated
# to save npu memory because they're no longer used.
dispose_tensor(unquantized_hidden_states)
else:
if dynamic_scale.ndim == 2: # w8a8mx gmm量化需要输入3维scale,拆分尾轴2出来
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use English for all comments.

@SlightwindSec
Copy link
Copy Markdown
Contributor

It looks like the DCO check failed. Please follow this guide to fix it: https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/contribution/index.html

fangrongcan and others added 3 commits February 11, 2026 21:48
Signed-off-by: fangrongcan <17343701736@163.com>

add mxfp8 moe quantization

Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: wangyao-i <iwangyao@outlook.com>
Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: fangrongcan <17343701736@163.com>

cleancode

Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: fangrongcan <17343701736@163.com>
Copy link
Copy Markdown
Collaborator

@linfeng-yuan linfeng-yuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some modifications should be done in token_dispatcher.py:

  1. token_combine function is wrongly deleted which must lead to instantiation error with abstract class and function.
  2. Some useless and vague branches should be merged with original implementation instead of adding extra codes.
  3. It is not necessary to keep fp8_comm flag since we defautly enable communication with quant in moe_dispatch.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
…abled

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com>
…heck

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
linfeng-yuan and others added 4 commits February 26, 2026 20:18
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Feb 27, 2026
Comment thread vllm_ascend/ops/fused_moe/moe_mlp.py Outdated
dynamic_eplb: bool = False,
**kwargs,
) -> torch.Tensor:
adaptor_cls = DeviceOperator
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use DeviceOperator directly.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used it at the beginning but lead to mypy lint error. I would fix it at once.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangxiyuan We've narrowed DeviceOperator type to pass mypy. Please have a quick check for the update~

@Potabk Potabk added the accuracy-test enable all accuracy test for PR label Feb 27, 2026
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
@linfeng-yuan linfeng-yuan self-requested a review February 27, 2026 13:44
@whx-sjtu whx-sjtu merged commit 3c66a97 into vllm-project:main Mar 2, 2026
39 of 44 checks passed
wangyao-i added a commit to wangyao-i/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
support mxfp8 quantization (Qwen MOE )
Using adaptor to make the hardware-specific behavior clearer and more
maintainable
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@1339784

---------

Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: wangyao-i <iwangyao@outlook.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com>
Co-authored-by: fangrongcan <f00876277@china.huawei.com>
Co-authored-by: wangyao-i <iwangyao@outlook.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
support mxfp8 quantization (Qwen MOE )
Using adaptor to make the hardware-specific behavior clearer and more
maintainable
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@1339784

---------

Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: wangyao-i <iwangyao@outlook.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com>
Co-authored-by: fangrongcan <f00876277@china.huawei.com>
Co-authored-by: wangyao-i <iwangyao@outlook.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?
support mxfp8 quantization (Qwen MOE )
Using adaptor to make the hardware-specific behavior clearer and more
maintainable
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@1339784

---------

Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: wangyao-i <iwangyao@outlook.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com>
Co-authored-by: fangrongcan <f00876277@china.huawei.com>
Co-authored-by: wangyao-i <iwangyao@outlook.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
### What this PR does / why we need it?
support mxfp8 quantization (Qwen MOE )
Using adaptor to make the hardware-specific behavior clearer and more
maintainable
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
vllm-project/vllm@1339784

---------

Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: wangyao-i <iwangyao@outlook.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com>
Co-authored-by: fangrongcan <f00876277@china.huawei.com>
Co-authored-by: wangyao-i <iwangyao@outlook.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
linfeng-yuan pushed a commit that referenced this pull request May 9, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[#6670](#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([#731](#731)), landed
super-kernel fusion for quantized DSR1
([#3485](#3485)), and
added initial MoE support for Model Runner v2
([#7922](#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(#7157)](#7157) for A5
support, landed initial build support
([#7151](#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([#6670](#6670),
[#7877](#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([#7573](#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([#4805](#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([#4154](#4154)), and
added a penalty-related Triton kernel for sampling performance
([#7794](#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([#2384](#2384),
[#2459](#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([#2849](#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([#7024](#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 10, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
SOMEONEUNSEEN pushed a commit to SOMEONEUNSEEN/vllm-ascend that referenced this pull request May 11, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 11, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zhuqi <z00480217@china.huawei.com>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 11, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zhuqi <z00480217@china.huawei.com>
Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 12, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

accuracy-test enable all accuracy test for PR module:ops module:quantization ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants