Skip to content

[releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties#7794

Merged
yiz-liu merged 1 commit into
vllm-project:releases/v0.18.0from
linfeng-yuan:0180_triton_penalty
Mar 31, 2026
Merged

[releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties#7794
yiz-liu merged 1 commit into
vllm-project:releases/v0.18.0from
linfeng-yuan:0180_triton_penalty

Conversation

@linfeng-yuan
Copy link
Copy Markdown
Collaborator

@linfeng-yuan linfeng-yuan commented Mar 28, 2026

What this PR does / why we need it?

Implement get_token_bin_counts_and_mask and apply_penalties with Triton-Ascend kernels. This significantly reduces latency of the sampling process when repetition/frequency/presence penalties are enabled.

Cherry-pick from main PR #7569

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI passed.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of penalty application within the vllm-ascend project by leveraging Triton-Ascend kernels. The changes introduce highly optimized implementations for token bin counting and the application of repetition, frequency, and presence penalties, directly integrating these into the AscendSampler. This optimization is crucial for improving the efficiency of sampling operations on Ascend NPUs, ensuring faster and more resource-efficient model inference.

Highlights

  • Triton-Ascend Penalty Kernels: Introduced new Triton-Ascend kernels for efficient application of repetition, frequency, and presence penalties, aiming to improve performance on Ascend NPUs.
  • Optimized Token Bincount: Implemented a 2D tiled Triton-Ascend kernel for get_token_bin_counts_and_mask, which optimizes token occurrence counting across batches and sequence lengths, particularly beneficial for prefill stages.
  • AscendSampler Integration: Integrated the new Triton-Ascend penalty application into the AscendSampler, allowing it to conditionally use these optimized kernels when Triton is available.
  • Comprehensive Testing: Added a new end-to-end test to verify the equivalence and correctness of the Triton-Ascend penalty implementation against the existing PyTorch-based vLLM implementation across various scenarios and data types.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@linfeng-yuan linfeng-yuan added ready read for review ready-for-test start test by label for PR labels Mar 28, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:\n\nmarkdown\n[Ops][Feature] Implement Triton-Ascend kernels for sampling penalties\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis PR implements Triton-Ascend kernels for repetition, frequency, and presence penalties, including a specialized bincount kernel. These are integrated into the `AscendSampler` to optimize NPU performance. A critical issue was found in `vllm_ascend/sample/sampler.py` where `apply_penalties` is defined as a `@staticmethod`, which will cause a `TypeError` during fallback to the base class.\n\n### Does this PR introduce _any_ user-facing change?\nYes, it provides optimized NPU performance for sampling penalties.\n\n### How was this patch tested?\nA new E2E test `test_apply_all_penalties_v1_vs_ascend` was added to verify the Triton-Ascend implementation against the vLLM v1 reference.\n

Comment on lines +42 to +62
@staticmethod
def apply_penalties(
logits: torch.Tensor,
sampling_metadata: SamplingMetadata,
output_token_ids: list[list[int]],
) -> torch.Tensor:
"""Use Triton-Ascend penalties on NPU when Triton is available; else vLLM default."""
if not HAS_TRITON:
return Sampler.apply_penalties(logits, sampling_metadata, output_token_ids)

if sampling_metadata.no_penalties:
return logits
assert sampling_metadata.prompt_token_ids is not None
return apply_all_penalties(
logits,
sampling_metadata.prompt_token_ids,
sampling_metadata.presence_penalties,
sampling_metadata.frequency_penalties,
sampling_metadata.repetition_penalties,
output_token_ids,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The apply_penalties method is incorrectly defined as a @staticmethod. The base class Sampler defines it as an instance method, so this override should also be an instance method to maintain signature compatibility. The current implementation will raise a TypeError at runtime when falling back to the default implementation because self is not passed to Sampler.apply_penalties.

To fix this, you should remove the @staticmethod decorator, add self as the first argument to the method, and use super().apply_penalties(...) for the fallback to correctly call the base class method.

    def apply_penalties(
        self,
        logits: torch.Tensor,
        sampling_metadata: SamplingMetadata,
        output_token_ids: list[list[int]],
    ) -> torch.Tensor:
        """Use Triton-Ascend penalties on NPU when Triton is available; else vLLM default."""
        if not HAS_TRITON:
            return super().apply_penalties(logits, sampling_metadata, output_token_ids)

        if sampling_metadata.no_penalties:
            return logits
        assert sampling_metadata.prompt_token_ids is not None
        return apply_all_penalties(
            logits,
            sampling_metadata.prompt_token_ids,
            sampling_metadata.presence_penalties,
            sampling_metadata.frequency_penalties,
            sampling_metadata.repetition_penalties,
            output_token_ids,
        )

…er performance of penalties

Cherry-pick from upstream/main PR vllm-project#7569:
vllm-project#7569

Included commits:
- 9817012 feat: add Triton-Ascend apply_penalties via AscendSampler
- ca14fe5 perf: 2D tile token bincount over batch and seq
- 25fac6c fix lint
- a4df2bf fix empty decode history
- b21da88 simplify code
- 329071c refactor ut

Co-authored-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
@linfeng-yuan linfeng-yuan changed the title [v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties [releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties Mar 29, 2026
@yiz-liu yiz-liu added this to the v0.18.0rc1 milestone Mar 31, 2026
@yiz-liu yiz-liu merged commit ed4ef1f into vllm-project:releases/v0.18.0 Mar 31, 2026
28 checks passed
linfeng-yuan pushed a commit that referenced this pull request May 9, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[#6670](#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([#731](#731)), landed
super-kernel fusion for quantized DSR1
([#3485](#3485)), and
added initial MoE support for Model Runner v2
([#7922](#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(#7157)](#7157) for A5
support, landed initial build support
([#7151](#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([#6670](#6670),
[#7877](#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([#7573](#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([#4805](#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([#4154](#4154)), and
added a penalty-related Triton kernel for sampling performance
([#7794](#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([#2384](#2384),
[#2459](#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([#2849](#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([#7024](#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 10, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
SOMEONEUNSEEN pushed a commit to SOMEONEUNSEEN/vllm-ascend that referenced this pull request May 11, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 11, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zhuqi <z00480217@china.huawei.com>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 11, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zhuqi <z00480217@china.huawei.com>
Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 12, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants