Skip to content

[perf][bugfix] improve performance of rejection sampler and eliminate HD synchronize in TopKTopPSampler#4154

Merged
realliujiaxu merged 9 commits into
vllm-project:mainfrom
linfeng-yuan:upgrade_top_k_top_p_main
Dec 24, 2025
Merged

[perf][bugfix] improve performance of rejection sampler and eliminate HD synchronize in TopKTopPSampler#4154
realliujiaxu merged 9 commits into
vllm-project:mainfrom
linfeng-yuan:upgrade_top_k_top_p_main

Conversation

@linfeng-yuan
Copy link
Copy Markdown
Collaborator

@linfeng-yuan linfeng-yuan commented Nov 12, 2025

What this PR does / why we need it?

  1. Use optimized apply_top_k_top_p for NPU platfrom in rejection sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24 per DP)
  2. Avoid D2H Synchronization before calling npu_top_k_top_p introduced by parameter validation which improves inference speed with async_scheduling enabled; In order to elminate the D2H synchronization introduced by parameter validation before calling npu_top_k_top_p, we directly drop this fused operator since the performance improvement is not significant compared to async_scheduling and may bring potential accuracy problem.
  3. Refactor the implementation of AscendTopKTopPSampler to align that of vLLM.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

E2E serving test with combinations of k=500 and p=0.95 with async_scheduling in single node and wide-EP scenarios.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the _apply_top_k_top_p sampler function to leverage the npu_top_k_top_p operator when either k (top-k) or p (top-p) parameters are None. While the change correctly broadens the conditions for using the optimized NPU kernel, it introduces a potential for a runtime crash. My review focuses on a critical fix to handle empty tensors safely.

Comment thread vllm_ascend/sample/sampler.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
… with post-processing sampler

Signed-off-by: linfeng-yuan <1102311262@qq.com>
@linfeng-yuan linfeng-yuan force-pushed the upgrade_top_k_top_p_main branch from a8442f6 to 0e4262a Compare December 23, 2025 02:12
@linfeng-yuan linfeng-yuan changed the title [ops] npu_top_k_top_p supports k or p is None [perf][bugfix] improve performance of rejection sampler and eliminate HD synchronize in TopKTopPSampler Dec 23, 2025
Signed-off-by: linfeng-yuan <1102311262@qq.com>
@linfeng-yuan linfeng-yuan force-pushed the upgrade_top_k_top_p_main branch from abfd957 to b1e770f Compare December 23, 2025 02:29
Signed-off-by: linfeng-yuan <1102311262@qq.com>
@linfeng-yuan linfeng-yuan force-pushed the upgrade_top_k_top_p_main branch from 95c668e to 9088478 Compare December 23, 2025 05:53
@linfeng-yuan
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the top-k/top-p sampling logic to improve performance by avoiding D2H synchronization when asynchronous scheduling is enabled. The implementation of AscendTopKTopPSampler is also updated to better align with vLLM's coding patterns. While the refactoring in sampler.py is well-executed, I've identified a critical issue in rejection_sampler.py where the changes inadvertently remove the optimized NPU kernel path, leading to a performance regression. My review includes specific suggestions to correct this by ensuring the fused kernel is used when appropriate, thereby achieving both the performance and bug-fixing goals of this PR.

Comment thread vllm_ascend/sample/rejection_sampler.py
Comment thread vllm_ascend/sample/rejection_sampler.py
Comment thread vllm_ascend/sample/sampler.py Outdated
Comment thread vllm_ascend/sample/sampler.py Outdated
@linfeng-yuan linfeng-yuan added ready-for-test start test by label for PR ready read for review and removed ready read for review labels Dec 23, 2025
@linfeng-yuan
Copy link
Copy Markdown
Collaborator Author

All the E2E CI passed (https://github.com/vllm-project/vllm-ascend/actions/runs/20461117976/job/58793951915?pr=4154). I would remove npu_top_k_top_p op calling in all cases

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
@linfeng-yuan linfeng-yuan force-pushed the upgrade_top_k_top_p_main branch from 8577878 to 2cb6b71 Compare December 24, 2025 05:59
@realliujiaxu realliujiaxu merged commit 515267d into vllm-project:main Dec 24, 2025
14 checks passed
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
… HD synchronize in TopKTopPSampler (vllm-project#4154)

### What this PR does / why we need it?
1. Use optimized apply_top_k_top_p for NPU platfrom in rejection
sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24
per DP)
2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p
introduced by parameter validation which improves inference speed with
`async_scheduling` enabled;</del> In order to elminate the D2H
synchronization introduced by parameter validation before calling
`npu_top_k_top_p`, we directly drop this fused operator since the
performance improvement is not significant compared to async_scheduling
and may bring potential accuracy problem.
3. Refactor the implementation of AscendTopKTopPSampler to align that of
vLLM.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E serving test with combinations of `k=500` and `p=0.95` with
async_scheduling in single node and wide-EP scenarios.

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
… HD synchronize in TopKTopPSampler (vllm-project#4154)

### What this PR does / why we need it?
1. Use optimized apply_top_k_top_p for NPU platfrom in rejection
sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24
per DP)
2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p
introduced by parameter validation which improves inference speed with
`async_scheduling` enabled;</del> In order to elminate the D2H
synchronization introduced by parameter validation before calling
`npu_top_k_top_p`, we directly drop this fused operator since the
performance improvement is not significant compared to async_scheduling
and may bring potential accuracy problem.
3. Refactor the implementation of AscendTopKTopPSampler to align that of
vLLM.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E serving test with combinations of `k=500` and `p=0.95` with
async_scheduling in single node and wide-EP scenarios.

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
… HD synchronize in TopKTopPSampler (vllm-project#4154)

### What this PR does / why we need it?
1. Use optimized apply_top_k_top_p for NPU platfrom in rejection
sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24
per DP)
2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p
introduced by parameter validation which improves inference speed with
`async_scheduling` enabled;</del> In order to elminate the D2H
synchronization introduced by parameter validation before calling
`npu_top_k_top_p`, we directly drop this fused operator since the
performance improvement is not significant compared to async_scheduling
and may bring potential accuracy problem.
3. Refactor the implementation of AscendTopKTopPSampler to align that of
vLLM.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E serving test with combinations of `k=500` and `p=0.95` with
async_scheduling in single node and wide-EP scenarios.

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@83f478b

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
linfeng-yuan pushed a commit that referenced this pull request May 9, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[#6670](#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([#731](#731)), landed
super-kernel fusion for quantized DSR1
([#3485](#3485)), and
added initial MoE support for Model Runner v2
([#7922](#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(#7157)](#7157) for A5
support, landed initial build support
([#7151](#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([#6670](#6670),
[#7877](#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([#7573](#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([#4805](#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([#4154](#4154)), and
added a penalty-related Triton kernel for sampling performance
([#7794](#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([#2384](#2384),
[#2459](#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([#2849](#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([#7024](#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 10, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
SOMEONEUNSEEN pushed a commit to SOMEONEUNSEEN/vllm-ascend that referenced this pull request May 11, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 11, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zhuqi <z00480217@china.huawei.com>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 11, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zhuqi <z00480217@china.huawei.com>
Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
ZhuQi-seu pushed a commit to ZhuQi-seu/vllm-ascend that referenced this pull request May 12, 2026
- ✅ **Review Quality:**
He has completed [50+
reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan)
since April 2025, covering graph mode, MoE, quantization, model support,
and performance-related changes.

In addition to regular review work, he has also participated in complex
feature development and review, such as
[vllm-project#6670](vllm-project#6670) (MoE
MXFP8 quantization), where he helped with A5 MXFP8 integration,
compatibility cleanup, dispatch updates, and implementation fixes.

- ✅ **Sustained Contributions:**
He has [60+ merged
PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan)
since April 2025, with continuous activity across major release cycles.

- ✅ **Quality Contributions:**

  **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):**
He was the Feature Owner for DeepSeek high-throughput inference under
torchair graph mode and the Wide-EP project. He drove graph mode
performance optimization
([vllm-project#731](vllm-project#731)), landed
super-kernel fusion for quantized DSR1
([vllm-project#3485](vllm-project#3485)), and
added initial MoE support for Model Runner v2
([vllm-project#7922](vllm-project#7922)).

  **Ascend950 (A5) — Feature Owner:**
He authored the [RFC roadmap
(vllm-project#7157)](vllm-project#7157) for A5
support, landed initial build support
([vllm-project#7151](vllm-project#7151)),
co-authored MXFP8 and MXFP4 quantization support for A5
([vllm-project#6670](vllm-project#6670),
[vllm-project#7877](vllm-project#7877)), and
fixed the MXFP8 scale normalization issue that unblocked A5 quantized
inference
([vllm-project#7573](vllm-project#7573)).

  **DeepSeek Low-Latency & Post-Processing:**
He improved DSv3.2 performance by eliminating HD synchronization
([vllm-project#4805](vllm-project#4805)),
improved rejection sampler performance and eliminated D2H sync in
TopKTopPSampler
([vllm-project#4154](vllm-project#4154)), and
added a penalty-related Triton kernel for sampling performance
([vllm-project#7794](vllm-project#7794)).

- ✅ **Community Involvement:**
He led a 2-part torchair modeling refactor
([vllm-project#2384](vllm-project#2384),
[vllm-project#2459](vllm-project#2459)) and
deleted ~2K lines of redundant DeepSeek modeling code as upstream
absorbed the changes
([vllm-project#2849](vllm-project#2849)). He
also replaced scattered business kwargs with typed request objects
across MoE stage boundaries
([vllm-project#7024](vllm-project#7024)).

Since March 2026, he has taken part in issue triage and user support,
responding to [30+
issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01)
covering graph mode failures, quantization accuracy regressions, MoE
deployment problems, and multi-node communication issues.

- vLLM version: v0.19.1
- vLLM main:
vllm-project/vllm@4d51588

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants