[perf][bugfix] improve performance of rejection sampler and eliminate HD synchronize in TopKTopPSampler by linfeng-yuan · Pull Request #4154 · vllm-project/vllm-ascend

linfeng-yuan · 2025-11-12T09:39:54Z

What this PR does / why we need it?

Use optimized apply_top_k_top_p for NPU platfrom in rejection sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24 per DP)
~~Avoid D2H Synchronization before calling npu_top_k_top_p introduced by parameter validation which improves inference speed with async_scheduling enabled;~~ In order to elminate the D2H synchronization introduced by parameter validation before calling npu_top_k_top_p, we directly drop this fused operator since the performance improvement is not significant compared to async_scheduling and may bring potential accuracy problem.
Refactor the implementation of AscendTopKTopPSampler to align that of vLLM.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

E2E serving test with combinations of k=500 and p=0.95 with async_scheduling in single node and wide-EP scenarios.

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

gemini-code-assist

Code Review

This pull request updates the _apply_top_k_top_p sampler function to leverage the npu_top_k_top_p operator when either k (top-k) or p (top-p) parameters are None. While the change correctly broadens the conditions for using the optimized NPU kernel, it introduces a potential for a runtime crash. My review focuses on a critical fix to handle empty tensors safely.

github-actions · 2025-11-12T09:42:50Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-11-26T06:45:29Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: linfeng-yuan <1102311262@qq.com>

… with post-processing sampler Signed-off-by: linfeng-yuan <1102311262@qq.com>

Signed-off-by: linfeng-yuan <1102311262@qq.com>

linfeng-yuan · 2025-12-23T06:29:34Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the top-k/top-p sampling logic to improve performance by avoiding D2H synchronization when asynchronous scheduling is enabled. The implementation of AscendTopKTopPSampler is also updated to better align with vLLM's coding patterns. While the refactoring in sampler.py is well-executed, I've identified a critical issue in rejection_sampler.py where the changes inadvertently remove the optimized NPU kernel path, leading to a performance regression. My review includes specific suggestions to correct this by ensuring the fused kernel is used when appropriate, thereby achieving both the performance and bug-fixing goals of this PR.

Signed-off-by: linfeng-yuan <1102311262@qq.com>

linfeng-yuan · 2025-12-24T02:43:14Z

All the E2E CI passed (https://github.com/vllm-project/vllm-ascend/actions/runs/20461117976/job/58793951915?pr=4154). I would remove npu_top_k_top_p op calling in all cases

Signed-off-by: linfeng-yuan <1102311262@qq.com>

… HD synchronize in TopKTopPSampler (vllm-project#4154) ### What this PR does / why we need it? 1. Use optimized apply_top_k_top_p for NPU platfrom in rejection sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24 per DP) 2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p introduced by parameter validation which improves inference speed with `async_scheduling` enabled;</del> In order to elminate the D2H synchronization introduced by parameter validation before calling `npu_top_k_top_p`, we directly drop this fused operator since the performance improvement is not significant compared to async_scheduling and may bring potential accuracy problem. 3. Refactor the implementation of AscendTopKTopPSampler to align that of vLLM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E serving test with combinations of `k=500` and `p=0.95` with async_scheduling in single node and wide-EP scenarios. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

… HD synchronize in TopKTopPSampler (vllm-project#4154) ### What this PR does / why we need it? 1. Use optimized apply_top_k_top_p for NPU platfrom in rejection sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24 per DP) 2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p introduced by parameter validation which improves inference speed with `async_scheduling` enabled;</del> In order to elminate the D2H synchronization introduced by parameter validation before calling `npu_top_k_top_p`, we directly drop this fused operator since the performance improvement is not significant compared to async_scheduling and may bring potential accuracy problem. 3. Refactor the implementation of AscendTopKTopPSampler to align that of vLLM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E serving test with combinations of `k=500` and `p=0.95` with async_scheduling in single node and wide-EP scenarios. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>

- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [#6670](#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([#731](#731)), landed super-kernel fusion for quantized DSR1 ([#3485](#3485)), and added initial MoE support for Model Runner v2 ([#7922](#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (#7157)](#7157) for A5 support, landed initial build support ([#7151](#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([#6670](#6670), [#7877](#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([#7573](#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([#4805](#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([#4154](#4154)), and added a penalty-related Triton kernel for sampling performance ([#7794](#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([#2384](#2384), [#2459](#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([#2849](#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([#7024](#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>

- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com>

- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>

- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>

gemini-code-assist Bot reviewed Nov 12, 2025

View reviewed changes

Comment thread vllm_ascend/sample/sampler.py Outdated

github-actions Bot added the merge-conflicts label Nov 26, 2025

linfeng-yuan added 2 commits December 22, 2025 21:02

[ops] npu_top_k_top_p supports k or p is None

3610047

Signed-off-by: linfeng-yuan <1102311262@qq.com>

fix performance degradation of rejection sampler and async scheduling…

0e4262a

… with post-processing sampler Signed-off-by: linfeng-yuan <1102311262@qq.com>

linfeng-yuan force-pushed the upgrade_top_k_top_p_main branch from a8442f6 to 0e4262a Compare December 23, 2025 02:12

github-actions Bot removed the merge-conflicts label Dec 23, 2025

linfeng-yuan changed the title ~~[ops] npu_top_k_top_p supports k or p is None~~ [perf][bugfix] improve performance of rejection sampler and eliminate HD synchronize in TopKTopPSampler Dec 23, 2025

delete npu_top_k_top_p calling in rej_sampler

b1e770f

Signed-off-by: linfeng-yuan <1102311262@qq.com>

linfeng-yuan force-pushed the upgrade_top_k_top_p_main branch from abfd957 to b1e770f Compare December 23, 2025 02:29

add fused top_k_top_p back if async scheduling is disbled

9088478

Signed-off-by: linfeng-yuan <1102311262@qq.com>

linfeng-yuan force-pushed the upgrade_top_k_top_p_main branch from 95c668e to 9088478 Compare December 23, 2025 05:53

gemini-code-assist Bot reviewed Dec 23, 2025

View reviewed changes

Comment thread vllm_ascend/sample/rejection_sampler.py

Comment thread vllm_ascend/sample/rejection_sampler.py

realliujiaxu reviewed Dec 23, 2025

View reviewed changes

Comment thread vllm_ascend/sample/sampler.py Outdated

realliujiaxu reviewed Dec 23, 2025

View reviewed changes

Comment thread vllm_ascend/sample/sampler.py Outdated

linfeng-yuan added 2 commits December 23, 2025 19:05

Merge branch 'main' into upgrade_top_k_top_p_main

28e043d

change fusion flag to tmp variable

a076c51

Signed-off-by: linfeng-yuan <1102311262@qq.com>

linfeng-yuan added ready-for-test start test by label for PR ready read for review and removed ready read for review labels Dec 23, 2025

linfeng-yuan added 2 commits December 24, 2025 11:50

remove calling of npu_top_k_top_p

95dda88

Signed-off-by: linfeng-yuan <1102311262@qq.com>

remove npu_top_k_top_p related UT

2cb6b71

Signed-off-by: linfeng-yuan <1102311262@qq.com>

linfeng-yuan force-pushed the upgrade_top_k_top_p_main branch from 8577878 to 2cb6b71 Compare December 24, 2025 05:59

realliujiaxu approved these changes Dec 24, 2025

View reviewed changes

Merge branch 'main' into upgrade_top_k_top_p_main

6f6f1e1

realliujiaxu merged commit 515267d into vllm-project:main Dec 24, 2025
14 checks passed

wangxiyuan mentioned this pull request May 9, 2026

[Community] Nominate new maintainer @linfeng-yuan #8996

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf][bugfix] improve performance of rejection sampler and eliminate HD synchronize in TopKTopPSampler#4154

[perf][bugfix] improve performance of rejection sampler and eliminate HD synchronize in TopKTopPSampler#4154
realliujiaxu merged 9 commits into
vllm-project:mainfrom
linfeng-yuan:upgrade_top_k_top_p_main

linfeng-yuan commented Nov 12, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Nov 12, 2025

Uh oh!

github-actions Bot commented Nov 26, 2025

Uh oh!

linfeng-yuan commented Dec 23, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linfeng-yuan commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

linfeng-yuan commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions Bot commented Nov 12, 2025

Uh oh!

github-actions Bot commented Nov 26, 2025

Uh oh!

linfeng-yuan commented Dec 23, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linfeng-yuan commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

linfeng-yuan commented Nov 12, 2025 •

edited

Loading