[perf][bugfix] improve performance of rejection sampler and eliminate HD synchronize in TopKTopPSampler#4154
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the _apply_top_k_top_p sampler function to leverage the npu_top_k_top_p operator when either k (top-k) or p (top-p) parameters are None. While the change correctly broadens the conditions for using the optimized NPU kernel, it introduces a potential for a runtime crash. My review focuses on a critical fix to handle empty tensors safely.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: linfeng-yuan <1102311262@qq.com>
… with post-processing sampler Signed-off-by: linfeng-yuan <1102311262@qq.com>
a8442f6 to
0e4262a
Compare
Signed-off-by: linfeng-yuan <1102311262@qq.com>
abfd957 to
b1e770f
Compare
Signed-off-by: linfeng-yuan <1102311262@qq.com>
95c668e to
9088478
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the top-k/top-p sampling logic to improve performance by avoiding D2H synchronization when asynchronous scheduling is enabled. The implementation of AscendTopKTopPSampler is also updated to better align with vLLM's coding patterns. While the refactoring in sampler.py is well-executed, I've identified a critical issue in rejection_sampler.py where the changes inadvertently remove the optimized NPU kernel path, leading to a performance regression. My review includes specific suggestions to correct this by ensuring the fused kernel is used when appropriate, thereby achieving both the performance and bug-fixing goals of this PR.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
|
All the E2E CI passed (https://github.com/vllm-project/vllm-ascend/actions/runs/20461117976/job/58793951915?pr=4154). I would remove |
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
8577878 to
2cb6b71
Compare
… HD synchronize in TopKTopPSampler (vllm-project#4154) ### What this PR does / why we need it? 1. Use optimized apply_top_k_top_p for NPU platfrom in rejection sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24 per DP) 2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p introduced by parameter validation which improves inference speed with `async_scheduling` enabled;</del> In order to elminate the D2H synchronization introduced by parameter validation before calling `npu_top_k_top_p`, we directly drop this fused operator since the performance improvement is not significant compared to async_scheduling and may bring potential accuracy problem. 3. Refactor the implementation of AscendTopKTopPSampler to align that of vLLM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E serving test with combinations of `k=500` and `p=0.95` with async_scheduling in single node and wide-EP scenarios. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
… HD synchronize in TopKTopPSampler (vllm-project#4154) ### What this PR does / why we need it? 1. Use optimized apply_top_k_top_p for NPU platfrom in rejection sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24 per DP) 2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p introduced by parameter validation which improves inference speed with `async_scheduling` enabled;</del> In order to elminate the D2H synchronization introduced by parameter validation before calling `npu_top_k_top_p`, we directly drop this fused operator since the performance improvement is not significant compared to async_scheduling and may bring potential accuracy problem. 3. Refactor the implementation of AscendTopKTopPSampler to align that of vLLM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E serving test with combinations of `k=500` and `p=0.95` with async_scheduling in single node and wide-EP scenarios. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
… HD synchronize in TopKTopPSampler (vllm-project#4154) ### What this PR does / why we need it? 1. Use optimized apply_top_k_top_p for NPU platfrom in rejection sampler; (avoid scatter elements which can reduce ~26ms TPOT with bs=24 per DP) 2. <del>Avoid D2H Synchronization before calling npu_top_k_top_p introduced by parameter validation which improves inference speed with `async_scheduling` enabled;</del> In order to elminate the D2H synchronization introduced by parameter validation before calling `npu_top_k_top_p`, we directly drop this fused operator since the performance improvement is not significant compared to async_scheduling and may bring potential accuracy problem. 3. Refactor the implementation of AscendTopKTopPSampler to align that of vLLM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E serving test with combinations of `k=500` and `p=0.95` with async_scheduling in single node and wide-EP scenarios. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>
- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [#6670](#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([#731](#731)), landed super-kernel fusion for quantized DSR1 ([#3485](#3485)), and added initial MoE support for Model Runner v2 ([#7922](#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (#7157)](#7157) for A5 support, landed initial build support ([#7151](#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([#6670](#6670), [#7877](#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([#7573](#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([#4805](#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([#4154](#4154)), and added a penalty-related Triton kernel for sampling performance ([#7794](#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([#2384](#2384), [#2459](#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([#2849](#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([#7024](#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com>
- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
What this PR does / why we need it?
Avoid D2H Synchronization before calling npu_top_k_top_p introduced by parameter validation which improves inference speed withIn order to elminate the D2H synchronization introduced by parameter validation before callingasync_schedulingenabled;npu_top_k_top_p, we directly drop this fused operator since the performance improvement is not significant compared to async_scheduling and may bring potential accuracy problem.Does this PR introduce any user-facing change?
No.
How was this patch tested?
E2E serving test with combinations of
k=500andp=0.95with async_scheduling in single node and wide-EP scenarios.