[releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties by linfeng-yuan · Pull Request #7794 · vllm-project/vllm-ascend

linfeng-yuan · 2026-03-28T09:50:04Z

What this PR does / why we need it?

Implement get_token_bin_counts_and_mask and apply_penalties with Triton-Ascend kernels. This significantly reduces latency of the sampling process when repetition/frequency/presence penalties are enabled.

Cherry-pick from main PR #7569

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI passed.

gemini-code-assist · 2026-03-28T09:51:22Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of penalty application within the vllm-ascend project by leveraging Triton-Ascend kernels. The changes introduce highly optimized implementations for token bin counting and the application of repetition, frequency, and presence penalties, directly integrating these into the AscendSampler. This optimization is crucial for improving the efficiency of sampling operations on Ascend NPUs, ensuring faster and more resource-efficient model inference.

Highlights

Triton-Ascend Penalty Kernels: Introduced new Triton-Ascend kernels for efficient application of repetition, frequency, and presence penalties, aiming to improve performance on Ascend NPUs.
Optimized Token Bincount: Implemented a 2D tiled Triton-Ascend kernel for get_token_bin_counts_and_mask, which optimizes token occurrence counting across batches and sequence lengths, particularly beneficial for prefill stages.
AscendSampler Integration: Integrated the new Triton-Ascend penalty application into the AscendSampler, allowing it to conditionally use these optimized kernels when Triton is available.
Comprehensive Testing: Added a new end-to-end test to verify the equivalence and correctness of the Triton-Ascend penalty implementation against the existing PyTorch-based vLLM implementation across various scenarios and data types.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Suggested PR Title:\n\nmarkdown\n[Ops][Feature] Implement Triton-Ascend kernels for sampling penalties\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis PR implements Triton-Ascend kernels for repetition, frequency, and presence penalties, including a specialized bincount kernel. These are integrated into the `AscendSampler` to optimize NPU performance. A critical issue was found in `vllm_ascend/sample/sampler.py` where `apply_penalties` is defined as a `@staticmethod`, which will cause a `TypeError` during fallback to the base class.\n\n### Does this PR introduce _any_ user-facing change?\nYes, it provides optimized NPU performance for sampling penalties.\n\n### How was this patch tested?\nA new E2E test `test_apply_all_penalties_v1_vs_ascend` was added to verify the Triton-Ascend implementation against the vLLM v1 reference.\n

gemini-code-assist · 2026-03-28T09:54:38Z

+    @staticmethod
+    def apply_penalties(
+        logits: torch.Tensor,
+        sampling_metadata: SamplingMetadata,
+        output_token_ids: list[list[int]],
+    ) -> torch.Tensor:
+        """Use Triton-Ascend penalties on NPU when Triton is available; else vLLM default."""
+        if not HAS_TRITON:
+            return Sampler.apply_penalties(logits, sampling_metadata, output_token_ids)
+
+        if sampling_metadata.no_penalties:
+            return logits
+        assert sampling_metadata.prompt_token_ids is not None
+        return apply_all_penalties(
+            logits,
+            sampling_metadata.prompt_token_ids,
+            sampling_metadata.presence_penalties,
+            sampling_metadata.frequency_penalties,
+            sampling_metadata.repetition_penalties,
+            output_token_ids,
+        )


The apply_penalties method is incorrectly defined as a @staticmethod. The base class Sampler defines it as an instance method, so this override should also be an instance method to maintain signature compatibility. The current implementation will raise a TypeError at runtime when falling back to the default implementation because self is not passed to Sampler.apply_penalties.

To fix this, you should remove the @staticmethod decorator, add self as the first argument to the method, and use super().apply_penalties(...) for the fallback to correctly call the base class method.

def apply_penalties( self, logits: torch.Tensor, sampling_metadata: SamplingMetadata, output_token_ids: list[list[int]], ) -> torch.Tensor: """Use Triton-Ascend penalties on NPU when Triton is available; else vLLM default.""" if not HAS_TRITON: return super().apply_penalties(logits, sampling_metadata, output_token_ids) if sampling_metadata.no_penalties: return logits assert sampling_metadata.prompt_token_ids is not None return apply_all_penalties( logits, sampling_metadata.prompt_token_ids, sampling_metadata.presence_penalties, sampling_metadata.frequency_penalties, sampling_metadata.repetition_penalties, output_token_ids, )

…er performance of penalties Cherry-pick from upstream/main PR vllm-project#7569: vllm-project#7569 Included commits: - 9817012 feat: add Triton-Ascend apply_penalties via AscendSampler - ca14fe5 perf: 2D tile token bincount over batch and seq - 25fac6c fix lint - a4df2bf fix empty decode history - b21da88 simplify code - 329071c refactor ut Co-authored-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: linfeng-yuan <1102311262@qq.com>

- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [#6670](#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([#731](#731)), landed super-kernel fusion for quantized DSR1 ([#3485](#3485)), and added initial MoE support for Model Runner v2 ([#7922](#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (#7157)](#7157) for A5 support, landed initial build support ([#7151](#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([#6670](#6670), [#7877](#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([#7573](#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([#4805](#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([#4154](#4154)), and added a penalty-related Triton kernel for sampling performance ([#7794](#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([#2384](#2384), [#2459](#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([#2849](#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([#7024](#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>

- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com>

- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>

- ✅ **Review Quality:** He has completed [50+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+reviewed-by%3Alinfeng-yuan) since April 2025, covering graph mode, MoE, quantization, model support, and performance-related changes. In addition to regular review work, he has also participated in complex feature development and review, such as [vllm-project#6670](vllm-project#6670) (MoE MXFP8 quantization), where he helped with A5 MXFP8 integration, compatibility cleanup, dispatch updates, and implementation fixes. - ✅ **Sustained Contributions:** He has [60+ merged PRs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+is%3Amerged+author%3Alinfeng-yuan) since April 2025, with continuous activity across major release cycles. - ✅ **Quality Contributions:** **Torchair Graph Mode & Wide-EP / MoE — Feature Owner (2025 Q2~Q4):** He was the Feature Owner for DeepSeek high-throughput inference under torchair graph mode and the Wide-EP project. He drove graph mode performance optimization ([vllm-project#731](vllm-project#731)), landed super-kernel fusion for quantized DSR1 ([vllm-project#3485](vllm-project#3485)), and added initial MoE support for Model Runner v2 ([vllm-project#7922](vllm-project#7922)). **Ascend950 (A5) — Feature Owner:** He authored the [RFC roadmap (vllm-project#7157)](vllm-project#7157) for A5 support, landed initial build support ([vllm-project#7151](vllm-project#7151)), co-authored MXFP8 and MXFP4 quantization support for A5 ([vllm-project#6670](vllm-project#6670), [vllm-project#7877](vllm-project#7877)), and fixed the MXFP8 scale normalization issue that unblocked A5 quantized inference ([vllm-project#7573](vllm-project#7573)). **DeepSeek Low-Latency & Post-Processing:** He improved DSv3.2 performance by eliminating HD synchronization ([vllm-project#4805](vllm-project#4805)), improved rejection sampler performance and eliminated D2H sync in TopKTopPSampler ([vllm-project#4154](vllm-project#4154)), and added a penalty-related Triton kernel for sampling performance ([vllm-project#7794](vllm-project#7794)). - ✅ **Community Involvement:** He led a 2-part torchair modeling refactor ([vllm-project#2384](vllm-project#2384), [vllm-project#2459](vllm-project#2459)) and deleted ~2K lines of redundant DeepSeek modeling code as upstream absorbed the changes ([vllm-project#2849](vllm-project#2849)). He also replaced scattered business kwargs with typed request objects across MoE stage boundaries ([vllm-project#7024](vllm-project#7024)). Since March 2026, he has taken part in issue triage and user support, responding to [30+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue+commenter%3Alinfeng-yuan+updated%3A%3E2026-03-01) covering graph mode failures, quantization accuracy regressions, MoE deployment problems, and multi-node communication issues. - vLLM version: v0.19.1 - vLLM main: vllm-project/vllm@4d51588 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>

linfeng-yuan requested review from realliujiaxu, wangxiyuan, whx-sjtu and zzzzwwjj as code owners March 28, 2026 09:50

linfeng-yuan added ready read for review ready-for-test start test by label for PR labels Mar 28, 2026

gemini-code-assist Bot reviewed Mar 28, 2026

View reviewed changes

linfeng-yuan force-pushed the 0180_triton_penalty branch from e75a3bc to 2efd3d5 Compare March 28, 2026 10:14

linfeng-yuan changed the title ~~[v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties~~ [releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties Mar 29, 2026

yiz-liu approved these changes Mar 31, 2026

View reviewed changes

yiz-liu added this to the v0.18.0rc1 milestone Mar 31, 2026

yiz-liu merged commit ed4ef1f into vllm-project:releases/v0.18.0 Mar 31, 2026
28 checks passed

wangxiyuan mentioned this pull request May 9, 2026

[Community] Nominate new maintainer @linfeng-yuan #8996

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties#7794

[releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties#7794
yiz-liu merged 1 commit into
vllm-project:releases/v0.18.0from
linfeng-yuan:0180_triton_penalty

linfeng-yuan commented Mar 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

linfeng-yuan commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist Bot commented Mar 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

linfeng-yuan commented Mar 28, 2026 •

edited

Loading