[Perf] Optimize `group_topk` kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt by yewentao256 · Pull Request #30159 · vllm-project/vllm

yewentao256 · 2025-12-05T22:25:21Z

Purpose

We are trying to optimize the GLMv4.6 model, this kernel takes a lot of time and we try to reduce this first.

This optimization could be also used for other models like V3.2 etc.

Optimize the kernel, mainly:

Use template for the scoring function
unroll some usual ngroup

Test

export MODEL="zai-org/GLM-4.6-FP8"

Acc

lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=$MODEL,num_concurrent=1024" --tasks gsm8k
Now
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9356|±  |0.0068|
|     |       |strict-match    |     5|exact_match|↑  |0.9310|±  |0.0070|
main
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9325|±  |0.0069|
|     |       |strict-match    |     5|exact_match|↑  |0.9280|±  |0.0071|

Perf

vllm bench serve --model $MODEL  --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 128 --request-rate inf --num-prompts 1024

Now
============ Serving Benchmark Result ============
Successful requests:                     1024      
Failed requests:                         0         
Benchmark duration (s):                  21.81     
Total input tokens:                      2048      
Total generated tokens:                  131072    
Request throughput (req/s):              46.95     
Output token throughput (tok/s):         6009.90   
Peak output token throughput (tok/s):    7157.00   
Peak concurrent requests:                1024.00   
Total Token throughput (tok/s):          6103.80   
---------------Time to First Token----------------
Mean TTFT (ms):                          969.39    
Median TTFT (ms):                        1037.20   
P99 TTFT (ms):                           1195.60   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          162.53    
Median TPOT (ms):                        162.71    
P99 TPOT (ms):                           162.94    
---------------Inter-token Latency----------------
Mean ITL (ms):                           162.54    
Median ITL (ms):                         161.80    
P99 ITL (ms):                            188.35    
==================================================

Main
============ Serving Benchmark Result ============
Successful requests:                     1024      
Failed requests:                         0         
Benchmark duration (s):                  22.24     
Total input tokens:                      2048      
Total generated tokens:                  131072    
Request throughput (req/s):              46.05     
Output token throughput (tok/s):         5894.52   
Peak output token throughput (tok/s):    6715.00   
Peak concurrent requests:                1024.00   
Total Token throughput (tok/s):          5986.63   
---------------Time to First Token----------------
Mean TTFT (ms):                          966.52    
Median TTFT (ms):                        1066.92   
P99 TTFT (ms):                           1080.13   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          166.03    
Median TPOT (ms):                        166.09    
P99 TPOT (ms):                           166.41    
---------------Inter-token Latency----------------
Mean ITL (ms):                           166.05    
Median ITL (ms):                         164.11    
P99 ITL (ms):                            206.18    
==================================================

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist

Code Review

This pull request introduces performance optimizations to the group_topk kernel by leveraging C++ templates for compile-time specialization based on the scoring function, renormalization, and group size. These changes appear to correctly implement the intended optimizations and should yield the performance improvements described. My review focuses on several instances of significant code duplication that have been introduced. While the optimizations are valuable, the duplicated code harms maintainability and increases the risk of future bugs. I've provided suggestions to refactor these sections to be more DRY (Don't Repeat Yourself) while retaining the performance benefits.

csrc/moe/grouped_topk_kernels.cu

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin

Nice work! Just some optional nits

csrc/moe/grouped_topk_kernels.cu

…% TPOT improvemnt (vllm-project#30159) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

yewentao256 added 3 commits December 5, 2025 20:54

first version

220761b

Signed-off-by: yewentao256 <zhyanwentao@126.com>

further optimize

9472d90

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Merge branch 'main' into wentao-optimize-group-topk

a70d5f5

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

csrc/moe/grouped_topk_kernels.cu Outdated Show resolved Hide resolved

csrc/moe/grouped_topk_kernels.cu Show resolved Hide resolved

csrc/moe/grouped_topk_kernels.cu Show resolved Hide resolved

csrc/moe/grouped_topk_kernels.cu Show resolved Hide resolved

yewentao256 added 3 commits December 5, 2025 22:31

reduce code

bcf6077

Signed-off-by: yewentao256 <zhyanwentao@126.com>

reduce code

bb310e3

Signed-off-by: yewentao256 <zhyanwentao@126.com>

remove renorm template

241e0b3

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 5, 2025

yewentao256 and others added 2 commits December 6, 2025 10:59

Merge branch 'main' into wentao-optimize-group-topk

2ea2bb2

Merge branch 'main' into wentao-optimize-group-topk

e60e552

mgoin approved these changes Dec 9, 2025

View reviewed changes

csrc/moe/grouped_topk_kernels.cu Show resolved Hide resolved

csrc/moe/grouped_topk_kernels.cu Show resolved Hide resolved

mgoin added moe performance Performance-related issues labels Dec 9, 2025

mgoin merged commit 0ee6416 into main Dec 9, 2025
96 of 97 checks passed

mgoin deleted the wentao-optimize-group-topk branch December 9, 2025 00:44

yewentao256 mentioned this pull request Dec 12, 2025

[Refactor] Small refactor for group topk #30562

Merged

yewentao256 mentioned this pull request Jan 5, 2026

[Feature]: Optimizations for MOE models (GLM4.7, DeepSeek series) #31755

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Optimize `group_topk` kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt#30159

[Perf] Optimize `group_topk` kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt#30159
mgoin merged 8 commits intomainfrom
wentao-optimize-group-topk

yewentao256 commented Dec 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yewentao256 commented Dec 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Acc

Perf

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yewentao256 commented Dec 5, 2025 •

edited by github-actions bot

Loading