[Perf] Optimize group_topk kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt#30159
[Perf] Optimize group_topk kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt#30159
group_topk kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt#30159Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Code Review
This pull request introduces performance optimizations to the group_topk kernel by leveraging C++ templates for compile-time specialization based on the scoring function, renormalization, and group size. These changes appear to correctly implement the intended optimizations and should yield the performance improvements described. My review focuses on several instances of significant code duplication that have been introduced. While the optimizations are valuable, the duplicated code harms maintainability and increases the risk of future bugs. I've provided suggestions to refactor these sections to be more DRY (Don't Repeat Yourself) while retaining the performance benefits.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
mgoin
left a comment
There was a problem hiding this comment.
Nice work! Just some optional nits
…% TPOT improvemnt (vllm-project#30159) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Purpose
We are trying to optimize the GLMv4.6 model, this kernel takes a lot of time and we try to reduce this first.
This optimization could be also used for other models like V3.2 etc.
Optimize the kernel, mainly:
Test
export MODEL="zai-org/GLM-4.6-FP8"Acc
Perf
vllm bench serve --model $MODEL --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 128 --request-rate inf --num-prompts 1024 Now ============ Serving Benchmark Result ============ Successful requests: 1024 Failed requests: 0 Benchmark duration (s): 21.81 Total input tokens: 2048 Total generated tokens: 131072 Request throughput (req/s): 46.95 Output token throughput (tok/s): 6009.90 Peak output token throughput (tok/s): 7157.00 Peak concurrent requests: 1024.00 Total Token throughput (tok/s): 6103.80 ---------------Time to First Token---------------- Mean TTFT (ms): 969.39 Median TTFT (ms): 1037.20 P99 TTFT (ms): 1195.60 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 162.53 Median TPOT (ms): 162.71 P99 TPOT (ms): 162.94 ---------------Inter-token Latency---------------- Mean ITL (ms): 162.54 Median ITL (ms): 161.80 P99 ITL (ms): 188.35 ================================================== Main ============ Serving Benchmark Result ============ Successful requests: 1024 Failed requests: 0 Benchmark duration (s): 22.24 Total input tokens: 2048 Total generated tokens: 131072 Request throughput (req/s): 46.05 Output token throughput (tok/s): 5894.52 Peak output token throughput (tok/s): 6715.00 Peak concurrent requests: 1024.00 Total Token throughput (tok/s): 5986.63 ---------------Time to First Token---------------- Mean TTFT (ms): 966.52 Median TTFT (ms): 1066.92 P99 TTFT (ms): 1080.13 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 166.03 Median TPOT (ms): 166.09 P99 TPOT (ms): 166.41 ---------------Inter-token Latency---------------- Mean ITL (ms): 166.05 Median ITL (ms): 164.11 P99 ITL (ms): 206.18 ==================================================