[Kernel] Enable fused_qknorm_rope_kernel supports partial rope by jeejeelee · Pull Request #30821 · vllm-project/vllm

jeejeelee · 2025-12-16T23:36:18Z

Purpose

Enable fused_qknorm_rope_kernel to support partial rope, which can slightly improve the performance of models like GLM4.6-MoE. The gains are not very noticeable because torch combo kernel has already fused q and norm into a single kernel, meaning this fusion only consolidates 2 kernels into 1 kernel，see: #27165

cc @mgoin @ProExpertProg @yewentao256

Metrics

Test Script

Server

vllm serve zai-org/GLM-4.6 -tp 8 -dp 1 --max-num-seqs 128 --served-model-name glm46 --compilation-config '{"pass_config": {"enable_qk_norm_rope_fusion": "1"}}' --no-enable-prefix-caching

lm-eval

lm_eval --model local-completions --model_args "model=glm46,base_url=http://localhost:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5

Benchmark

vllm bench serve --tokenizer zai-org/GLM-4.6 --dataset-name random --ignore-eos --metric-percentiles 90 --model glm46 --num-prompts 100 --random-input-len 2048 --random-output-len 1024 --request-rate inf

GSM8K

diable

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9378|±  |0.0067|
|     |       |strict-match    |     5|exact_match|↑  |0.9333|±  |0.0069|

enable

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9386|±  |0.0066|
|     |       |strict-match    |     5|exact_match|↑  |0.9340|±  |0.0068|

Performance

disable

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  74.06     
Total input tokens:                      204800    
Total generated tokens:                  102400    
Request throughput (req/s):              1.35      
Output token throughput (tok/s):         1382.67   
Peak output token throughput (tok/s):    2017.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4148.00   
---------------Time to First Token----------------
Mean TTFT (ms):                          10940.08  
Median TTFT (ms):                        10941.97  
P90 TTFT (ms):                           19154.24  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          61.27     
Median TPOT (ms):                        61.30     
P90 TPOT (ms):                           68.85     
---------------Inter-token Latency----------------
Mean ITL (ms):                           61.27     
Median ITL (ms):                         52.56     
P90 ITL (ms):                            54.70     
==================================================

enable

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  73.48     
Total input tokens:                      204800    
Total generated tokens:                  102400    
Request throughput (req/s):              1.36      
Output token throughput (tok/s):         1393.50   
Peak output token throughput (tok/s):    2100.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4180.51   
---------------Time to First Token----------------
Mean TTFT (ms):                          10950.81  
Median TTFT (ms):                        10955.08  
P90 TTFT (ms):                           19177.07  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          60.70     
Median TPOT (ms):                        60.73     
P90 TPOT (ms):                           68.31     
---------------Inter-token Latency----------------
Mean ITL (ms):                           60.70     
Median ITL (ms):                         51.93     
P90 ITL (ms):                            53.94     
==================================================

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

chatgpt-codex-connector · 2025-12-16T23:36:33Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request successfully introduces support for partial Rotary Positional Embeddings (RoPE) in the fused_qknorm_rope_kernel. The changes correctly propagate the rotary_dim parameter through the kernel and launch functions, ensuring that RoPE is applied only to the specified dimensions. The input validation for cos_sin_cache has been updated to reflect the new rotary_dim constraint, and the test suite has been extended to cover both full and partial RoPE scenarios. The logic for calculating pairOffset and conditionally applying RoPE within the kernel appears to be correctly adapted for this functionality.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

chatgpt-codex-connector · 2025-12-19T10:53:13Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

tjtanaa · 2025-12-22T14:54:27Z

@jeejeelee This PR broke ROCm builds. I will open an issue to track this.

DTORCH_HIP_VERSION=700 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIPBLASLT_OUTER_VEC -DUSE_ROCM_CK_GEMM -MD -MT CMakeFiles/_C.dir/csrc/fused_qknorm_rope_kernel.hip.o -MF CMakeFiles/_C.dir/csrc/fused_qknorm_rope_kernel.hip.o.d -o CMakeFiles/_C.dir/csrc/fused_qknorm_rope_kernel.hip.o -x hip -c /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/fused_qknorm_rope_kernel.hip
#13 132.3 /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/fused_qknorm_rope_kernel.hip:409:11: error: unused variable 'rotary_dim' [-Werror,-Wunused-variable]
#13 132.3   409 |   int64_t rotary_dim = cos_sin_cache.size(1);
#13 132.3       |           ^~~~~~~~~~
#13 132.3 1 error generated when compiling for gfx90a.
#13 138.6 [8/37] Building CXX object CMakeFiles/_rocm_C.dir/csrc/rocm/torch_bindings.cpp.o
#13 138.6 cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
#13 138.7 [9/37] Building CXX object CMakeFiles/_moe_C.dir/csrc/moe/torch_bindings.cpp.o
#13 138.7 cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
#13 143.6 [10/37] Building CXX object CMakeFiles/_C.dir/csrc/torch_bindings.cpp.o
#13 143.6 cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++

…project#30821) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

…project#30821) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

…project#30821) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…project#30821) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Done

7b0a43c

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee requested review from WoosukKwon, mgoin, tlrmchlsmth and yewentao256 as code owners December 16, 2025 23:36

jeejeelee marked this pull request as draft December 16, 2025 23:36

Merge branch 'main' into fused-qknorm-partial-rope

c48eb25

gemini-code-assist bot reviewed Dec 16, 2025

View reviewed changes

jeejeelee added 5 commits December 18, 2025 11:48

Merge branch 'main' into fused-qknorm-partial-rope

1af47d2

Move fowarad

0d726db

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Revert

b26aaf7

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'main' into fused-qknorm-partial-rope

d2f1dd2

Move fowarad

046533e

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee marked this pull request as ready for review December 19, 2025 10:53

Merge branch 'main' into fused-qknorm-partial-rope

db6ab2c

jeejeelee requested a review from ProExpertProg December 19, 2025 10:53

jeejeelee added 2 commits December 19, 2025 11:06

Move fowarad

3a2f1b5

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Merge branch 'main' into fused-qknorm-partial-rope

deacddf

ProExpertProg approved these changes Dec 20, 2025

View reviewed changes

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 20, 2025

jeejeelee added 2 commits December 21, 2025 09:03

Merge branch 'main' into fused-qknorm-partial-rope

421fb15

Merge branch 'main' into fused-qknorm-partial-rope

dd5d190

vllm-bot merged commit 097978a into vllm-project:main Dec 22, 2025
88 of 92 checks passed

jeejeelee deleted the fused-qknorm-partial-rope branch December 22, 2025 03:03

tjtanaa mentioned this pull request Dec 22, 2025

[Bug] [ROCm] [Critical]: ROCm build broken #31155

Closed

1 task

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025

[Kernel] Enable fused_qknorm_rope_kernel supports partial rope (vllm-…

0452d74

…project#30821) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025

[Kernel] Enable fused_qknorm_rope_kernel supports partial rope (vllm-…

986b25b

…project#30821) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

jeejeelee mentioned this pull request Jan 6, 2026

[Feature]: Optimizations for MOE models (GLM4.7, DeepSeek series) #31755

Closed

5 tasks

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[Kernel] Enable fused_qknorm_rope_kernel supports partial rope (vllm-…

3008ca9

…project#30821) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Kernel] Enable fused_qknorm_rope_kernel supports partial rope (vllm-…

ce4c5a2

…project#30821) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kernel] Enable fused_qknorm_rope_kernel supports partial rope#30821

[Kernel] Enable fused_qknorm_rope_kernel supports partial rope#30821
vllm-bot merged 12 commits intovllm-project:mainfrom
jeejeelee:fused-qknorm-partial-rope

jeejeelee commented Dec 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot commented Dec 19, 2025

Uh oh!

Uh oh!

tjtanaa commented Dec 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

jeejeelee commented Dec 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Metrics

Test Script

GSM8K

Performance

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot commented Dec 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot commented Dec 19, 2025

Uh oh!

Uh oh!

tjtanaa commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeejeelee commented Dec 16, 2025 •

edited by github-actions bot

Loading

tjtanaa commented Dec 22, 2025 •

edited

Loading