Skip to content

[Kernel] Enable fused_qknorm_rope_kernel supports partial rope#30821

Merged
vllm-bot merged 12 commits intovllm-project:mainfrom
jeejeelee:fused-qknorm-partial-rope
Dec 22, 2025
Merged

[Kernel] Enable fused_qknorm_rope_kernel supports partial rope#30821
vllm-bot merged 12 commits intovllm-project:mainfrom
jeejeelee:fused-qknorm-partial-rope

Conversation

@jeejeelee
Copy link
Copy Markdown
Collaborator

@jeejeelee jeejeelee commented Dec 16, 2025

Purpose

Enable fused_qknorm_rope_kernel to support partial rope, which can slightly improve the performance of models like GLM4.6-MoE. The gains are not very noticeable because torch combo kernel has already fused q and norm into a single kernel, meaning this fusion only consolidates 2 kernels into 1 kernel,see: #27165

cc @mgoin @ProExpertProg @yewentao256

Metrics

Test Script

  • Server
vllm serve zai-org/GLM-4.6 -tp 8 -dp 1 --max-num-seqs 128 --served-model-name glm46 --compilation-config '{"pass_config": {"enable_qk_norm_rope_fusion": "1"}}' --no-enable-prefix-caching
  • lm-eval
lm_eval --model local-completions --model_args "model=glm46,base_url=http://localhost:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5
  • Benchmark
vllm bench serve --tokenizer zai-org/GLM-4.6 --dataset-name random --ignore-eos --metric-percentiles 90 --model glm46 --num-prompts 100 --random-input-len 2048 --random-output-len 1024 --request-rate inf

GSM8K

diable

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.9378|±  |0.0067|
|     |       |strict-match    |     5|exact_match||0.9333|±  |0.0069|

enable

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.9386|±  |0.0066|
|     |       |strict-match    |     5|exact_match||0.9340|±  |0.0068|

Performance

disable

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  74.06     
Total input tokens:                      204800    
Total generated tokens:                  102400    
Request throughput (req/s):              1.35      
Output token throughput (tok/s):         1382.67   
Peak output token throughput (tok/s):    2017.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4148.00   
---------------Time to First Token----------------
Mean TTFT (ms):                          10940.08  
Median TTFT (ms):                        10941.97  
P90 TTFT (ms):                           19154.24  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          61.27     
Median TPOT (ms):                        61.30     
P90 TPOT (ms):                           68.85     
---------------Inter-token Latency----------------
Mean ITL (ms):                           61.27     
Median ITL (ms):                         52.56     
P90 ITL (ms):                            54.70     
==================================================

enable

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  73.48     
Total input tokens:                      204800    
Total generated tokens:                  102400    
Request throughput (req/s):              1.36      
Output token throughput (tok/s):         1393.50   
Peak output token throughput (tok/s):    2100.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          4180.51   
---------------Time to First Token----------------
Mean TTFT (ms):                          10950.81  
Median TTFT (ms):                        10955.08  
P90 TTFT (ms):                           19177.07  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          60.70     
Median TPOT (ms):                        60.73     
P90 TPOT (ms):                           68.31     
---------------Inter-token Latency----------------
Mean ITL (ms):                           60.70     
Median ITL (ms):                         51.93     
P90 ITL (ms):                            53.94     
==================================================

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@jeejeelee jeejeelee marked this pull request as draft December 16, 2025 23:36
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully introduces support for partial Rotary Positional Embeddings (RoPE) in the fused_qknorm_rope_kernel. The changes correctly propagate the rotary_dim parameter through the kernel and launch functions, ensuring that RoPE is applied only to the specified dimensions. The input validation for cos_sin_cache has been updated to reflect the new rotary_dim constraint, and the test suite has been extended to cover both full and partial RoPE scenarios. The logic for calculating pairOffset and conditionally applying RoPE within the kernel appears to be correctly adapted for this functionality.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@jeejeelee jeejeelee marked this pull request as ready for review December 19, 2025 10:53
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 20, 2025
@vllm-bot vllm-bot merged commit 097978a into vllm-project:main Dec 22, 2025
88 of 92 checks passed
@jeejeelee jeejeelee deleted the fused-qknorm-partial-rope branch December 22, 2025 03:03
@tjtanaa
Copy link
Copy Markdown
Collaborator

tjtanaa commented Dec 22, 2025

@jeejeelee This PR broke ROCm builds. I will open an issue to track this.

DTORCH_HIP_VERSION=700 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIPBLASLT_OUTER_VEC -DUSE_ROCM_CK_GEMM -MD -MT CMakeFiles/_C.dir/csrc/fused_qknorm_rope_kernel.hip.o -MF CMakeFiles/_C.dir/csrc/fused_qknorm_rope_kernel.hip.o.d -o CMakeFiles/_C.dir/csrc/fused_qknorm_rope_kernel.hip.o -x hip -c /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/fused_qknorm_rope_kernel.hip
#13 132.3 /app/vllm/build/temp.linux-x86_64-cpython-312/csrc/fused_qknorm_rope_kernel.hip:409:11: error: unused variable 'rotary_dim' [-Werror,-Wunused-variable]
#13 132.3   409 |   int64_t rotary_dim = cos_sin_cache.size(1);
#13 132.3       |           ^~~~~~~~~~
#13 132.3 1 error generated when compiling for gfx90a.
#13 138.6 [8/37] Building CXX object CMakeFiles/_rocm_C.dir/csrc/rocm/torch_bindings.cpp.o
#13 138.6 cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
#13 138.7 [9/37] Building CXX object CMakeFiles/_moe_C.dir/csrc/moe/torch_bindings.cpp.o
#13 138.7 cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
#13 143.6 [10/37] Building CXX object CMakeFiles/_C.dir/csrc/torch_bindings.cpp.o
#13 143.6 cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025
Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
…project#30821)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…project#30821)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants