[kernel][perf] support uncontiguous input for rms_norm kernel by izhuhaoran · Pull Request #28103 · vllm-project/vllm

izhuhaoran · 2025-11-05T06:15:42Z

Purpose

currently, main branch has a todo:

Lines 331 to 332 in 14a125a

    
           # TODO: Remove this contiguous call when the kernel is updated to support non-contiguous input 
        
           input_contiguous = input.contiguous()

As titiled, this pr support uncontiguous input for norm kernel and solve this todo, which is introduced by #17735 .

Previously, the RMS norm kernel required .contiguous() because q/k tensors used in qk-norm are sliced from qkv via .split() and then reshaped to [num_tokens, num_heads, head_dim]. This results in non-contiguous tensors where the first dimension has qkv's original stride. The original kernel used input.view({-1, hidden_size}), which fails or produces incorrect results for such tensors.

This PR extends the kernel to accept explicit stride information and supports both 2D and 3D non-contiguous inputs (with the last dimension required to be contiguous).

BTW, this PR should be merged after #27165

Test Result

timeline profile trace

Main

This PR

bench serve

setting: qwen3-0.6b, tp1, num_requests=32, max_concurrency=8, in_len=out_len=1024
result: TTFT from 86.76ms to 85.73ms, TPOT from 3.46ms to 3.31ms

Main

============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  14.49     
Total input tokens:                      32768     
Total generated tokens:                  32768     
Request throughput (req/s):              2.21      
Output token throughput (tok/s):         2261.23   
Peak output token throughput (tok/s):    2472.00   
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          4522.47   
---------------Time to First Token----------------
Mean TTFT (ms):                          86.76     
Median TTFT (ms):                        95.76     
P99 TTFT (ms):                           100.81    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.46      
Median TPOT (ms):                        3.45      
P99 TPOT (ms):                           3.52      
---------------Inter-token Latency----------------
Mean ITL (ms):                           3.46      
Median ITL (ms):                         3.48      
P99 ITL (ms):                            4.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          3621.38   
Median E2EL (ms):                        3618.82   
P99 E2EL (ms):                           3634.57   
==================================================

This PR

============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  13.91     
Total input tokens:                      32768     
Total generated tokens:                  32768     
Request throughput (req/s):              2.30      
Output token throughput (tok/s):         2356.32   
Peak output token throughput (tok/s):    2528.00   
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          4712.63   
---------------Time to First Token----------------
Mean TTFT (ms):                          85.73     
Median TTFT (ms):                        93.82     
P99 TTFT (ms):                           97.15     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.31      
Median TPOT (ms):                        3.31      
P99 TPOT (ms):                           3.38      
---------------Inter-token Latency----------------
Mean ITL (ms):                           3.31      
Median ITL (ms):                         3.33      
P99 ITL (ms):                            3.87      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          3475.61   
Median E2EL (ms):                        3476.51   
P99 E2EL (ms):                           3485.74   
==================================================

lm_eval

lm_eval --model local-completions --tasks gsm8k --batch_size 128 --model_args model=/mnt/data/nas/zhr/models/Qwen3-0.6B,base_url=http://localhost:8000/v1/completions,max_retries=3

Main

local-completions (model=/mnt/data/nas/zhr/models/Qwen3-0.6B,base_url=http://localhost:8000/v1/completions,max_retries=3), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 128
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4071|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.4094|±  |0.0135|

This PR

local-completions (model=/mnt/data/nas/zhr/models/Qwen3-0.6B,base_url=http://localhost:8000/v1/completions,max_retries=3), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 128
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4086|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.4124|±  |0.0136|

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

gemini-code-assist

Code Review

This pull request successfully adds support for non-contiguous inputs to the rms_norm kernel, which resolves a TODO and improves performance by avoiding an explicit .contiguous() call. The changes in the CUDA kernel to handle 2D and 3D tensors with explicit strides are well-implemented. My review includes one suggestion to refactor duplicated code in the C++ dispatcher function to improve maintainability.

csrc/layernorm_kernels.cu

gemini-code-assist

Code Review

This pull request successfully adds support for non-contiguous inputs to the RMS norm kernel, which removes a .contiguous() call and provides a performance improvement. The changes in the CUDA kernel to handle 2D and 3D non-contiguous tensors using explicit strides are well-implemented.

However, I've identified a critical issue where the output tensor out is not guaranteed to be contiguous, which will cause a runtime failure in the C++ kernel. I've left a specific comment with details on how to address this. Once that is fixed, this PR should be in a great shape.

vllm/_custom_ops.py

gemini-code-assist

Code Review

This pull request successfully adds support for non-contiguous inputs to the RMS norm kernel, which removes a .contiguous() call and provides a performance improvement as shown in the benchmarks. The changes in the CUDA kernel correctly handle both 2D and 3D non-contiguous tensors by using explicit stride information. The Python and C++ wrapper code is updated accordingly. My feedback includes one suggestion to refactor the C++ code to improve maintainability by reducing code duplication.

csrc/layernorm_kernels.cu

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran · 2025-11-05T06:58:15Z

@ProExpertProg , would you please take a look when you have time ?

…-input Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran · 2025-11-14T09:15:03Z

@ProExpertProg I think this PR is ready for review. Would you please have a look?

ProExpertProg

Just one nit vis-a-vis dispatching

csrc/layernorm_kernels.cu

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

ProExpertProg · 2025-11-19T22:17:31Z

cc @yewentao256

yewentao256

Could you also test for a larger model, eg. R1 using lm_eval and vllm bench as well?

izhuhaoran · 2025-11-20T02:34:58Z

Could you also test for a larger model, eg. R1 using lm_eval and vllm bench as well?

Actually, could we test Qwen3-235b-fp8 instead? R1-fp8 is too large for my current hardware and would result in an OOM.
Here are the test results of Qwen3-235b-fp8:

bench serve

Main

============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  327.34    
Total input tokens:                      32768     
Total generated tokens:                  32768     
Request throughput (req/s):              0.10      
Output token throughput (tok/s):         100.10    
Peak output token throughput (tok/s):    104.00    
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          200.21    
---------------Time to First Token----------------
Mean TTFT (ms):                          670.19    
Median TTFT (ms):                        730.73    
P99 TTFT (ms):                           746.87    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          79.34     
Median TPOT (ms):                        79.32     
P99 TPOT (ms):                           79.81     
---------------Inter-token Latency----------------
Mean ITL (ms):                           79.34     
Median ITL (ms):                         79.25     
P99 ITL (ms):                            80.00     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          81831.64  
Median E2EL (ms):                        81859.35  
P99 E2EL (ms):                           81888.10  
==================================================

this PR

============ Serving Benchmark Result ============
Successful requests:                     32        
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  325.45    
Total input tokens:                      32768     
Total generated tokens:                  32768     
Request throughput (req/s):              0.10      
Output token throughput (tok/s):         100.68    
Peak output token throughput (tok/s):    104.00    
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          201.37    
---------------Time to First Token----------------
Mean TTFT (ms):                          663.54    
Median TTFT (ms):                        723.32    
P99 TTFT (ms):                           739.82    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          78.88     
Median TPOT (ms):                        78.86     
P99 TPOT (ms):                           79.35     
---------------Inter-token Latency----------------
Mean ITL (ms):                           78.88     
Median ITL (ms):                         78.80     
P99 ITL (ms):                            79.56     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          81360.23  
Median E2EL (ms):                        81384.90  
P99 E2EL (ms):                           81407.92  
==================================================

lm_eval

Main

local-completions (model=/mnt/data/nas/models/Qwen3-235B-A22B-Thinking-2507-FP8,base_url=http://localhost:8000/v1/completions,max_retries=3), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 128
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6732|±  |0.0129|
|     |       |strict-match    |     5|exact_match|↑  |0.6209|±  |0.0134|

This PR

local-completions (model=/mnt/data/nas/models/Qwen3-235B-A22B-Thinking-2507-FP8,base_url=http://localhost:8000/v1/completions,max_retries=3), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 128
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6823|±  |0.0128|
|     |       |strict-match    |     5|exact_match|↑  |0.6217|±  |0.0134|

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

… for transformers model_impl Signed-off-by: izhuhaoran <izhuhaoran@qq.com>

Signed-off-by: izhuhaoran <izhuhaoran@qq.com>

…-input

izhuhaoran · 2025-11-20T14:17:41Z

@yewentao256 I've updated the test results for the larger models and fixed the CI issues—please take a look when you have time. Also cc @ProExpertProg

BTW, there's currently a CI error: "ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine." This appears unrelated to this PR.

yewentao256

LGTM, thanks for the work!

izhuhaoran · 2025-11-21T02:49:21Z

@yewentao256 The CI failure is unrelated to this PR. The failing Plamo3 test is also failing on main and should be fixed by #29092

…roject#28103) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: izhuhaoran <izhuhaoran@qq.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

…roject#28103) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: izhuhaoran <izhuhaoran@qq.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

…roject#28103) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: izhuhaoran <izhuhaoran@qq.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

feat: support uncontiguous input for norm kernel

6bf762e

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

csrc/layernorm_kernels.cu Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

vllm/_custom_ops.py Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

csrc/layernorm_kernels.cu Outdated Show resolved Hide resolved

izhuhaoran changed the title ~~[kernel][perf] support uncontiguous input for norm kernel~~ [kernel][perf] support uncontiguous input for rms_norm kernel Nov 5, 2025

izhuhaoran added 2 commits November 5, 2025 14:49

chore: clean code by define LAUNCH_RMS_NORM_KERNEL

94db4d7

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

lint: fix lint error for layernorm_kernels.cu

eff8146

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran mentioned this pull request Nov 5, 2025

[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model #27165

Merged

5 tasks

heheda12345 requested a review from ProExpertProg November 7, 2025 07:22

izhuhaoran added 3 commits November 12, 2025 10:56

Merge remote-tracking branch 'origin/main' into rmsnorm-noncontiguous…

355f739

…-input Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

fix: support dim patch for rms_norm kernel

795d4fa

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

fix: remove unneed continuous in MatcherRMSNorm

74b95ab

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran requested review from youkaichao and zou3519 as code owners November 14, 2025 09:12

ProExpertProg reviewed Nov 17, 2025

View reviewed changes

csrc/layernorm_kernels.cu Outdated Show resolved Hide resolved

izhuhaoran added 2 commits November 18, 2025 16:04

refactor: add VLLM_DISPATCH_RANK23

f305dbc

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

fix: fix lint error

e740caf

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

izhuhaoran requested a review from ProExpertProg November 18, 2025 08:50

ProExpertProg approved these changes Nov 19, 2025

View reviewed changes

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025

yewentao256 reviewed Nov 19, 2025

View reviewed changes

izhuhaoran and others added 4 commits November 20, 2025 11:07

fix: remove unneed torch check

95978ac

Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>

Merge branch 'main' into rmsnorm-noncontiguous-input

ee91832

fix: fix input.stride(-1) != 1 for gemma3n model and support 4d input…

34dedba

… for transformers model_impl Signed-off-by: izhuhaoran <izhuhaoran@qq.com>

fix: fix lint error

5f39dcd

Signed-off-by: izhuhaoran <izhuhaoran@qq.com>

izhuhaoran force-pushed the rmsnorm-noncontiguous-input branch from 270c8cc to 5f39dcd Compare November 20, 2025 10:55

Merge remote-tracking branch 'origin/main' into rmsnorm-noncontiguous…

c281acf

…-input

yewentao256 approved these changes Nov 20, 2025

View reviewed changes

Merge branch 'main' into rmsnorm-noncontiguous-input

ede16b0

yewentao256 enabled auto-merge (squash) November 20, 2025 16:54

vllm-bot merged commit a982f5b into vllm-project:main Nov 21, 2025
87 of 89 checks passed

jikunshang mentioned this pull request Dec 12, 2025

add contiguous inside rmsnorm kernel vllm-project/vllm-xpu-kernels#95

Merged

4 tasks

jikunshang added a commit to jikunshang/vllm that referenced this pull request Dec 12, 2025

rms_norm kernel, revert vllm-project#28103

c19f809

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

vince62s mentioned this pull request Dec 24, 2025

[Performance]: big perf loss between 0.11.2 and 0.12.0 on rms_norm #30043

Closed

1 task

bigPYJ1151 mentioned this pull request Jan 5, 2026

[Bugfix][CPU] Fix RotaryEmbedding fallback causing gibberish with --enforce-eager #31643

Merged

yma11 pushed a commit to yma11/vllm that referenced this pull request Jan 6, 2026

rms_norm kernel, revert vllm-project#28103

5f435aa

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

	# TODO: Remove this contiguous call when the kernel is updated to support non-contiguous input
	input_contiguous = input.contiguous()

Uh oh!

Conversation

izhuhaoran commented Nov 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Result

timeline profile trace

bench serve

lm_eval

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

izhuhaoran commented Nov 5, 2025

Uh oh!

izhuhaoran commented Nov 14, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ProExpertProg commented Nov 19, 2025

Uh oh!

yewentao256 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

izhuhaoran commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

bench serve

lm_eval

Uh oh!

izhuhaoran commented Nov 20, 2025

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

izhuhaoran commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

izhuhaoran commented Nov 5, 2025 •

edited by github-actions bot

Loading

yewentao256 left a comment •

edited

Loading

izhuhaoran commented Nov 20, 2025 •

edited

Loading