[kernel][perf] support uncontiguous input for rms_norm kernel#28103
[kernel][perf] support uncontiguous input for rms_norm kernel#28103vllm-bot merged 14 commits intovllm-project:mainfrom
Conversation
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
There was a problem hiding this comment.
Code Review
This pull request successfully adds support for non-contiguous inputs to the rms_norm kernel, which resolves a TODO and improves performance by avoiding an explicit .contiguous() call. The changes in the CUDA kernel to handle 2D and 3D tensors with explicit strides are well-implemented. My review includes one suggestion to refactor duplicated code in the C++ dispatcher function to improve maintainability.
There was a problem hiding this comment.
Code Review
This pull request successfully adds support for non-contiguous inputs to the RMS norm kernel, which removes a .contiguous() call and provides a performance improvement. The changes in the CUDA kernel to handle 2D and 3D non-contiguous tensors using explicit strides are well-implemented.
However, I've identified a critical issue where the output tensor out is not guaranteed to be contiguous, which will cause a runtime failure in the C++ kernel. I've left a specific comment with details on how to address this. Once that is fixed, this PR should be in a great shape.
There was a problem hiding this comment.
Code Review
This pull request successfully adds support for non-contiguous inputs to the RMS norm kernel, which removes a .contiguous() call and provides a performance improvement as shown in the benchmarks. The changes in the CUDA kernel correctly handle both 2D and 3D non-contiguous tensors by using explicit stride information. The Python and C++ wrapper code is updated accordingly. My feedback includes one suggestion to refactor the C++ code to improve maintainability by reducing code duplication.
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
@ProExpertProg , would you please take a look when you have time ? |
…-input Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
@ProExpertProg I think this PR is ready for review. Would you please have a look? |
ProExpertProg
left a comment
There was a problem hiding this comment.
Just one nit vis-a-vis dispatching
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
cc @yewentao256 |
Actually, could we test Qwen3-235b-fp8 instead? R1-fp8 is too large for my current hardware and would result in an OOM. bench serve
lm_eval
|
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
… for transformers model_impl Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
270c8cc to
5f39dcd
Compare
|
@yewentao256 I've updated the test results for the larger models and fixed the CI issues—please take a look when you have time. Also cc @ProExpertProg BTW, there's currently a CI error: |
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, thanks for the work!
|
@yewentao256 The CI failure is unrelated to this PR. The failing Plamo3 test is also failing on main and should be fixed by #29092 |
…roject#28103) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: izhuhaoran <izhuhaoran@qq.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
…roject#28103) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: izhuhaoran <izhuhaoran@qq.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
…roject#28103) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: izhuhaoran <izhuhaoran@qq.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
…roject#28103) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: izhuhaoran <izhuhaoran@qq.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Purpose
currently, main branch has a todo:
vllm/vllm/_custom_ops.py
Lines 331 to 332 in 14a125a
As titiled, this pr support uncontiguous input for norm kernel and solve this todo, which is introduced by #17735 .
Previously, the RMS norm kernel required
.contiguous()becauseq/ktensors used in qk-norm are sliced fromqkvvia.split()and then reshaped to[num_tokens, num_heads, head_dim]. This results in non-contiguous tensors where the first dimension has qkv's original stride. The original kernel usedinput.view({-1, hidden_size}), which fails or produces incorrect results for such tensors.This PR extends the kernel to accept explicit stride information and supports both 2D and 3D non-contiguous inputs (with the last dimension required to be contiguous).
BTW, this PR should be merged after #27165
Test Result
timeline profile trace
bench serve
setting: qwen3-0.6b, tp1, num_requests=32, max_concurrency=8, in_len=out_len=1024
result: TTFT from 86.76ms to 85.73ms, TPOT from 3.46ms to 3.31ms
lm_eval
lm_eval --model local-completions --tasks gsm8k --batch_size 128 --model_args model=/mnt/data/nas/zhr/models/Qwen3-0.6B,base_url=http://localhost:8000/v1/completions,max_retries=3