[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model#27165
Conversation
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a fused CUDA kernel for QK Normalization and RoPE for the Qwen model, aiming to improve inference performance. The fusion is implemented as a torch.compile pass. The changes include the CUDA kernel, its PyTorch bindings, the fusion pass logic, and integration into the model and build system. A new test is also added to verify the fusion.
The overall approach is solid and follows existing patterns in the codebase for custom ops and fusions. However, I've found a critical issue in the fusion pass implementation that causes the fusion to produce incorrect results. The output of the fused operation is not correctly propagated in the graph, making the fusion effectively a no-op. Please see the detailed comment for the fix.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
The target graph for replacement is quite large. Using pattern matching here, as we do in other passes, may not scale effectively and could become a maintenance burden. |
ProExpertProg
left a comment
There was a problem hiding this comment.
Two nits in the kernel, otherwise LGTM!
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
b9cee22 to
32e0171
Compare
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
32e0171 to
b23467c
Compare
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
Hi @izhuhaoran this PR breaks the load of Qwen3VL Moe and Dense in ROCM RDNA3: @tjtanaa @DarkLight1337 Can you check it please? |
|
@ZJY0516 @izhuhaoran @tjtanaa yes this PR: #28500 solve the problem |
Summary: vllm-project#27165 introduced an issue where when we run on AMD hardware, we would try to load `FUSED_QK_ROPE_OP = torch.ops._C.fused_qk_norm_rope.default`, which is CUDA only, and get error: ``` [1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] WorkerProc failed to start. �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] WorkerProc failed to start. �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] Traceback (most recent call last): �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] Traceback (most recent call last): �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/executor/multiproc_executor.py", line 613, in worker_main �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/executor/multiproc_executor.py", line 613, in worker_main �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] worker = WorkerProc(*args, **kwargs) �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] worker = WorkerProc(*args, **kwargs) �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/executor/multiproc_executor.py", line 468, in __init__ �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/executor/multiproc_executor.py", line 468, in __init__ �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.worker.load_model() �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.worker.load_model() �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/worker/gpu_worker.py", line 266, in load_model �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/worker/gpu_worker.py", line 266, in load_model �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.model_runner.load_model(eep_scale_up=eep_scale_up) �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.model_runner.load_model(eep_scale_up=eep_scale_up) �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/worker/gpu_model_runner.py", line 3033, in load_model �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/worker/gpu_model_runner.py", line 3033, in load_model �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.model = model_loader.load_model( �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.model = model_loader.load_model( �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] model = initialize_model( �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/model_executor/model_loader/utils.py", line 55, in initialize_model �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] model = initialize_model( �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] return model_class(vllm_config=vllm_config, prefix=prefix) �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/model_executor/model_loader/utils.py", line 55, in initialize_model �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/model_executor/models/deepseek_v2.py", line 1349, in __init__ �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] return model_class(vllm_config=vllm_config, prefix=prefix) �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.model = DeepseekV2Model( �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/model_executor/models/deepseek_v2.py", line 1349, in __init__ �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/decorators.py", line 293, in __init__ �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.model = DeepseekV2Model( �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] TorchCompileWrapperWithCustomDispatcher.__init__( �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/wrapper.py", line 42, in __init__ �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/decorators.py", line 293, in __init__ �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] backend = vllm_config.compilation_config.init_backend(vllm_config) �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] TorchCompileWrapperWithCustomDispatcher.__init__( �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/config/compilation.py", line 770, in init_backend �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] from vllm.compilation.backends import VllmBackend �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/wrapper.py", line 42, in __init__ �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/backends.py", line 40, in <module> �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] backend = vllm_config.compilation_config.init_backend(vllm_config) �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] from .pass_manager import PostGradPassManager �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/pass_manager.py", line 20, in <module> �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/config/compilation.py", line 770, in init_backend �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] from .qk_norm_rope_fusion import QKNormRoPEFusionPass �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] from vllm.compilation.backends import VllmBackend �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/qk_norm_rope_fusion.py", line 24, in <module> �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/backends.py", line 40, in <module> �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] FUSED_QK_ROPE_OP = torch.ops._C.fused_qk_norm_rope.default �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] from .pass_manager import PostGradPassManager �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/torch/_ops.py", line 1361, in __getattr__ �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/pass_manager.py", line 20, in <module> �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] raise AttributeError( �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] from .qk_norm_rope_fusion import QKNormRoPEFusionPass �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] AttributeError: '_OpNamespace' '_C' object has no attribute 'fused_qk_norm_rope' �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/qk_norm_rope_fusion.py", line 24, in <module> �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] FUSED_QK_ROPE_OP = torch.ops._C.fused_qk_norm_rope.default �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/torch/_ops.py", line 1361, in __getattr__ �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] raise AttributeError( �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] AttributeError: '_OpNamespace' '_C' object has no attribute 'fused_qk_norm_rope' ``` We should only import `QKNormRoPEFusionPass` when `is_cuda`, instead of `is_cuda_alike`, which include ROCM. Test Plan: Patch the change and able to start vllm on AMD properly (deepseek) ``` Ran 500/500 requests in 144.51s Success rate: 100.00% QPS: 3.46 Avg latency: 4.489s Avg TTFT (client): 161.38ms P50 TTFT (client): 143.11ms P99 TTFT (client): 266.50ms Avg TTIT (client): 28.85ms P50 TTIT (client): 28.94ms P99 TTIT (client): 29.38ms Avg TTFT (server): 224.00ms Avg TTIT (server): 28.62ms Avg prefill len: 3293.05 tokens P50 prefill len: 3293.00 tokens P99 prefill len: 3335.00 tokens Avg decode len: 150.00 tokens P50 decode len: 150.00 tokens P99 decode len: 150.00 tokens Peak TPGS: 66.375 ``` ``` [2025-11-12 16:54:55,483] [rank 0] [INFO] Evaluation results on task gsm8k.8_shot.1_gen: em: 0.960576 | f1: 0.960576 | em_maj1@1: 0.960576 | f1_maj1@1: 0.960576 ``` Differential Revision: D86838348
…del (vllm-project#27165) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
I think this is broken: #33295 |
Purpose
Inspired by TensorRT-LLM. This PR is a follow PR about #27018 , and fuses QNorm, KNorm, and RoPE into a single CUDA kernel for the Qwen3 model, improving inference performance. We convert this fusion into a custom torch.compile pass, users can enable it by:
More details see #27018
Result GPU Trace
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.