Fix where we load CUDA only kernel when running on AMD hardware#28605
Fix where we load CUDA only kernel when running on AMD hardware#28605liuzijing2014 wants to merge 1 commit intovllm-project:mainfrom
Conversation
Summary: vllm-project#27165 introduced an issue where when we run on AMD hardware, we would try to load `FUSED_QK_ROPE_OP = torch.ops._C.fused_qk_norm_rope.default`, which is CUDA only, and get error: ``` [1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] WorkerProc failed to start. �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] WorkerProc failed to start. �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] Traceback (most recent call last): �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] Traceback (most recent call last): �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/executor/multiproc_executor.py", line 613, in worker_main �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/executor/multiproc_executor.py", line 613, in worker_main �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] worker = WorkerProc(*args, **kwargs) �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] worker = WorkerProc(*args, **kwargs) �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/executor/multiproc_executor.py", line 468, in __init__ �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/executor/multiproc_executor.py", line 468, in __init__ �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.worker.load_model() �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.worker.load_model() �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/worker/gpu_worker.py", line 266, in load_model �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/worker/gpu_worker.py", line 266, in load_model �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.model_runner.load_model(eep_scale_up=eep_scale_up) �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.model_runner.load_model(eep_scale_up=eep_scale_up) �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/worker/gpu_model_runner.py", line 3033, in load_model �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/v1/worker/gpu_model_runner.py", line 3033, in load_model �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.model = model_loader.load_model( �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.model = model_loader.load_model( �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] model = initialize_model( �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/model_executor/model_loader/utils.py", line 55, in initialize_model �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] model = initialize_model( �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] return model_class(vllm_config=vllm_config, prefix=prefix) �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/model_executor/model_loader/utils.py", line 55, in initialize_model �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/model_executor/models/deepseek_v2.py", line 1349, in __init__ �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] return model_class(vllm_config=vllm_config, prefix=prefix) �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.model = DeepseekV2Model( �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/model_executor/models/deepseek_v2.py", line 1349, in __init__ �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/decorators.py", line 293, in __init__ �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] self.model = DeepseekV2Model( �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] TorchCompileWrapperWithCustomDispatcher.__init__( �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/wrapper.py", line 42, in __init__ �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/decorators.py", line 293, in __init__ �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] backend = vllm_config.compilation_config.init_backend(vllm_config) �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] TorchCompileWrapperWithCustomDispatcher.__init__( �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/config/compilation.py", line 770, in init_backend �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] from vllm.compilation.backends import VllmBackend �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/wrapper.py", line 42, in __init__ �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/backends.py", line 40, in <module> �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] backend = vllm_config.compilation_config.init_backend(vllm_config) �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] from .pass_manager import PostGradPassManager �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/pass_manager.py", line 20, in <module> �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/config/compilation.py", line 770, in init_backend �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] from .qk_norm_rope_fusion import QKNormRoPEFusionPass �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] from vllm.compilation.backends import VllmBackend �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/qk_norm_rope_fusion.py", line 24, in <module> �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/backends.py", line 40, in <module> �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] FUSED_QK_ROPE_OP = torch.ops._C.fused_qk_norm_rope.default �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] from .pass_manager import PostGradPassManager �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/torch/_ops.py", line 1361, in __getattr__ �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/pass_manager.py", line 20, in <module> �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] raise AttributeError( �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] from .qk_norm_rope_fusion import QKNormRoPEFusionPass �[1;36m(Worker_TP5 pid=4058)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] AttributeError: '_OpNamespace' '_C' object has no attribute 'fused_qk_norm_rope' �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/vllm/compilation/qk_norm_rope_fusion.py", line 24, in <module> �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] FUSED_QK_ROPE_OP = torch.ops._C.fused_qk_norm_rope.default �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] File "/dev/shm/uid-99/83e08bb2-seed-nspid4026555323_cgpid2465342-ns-4026555243/torch/_ops.py", line 1361, in __getattr__ �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] raise AttributeError( �[1;36m(Worker_TP6 pid=4059)�[0;0m ERROR 11-11 18:07:51 [multiproc_executor.py:639] AttributeError: '_OpNamespace' '_C' object has no attribute 'fused_qk_norm_rope' ``` We should only import `QKNormRoPEFusionPass` when `is_cuda`, instead of `is_cuda_alike`, which include ROCM. Test Plan: Patch the change and able to start vllm on AMD properly (deepseek) ``` Ran 500/500 requests in 144.51s Success rate: 100.00% QPS: 3.46 Avg latency: 4.489s Avg TTFT (client): 161.38ms P50 TTFT (client): 143.11ms P99 TTFT (client): 266.50ms Avg TTIT (client): 28.85ms P50 TTIT (client): 28.94ms P99 TTIT (client): 29.38ms Avg TTFT (server): 224.00ms Avg TTIT (server): 28.62ms Avg prefill len: 3293.05 tokens P50 prefill len: 3293.00 tokens P99 prefill len: 3335.00 tokens Avg decode len: 150.00 tokens P50 decode len: 150.00 tokens P99 decode len: 150.00 tokens Peak TPGS: 66.375 ``` ``` [2025-11-12 16:54:55,483] [rank 0] [INFO] Evaluation results on task gsm8k.8_shot.1_gen: em: 0.960576 | f1: 0.960576 | em_maj1@1: 0.960576 | f1_maj1@1: 0.960576 ``` Differential Revision: D86838348
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a crash on AMD hardware by ensuring that the CUDA-only fused_qk_norm_rope kernel and its associated fusion passes are only loaded on CUDA platforms. The changes in csrc/ops.h, vllm/compilation/fix_functionalization.py, and vllm/compilation/pass_manager.py are logical and effectively solve the reported issue. However, I've identified a related inconsistency in the configuration validation logic that was not updated as part of this change. This could lead to a NameError on ROCm systems if a user enables the corresponding fusion pass, as detailed in my comment.
| from .qk_norm_rope_fusion import QKNormRoPEFusionPass | ||
|
|
||
| if current_platform.is_cuda(): | ||
| from .qk_norm_rope_fusion import QKNormRoPEFusionPass |
There was a problem hiding this comment.
While moving this import under is_cuda() is correct to fix the crash on ROCm, it introduces a potential inconsistency. The configuration validation for this fusion in vllm/config/compilation.py still uses is_cuda_alike().
This means a user on a ROCm platform could set enable_qk_norm_rope_fusion=True, and it would pass the configuration check. However, this would lead to a NameError here, as QKNormRoPEFusionPass would not be imported.
To fix this, please also update the check in vllm/config/compilation.py (line 187) to use current_platform.is_cuda() instead of current_platform.is_cuda_alike().
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| #ifndef USE_ROCM | ||
| void fused_qk_norm_rope(torch::Tensor& qkv, int64_t num_heads_q, | ||
| int64_t num_heads_k, int64_t num_heads_v, | ||
| int64_t head_dim, double eps, torch::Tensor& q_weight, | ||
| torch::Tensor& k_weight, torch::Tensor& cos_sin_cache, | ||
| bool is_neox, torch::Tensor& position_ids); | ||
| #endif |
There was a problem hiding this comment.
Guard removes declaration but binding still references op
The new #ifndef USE_ROCM guard in csrc/ops.h (lines 95‑101) removes the declaration of fused_qk_norm_rope on ROCm builds, but csrc/torch_bindings.cpp still unconditionally registers the custom op at lines 178‑184 via ops.impl("fused_qk_norm_rope", torch::kCUDA, &fused_qk_norm_rope);. When building with USE_ROCM defined, the compiler no longer sees any declaration of that symbol before it is used, so torch_bindings.cpp fails to compile on AMD/ROCm even though the function definition still exists in fused_qknorm_rope_kernel.cu. Either the declaration needs to remain available or the binding needs to be wrapped in the same guard; otherwise every ROCm build is broken.
Useful? React with 👍 / 👎.
|
I think this error has been solved by #28500 |
Summary:
#27165 introduced an issue where when we run on AMD hardware, we would try to load
FUSED_QK_ROPE_OP = torch.ops._C.fused_qk_norm_rope.default, which is CUDA only, and get error:We should only import
QKNormRoPEFusionPasswhenis_cuda, instead ofis_cuda_alike, which include ROCM.Test Plan:
Patch the change and able to start vllm on AMD properly (deepseek)
Differential Revision: D86838348