[Bugfix][Hardware][AMD] Fix hardcoded device in AITER MLA and Fused MOE#31729
[Bugfix][Hardware][AMD] Fix hardcoded device in AITER MLA and Fused MOE#31729c0de128 wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
/ci-run |
There was a problem hiding this comment.
Code Review
This pull request correctly addresses the issue of hardcoded device="cuda" in AITER MLA and Fused MOE components. The changes are well-implemented, replacing the hardcoded device with dynamic device information from existing tensors or function parameters. This is a crucial fix for improving ROCm compatibility and ensuring correctness in multi-GPU setups. The approach of adding a device parameter with a backward-compatible default is sound. Overall, this is a solid bugfix that improves the robustness and portability of the code.
|
This pull request has merge conflicts that must be resolved before it can be |
Replace hardcoded device="cuda" with dynamic device inference from input tensors or platform configuration: 1. rocm_aiter_mla_sparse.py: Use q.device or q_fp8.device (5 instances) 2. rocm_aiter_fused_moe.py: Add device parameter to init_aiter_topK_meta_data 3. layer.py: Pass current_platform.device_type to init function This ensures tensors are created on the correct device in multi-GPU setups and improves ROCm compatibility. Signed-off-by: c0de128 <kevin.mckay@outlook.com>
9f73123 to
d87dadc
Compare
|
/buildkite run |
|
Closing this PR to reduce maintainer review burden. The fix is available in this branch if needed in the future. Thank you for your time! |
Summary
Fix hardcoded
device="cuda"in AITER MLA sparse attention and Fused MOE initialization code. This ensures tensors are created on the correct device in multi-GPU setups and improves ROCm compatibility.Changes
1.
vllm/attention/ops/rocm_aiter_mla_sparse.pyReplace 5 instances of
device="cuda"withdevice=q.deviceordevice=q_fp8.device:q.deviceinfp8_mqa_logits_torch()q.deviceinfp8_paged_mqa_logits_torch()q_fp8.deviceinrocm_fp8_paged_mqa_logits()2.
vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.pyAdd
deviceparameter toinit_aiter_topK_meta_data()function:device="cuda"withdevice=deviceparameterdevice="cuda"maintains backward compatibility3.
vllm/model_executor/layers/fused_moe/layer.pyUpdate call site to pass
current_platform.device_typeto the init function.Test Plan
cc @hongxiayang @tjtanaa
Note
Ensures tensors are allocated on the correct device (CUDA/ROCm) instead of being hardcoded to CUDA, improving multi-GPU correctness and ROCm compatibility.
vllm/v1/attention/ops/rocm_aiter_mla_sparse.py, usesq.device/q_fp8.devicefortorch.arange, tensor fills, and logits buffers.vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py, addsdeviceparam toinit_aiter_topK_meta_data()and switches tensor allocations todevice=device.vllm/model_executor/layers/fused_moe/layer.py, passescurrent_platform.device_typetoinit_aiter_topK_meta_data()for shared-expert TopK buffers.Written by Cursor Bugbot for commit d87dadc. This will update automatically on new commits. Configure here.