Skip to content

[Bugfix][Hardware][AMD] Fix hardcoded device in AITER MLA and Fused MOE#31729

Closed
c0de128 wants to merge 1 commit intovllm-project:mainfrom
c0de128:fix-aiter-device-hardcoding
Closed

[Bugfix][Hardware][AMD] Fix hardcoded device in AITER MLA and Fused MOE#31729
c0de128 wants to merge 1 commit intovllm-project:mainfrom
c0de128:fix-aiter-device-hardcoding

Conversation

@c0de128
Copy link
Copy Markdown
Contributor

@c0de128 c0de128 commented Jan 5, 2026

Summary

Fix hardcoded device="cuda" in AITER MLA sparse attention and Fused MOE initialization code. This ensures tensors are created on the correct device in multi-GPU setups and improves ROCm compatibility.

Changes

1. vllm/attention/ops/rocm_aiter_mla_sparse.py

Replace 5 instances of device="cuda" with device=q.device or device=q_fp8.device:

  • Lines 46, 49: Use q.device in fp8_mqa_logits_torch()
  • Lines 127, 135: Use q.device in fp8_paged_mqa_logits_torch()
  • Line 194: Use q_fp8.device in rocm_fp8_paged_mqa_logits()

2. vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py

Add device parameter to init_aiter_topK_meta_data() function:

  • Replace 3 instances of device="cuda" with device=device parameter
  • Default value device="cuda" maintains backward compatibility

3. vllm/model_executor/layers/fused_moe/layer.py

Update call site to pass current_platform.device_type to the init function.

Test Plan

  • Code inspection verified correct device propagation
  • Linting passed (ruff format, ruff check)
  • The fix follows existing vLLM patterns for device handling

cc @hongxiayang @tjtanaa


Note

Ensures tensors are allocated on the correct device (CUDA/ROCm) instead of being hardcoded to CUDA, improving multi-GPU correctness and ROCm compatibility.

  • In vllm/v1/attention/ops/rocm_aiter_mla_sparse.py, uses q.device/q_fp8.device for torch.arange, tensor fills, and logits buffers.
  • In vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py, adds device param to init_aiter_topK_meta_data() and switches tensor allocations to device=device.
  • In vllm/model_executor/layers/fused_moe/layer.py, passes current_platform.device_type to init_aiter_topK_meta_data() for shared-expert TopK buffers.

Written by Cursor Bugbot for commit d87dadc. This will update automatically on new commits. Configure here.

@c0de128
Copy link
Copy Markdown
Contributor Author

c0de128 commented Jan 5, 2026

/ci-run

@mergify mergify bot added the rocm Related to AMD ROCm label Jan 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses the issue of hardcoded device="cuda" in AITER MLA and Fused MOE components. The changes are well-implemented, replacing the hardcoded device with dynamic device information from existing tensors or function parameters. This is a crucial fix for improving ROCm compatibility and ensuring correctness in multi-GPU setups. The approach of adding a device parameter with a backward-compatible default is sound. Overall, this is a solid bugfix that improves the robustness and portability of the code.

@c0de128
Copy link
Copy Markdown
Contributor Author

c0de128 commented Jan 5, 2026

Hi @tjtanaa, AMD CI passed (#2374). This fixes hardcoded device parameters in AITER MLA and Fused MOE code. Would appreciate a review when you have time. Thanks!

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 9, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @c0de128.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 9, 2026
Replace hardcoded device="cuda" with dynamic device inference from input
tensors or platform configuration:

1. rocm_aiter_mla_sparse.py: Use q.device or q_fp8.device (5 instances)
2. rocm_aiter_fused_moe.py: Add device parameter to init_aiter_topK_meta_data
3. layer.py: Pass current_platform.device_type to init function

This ensures tensors are created on the correct device in multi-GPU setups
and improves ROCm compatibility.

Signed-off-by: c0de128 <kevin.mckay@outlook.com>
@c0de128 c0de128 force-pushed the fix-aiter-device-hardcoding branch from 9f73123 to d87dadc Compare January 9, 2026 23:28
@c0de128
Copy link
Copy Markdown
Contributor Author

c0de128 commented Jan 9, 2026

/buildkite run

@mergify mergify bot added v1 and removed needs-rebase labels Jan 9, 2026
@c0de128
Copy link
Copy Markdown
Contributor Author

c0de128 commented Jan 12, 2026

Closing this PR to reduce maintainer review burden. The fix is available in this branch if needed in the future. Thank you for your time!

@c0de128 c0de128 closed this Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rocm Related to AMD ROCm v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant