[Bugfix][Hardware][AMD] Fix device parameter and exception handling#31552
[Bugfix][Hardware][AMD] Fix device parameter and exception handling#31552c0de128 wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces two important bug fixes for ROCm/AMD hardware support. First, it correctly parameterizes the device in several tensor creation helper functions in vllm/compilation/fusion.py, removing a hardcoded "cuda" value and allowing for explicit device selection. Second, it improves exception handling in vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py by replacing a broad except Exception with a more specific except (ImportError, ModuleNotFoundError), which prevents masking unrelated errors. The changes are well-implemented and improve the robustness and cross-platform compatibility of the codebase. I've suggested a further improvement in vllm/compilation/fusion.py to make the default device platform-aware, which will enhance support for other hardware backends like XPU.
vllm/compilation/fusion.py
Outdated
| def empty_bf16(*args, device="cuda", **kwargs): | ||
| return torch.empty(*args, **kwargs, dtype=torch.bfloat16, device=device) |
There was a problem hiding this comment.
To improve platform-agnosticism and ensure correctness on non-CUDA/ROCm devices (like XPU), it's better to use a dynamic default for the device. Using current_platform.device_type will correctly select the device type for the active platform.
| def empty_bf16(*args, device="cuda", **kwargs): | |
| return torch.empty(*args, **kwargs, dtype=torch.bfloat16, device=device) | |
| def empty_bf16(*args, device=current_platform.device_type, **kwargs): | |
| return torch.empty(*args, **kwargs, dtype=torch.bfloat16, device=device) |
vllm/compilation/fusion.py
Outdated
| def empty_fp32(*args, device="cuda", **kwargs): | ||
| return torch.empty(*args, **kwargs, dtype=torch.float32, device=device) |
There was a problem hiding this comment.
Similar to the empty_bf16 function, using current_platform.device_type as the default device will make this helper more robust across different hardware platforms.
| def empty_fp32(*args, device="cuda", **kwargs): | |
| return torch.empty(*args, **kwargs, dtype=torch.float32, device=device) | |
| def empty_fp32(*args, device=current_platform.device_type, **kwargs): | |
| return torch.empty(*args, **kwargs, dtype=torch.float32, device=device) |
vllm/compilation/fusion.py
Outdated
| def empty_i32(*args, device="cuda", **kwargs): | ||
| return torch.empty(*args, **kwargs, dtype=torch.int32, device=device) |
There was a problem hiding this comment.
To maintain consistency and improve platform support, please update the default device to current_platform.device_type.
| def empty_i32(*args, device="cuda", **kwargs): | |
| return torch.empty(*args, **kwargs, dtype=torch.int32, device=device) | |
| def empty_i32(*args, device=current_platform.device_type, **kwargs): | |
| return torch.empty(*args, **kwargs, dtype=torch.int32, device=device) |
vllm/compilation/fusion.py
Outdated
| def empty_i64(*args, device="cuda", **kwargs): | ||
| return torch.empty(*args, **kwargs, dtype=torch.int64, device=device) |
There was a problem hiding this comment.
Finally, please update this function to use current_platform.device_type for the default device to ensure consistent, platform-agnostic behavior.
| def empty_i64(*args, device="cuda", **kwargs): | |
| return torch.empty(*args, **kwargs, dtype=torch.int64, device=device) | |
| def empty_i64(*args, device=current_platform.device_type, **kwargs): | |
| return torch.empty(*args, **kwargs, dtype=torch.int64, device=device) |
Fix two ROCm-related issues: 1. fusion.py helper functions (vllm/compilation/fusion.py): - Bug: Hardcoded device="cuda" in empty_bf16, empty_fp32, etc. - Fix: Add device parameter with "cuda" default for flexibility - This allows explicit device selection in multi-GPU scenarios 2. AITER import exception handling (scaled_mm/aiter.py): - Bug: Used broad `except Exception:` which masks unexpected errors - Fix: Use specific `except (ImportError, ModuleNotFoundError):` - This prevents masking driver errors, OOM, etc. during imports Signed-off-by: c0de128 <kevin.mckay@outlook.com>
e83c89b to
023eb38
Compare
|
Applied Gemini's suggestion - now using |
📊 Hardware Verification (MI300X)Verified on AMD Instinct MI300X VF (gfx942, ROCm 6.2). Changes Applied:
Validation Results: Platform Portability: The Device Info:
|
|
Closing this PR to reduce maintainer review burden. The fix is available in this branch if needed in the future. Thank you for your time! |
Summary
Fix two ROCm-related issues:
1. Fusion Helper Functions (
vllm/compilation/fusion.py)Bug: Hardcoded
device="cuda"in helper functions prevents explicit device selection.This allows explicit device selection in multi-GPU scenarios while maintaining backward compatibility with the default.
2. AITER Import Exception Handling (
scaled_mm/aiter.py)Bug: Used broad
except Exception:which masks unexpected errors like driver issues, OOM, etc.This ensures only import-related errors are caught, allowing other errors to propagate for debugging.
Test Plan
🤖 Generated with Claude Code