fix(studio): add ROCm DeviceType and correct AMD GPU detection in hardware.py#4449
fix(studio): add ROCm DeviceType and correct AMD GPU detection in hardware.py#4449GoldenGrapeGentleman wants to merge 4 commits into
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
ROCm implements the CUDA API via HIP, so torch.cuda.is_available() returns True on AMD GPUs. Without this fix, detect_hardware() always reports DeviceType.CUDA on AMD hosts. Changes: - Add DeviceType.ROCM = 'rocm' after CUDA in the enum - detect_hardware(): use torch.version.hip to distinguish CUDA vs ROCm - clear_gpu_cache(): extend to DeviceType.ROCM (same torch.cuda API) - get_gpu_memory_info(): extend to DeviceType.ROCM - get_gpu_utilization(): allow ROCm through; VRAM backfill via torch.cuda works - get_package_versions(): add 'rocm' key (torch.version.hip) - get_physical_gpu_count(): add rocm-smi fallback for AMD hosts Tested on 8×AMD MI355X (ROCm 7.1, 288 GB HBM3e per GPU): detect_hardware() -> DeviceType.ROCM ✅ get_gpu_memory_info() total_gb=288.0 ✅ get_package_versions() rocm='7.1.25...' cuda=None ✅ Co-authored-by: billishyahao <bill.he@amd.com>
62c406a to
8adbdde
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62c406ae33
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| DEVICE = DeviceType.CUDA | ||
| print(f"Hardware detected: CUDA — {device_name}") | ||
| else: | ||
| DEVICE = DeviceType.ROCM |
There was a problem hiding this comment.
Keep PyTorch device string on ROCm as cuda
On ROCm hosts this branch returns DeviceType.ROCM, whose .value is "rocm", but downstream inference code uses get_device().value as a PyTorch device string (InferenceBackend.__init__ sets self.device, then .to(self.device) is called in generation paths). PyTorch HIP still uses the CUDA device namespace, so "rocm" is not a valid target for Tensor.to(...), which can break model load/inference specifically in ROCm environments.
Useful? React with 👍 / 👎.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Problem
ROCm implements the CUDA API via HIP, so
torch.cuda.is_available()returnsTrueon AMD GPUs. Without this fix,detect_hardware()always reportsDeviceType.CUDAon AMD hosts, causing incorrect backend labels in the Studio UI.Additionally,
get_physical_gpu_count()relies solely onnvidia-smi, which returns 1 on AMD systems.Changes
DeviceType.ROCM = "rocm"added afterCUDAin the enumdetect_hardware(): checktorch.version.hipto distinguish NVIDIA CUDA from AMD ROCmclear_gpu_cache(): extend toDeviceType.ROCM(sametorch.cudaAPI via HIP)get_gpu_memory_info(): extend toDeviceType.ROCMget_gpu_utilization(): allow ROCm through; VRAM backfill viatorch.cudaworks on ROCmget_package_versions(): addrocmkey (torch.version.hip)get_physical_gpu_count(): addrocm-smifallback for AMD hostsTesting
Verified on 8×AMD MI355X (ROCm 7.1, 288 GB HBM3e per GPU):
detect_hardware()→DeviceType.ROCM✅get_gpu_memory_info()total_gb=288.0✅get_package_versions()rocm='7.1.25...'cuda=None✅clear_gpu_cache()no exception ✅Co-authored-by: billishyahao bill.he@amd.com