Skip to content

fix(studio): add ROCm DeviceType and correct AMD GPU detection in hardware.py#4449

Closed
GoldenGrapeGentleman wants to merge 4 commits into
unslothai:mainfrom
GoldenGrapeGentleman:fix/hardware-rocm-device
Closed

fix(studio): add ROCm DeviceType and correct AMD GPU detection in hardware.py#4449
GoldenGrapeGentleman wants to merge 4 commits into
unslothai:mainfrom
GoldenGrapeGentleman:fix/hardware-rocm-device

Conversation

@GoldenGrapeGentleman
Copy link
Copy Markdown
Contributor

Problem

ROCm implements the CUDA API via HIP, so torch.cuda.is_available() returns True on AMD GPUs. Without this fix, detect_hardware() always reports DeviceType.CUDA on AMD hosts, causing incorrect backend labels in the Studio UI.

Additionally, get_physical_gpu_count() relies solely on nvidia-smi, which returns 1 on AMD systems.

Changes

  • DeviceType.ROCM = "rocm" added after CUDA in the enum
  • detect_hardware(): check torch.version.hip to distinguish NVIDIA CUDA from AMD ROCm
  • clear_gpu_cache(): extend to DeviceType.ROCM (same torch.cuda API via HIP)
  • get_gpu_memory_info(): extend to DeviceType.ROCM
  • get_gpu_utilization(): allow ROCm through; VRAM backfill via torch.cuda works on ROCm
  • get_package_versions(): add rocm key (torch.version.hip)
  • get_physical_gpu_count(): add rocm-smi fallback for AMD hosts

Testing

Verified on 8×AMD MI355X (ROCm 7.1, 288 GB HBM3e per GPU):

  • detect_hardware()DeviceType.ROCM
  • get_gpu_memory_info() total_gb=288.0
  • get_package_versions() rocm='7.1.25...' cuda=None
  • clear_gpu_cache() no exception ✅

Co-authored-by: billishyahao bill.he@amd.com

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ROCm implements the CUDA API via HIP, so torch.cuda.is_available()
returns True on AMD GPUs. Without this fix, detect_hardware() always
reports DeviceType.CUDA on AMD hosts.

Changes:
- Add DeviceType.ROCM = 'rocm' after CUDA in the enum
- detect_hardware(): use torch.version.hip to distinguish CUDA vs ROCm
- clear_gpu_cache(): extend to DeviceType.ROCM (same torch.cuda API)
- get_gpu_memory_info(): extend to DeviceType.ROCM
- get_gpu_utilization(): allow ROCm through; VRAM backfill via torch.cuda works
- get_package_versions(): add 'rocm' key (torch.version.hip)
- get_physical_gpu_count(): add rocm-smi fallback for AMD hosts

Tested on 8×AMD MI355X (ROCm 7.1, 288 GB HBM3e per GPU):
  detect_hardware() -> DeviceType.ROCM ✅
  get_gpu_memory_info() total_gb=288.0 ✅
  get_package_versions() rocm='7.1.25...' cuda=None ✅

Co-authored-by: billishyahao <bill.he@amd.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62c406ae33

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

DEVICE = DeviceType.CUDA
print(f"Hardware detected: CUDA — {device_name}")
else:
DEVICE = DeviceType.ROCM
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep PyTorch device string on ROCm as cuda

On ROCm hosts this branch returns DeviceType.ROCM, whose .value is "rocm", but downstream inference code uses get_device().value as a PyTorch device string (InferenceBackend.__init__ sets self.device, then .to(self.device) is called in generation paths). PyTorch HIP still uses the CUDA device namespace, so "rocm" is not a valid target for Tensor.to(...), which can break model load/inference specifically in ROCm environments.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant