Add AMD ROCm/HIP support across installer and hardware detection#4720
Conversation
Add AMD ROCm GPU detection to get_torch_index_url() in install.sh. When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm version file, hipconfig, dpkg-query, and rpm. Includes validation guard for malformed _rocm_tag, Debian epoch prefix stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install, and status messaging. Shell tests expanded to 23 cases. Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a Linux host has ROCm but the venv received CPU-only torch, and reinstall with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a 30-second timeout on the torch GPU probe subprocess. Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP prebuilt with CPU fallback). Add linux-rocm and windows-hip install kinds to runtime_patterns_for_choice(). Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Add IS_ROCM flag to hardware.py detect_hardware() (set when torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM from __init__.py. Add "rocm" key to get_package_versions(). Replace "We do not support AMD" error in tokenizer_utils.py with a helpful message pointing to ROCm installation docs. Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Add tests/studio/install/test_rocm_support.py covering all ROCm code paths across install_llama_prebuilt.py, install_python_stack.py, hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks and run without AMD hardware. Covers: asset selection (11), runtime patterns (5), HostInfo (4), ROCm version detection (9), torch reinstall (9), index mapping (8), hardware flag (8), tokenizer message (2), install.sh structure (10), and live regression (1).
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7290199c26
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| amd-smi version 2>/dev/null | awk -F'ROCm version: ' \ | ||
| 'NF>1{gsub(/[^0-9.]/, "", $2); split($2,a,"."); print "rocm"a[1]"."a[2]; ok=1; exit} END{exit !ok}'; } || \ |
There was a problem hiding this comment.
Continue ROCm fallback when amd-smi version is malformed
The amd-smi probe currently returns success as soon as it sees a ROCm version: field, even when the extracted version is empty/non-numeric (for example N/A), so the || fallback chain never reaches /opt/rocm/.info/version, hipconfig, or package metadata. In that case _rocm_tag is later cleared by validation and the installer incorrectly falls back to CPU on ROCm hosts that have other valid version sources.
Useful? React with 👍 / 👎.
| probe = subprocess.run( | ||
| [ | ||
| sys.executable, | ||
| "-c", | ||
| "import torch; print(torch.version.hip or torch.version.cuda or '')", | ||
| ], | ||
| stdout = subprocess.PIPE, | ||
| stderr = subprocess.DEVNULL, | ||
| timeout = 30, |
There was a problem hiding this comment.
Catch torch probe timeout before deciding reinstall
The ROCm torch probe uses subprocess.run(..., timeout=30) without exception handling, so a slow or hung import torch raises TimeoutExpired and aborts install_python_stack() instead of continuing install/update. This turns a defensive probe into a hard failure on degraded environments; the timeout path should be handled explicitly and treated as a recoverable probe miss.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces AMD ROCm support across the installation scripts and backend hardware utilities, including new tests and regression checks. Key changes involve adding ROCm version detection, configuring appropriate PyTorch index URLs, and ensuring bitsandbytes is installed with AMD support. Review feedback focused on improving resource management by using context managers for file operations and refining path filtering logic for better code clarity.
| os.path.join(rocm_root, "lib", "rocm_version"), | ||
| ): | ||
| try: | ||
| parts = open(path).read().strip().split("-")[0].split(".") |
| shutil.which("rocm-smi"), | ||
| ] | ||
| rocm_paths = ["/opt/rocm", os.environ.get("ROCM_PATH", "")] | ||
| if any(rocm_hints) or any(os.path.isdir(p) for p in rocm_paths if p): |
There was a problem hiding this comment.
The use of any(os.path.isdir(p) for p in rocm_paths if p) is good, but rocm_paths contains an empty string if ROCM_PATH is not set. While if p handles this, it is cleaner to filter the list first.
rocm_paths = [p for p in ["/opt/rocm", os.environ.get("ROCM_PATH")] if p]
if any(rocm_hints) or any(os.path.isdir(p) for p in rocm_paths):Address review findings from 8 independent reviewers: - Wrap _ensure_rocm_torch() torch probe in try/except for TimeoutExpired and OSError so a hung or broken torch import does not crash the installer (8/8 reviewers flagged this) - Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to prevent installing unsupported torch 2.11.0 from the rocm7.1 index - Use with-statement for file reads in _detect_rocm_version() to avoid resource leaks - Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of default parameter to avoid relative path resolution) - Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to reject rocm0.x tags that would produce nonexistent PyTorch index URLs - Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0* |rocm7.1* pass through, everything else caps to rocm7.1) so future ROCm 10+ does not fall through to a nonexistent index - Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering - Fix test_probe_timeout_handled: replace zero-assertion test with proper assertions verifying reinstall proceeds after timeout
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements AMD ROCm support by adding hardware detection logic, updating installation scripts to fetch ROCm-compatible PyTorch and bitsandbytes wheels, and refining asset resolution for prebuilt binaries. It also introduces a comprehensive test suite and updates error messages for AMD users. The review feedback identifies a critical version typo for the bitsandbytes package (0.49.1) that must be corrected to a valid version like 0.45.0 to prevent installation failures. Furthermore, the reviewer recommends replacing silent broad exception handlers in the hardware probing logic with logging to improve maintainability and debuggability.
| case "$TORCH_INDEX_URL" in | ||
| */rocm*) | ||
| substep "installing bitsandbytes for AMD ROCm..." | ||
| run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.49.1" |
There was a problem hiding this comment.
The version constraint bitsandbytes>=0.49.1 appears to be a typo. As of current releases, bitsandbytes is in the 0.45.x range. ROCm support was officially added in 0.44.0. Using a non-existent version will cause the installation to fail. Please verify the intended version (likely 0.45.1 or similar).
| run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.49.1" | |
| run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.45.0" |
| pip_install( | ||
| "bitsandbytes (AMD)", | ||
| "--no-cache-dir", | ||
| "bitsandbytes>=0.49.1", |
There was a problem hiding this comment.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Avoid using broad, silent exception handlers like except Exception: pass. While this is a probe, it is better to at least include a comment explaining why the exception is ignored or log it at a debug level to assist in troubleshooting if version detection fails unexpectedly.
References
- Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Broad exception handler detected. Per general rules, it is recommended to log the exception or add a descriptive comment instead of a silent pass, even for hardware probing logic.
References
- Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2cb0b52be7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| rocm6.*|rocm7.0*|rocm7.1*) | ||
| echo "$_base/$_rocm_tag" ;; | ||
| *) | ||
| # ROCm 7.2+ (including future 10.x+): cap to rocm7.1 | ||
| echo "$_base/rocm7.1" ;; |
There was a problem hiding this comment.
Restrict ROCm tag remap to supported versions
This remap logic treats every detected ROCm tag outside rocm6.*, rocm7.0*, and rocm7.1* as rocm7.1, so older stacks like rocm5.x are incorrectly forced onto the 7.1 wheel index; at the same time, rocm7.1* also matches rocm7.10, which emits a likely non-existent .../rocm7.10 index. In both cases the installer can request incompatible/unavailable wheels and fail instead of falling back cleanly. The mapping should only pass through explicitly supported tags and only cap versions >= 7.2 to rocm7.1.
Useful? React with 👍 / 👎.
Filter None from the ROCM_PATH env var lookup at list construction time instead of relying on the inline `if p` guard in the any() call.
All 8 reviewers across 2 cycles independently flagged that ROCm detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core) as a proxy for GPU presence, which would misroute CPU-only or NVIDIA hosts that happen to have ROCm tools installed. Now all 3 detection points (install.sh, install_python_stack.py, install_llama_prebuilt.py) probe for an actual AMD GPU before entering the ROCm path: - install.sh: check rocminfo for gfx* GPU names, or amd-smi list for device rows, before version detection - install_python_stack.py: new _has_rocm_gpu() function probes rocminfo and amd-smi list before _ensure_rocm_torch() proceeds - install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi list instead of just checking tool existence or directory paths Also: - Shell test mock amd-smi now handles "list" subcommand - Python tests updated to mock _has_rocm_gpu where needed - Added test_no_gpu_with_rocm_tools_skips to verify the new guard - Test index lookups now use sorted() to match production code
|
/gemini review |
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for AMD ROCm across the installation and hardware detection pathways. Key changes include updating install.sh to detect AMD GPUs and select appropriate ROCm-enabled PyTorch wheels, adding ROCm detection logic to the Python stack installer, and updating hardware utility modules to distinguish between NVIDIA CUDA and AMD ROCm (HIP). The PR also includes extensive new tests to verify these pathways without requiring physical AMD hardware. Feedback focuses on improving the robustness of version string parsing and subprocess probing to prevent potential runtime errors like IndexError or AttributeError during hardware detection.
| _rocm_tag="" | ||
| _rocm_tag=$({ command -v amd-smi >/dev/null 2>&1 && \ | ||
| amd-smi version 2>/dev/null | awk -F'ROCm version: ' \ | ||
| 'NF>1{gsub(/[^0-9.]/, "", $2); split($2,a,"."); print "rocm"a[1]"."a[2]; ok=1; exit} END{exit !ok}'; } || \ |
There was a problem hiding this comment.
The awk script for parsing amd-smi version might fail if the version string contains unexpected characters or if the split results in fewer than two elements. While the validation guard at line 1017 handles malformed tags, ensuring that a[1] and a[2] are populated before printing would be more robust.
| parts = fh.read().strip().split("-")[0].split(".") | ||
| return int(parts[0]), int(parts[1]) |
There was a problem hiding this comment.
Accessing parts[0] and parts[1] without checking the length of the parts list could raise an IndexError. While this is wrapped in a broad try...except Exception, repository rules advise against silent broad exception handlers. It is better to validate the list length explicitly and ensure any caught exceptions are logged for debugging purposes.
| parts = fh.read().strip().split("-")[0].split(".") | |
| return int(parts[0]), int(parts[1]) | |
| parts = fh.read().strip().split("-")[0].split(".") | |
| if len(parts) >= 2: | |
| return int(parts[0]), int(parts[1]) |
References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
| raw = result.stdout.decode().strip().split("\n")[0] | ||
| parts = raw.split(".") | ||
| if len(parts) >= 2 and parts[0].isdigit(): | ||
| return int(parts[0]), int(parts[1]) |
There was a problem hiding this comment.
Similar to the version file parsing, parts[1] could raise an IndexError if hipconfig --version returns a string without at least one dot. Adding a length check ensures robustness.
| return int(parts[0]), int(parts[1]) | |
| if len(parts) >= 2 and parts[0].isdigit() and parts[1].isdigit(): | |
| return int(parts[0]), int(parts[1]) |
| [ | ||
| sys.executable, | ||
| "-c", | ||
| "import torch; print(torch.version.hip or torch.version.cuda or '')", |
There was a problem hiding this comment.
The probe script import torch; print(torch.version.hip or torch.version.cuda or '') might raise an AttributeError if torch.version does not have a hip or cuda attribute (e.g., in very old or custom builds). Using getattr is safer.
| "import torch; print(torch.version.hip or torch.version.cuda or '')", | |
| "import torch; v=torch.version; print(getattr(v, 'hip', None) or getattr(v, 'cuda', None) or '')", |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4286525c53
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _rocm_tag=$({ command -v amd-smi >/dev/null 2>&1 && \ | ||
| amd-smi version 2>/dev/null | awk -F'ROCm version: ' \ | ||
| 'NF>1{gsub(/[^0-9.]/, "", $2); split($2,a,"."); print "rocm"a[1]"."a[2]; ok=1; exit} END{exit !ok}'; } || \ | ||
| { [ -r /opt/rocm/.info/version ] && \ | ||
| awk -F. '{print "rocm"$1"."$2; exit}' /opt/rocm/.info/version; } || \ |
There was a problem hiding this comment.
Prevent ROCm version probe from aborting install.sh
Because install.sh runs with set -e, this _rocm_tag=$(...) assignment will terminate the script when every version probe in the || chain fails. That happens on real ROCm hosts when GPU detection succeeds (via rocminfo) but version sources are unavailable/unreadable (for example custom ROCm layouts where /opt/rocm/.info/version, hipconfig, and package metadata are missing). In that case the installer exits before reaching the intended CPU fallback path.
Useful? React with 👍 / 👎.
| if result.returncode == 0 and result.stdout.strip(): | ||
| if marker is None or marker in result.stdout.lower(): | ||
| return True |
There was a problem hiding this comment.
Require real device rows in amd-smi GPU detection
The amd-smi path treats any non-empty amd-smi list output as proof of an AMD GPU because marker is None, so informational/header-only output can be misclassified as a real adapter. When that happens on hosts with ROCm tools but no AMD GPU, _ensure_rocm_torch() can incorrectly force a ROCm torch reinstall and fail dependency setup; this check should validate actual device entries (similar to the shell script’s row filtering) rather than just stdout.strip().
Useful? React with 👍 / 👎.
- Add parts[1].isdigit() check in hipconfig version parsing to handle versions like "6.3-HIP" where the minor component has non-numeric suffix (strip "-" prefix before int() conversion) - Use getattr() in torch probe subprocess to safely handle old or custom torch builds that may lack torch.version.hip/cuda attributes
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
- Change amd-smi list detection from any-non-empty-output to requiring "gpu" marker in output, matching the shell-side NR>1 check. Prevents false positives from header-only amd-smi list output. - Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and install_llama_prebuilt.py behavior). - Apply the same amd-smi marker fix to install_llama_prebuilt.py detect_host() for consistency.
|
/gemini review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 10ec0cdefa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| (["rocminfo"], "gfx"), | ||
| (["amd-smi", "list"], "gpu"), |
There was a problem hiding this comment.
Require concrete AMD device markers in ROCm probe
The new ROCm detector uses ("amd-smi", "list") with marker "gpu" and then accepts any output containing that substring, which can misclassify non-AMD hosts because informational text (for example messages like “No GPUs found” or header-only output) still includes gpu. In environments without usable NVIDIA GPUs, that false positive routes selection toward HIP/ROCm prebuilts instead of CPU, leading to unnecessary validation failures and source-build fallback behavior.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request implements AMD ROCm support across the installation and hardware detection stack. Key updates include ROCm version detection in the shell installer, the introduction of an IS_ROCM flag in the hardware utility module, and logic in the Python stack installer to handle ROCm-specific PyTorch and bitsandbytes wheels. Additionally, prebuilt binary resolution now supports HIP assets for both Linux and Windows. Feedback was provided regarding the ROCm detection logic in detect_host, which is likely to fail on Windows due to its reliance on Linux-specific CLI tools.
| has_rocm = False | ||
| if not is_macos: | ||
| for _cmd, _marker in ( | ||
| (["rocminfo"], "gfx"), | ||
| (["amd-smi", "list"], "gpu"), | ||
| ): | ||
| _exe = shutil.which(_cmd[0]) | ||
| if not _exe: | ||
| continue | ||
| try: | ||
| _result = run_capture([_exe, *_cmd[1:]], timeout = 10) | ||
| except Exception: | ||
| continue | ||
| if _result.returncode == 0 and _result.stdout.strip(): | ||
| if _marker in _result.stdout.lower(): | ||
| has_rocm = True | ||
| break |
There was a problem hiding this comment.
The ROCm detection logic in detect_host is likely to fail on native Windows systems. It relies on rocminfo (which is Linux-only) and amd-smi (which is often not in the system PATH on Windows). To improve detection on Windows, consider checking for the presence of amdhip64.dll in standard locations (e.g., C:\Windows\System32 or the ROCm installation directory) or using wmic to query the GPU vendor.
The previous detect_host() ROCm check used rocminfo and amd-smi list which are Linux-only tools. On Windows, has_rocm would always be False, making the Windows HIP prebuilt path at line 1794 unreachable. Now detect_host() uses platform-specific detection: - Linux: rocminfo (check for gfx GPU names) or amd-smi list - Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH This allows Windows AMD users to get the HIP prebuilt binary instead of silently falling through to the CPU prebuilt.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for AMD ROCm (HIP) across the installation scripts, backend hardware monitoring, and kernel utilities. Key enhancements include automated AMD GPU and ROCm version detection, the ability to install ROCm-specific PyTorch wheels and bitsandbytes, and hardware monitoring via amd-smi. It also adds support for llama.cpp ROCm/HIP prebuilts on both Linux and Windows and expands RDNA architecture support in the kernel utilities. I have no feedback to provide as there are no review comments.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for AMD ROCm GPUs across the installation, hardware detection, and runtime components. Key changes include enhanced GPU detection logic in the installation script to differentiate between NVIDIA and AMD ROCm, enabling the selection of appropriate PyTorch index URLs and direct installation of bitsandbytes for ROCm. A new amd.py module was added to mirror NVIDIA GPU monitoring, providing detailed metrics for ROCm devices, and the core hardware.py was updated to dynamically dispatch to the correct monitoring backend and handle ROCm-specific environment variables for GPU visibility. New functions were introduced in install_python_stack.py to detect ROCm versions and ensure the installation of ROCm-compatible PyTorch wheels and bitsandbytes, with specific warnings for Windows users. The install_llama_prebuilt.py now includes has_rocm in host detection and prioritizes ROCm prebuilts for both Linux and Windows, selecting compatible versions. Kernel utilities expanded RDNA architecture detection, and tokenizer utilities updated error messages for AMD ROCm users. Review comments highlighted potential issues in studio/backend/utils/hardware/amd.py regarding truthiness checks for vram_total_mb and power_limit that could lead to None results if values are exactly 0.0, and suggested more robust type checking for amd-smi output to prevent AttributeError when data is not a dictionary.
| round((vram_used_mb / vram_total_mb) * 100, 1) | ||
| if vram_used_mb is not None and vram_total_mb and vram_total_mb > 0 | ||
| else None | ||
| ) |
There was a problem hiding this comment.
The truthiness check and vram_total_mb will evaluate to False if vram_total_mb is 0.0, which is a valid numeric value. While the subsequent vram_total_mb > 0 check correctly handles the division-by-zero case, the redundant truthiness check might lead to unexpected None results if a GPU reports exactly zero total VRAM (e.g., in some virtualized or error states).
vram_util = (
round((vram_used_mb / vram_total_mb) * 100, 1)
if vram_used_mb is not None and vram_total_mb is not None and vram_total_mb > 0
else None
)| round((power_draw / power_limit) * 100, 1) | ||
| if power_draw is not None and power_limit and power_limit > 0 | ||
| else None | ||
| ) |
There was a problem hiding this comment.
Similar to the VRAM utilization check, using truthiness for power_limit will return None if the power limit is exactly 0.0. It is safer to use an explicit is not None check.
power_util = (
round((power_draw / power_limit) * 100, 1)
if power_draw is not None and power_limit is not None and power_limit > 0
else None
)| if isinstance(data, list): | ||
| return len(data) | ||
| # Some versions return a dict with a "gpu" key | ||
| gpus = data.get("gpu", data.get("gpus", [])) |
There was a problem hiding this comment.
If data is not a dictionary (e.g., if amd-smi returns a JSON array or a scalar unexpectedly), calling .get() will raise an AttributeError. Given that _run_amd_smi returns the result of json.loads(), it is safer to verify the type before accessing keys.
| gpus = data.get("gpu", data.get("gpus", [])) | |
| gpus = data.get("gpu", data.get("gpus", [])) if isinstance(data, dict) else [] |
| } | ||
|
|
||
| gpu_list = ( | ||
| data if isinstance(data, list) else data.get("gpus", data.get("gpu", [data])) |
There was a problem hiding this comment.
If data is not a dictionary, data.get() will raise an AttributeError. Although _run_amd_smi usually returns a structured object, adding a type check ensures robustness against malformed SMI output.
gpu_list = (
data if isinstance(data, list) else data.get("gpus", data.get("gpu", [data]))
if isinstance(data, dict) else [data]
)…k, bnb, backend label Consolidated fix batch from a 20-parallel reviewer.py run on the current head. Each fix is drawn from a high-consensus finding and addresses a real bug or feature gap, not a stylistic preference. 1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five call sites so this branch no longer regresses main's version floor (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would silently downgrade the minimum version pin for fresh installs. 2. install.sh: URL-decode Radeon wheel names before extracting the torch / torchvision / torchaudio version strings. Real wheel URLs from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...") so the previous `[+-]` terminator in the sed regex never matched, `_torch_ver` stayed empty, `_radeon_versions_match` stayed false, and every Radeon consumer install silently fell back to the generic ROCm index. Now decode %2B -> + first, then extract, then validate. 3. install.sh: the two AMD bitsandbytes install lines were running `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`, so upgrades where the venv already has a CPU/CUDA bitsandbytes satisfying the constraint would keep the stale non-AMD wheel. Add `--force-reinstall --no-cache-dir` to both call sites, matching the pattern already used in install_python_stack.py::_ensure_rocm_torch. 4. install_python_stack.py and install_llama_prebuilt.py: add `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the Python-side ROCm version detectors so they match the chain in install.sh::get_torch_index_url. Package-managed ROCm installs (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig, or amd-smi `version` output -- without these fallbacks, `unsloth studio update` on such hosts returned None and skipped the ROCm torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before parsing so epoch-annotated packages parse correctly. 5. hardware.py: add a `_backend_label(device)` helper that returns "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and use it for every `"backend": ...` emission in JSON responses served to the Studio frontend. Internally we still represent ROCm hosts as DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API surface), but the user-facing API now correctly reports "rocm" on AMD boxes instead of labeling them as "cuda". All 250 simulation scenarios pass (was 233 before this batch: added 17 new regression tests covering the version pin, %2B decoding, bnb force-reinstall flags, dpkg/rpm fallback presence, and the _backend_label helper's four-way truth table).
for more information, see https://pre-commit.ci
…p to 6.4 Two rounds of fixes in one commit, plus a full URL audit of every PyPI / download.pytorch.org / repo.radeon.com reference the PR introduces. amd.py (4 medium gemini findings on commit b3627bc): 1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util gate. The follow-up `vram_total_mb > 0` already handles the division guard, but the truthiness check was redundant and slightly surprising for a 0.0 valid value. Replace with explicit `is not None and > 0` for both vram_util and power_util. 2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding for non-dict envelopes. A scalar / string JSON response from amd-smi would raise AttributeError. Add an isinstance(data, dict) check and return None for unexpected shapes. 3. get_visible_gpu_utilization had the same .get() exposure on the outer envelope. Rewrite the gpu_list extraction as an explicit list/dict/else cascade so a malformed scalar envelope produces gpu_list=[data] and continues without raising. 4. The same function's per-entry loop also called gpu_data.get() on whatever was inside gpu_list. If a scalar ever leaks into the list (directly or via the previous fix's fallback), _extract_gpu_metrics would raise on the first .get() inside the helper. Skip non-dict entries in the loop before extracting metrics. install.sh (URL audit finding, previously flagged by 20-reviewer as #13): 5. get_torch_index_url used `rocm6.*` in the rocm tag case statement, which matched rocm6.5 and rocm6.6 and emitted download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the supported 6.x minors explicitly and add a rocm6.* fallback branch that clips to rocm6.4 (the last supported 6.x wheel set). URL audit results (all URLs PR 4720 references): - 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130, rocm6.0..6.4,rocm7.0..7.2} return HTTP 200. - 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3, 6.4,7.0,7.1,7.2}/ return HTTP 200. - X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for 6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z -> X.Y fallback sed in the Radeon wheel install block. - Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the llama.cpp GitHub releases API endpoint all return 200. Test suite: 255 -> 258. New regression coverage: - U17: get_physical_gpu_count tolerates scalar amd-smi envelope - U18: get_visible_gpu_utilization tolerates scalar envelope - U19a-c: vram_util / power_util return None on zero total, but vram_total_gb still echoes 0.0 (not None) - A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported 6.x minors to rocm6.4 instead of producing a 403 index URL
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bae24218b0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if command -v rocminfo >/dev/null 2>&1 && \ | ||
| rocminfo 2>/dev/null | awk '/Name:[[:space:]]*gfx[1-9]/{found=1} END{exit !found}'; then |
There was a problem hiding this comment.
Broaden rocminfo GPU match beyond Name: gfx*
The ROCm probe currently requires rocminfo lines to match Name: ... gfx... with gfx immediately after whitespace, but many valid ROCm outputs report GPU ISA as strings like Name: amdgcn-amd-amdhsa--gfx90a.... In those environments (especially minimal/runtime-only installs where amd-smi is absent), _has_amd_rocm_gpu() returns false and get_torch_index_url() incorrectly falls back to CPU wheels on AMD hosts. Please match gfx[1-9] anywhere in relevant rocminfo device lines rather than only the strict prefix form.
Useful? React with 👍 / 👎.
…n.py backend label Three high-confidence findings from a second 20-parallel reviewer.py run on commit 7effb3a. Triaged 15 total findings and applied the three that were confirmed as real bugs; the rest were either false positives (e.g. "migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream via setup.sh regardless), design decisions (e.g. visibility mask env vars not consulted in installer detection), or edge cases the existing fallback logic already handles. 1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then only raises if `torch.cuda.is_available()` is False. On ROCm torch, torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.* API), so the guard becomes dead code on AMD hosts and multi-GPU AMD setups slip through even though unsloth does not support them yet. Add a torch.cuda.device_count() > 1 fallback inside the except so AMD multi-visible-device setups are flagged consistently with the original CUDA memory check. 2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user running `install.sh --no-torch` on an AMD host would still pull in bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the case block in an outer `[ "$SKIP_TORCH" = false ]` guard. 3. studio/backend/main.py [3/20]: the /api/system endpoint returned `"device_backend": get_device().value`, which is "cuda" on ROCm hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints (hardware.py) already use the _backend_label helper which swaps "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same helper so the Studio UI reports the backend consistently across all endpoints. 4. studio/backend/tests/test_utils.py: update test_backend_matches_device to call _backend_label(get_device()) instead of raw get_device().value so the test matches the new contract and still passes on CUDA hosts. Tests: 258 -> 261. New regression coverage: - X08 main.py /api/system uses _backend_label - X09 tokenizer multi-GPU guard has device_count() fallback - X10 fresh-install bnb case block gated on SKIP_TORCH=false
During install, bitsandbytes was installed without --no-deps, causing uv to resolve torch from PyPI (CUDA build) and silently overwrite the ROCm wheels that were just installed in the previous step. This happened in three places: - install.sh: bitsandbytes install in both migrated and fresh paths - install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch() Additionally, multiple install steps in install_python_stack.py (extras, overrides, studio deps) can pull in CUDA torch via transitive dependencies. A final _ensure_rocm_torch() call at the end of the install sequence ensures ROCm torch is always in place at runtime. All changes are gated behind ROCm-specific conditions and do not affect NVIDIA, CPU-only, macOS, or Windows install paths. Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install.
…n HIP On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path: 1. Unsloth's global monkey-patching of transformers model classes (LlamaRotaryEmbedding, attention modules) triggers _assert_async_cuda_kernel crashes on HIP during generation. Training uses different code paths and works fine. 2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion failures on MI300X (CDNA3 / gfx942), even without Unsloth patching. This commit adds a ROCm-specific inference fallback that: - Skips importing Unsloth at module level (prevents global patching) - Loads models in 16-bit with plain transformers + PEFT instead - Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx") since pre-quantized HF repos still trigger bnb codepaths - Guards get_chat_template calls (unavailable without Unsloth import) - Fixes max_seq_length=0 being passed to from_pretrained (GGUF semantics don't apply to transformers path) The NVIDIA path is completely unchanged -- Unsloth import and for_inference() optimization remain active. GGUF inference (via llama-server/HIP) is unaffected since it never imports Python model classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X) so 16-bit loading is practical for inference. Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424): - Simple generation: PASS - Compare mode (base vs finetuned): PASS - GGUF inference + tool calling: PASS (unaffected by this change)
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cb7dbd4361
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if _IS_ROCM_ENV: | ||
| FastLanguageModel = None # Loaded on-demand only on NVIDIA | ||
| FastVisionModel = None | ||
| get_chat_template = None |
There was a problem hiding this comment.
Keep ROCm non-text loaders callable
Do not null out FastLanguageModel/FastVisionModel at import time for ROCm unless every downstream callsite is guarded. load_model() still uses FastVisionModel in the vision path and FastLanguageModel in the SNAC audio path, so on ROCm these branches now crash with AttributeError before any fallback logic runs, breaking vision/audio model loading on AMD hosts.
Useful? React with 👍 / 👎.
| from peft import PeftModel | ||
|
|
||
| model = PeftModel.from_pretrained(model, config.path) | ||
| tokenizer = AutoTokenizer.from_pretrained(config.path) |
There was a problem hiding this comment.
Load LoRA tokenizer from base model in ROCm fallback
In the ROCm LoRA branch the base weights are loaded from _base, but the tokenizer is loaded from config.path (the adapter path). Adapter-only PEFT checkpoints commonly omit tokenizer files, so this can fail on ROCm even when the base model is valid. Use the base model as tokenizer source (or fallback to it when adapter tokenizer files are absent) to avoid adapter-only load failures.
Useful? React with 👍 / 👎.
- Add clear RuntimeError for audio/vision model inference on ROCm (these paths use Unsloth's FastModel/FastVisionModel which would crash on HIP; GGUF inference is the supported path on AMD) - Remove unused `import os as _os` from the ROCm changes
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 98f0215116
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _IS_ROCM_ENV = getattr(__import__("torch").version, "hip", None) is not None | ||
|
|
||
| if _IS_ROCM_ENV: | ||
| FastLanguageModel = None # Loaded on-demand only on NVIDIA |
There was a problem hiding this comment.
Gate ROCm loader nulling on active ROCm hardware
_IS_ROCM_ENV is derived from torch.version.hip (build-time capability), so HIP-built torch on a non-ROCm runtime (e.g., CPU-only host or broken AMD stack) still sets FastLanguageModel = None at import time. Later, load_model() branches on _hw_module.IS_ROCM (runtime detection) and can take the non-ROCm path, which then calls FastLanguageModel.from_pretrained(...) and crashes with AttributeError. Fresh evidence in this patch is the new split between import-time _IS_ROCM_ENV and runtime _hw_module.IS_ROCM, which can legitimately diverge.
Useful? React with 👍 / 👎.
…sage, temperature)
amd-smi on recent ROCm versions (7.x) wraps metric output in a
{"gpu_data": [...]} envelope instead of returning a raw list. This
caused get_primary_gpu_utilization() and get_visible_gpu_utilization()
to fail silently (returning available=False) because the GPU data
dict was never unwrapped.
Additionally:
- VRAM data moved from "vram" to "mem_usage" with "total_vram" /
"used_vram" keys. Added fallback key lookup.
- Temperature "edge" sensor returns "N/A" on MI300X VF; the previous
dict.get() chain returned the "N/A" string instead of falling
through to "hotspot". Changed to a loop that checks each key until
a parseable value is found.
Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x):
- GPU utilization: 0% (idle), up to 100% during training
- Temperature: 40-44C (from hotspot sensor)
- VRAM: 0.28/191.69 GB (idle)
- Power: 158-211W draw
for more information, see https://pre-commit.ci
AMD ROCm Testing Report -- MI300X VF (ROCm 7.2, HIP 7.1.25424)Environment
Fixes Added (3 commits)
Test ResultsStudio Launch & Hardware Detection
GGUF Inference (4 models, all PASS)
GGUF Tool Calling (7 tests, all PASS)
Fine-tuning (5 models, all PASS)
Export (all PASS)
GGUF Memory & Context
KV Cache Quantization (all PASS)q8_0, q4_0, f16 all produce correct "Berlin" for Germany capital. GPU Monitoring (amd-smi)
Compare Mode (base vs finetuned)
Embedding FinetuningQwen3-Embedding-0.6B with all-nli dataset, 10 steps: PASS (loss converged) Vision
NVIDIA/Windows/Mac SafetyAll changes verified gated behind |
* Bug fix detecting radeon * Expanding GPU target for gfx1100* * Generalize gfx family-prefix filter to cover gfx10/gfx12 as well rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures the bare family prefix from the generic line, and passing that to -DGPU_TARGETS breaks the HIP build because clang only accepts specific gfxNNN ids. The previous filter only special-cased gfx11. Generalize it so any bare 2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a specific sibling target is present in the same list. No real AMD GPU has a 2-digit gfx id, so the filter can only ever drop family prefixes and never a real target. Covers the existing gfx11 cases unchanged, and extends the same fix to gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4), which would otherwise hit the same build failure on newer rocminfo. --------- Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
#4954) * Pin bitsandbytes to continuous-release_main on ROCm for 4-bit decode fix bitsandbytes 0.49.2 on PyPI ships with a broken 4-bit GEMV kernel on every ROCm target: - CDNA (gfx90a / gfx942 / gfx950 = MI210 / MI300X / MI350) via a broken blocksize=32/64 warp64 GEMV kernel whose tests were explicitly skipped with ROCM_WARP_SIZE_64 guards because the code was known broken. - RDNA3 / RDNA3.5 (gfx1100-1103 / gfx1150-1152) via a compile-time BNB_WARP_SIZE macro in the host-side dispatch that resolves to 64 when the multi-arch wheel is compiled with CDNA as the primary target, so num_blocks is wrong on RDNA and half the GEMV output is never written. At decode shape (1, 1, hidden) both bugs produce NaN. Training is unaffected because training shapes are (batch, seq_len > 1, hidden) and never touch the GEMV path. The crash during autoregressive inference surfaces as _assert_async_cuda_kernel in torch.multinomial which on HIP becomes a hard HSA_STATUS_ERROR_EXCEPTION instead of a clean Python error. Both bugs are fixed by bitsandbytes commit 713a3b8 ("[ROCm] Enable blocksize 32 4-bit quantization and GEMV kernels on AMD CDNA", PR #1887, merged 2026-03-09) which replaces BNB_WARP_SIZE with a runtime hipDeviceGetAttribute query and ships a working CDNA warp64 kernel. That commit has not shipped to PyPI yet, but continuous-release_main wheels are published on every push to bnb main via GitHub Releases. Point the ROCm install path at the continuous-release_main x86_64 and aarch64 wheels and fall back to PyPI >=0.49.1 when the pre-release is unreachable (offline installs, firewalled hosts, or architectures not covered by the pre-release wheels). Drop the pin once bnb cuts a 0.50+ tag on PyPI. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): direct bnb GEMV shape test now returns 0.0078 max abs error at seq_len=1 (no NaN) vs NaN on 0.49.2, and full Unsloth + for_inference + 4-bit sampling generation works end-to-end. NVIDIA / CPU / Mac / Windows paths are unaffected -- the helper is gated on the ROCm torch index and platform.machine() respectively. * Drop Studio ROCm 16-bit fallback now that bnb 0.50+ fixes 4-bit decode The 16-bit fallback in studio/backend/core/inference/inference.py was added as a workaround for a bug that this PR already fixes at the install layer: bitsandbytes <= 0.49.2 has a broken 4-bit GEMV kernel on every ROCm target, which NaNs at decode shape (seq_len=1) and crashes autoregressive inference. bnb PR #1887 (commit 713a3b8, in 0.50.0.dev0+, pinned by install.sh / install_python_stack.py in this PR) restores correct 4-bit decode on MI300X and verified working end-to-end with full Unsloth + for_inference + sampling. Revert the dual code path so ROCm and NVIDIA both go through the normal FastLanguageModel.from_pretrained + for_inference flow: - Remove the conditional `from unsloth import` that skipped the import on ROCm. The monkey-patches it was trying to avoid were never the cause of the crash; bnb 4-bit GEMV was. - Remove the `if _hw_module.IS_ROCM:` branch in load_model that loaded with plain transformers + PEFT + bfloat16, and the `_resolve_fp16_base` helper it relied on. - Remove the `get_chat_template is not None` fallback in _load_chat_template_info -- get_chat_template is now always imported. - Refactor the audio/vision ROCm guard to check _hw_module.IS_ROCM directly instead of the removed _IS_ROCM_ENV global. Audio and vision on ROCm still need separate validation (FastVisionModel and the CSM audio codecs were never tested on HIP) so the guard stays for now. Add _bnb_rocm_4bit_ok() as a runtime safety net for users who install from this PR before the install.sh bnb pin kicks in, or whose installer fell back to the PyPI pin because the continuous- release wheel was unreachable. When the installed bnb is < 0.50 on ROCm, force load_in_4bit=False and strip any -unsloth-bnb-4bit / -bnb-4bit suffix from the model path so a pre-quantized repo resolves to its FP16 sibling instead of pulling bnb back in via the repo's quantization_config. LoRA adapters whose base is a pre-quantized repo on old bnb will still fail inside Unsloth's loader -- the only real fix there is `unsloth studio update`. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): - HAPPY path (bnb 0.50.0.dev0, load_in_4bit=True, pre-quantized repo): loads in 4-bit via the fixed GEMV, generation returns "Paris." for greedy and sampling. - SAFETY-NET path (simulated old bnb, suffix-stripped to the FP16 sibling, load_in_4bit=False): loads in bf16, generation returns "Paris." for greedy and sampling. Net diff is ~45 lines smaller than the pre-revert state because the entire plain-transformers 16-bit branch is gone. * Cache _bnb_rocm_4bit_ok() with functools.cache load_model() can be called many times in a single session but the bnb version and hardware state cannot change at runtime, so memoise the check. First call is ~1.9 ms (dominated by the lazy `import bitsandbytes` inside the try block), subsequent calls drop to sub-microsecond dict lookups. Zero behavioral change. * Shorten verbose bnb/ROCm comments Comment-only cleanup across install.sh, studio/install_python_stack.py, and studio/backend/core/inference/inference.py. No behavioral change. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove _bnb_rocm_4bit_ok safety net from inference.py Studio's ROCm support is brand new (PR #4720, merged today) and every fresh install pulls the bnb continuous-release_main wheel via install.sh / install_python_stack.py in this same PR. There are no existing ROCm Studio installs carrying bnb < 0.50, so the defensive version-check fallback is guarding against a scenario that cannot actually occur. Delete the helper, the functools import, and the safety-net block -- inference.py now calls FastLanguageModel.from_pretrained directly with no ROCm branching. * Drop audio/vision ROCm guard in inference.py — verified unblocked by bnb fix Vision inference was blocked by the same bnb 4-bit GEMV bug that affected text inference (vision models use bnb 4-bit for the LM backbone). With bnb 0.50+ pinned in install.sh / install_python_stack.py, vision works end-to-end on MI300X: Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit loaded in 4-bit via FastVisionModel + for_inference returns a correct answer to a multimodal prompt. Audio (CSM) was never actually blocked by HIP — on this hardware CSM loads and runs its backbone forward pass fine with bnb 0.50, then fails during generate() with a transformers-level kwarg validation mismatch in generation_csm.py (`backbone_last_hidden_state` rejected). That's a pre-existing transformers/CSM integration bug that reproduces identically on NVIDIA, so the ROCm-gated guard was never actually protecting users from anything HIP-specific. Remove the combined audio/vision guard and the now-unused _hw_module import. Also restore the one-word "Can be" in an inline comment that drifted during the earlier comment-shortening pass, so the inference.py delta vs pre-#4720 is exactly the max_seq_length<=0 crash fix and nothing else. * Shorten max_seq_length=0 guard comment to one line --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Summary
Add AMD ROCm GPU support across the installer, Python dependency stack, llama.cpp prebuilt selection, and hardware detection module. All existing CUDA, CPU, macOS, and Windows pathways are preserved -- AMD ROCm only activates as a last resort when no NVIDIA GPU is found.
Built on top of the original work from @edamamez (#4708) and @GoldenGrapeGentleman (#4428, #4448, #4449, #4450) -- their commits are preserved with attribution. Supersedes those PRs with bug fixes, validation hardening, and a comprehensive test suite.
Test suite (95 Python + 23 shell tests) is in companion PR #4824.
Radeon-specific wheel installation from @iswaryaalex (#4770) is included.
Changes by file
1.
install.sh-- ROCm detection + Radeon wheel selectionExtend
get_torch_index_url()to detect AMD ROCm whennvidia-smiis not found. Detection chain:rocminfo>amd-smi>/opt/rocmversion file >hipconfig>dpkg-query>rpm. ROCm 7.2+ capped torocm7.1index (torch 2.11.0 exceeds current<2.11.0bound). ROCm < 6.0 rejected (no PyTorch wheels exist). Includes:rocm.from garbled amd-smi)rocminfoMarketing Namerepo.radeon.comwith fallback to standard ROCm indexamd-smi listprobe to require GPU data rows, not just headersrocminfoprobe to reject gfx000 CPU agent (gfx[1-9] instead of gfx[0-9])2.
install_python_stack.py-- ROCm torch reinstallAdd
_detect_rocm_version()and_ensure_rocm_torch()to handle the case whereunsloth studio updategives a venv CPU-only torch on a ROCm host. Probes the host, checks if torch has the HIP backend, reinstalls from the correct ROCm wheel index if needed. Also adds:_has_usable_nvidia_gpu()to validate nvidia-smi reports a real GPU before giving NVIDIA precedence_has_rocm_gpu()with hardened amd-smi and rocminfo probesamd-smi versionas additional ROCm version detection fallback3.
install_llama_prebuilt.py-- ROCm prebuilt selectionAdd
has_rocmtoHostInfo, extenddetect_host()to probe for ROCm, and route ROCm hosts to upstream prebuilts:Also adds
runtime_payload_health_groupsforlinux-rocmandwindows-hipinstall kinds.4. Hardware
IS_ROCMflag + AMD monitoringdetect_hardware()setsIS_ROCM=Truewhentorch.version.hipis present (DeviceType stays CUDA)get_package_versions()returns bothcudaandrocmkeysIS_ROCMexported via__getattr__for live value access after detectionamd.pymodule for AMD GPU monitoring via amd-smi with unit-aware VRAM parsing (GB vs GiB distinction)HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICESsupport in_get_parent_visible_gpu_spec()andapply_gpu_ids()5.
is_rdna()RDNA2/3/3.5/4 expansionExpand
is_rdna()to cover RDNA2 (gfx1030-1036), RDNA3 (gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201).6.
worker.pyROCm Mamba/SSM source buildstorch.version.hipin causal-conv1d environment detectionuv pip installwhen available for faster dependency resolution7.
tokenizer_utils.pyerror messageReplace "We do not support AMD / Intel machines yet" with a helpful message pointing to ROCm install docs.
Test plan
bash -n install.sh-- syntax check passedpython -m py_compileon all modified Python files -- passedIS_ROCM=Falseon NVIDIAbash tests/sh/test_get_torch_index_url.sh-- 23/23 (in companion PR Add ROCm test suite (companion to #4720) #4824)python -m pytest tests/studio/install/test_rocm_support.py -v-- 95/95 (in companion PR Add ROCm test suite (companion to #4720) #4824)Closes #4708, closes #4428, closes #4448, closes #4449, closes #4450.