Skip to content

Add AMD ROCm/HIP support across installer and hardware detection#22

Closed
danielhanchen wants to merge 58 commits into
mainfrom
pr-4720-head
Closed

Add AMD ROCm/HIP support across installer and hardware detection#22
danielhanchen wants to merge 58 commits into
mainfrom
pr-4720-head

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Staging mirror of unslothai#4720

Original PR: unslothai#4720
Author: danielhanchen

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.


Original description

Summary

Add AMD ROCm GPU support across the installer, Python dependency stack, llama.cpp prebuilt selection, and hardware detection module. All existing CUDA, CPU, macOS, and Windows pathways are preserved -- AMD ROCm only activates as a last resort when no NVIDIA GPU is found.

Built on top of the original work from @edamamez (unslothai#4708) and @GoldenGrapeGentleman (unslothai#4428, unslothai#4448, unslothai#4449, unslothai#4450) -- their commits are preserved with attribution. Supersedes those PRs with bug fixes, validation hardening, and a comprehensive test suite.

Test suite (95 Python + 23 shell tests) is in companion PR unslothai#4824.

Radeon-specific wheel installation from @iswaryaalex (unslothai#4770) is included.

Changes by file

1. install.sh -- ROCm detection + Radeon wheel selection

Extend get_torch_index_url() to detect AMD ROCm when nvidia-smi is not found. Detection chain: rocminfo > amd-smi > /opt/rocm version file > hipconfig > dpkg-query > rpm. ROCm 7.2+ capped to rocm7.1 index (torch 2.11.0 exceeds current <2.11.0 bound). ROCm < 6.0 rejected (no PyTorch wheels exist). Includes:

  • Validation guard rejecting malformed tags (e.g. rocm. from garbled amd-smi)
  • Debian epoch prefix stripping for dpkg-query
  • ROCm status messaging and bitsandbytes install for AMD
  • Radeon GPU auto-detection via rocminfo Marketing Name
  • Radeon-specific wheel installation from repo.radeon.com with fallback to standard ROCm index
  • Empty triton arg handling (conditional inclusion to avoid PEP508 error)
  • Relaxed wheel suffix matching for manylinux_2_28_x86_64 wheels
  • Hardened amd-smi list probe to require GPU data rows, not just headers
  • Hardened rocminfo probe to reject gfx000 CPU agent (gfx[1-9] instead of gfx[0-9])

2. install_python_stack.py -- ROCm torch reinstall

Add _detect_rocm_version() and _ensure_rocm_torch() to handle the case where unsloth studio update gives a venv CPU-only torch on a ROCm host. Probes the host, checks if torch has the HIP backend, reinstalls from the correct ROCm wheel index if needed. Also adds:

  • _has_usable_nvidia_gpu() to validate nvidia-smi reports a real GPU before giving NVIDIA precedence
  • _has_rocm_gpu() with hardened amd-smi and rocminfo probes
  • amd-smi version as additional ROCm version detection fallback
  • Windows AMD GPU warning that validates actual GPU presence
  • Only skips reinstall when HIP is already present (not stale CUDA builds)

3. install_llama_prebuilt.py -- ROCm prebuilt selection

Add has_rocm to HostInfo, extend detect_host() to probe for ROCm, and route ROCm hosts to upstream prebuilts:

Host Behavior
Linux x86_64 ROCm (no NVIDIA) Try ROCm 7.2 prebuilt, fall back to source build with HIP
Windows x86_64 ROCm (no NVIDIA) Try HIP prebuilt, fall back to CPU with log message
Any with NVIDIA Normal CUDA path (ROCm ignored)

Also adds runtime_payload_health_groups for linux-rocm and windows-hip install kinds.

4. Hardware `I

edamamez and others added 30 commits March 31, 2026 08:37
Add AMD ROCm GPU detection to get_torch_index_url() in install.sh.
When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm
version file, hipconfig, dpkg-query, and rpm.

Includes validation guard for malformed _rocm_tag, Debian epoch prefix
stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install,
and status messaging. Shell tests expanded to 23 cases.

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a
Linux host has ROCm but the venv received CPU-only torch, and reinstall
with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a
30-second timeout on the torch GPU probe subprocess.

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm
via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream
prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP
prebuilt with CPU fallback). Add linux-rocm and windows-hip install
kinds to runtime_patterns_for_choice().

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Add IS_ROCM flag to hardware.py detect_hardware() (set when
torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM
from __init__.py. Add "rocm" key to get_package_versions().

Replace "We do not support AMD" error in tokenizer_utils.py with a
helpful message pointing to ROCm installation docs.

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Add tests/studio/install/test_rocm_support.py covering all ROCm code
paths across install_llama_prebuilt.py, install_python_stack.py,
hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks
and run without AMD hardware.

Covers: asset selection (11), runtime patterns (5), HostInfo (4),
ROCm version detection (9), torch reinstall (9), index mapping (8),
hardware flag (8), tokenizer message (2), install.sh structure (10),
and live regression (1).
Address review findings from 8 independent reviewers:

- Wrap _ensure_rocm_torch() torch probe in try/except for
  TimeoutExpired and OSError so a hung or broken torch import does not
  crash the installer (8/8 reviewers flagged this)
- Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to
  prevent installing unsupported torch 2.11.0 from the rocm7.1 index
- Use with-statement for file reads in _detect_rocm_version() to avoid
  resource leaks
- Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of
  default parameter to avoid relative path resolution)
- Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to
  reject rocm0.x tags that would produce nonexistent PyTorch index URLs
- Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0*
  |rocm7.1* pass through, everything else caps to rocm7.1) so future
  ROCm 10+ does not fall through to a nonexistent index
- Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering
- Fix test_probe_timeout_handled: replace zero-assertion test with
  proper assertions verifying reinstall proceeds after timeout
Filter None from the ROCM_PATH env var lookup at list construction time
instead of relying on the inline `if p` guard in the any() call.
All 8 reviewers across 2 cycles independently flagged that ROCm
detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core)
as a proxy for GPU presence, which would misroute CPU-only or NVIDIA
hosts that happen to have ROCm tools installed.

Now all 3 detection points (install.sh, install_python_stack.py,
install_llama_prebuilt.py) probe for an actual AMD GPU before
entering the ROCm path:

- install.sh: check rocminfo for gfx* GPU names, or amd-smi list
  for device rows, before version detection
- install_python_stack.py: new _has_rocm_gpu() function probes
  rocminfo and amd-smi list before _ensure_rocm_torch() proceeds
- install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi
  list instead of just checking tool existence or directory paths

Also:
- Shell test mock amd-smi now handles "list" subcommand
- Python tests updated to mock _has_rocm_gpu where needed
- Added test_no_gpu_with_rocm_tools_skips to verify the new guard
- Test index lookups now use sorted() to match production code
- Add parts[1].isdigit() check in hipconfig version parsing to handle
  versions like "6.3-HIP" where the minor component has non-numeric
  suffix (strip "-" prefix before int() conversion)
- Use getattr() in torch probe subprocess to safely handle old or
  custom torch builds that may lack torch.version.hip/cuda attributes
- Change amd-smi list detection from any-non-empty-output to requiring
  "gpu" marker in output, matching the shell-side NR>1 check. Prevents
  false positives from header-only amd-smi list output.
- Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed
  AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and
  install_llama_prebuilt.py behavior).
- Apply the same amd-smi marker fix to install_llama_prebuilt.py
  detect_host() for consistency.
The previous detect_host() ROCm check used rocminfo and amd-smi list
which are Linux-only tools. On Windows, has_rocm would always be False,
making the Windows HIP prebuilt path at line 1794 unreachable.

Now detect_host() uses platform-specific detection:
- Linux: rocminfo (check for gfx GPU names) or amd-smi list
- Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH

This allows Windows AMD users to get the HIP prebuilt binary instead
of silently falling through to the CPU prebuilt.
…essaging, RDNA expansion

- worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check
  for hipcc before ROCm source builds, improve status messages and error
  reporting, add timeout and uv support for the source build fallback
- amd.py: New AMD GPU monitoring module via amd-smi metric --json,
  mirroring nvidia.py structure (utilization, temperature, power, VRAM)
- hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization,
  visible GPU queries, and physical GPU count
- install_python_stack.py: Detect AMD GPUs on Windows and warn that
  ROCm-enabled PyTorch must be installed manually
- kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032),
  RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries
- tests: Add 32 new tests covering all changes (95/95 pass)
- Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi
  output markers instead of just checking tool existence on PATH
- _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before
  giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools)
- amd.py _parse_numeric: handle dict-shaped metric objects from newer
  amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units
- amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly
  handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs
- amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index
  so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly
- install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels
  exist for older versions); fix rocm7.1* glob to not match rocm7.10+
- is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.)
- worker.py: increase ROCm source build timeout from 600s to 1800s;
  fix success log message for ROCm source builds
- Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts
… validation

- hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm
  before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with
  HIP-specific env vars report the correct visible device set
- amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi
  JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly;
  fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB
- install_python_stack.py: Windows AMD warning now validates actual GPU
  presence via hipinfo/amd-smi output markers before printing
- install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP
  detection after tool-based checks, so Windows HIP installs without CLI
  tools on PATH are still detected
- hardware.py: fix IS_ROCM comment to accurately describe its role
Use explicit None checks instead of Python `or` operator when reading
HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string
("") is correctly honored as "no visible GPUs" rather than silently
falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems.
…x visible GPU count

- Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in
  both install.sh and install_python_stack.py to prevent resolver from
  selecting incompatible companion packages from ROCm wheel index
- Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence
  without hipinfo/amd-smi is not proof of GPU existence)
- Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which
  respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts
The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3
(gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201)
architectures is based on the original work from PR unslothai#4428.

Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com>
Co-authored-by: billishyahao <bill.he@amd.com>
Co-authored-by: Iswarya Alex <iswarya.alex@amd.com>
Adopt main's $TORCH_CONSTRAINT variable for the default PyTorch install
path instead of hardcoded version strings.
Move test_rocm_support.py and shell test additions to a separate PR
to keep the main ROCm support PR focused on implementation changes.
- Fix empty _tri_arg passed to uv pip install in Radeon path (causes
  "Empty field is not allowed for PEP508" error)
- Fix Radeon fallback: use ROCm index instead of CPU-only when
  repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm)
- Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings
- Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64
  wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag)
- Fix IS_ROCM export: use __getattr__ so callers always see the live
  value after detect_hardware() runs
- Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES
  on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set
- Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB)
- Add amd-smi version as a fallback in _detect_rocm_version
- Fix trailing whitespace and missing newline at EOF in install.sh
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for AMD ROCm GPUs, including hardware detection, monitoring via a new amd-smi backend, and automated installation of ROCm-compatible PyTorch wheels. Key changes include the addition of an AMD-specific hardware utility, improved asset resolution for ROCm/HIP prebuilts, and expanded RDNA architecture support in kernels. The review feedback suggests moving local re module imports to the top of the files to align with PEP 8 standards.

Comment thread studio/install_llama_prebuilt.py Outdated
pass

# Detect AMD ROCm (HIP) -- require actual GPU, not just tools installed
import re as _re
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to PEP 8, imports should usually be at the top of the file. The re module is imported multiple times within this file (here in detect_host and again in resolve_upstream_asset_choice). It would be cleaner and more conventional to have a single import re at the top of the script.

timeout = 5,
)
if result.returncode == 0:
import re
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to PEP 8, imports should usually be at the top of the file. The re module is imported multiple times within different functions in this file (e.g., here in _detect_rocm_version, in _has_rocm_gpu, and as _re_win later). It would be cleaner and more conventional to have a single import re at the top of the script.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 810b833b01

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread install.sh
Comment on lines +1281 to +1283
run_install_cmd "install triton + PyTorch" uv pip install --python "$_VENV_PY" \
--find-links "$_RADEON_BASE_URL" \
"$_tri_whl" "$_torch_whl" "$_tv_whl" "$_ta_whl"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce TORCH_CONSTRAINT when installing Radeon wheel URLs

This Radeon fast-path installs direct wheel URLs without applying $TORCH_CONSTRAINT, even though the resolver picks the newest matching wheel from the repo listing; that bypasses the <2.11.0 guard used in the regular ROCm index path. As Radeon publishes newer torch builds, this path can silently install unsupported torch/vision/audio versions and destabilize later dependency overrides/runtime behavior.

Useful? React with 👍 / 👎.

Comment on lines +1360 to +1362
if IS_ROCM:
os.environ["HIP_VISIBLE_DEVICES"] = value
os.environ["ROCR_VISIBLE_DEVICES"] = value
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Update HIP visibility vars without relying on pre-detected IS_ROCM

apply_gpu_ids only syncs HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES when IS_ROCM is already true, but worker entrypoints call apply_gpu_ids(...) before hardware detection initializes that flag. In ROCm environments where HIP/ROCR visibility is pre-set (common under schedulers), the requested GPU mask will not be applied to the effective backend env vars, so jobs can run on unintended devices.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2 review of commit 810b833. The goal of this PR is to add first-class AMD ROCm/HIP support across installer, Python stack, llama.cpp prebuilts, and hardware detection. As a summary, the commit under review fixed several round 1 findings but introduced two regressions in install_llama_prebuilt.py (Windows ROCm CPU fallback) that need follow-up.

Consolidated findings from 12 reviewers (7 parallel reviewer.py runs + 3 Sonnet subagents + gemini-code-assist + chatgpt-codex-connector) on commit 810b833b:

Reviewers Severity File Finding
6/12 High studio/install_llama_prebuilt.py:3155 Reverting the and not host.has_rocm guard on published_choice means Windows AMD ROCm hosts always resolve to the published windows-cpu bundle and never reach resolve_upstream_asset_choice() where the new HIP prebuilt path lives
6/12 High studio/install_llama_prebuilt.py:4236 validate_server() adds --n-gpu-layers 1 for any host.has_rocm host, including CPU fallback bundles on AMD Windows without a HIP prebuilt; the CPU binary is then launched as if it were GPU-capable and preflight fails
3/12 Med studio/install_python_stack.py:204 _ensure_rocm_torch() returns before the bitsandbytes install when HIP torch is already present, so unsloth studio update runs on existing ROCm setups never install the AMD bitsandbytes dependency
2/12 Med install.sh:1307 The AMD bitsandbytes install only runs in the fresh-install branch; migrated environments (_MIGRATED=true) skip it even on ROCm hosts
2/12 Med install.sh:1269-1276 Radeon wheel picker selects the newest version for each of torch/torchvision/torchaudio independently, which can assemble a version-mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio 2.9.0 from the live rocm-rel-7.2.1 index)
2/12 Med install.sh:1032 get_torch_index_url() returns /rocm* for aarch64 / arm64 Linux hosts even though PyTorch only publishes ROCm wheels for x86_64
2/12 Med studio/backend/utils/hardware/amd.py:208 + studio/install_python_stack.py:107 + studio/install_llama_prebuilt.py:2573 Newly added ROCm detection paths ignore HIP_VISIBLE_DEVICES="" / ROCR_VISIBLE_DEVICES=-1; a process explicitly masking AMD GPUs still triggers ROCm install and ROCm torch repair
1/12 Med studio/install_llama_prebuilt.py:2549 Older nvidia-smi versions without -L support raise an exception in the new -L try block, leaving has_physical_nvidia=False even when the --query-gpu=index,uuid,compute_cap block succeeds; linux diagnostics at line ~4390 silently skip
1/12 Low studio/backend/utils/hardware/amd.py:220-223 _first_visible_amd_gpu_id() uses break after an empty-first-token env var, silently returning "0" for malformed inputs like HIP_VISIBLE_DEVICES=",2" instead of falling through to lower-priority vars
1/12 Low install.sh:1000-1010 _has_usable_nvidia_gpu() has no timeout protection on nvidia-smi -L; on a stuck driver the install script hangs indefinitely
1/12 Low studio/install_llama_prebuilt.py:2573 + 3003 New import re as _re / import re as _re_rocm local aliases duplicate the top-level import re in the file

See inline comments for concrete code suggestions.

Concrete suggestions for each finding below.

Comment thread studio/install_llama_prebuilt.py Outdated

published_choice: AssetChoice | None = None
if host.is_windows and host.is_x86_64:
# Always try the published Windows CPU bundle, even on AMD ROCm
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[6/12 reviewers - High] Reverting the and not host.has_rocm guard here means every Windows x86_64 host (including AMD ROCm ones) sets published_choice = published_asset_choice_for_kind(release, "windows-cpu"), then apply_approved_hashes([published_choice]) returns successfully, and resolve_asset_choice() -- which is the only entry point to the new resolve_upstream_asset_choice() Windows HIP branch -- is never reached. The commit message assumption that the upstream resolver would pick HIP first is wrong because the upstream resolver is only consulted when published_choice raises PrebuiltFallback, which it does not for a hash-approved CPU bundle. Re-instate the guard and rely on resolve_upstream_asset_choice() for the HIP -> CPU fallback chain.

Suggested change
# Always try the published Windows CPU bundle, even on AMD ROCm
published_choice: AssetChoice | None = None
if host.is_windows and host.is_x86_64:
# AMD Windows hosts should prefer a hash-approved published
# Windows HIP bundle when one exists, but otherwise fall through
# to resolve_asset_choice() so the upstream HIP prebuilt is
# tried before the CPU fallback. Hard-pinning the published
# windows-cpu bundle here would make the new HIP path
# unreachable.
if host.has_rocm:
published_choice = published_asset_choice_for_kind(
release, "windows-hip"
)
else:
published_choice = published_asset_choice_for_kind(
release, "windows-cpu"
)

Comment thread studio/install_llama_prebuilt.py Outdated
if host.has_usable_nvidia or (host.is_macos and host.is_arm64):
if (
host.has_usable_nvidia
or host.has_rocm
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[6/12 reviewers - High] Gating --n-gpu-layers 1 on host.has_rocm breaks the intentional CPU fallback on AMD Windows hosts without a HIP prebuilt: when resolve_upstream_asset_choice() falls back to windows-cpu, the CPU binary is still launched with --n-gpu-layers 1 and preflight validation fails. Switch to gating on the actual install_kind the resolver chose, and thread it through from the caller.

Suggested change
or host.has_rocm
def validate_server(
server_path: Path,
probe_path: Path,
host: HostInfo,
install_dir: Path,
*,
runtime_line: str | None = None,
install_kind: str | None = None,
) -> None:

and in the command builder a few lines down:

_gpu_kinds = {"linux-cuda", "linux-rocm", "windows-cuda", "windows-hip", "macos-arm64"}
if install_kind is not None:
    _enable_gpu_layers = install_kind in _gpu_kinds
else:
    _enable_gpu_layers = host.has_usable_nvidia or (
        host.is_macos and host.is_arm64
    )
if _enable_gpu_layers:
    command.extend(["--n-gpu-layers", "1"])

Then pass install_kind = choice.install_kind from the call site in validate_prebuilt_choice().

Comment thread studio/install_python_stack.py Outdated
except (OSError, subprocess.TimeoutExpired):
probe = None
if probe is not None and probe.returncode == 0 and probe.stdout.decode().strip():
return # torch already has HIP/ROCm backend
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[3/12 reviewers - Med] _ensure_rocm_torch() exits before the bitsandbytes install when torch already links against HIP, so the normal unsloth studio update path on existing ROCm venvs never installs the AMD bitsandbytes build. Fix: gate the torch reinstall on has_hip_torch but keep the bitsandbytes install unconditional so it runs for both the fresh-reinstall and the already-HIP case.

Suggested change
return # torch already has HIP/ROCm backend
has_hip_torch = (
probe is not None
and probe.returncode == 0
and probe.stdout.decode().strip() != ""
)
if not has_hip_torch:
# (existing torch wheel install block -- indent one level)

Then move the bitsandbytes install out of the if not has_hip_torch block so it always runs once this function returns.

Comment thread install.sh Outdated
fi
# AMD ROCm: install bitsandbytes (once, after torch, for all ROCm paths)
case "$TORCH_INDEX_URL" in
*/rocm*)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[2/12 reviewers - Med] The AMD bitsandbytes install only runs inside the elif [ -n "$TORCH_INDEX_URL" ] fresh-install branch. Migrated environments (_MIGRATED=true, reached when the venv layout changes between versions) skip it even on ROCm hosts, leaving the AMD bitsandbytes build un-installed for upgrades. Add a matching block to the migrated branch.

Suggested change
*/rocm*)
# AMD ROCm: install bitsandbytes even in migrated environments so
# existing ROCm installs gain the AMD bitsandbytes build without a
# fresh reinstall.
if [ "$SKIP_TORCH" = false ]; then
case "$TORCH_INDEX_URL" in
*/rocm*)
substep "installing bitsandbytes for AMD ROCm..."
run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.49.1"
;;
esac
fi

Comment thread install.sh
if [ -z "$_torch_whl" ] || [ -z "$_tv_whl" ] || [ -z "$_ta_whl" ]; then
substep "[WARN] Radeon repo lacks a complete wheel set for this Python; falling back to ROCm index ($TORCH_INDEX_URL)" "$C_WARN"
run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" \
"$TORCH_CONSTRAINT" torchvision torchaudio \
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[2/12 reviewers - Med] The Radeon wheel picker independently selects the newest version for each of torch, torchvision, torchaudio, which can assemble a version-mismatched trio from the live Radeon listing (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio 2.9.0 on the current rocm-rel-7.2.1 index). Let uv resolve a compatible set by passing version constraints + --no-index --find-links instead of explicit wheel URLs.

Suggested change
"$TORCH_CONSTRAINT" torchvision torchaudio \
# Use version constraints + --find-links + --no-index so
# uv resolves a compatible torch / torchvision / torchaudio
# set from the Radeon listing (instead of picking the
# highest-version wheel for each package independently,
# which can assemble a version-mismatched stack).
if [ -n "$_tri_whl" ]; then
run_install_cmd "install triton + PyTorch" uv pip install --python "$_VENV_PY" \
--no-index \
--find-links "$_RADEON_BASE_URL" \
"$TORCH_CONSTRAINT" \
"torchvision<0.26.0" \
"torchaudio<2.11.0" \
"triton<3.7"
else
run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" \
--no-index \
--find-links "$_RADEON_BASE_URL" \
"$TORCH_CONSTRAINT" \
"torchvision<0.26.0" \
"torchaudio<2.11.0"
fi

Comment thread install.sh
_smi="/usr/bin/nvidia-smi"
fi
fi
if [ -z "$_smi" ]; then
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[2/12 reviewers - Med] get_torch_index_url() happily returns $_base/rocm* for aarch64 / arm64 Linux hosts even though PyTorch only publishes ROCm wheels for linux_x86_64. The resulting uv pip install ... --index-url https://download.pytorch.org/whl/rocmX.Y fails to resolve a wheel for the running interpreter. Skip the ROCm branch entirely on non-x86_64 architectures and fall back to CPU wheels.

Suggested change
if [ -z "$_smi" ]; then
if [ -z "$_smi" ]; then
# No NVIDIA GPU -- check for AMD ROCm GPU.
# PyTorch only publishes ROCm wheels for linux-x86_64; skip the
# ROCm branch entirely on aarch64 / arm64 / other architectures
# so non-x86_64 Linux hosts fall back cleanly to CPU wheels.
case "$(uname -m)" in
x86_64|amd64) : ;;
*) echo "$_base/cpu"; return ;;
esac
if ! _has_amd_rocm_gpu; then
echo "$_base/cpu"; return
fi

Honours HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES
in that order (HIP respects all three). Returns ``"0"`` when none are
set, and ``None`` when the env var explicitly narrows to zero GPUs
("" or "-1"), so callers can short-circuit to "available: False".
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[2/12 reviewers - Med] Follow-up on the round 1 visibility helper: the same masking should also apply to detect_host() in studio/install_llama_prebuilt.py and _has_rocm_gpu() in studio/install_python_stack.py. A user explicitly setting HIP_VISIBLE_DEVICES="" or ROCR_VISIBLE_DEVICES=-1 to disable AMD GPUs for their process currently still triggers ROCm install and ROCm torch reinstall because those call sites do not consult the visibility env vars. This helper gets the AMD monitoring path right but the installer-level ROCm detectors do not.

Suggest adding a shared helper near the top of both installer files:

def _rocm_visible_devices_disabled() -> bool:
    for env_name in ("HIP_VISIBLE_DEVICES", "ROCR_VISIBLE_DEVICES", "CUDA_VISIBLE_DEVICES"):
        raw = os.environ.get(env_name)
        if raw is not None and raw.strip() in ("", "-1"):
            return True
    return False

and short-circuiting _has_rocm_gpu() and the detect_host() ROCm loop when it returns True.

Comment thread studio/backend/utils/hardware/amd.py Outdated
first = raw.split(",", 1)[0].strip()
if first:
return first
break
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1/12 reviewers - Low] With break here, a malformed env var like HIP_VISIBLE_DEVICES=",2" (leading comma, non-empty) gives first="" and exits the loop, returning "0" even though the lower-priority ROCR_VISIBLE_DEVICES or CUDA_VISIBLE_DEVICES may still hold a valid GPU id. Use continue to fall through to the next env var in priority order.

Suggested change
break
# Leading comma or all-whitespace first token -- fall through to
# the next env var in priority order rather than silently
# returning GPU 0.
continue

Follow-up to 810b833 addressing review findings on the first round of
hardening commits:

- install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the
  resolved install_kind instead of host.has_rocm. AMD Windows hosts
  without a HIP prebuilt fall back to windows-cpu and must not be
  validated with GPU layers; thread install_kind through from the
  caller.
- install_llama_prebuilt.py resolve_release_asset_choice: reinstate the
  "not has_rocm" guard on the published windows-cpu bundle so AMD
  Windows hosts reach resolve_upstream_asset_choice() where the new
  HIP prebuilt path lives. Prefer a published windows-hip bundle first
  when one exists, fall through to upstream HIP + upstream CPU
  otherwise.
- install_llama_prebuilt.py detect_host: also set has_physical_nvidia
  when the secondary --query-gpu block confirms a working NVIDIA GPU,
  so older nvidia-smi versions without -L support do not silently skip
  the Linux diagnostics that key off has_physical_nvidia.
- install_llama_prebuilt.py: drop redundant "import re as _re" /
  "import re as _re_rocm" local aliases in favour of the existing
  top-level "import re".
- install_python_stack.py _ensure_rocm_torch: run the AMD
  bitsandbytes install unconditionally after the HIP-torch probe so
  "unsloth studio update" on venvs that already have ROCm torch still
  gains the AMD bitsandbytes build.
- install.sh: add a non-x86_64 early-exit to get_torch_index_url() so
  aarch64 / arm64 Linux hosts do not hit the ROCm wheel index
  (PyTorch only publishes ROCm wheels for linux_x86_64).
- install.sh: add bitsandbytes install to the migrated-environment
  branch so upgrades pick it up for ROCm hosts instead of only the
  fresh-install path.
- install.sh: in the Radeon wheel path, pass version constraints +
  --no-index --find-links to uv instead of explicit wheel URLs so a
  version-compatible torch / torchvision / torchaudio triple is
  resolved, rather than picking the highest-version wheel for each
  package independently.
- studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall
  through to lower-priority visibility env vars when the first entry
  is malformed (leading comma, all-whitespace first token) instead of
  silently returning GPU 0.
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

@danielhanchen
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for AMD ROCm (HIP) across the installation scripts and backend utilities. Key changes include the addition of GPU detection helpers for ROCm, logic to select and install appropriate ROCm PyTorch wheels (including from repo.radeon.com for Radeon users), and a new AMD-specific GPU monitoring backend. The PR also enables ROCm prebuilt asset resolution for llama-server and expands RDNA architecture support in the kernel utilities. Review feedback highlights several opportunities to reduce code duplication regarding GPU detection logic and package installation steps across the shell and Python scripts.

Comment thread install.sh
Comment on lines +1025 to +1031
if _has_usable_nvidia_gpu; then
if command -v nvidia-smi >/dev/null 2>&1; then
_smi="nvidia-smi"
elif [ -x "/usr/bin/nvidia-smi" ]; then
_smi="/usr/bin/nvidia-smi"
fi
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block for finding the nvidia-smi executable is redundant, as the same logic already exists within the _has_usable_nvidia_gpu helper function. To improve maintainability and eliminate code duplication, consider modifying _has_usable_nvidia_gpu to return or export the path to nvidia-smi when a usable GPU is found. The caller can then use this path directly instead of re-discovering it.

Comment thread install.sh
Comment on lines +1251 to +1261
# AMD ROCm: install bitsandbytes even in migrated environments so
# existing ROCm installs gain the AMD bitsandbytes build without a
# fresh reinstall.
if [ "$SKIP_TORCH" = false ]; then
case "$TORCH_INDEX_URL" in
*/rocm*)
substep "installing bitsandbytes for AMD ROCm..."
run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.49.1"
;;
esac
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for installing bitsandbytes for AMD ROCm is duplicated here and again on lines 1340-1346. You can avoid this repetition by moving this block to after the main if/elif structure that handles migrated vs. fresh installs. This would centralize the bitsandbytes installation logic for ROCm, making the script cleaner and easier to maintain.

Comment on lines +772 to +800
if IS_WINDOWS and not NO_TORCH and not _has_usable_nvidia_gpu():
# Validate actual AMD GPU presence (not just tool existence)
import re as _re_win

def _win_amd_smi_has_gpu(stdout: str) -> bool:
return bool(_re_win.search(r"(?im)^gpu\s*[:\[]\s*\d", stdout))

_win_amd_gpu = False
for _wcmd, _check_fn in (
(["hipinfo"], lambda out: "gcnarchname" in out.lower()),
(["amd-smi", "list"], _win_amd_smi_has_gpu),
):
_wexe = shutil.which(_wcmd[0])
if not _wexe:
continue
try:
_wr = subprocess.run(
[_wexe, *_wcmd[1:]],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 10,
)
except Exception:
continue
if _wr.returncode == 0 and _check_fn(_wr.stdout):
_win_amd_gpu = True
break
if _win_amd_gpu:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to detect an AMD GPU on Windows is very similar to the logic in studio/install_llama_prebuilt.py. To improve maintainability and avoid duplication, consider refactoring this detection logic into a shared helper function that both scripts can import and use. This would centralize the logic and make future updates easier.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 👍

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Member Author

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 3 review of commit 8636fa6. The goal of this PR is to add first-class AMD ROCm/HIP support across the installer, Python stack, llama.cpp prebuilt selection, hardware detection, and runtime monitoring. As a summary, the commit under review cleanly fixed the round 2 regressions around validate_server and Windows HIP routing, but round 3 surfaced three remaining issues: the Studio ROCm torch repair path still misses an x86_64 guard, the Linux ROCm llama.cpp prebuilt picker ignores the host ROCm runtime version, and the round 2 --no-index switch in the Radeon install branch unintentionally breaks transitive dependency resolution.

Consolidated findings from 12 reviewers (7 parallel reviewer.py runs + 3 Sonnet subagents + gemini-code-assist + chatgpt-codex-connector) on commit 8636fa63:

Reviewers Severity File Finding
7/12 High studio/install_python_stack.py:168 _ensure_rocm_torch() is missing the x86_64 guard that was added to install.sh. On non-x86_64 Linux ROCm hosts, unsloth studio update tries to install from download.pytorch.org/whl/rocm* where no wheels exist for that platform
4/12 High install.sh:1276 Round 2's --no-index --find-links switch breaks transitive dependency resolution for the Radeon wheel path: --no-index instructs uv to consult only the Radeon listing, which has no filelock, sympy, networkx, jinja2, fsspec, setuptools, typing-extensions etc., so a fresh-venv install fails to resolve the dep graph
3/12 High studio/install_llama_prebuilt.py:3011 Linux ROCm prebuilt picker always picks the numerically newest rocm-<version> tarball. A ROCm 6.4 host would now download the rocm-7.2 tarball, fail preflight, and fall back to source build even though a compatible 6.4 prebuilt exists. The picker should clip to <= host_rocm_version when the host version is known
2/12 Med studio/install_python_stack.py:246 The new unconditional bitsandbytes>=0.49.1 install is a no-op when a CPU/CUDA bitsandbytes>=0.49.1 is already present in the venv; existing environments never upgrade to the AMD build. Use --force-reinstall
2/12 Med Hardware / install_python_stack / install_llama_prebuilt ROCm detection Pre-existing: ROCm detection paths still ignore HIP_VISIBLE_DEVICES="" / ROCR_VISIBLE_DEVICES=-1. Processes that explicitly mask AMD GPUs still trigger ROCm install and ROCm torch repair
1/12 Med studio/install_llama_prebuilt.py:3061 New Windows HIP prebuilt lookup only matches llama-<tag>-bin-win-hip-radeon-x64.zip. Pinned older upstream llama.cpp tags (e.g. b4832) use llama-<tag>-bin-win-hip-x64-gfx*.zip; those installs silently drop to CPU
1/12 Med studio/backend/core/training/worker.py:291 The new timeout = 1800 if is_hip else 300 timestamps applies a 300s cap to non-HIP causal-conv1d / mamba-ssm source builds too; pre-PR behavior was "no timeout" for non-HIP. Slow non-ROCm builds (Linux aarch64, unsupported torch/CUDA combos) can now be aborted after 5 minutes
1/12 Low studio/install_llama_prebuilt.py:4295 validate_server() fallback path when install_kind is None excludes host.has_rocm. Dead code today (the only caller passes install_kind) but a future call site omitting it would silently skip GPU probe on ROCm hosts
1/12 Low studio/install_python_stack.py:227 When _detect_rocm_version() returns a version older than any entry in _ROCM_TORCH_INDEX (ROCm < 6.0), _ensure_rocm_torch() returns early and the bitsandbytes install is not reached

See inline comments for concrete fixes.

Concrete suggestions for each finding below.

"""
# Explicit OS guard so the helper is safe to call from any context --
# ROCm wheels are only published for Linux x86_64.
if IS_WINDOWS or IS_MACOS:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[7/12 reviewers - High] _ensure_rocm_torch() is missing the x86_64 guard that was added to install.sh in round 2. On a Linux aarch64 / arm64 ROCm host, unsloth studio update will force-reinstall from download.pytorch.org/whl/rocm*, which only publishes linux_x86_64 wheels, and the updater aborts with a missing-wheel error instead of cleanly no-oping. Fix: mirror the shell guard with platform.machine().

Suggested change
if IS_WINDOWS or IS_MACOS:
# Explicit OS / architecture guards so the helper is safe to call
# from any context -- PyTorch only publishes ROCm wheels for
# linux_x86_64, so aarch64 / arm64 hosts must skip this repair path
# instead of failing the update with a missing-wheel error.
if IS_WINDOWS or IS_MACOS:
return
if platform.machine().lower() not in {"x86_64", "amd64"}:
return

Comment thread install.sh Outdated
# newest version-compatible triple.
if [ -n "$_tri_whl" ]; then
run_install_cmd "install triton + PyTorch" uv pip install --python "$_VENV_PY" \
--no-index \
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[4/12 reviewers - High] Round 2 added --no-index together with --find-links and version constraints to avoid mixing Radeon wheels with PyPI defaults, but --no-index instructs uv to consult only the Radeon HTML listing, which does not contain torch's transitive dependencies (filelock, sympy, networkx, jinja2, fsspec, setuptools, typing-extensions, ...). On a fresh venv (this is the first install step in the fresh-install path), uv fails to resolve the dependency graph and aborts.

Fix: drop --no-index, keep --find-links, and switch back to passing explicit torch / torchvision / torchaudio wheel URLs. Address the version-mismatch concern that motivated the round 2 change by validating that the three wheels share a matching torch / torchvision / torchaudio release pair before installing them, and falling through to the standard ROCm index otherwise.

continue
_parts = tuple(int(p) for p in _m.group(1).split("."))
rocm_candidates.append((_parts, _name))
rocm_candidates.sort(reverse = True)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[3/12 reviewers - High] The Linux ROCm prebuilt picker sorts rocm_candidates descending and always takes the top entry. A ROCm 6.4 host would download a published rocm-7.2 prebuilt, fail preflight, and fall back to a source build even though a compatible rocm-6.4 prebuilt exists in the same release.

Fix: detect the host's ROCm major.minor (best-effort from /opt/rocm/.info/version, amd-smi version, hipconfig --version) and filter rocm_candidates to entries whose (major, minor) is <= host version. Fall back to the newest candidate only when no compatible one exists, so preflight can still catch true incompatibilities.

Comment thread studio/install_python_stack.py Outdated
"--no-cache-dir",
"bitsandbytes>=0.49.1",
constrain = False,
)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[2/12 reviewers - Med] The new unconditional bitsandbytes>=0.49.1 install is a no-op when a CPU or CUDA bitsandbytes>=0.49.1 is already present in the venv. A user running unsloth studio update on a machine where the prior venv already had a standard bitsandbytes never gets upgraded to the AMD ROCm build. Fix: add --force-reinstall so the AMD build replaces whatever was there.

Suggested change
)
pip_install(
"bitsandbytes (AMD)",
"--force-reinstall",
"--no-cache-dir",
"bitsandbytes>=0.49.1",
constrain = False,
)

Comment thread studio/backend/core/training/worker.py Outdated
]

# Source compilation on ROCm can take 10-30 minutes; use a generous timeout
timeout = 1800 if is_hip else 300
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1/12 reviewers - Med] timeout = 1800 if is_hip else 300 applies a 300s cap to all non-HIP causal-conv1d / mamba-ssm source builds. The pre-PR behaviour was no timeout at all; this change silently aborts slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combinations) after 5 minutes. Fix: keep the generous ROCm timeout but omit timeout entirely for non-HIP runs so the pre-PR no-timeout behaviour is preserved.

Suggested change
timeout = 1800 if is_hip else 300
# Source compilation on ROCm can take 10-30 minutes; use a generous
# timeout. Non-HIP installs preserve the pre-existing "no timeout"
# behaviour so unrelated slow installs (e.g. causal-conv1d source
# build on Linux aarch64 or unsupported torch/CUDA combinations)
# are not aborted at 5 minutes by this PR.
_run_kwargs: dict[str, Any] = {
"stdout": _sp.PIPE,
"stderr": _sp.STDOUT,
"text": True,
}
if is_hip:
_run_kwargs["timeout"] = 1800

Comment thread studio/install_llama_prebuilt.py Outdated
if install_kind is not None:
_enable_gpu_layers = install_kind in _gpu_kinds
else:
_enable_gpu_layers = host.has_usable_nvidia or (
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1/12 reviewers - Low] The install_kind is None fallback no longer includes host.has_rocm. Dead code in the current single-caller world (validate_prebuilt_choice always passes install_kind), but a future call site that omits install_kind would silently skip the GPU probe on ROCm hosts, which is inconsistent with the documented intent. Either document the exclusion or restore host.has_rocm to the fallback expression.

Address issues surfaced by the round 3 reviewers on top of 8636fa6:

- install_python_stack.py _ensure_rocm_torch: add the same `x86_64`
  guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts
  must skip the repair path entirely; PyTorch only publishes ROCm
  wheels for linux_x86_64, and without this guard
  `unsloth studio update` aborts with a missing-wheel error on non
  x86_64 hosts.
- install_llama_prebuilt.py resolve_upstream_asset_choice: add a
  best-effort _detect_host_rocm_version() helper (reading
  /opt/rocm/.info/version, amd-smi version, hipconfig --version) and
  filter rocm_candidates to entries whose major.minor is <= host
  version. Falls back to the newest candidate only when no compatible
  one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being
  handed the numerically newest rocm-7.2 bundle (which fails preflight
  and forces a source build).
- install.sh: remove the round 2 --no-index switch from the Radeon
  wheel branch. --no-index forced uv to ignore PyPI entirely, which
  broke transitive dependency resolution (filelock, sympy, networkx,
  jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv.
  Restore the round 1 explicit wheel URL invocation but add a
  torch / torchvision / torchaudio version-pair sanity check so a
  mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio
  2.9.0) falls back to the standard ROCm index instead of installing a
  broken combination.
- install_python_stack.py _ensure_rocm_torch: restructure the
  "tag is None" path so it no longer short-circuits the bitsandbytes
  install. On a ROCm runtime older than anything in
  _ROCM_TORCH_INDEX, print the "no wheel" warning but still run the
  AMD bitsandbytes install.
- studio/backend/core/training/worker.py: restore the pre-PR
  "no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source
  builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts
  slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos)
  after 5 minutes; omit timeout for the non-HIP branch so the cap
  only applies to ROCm source builds.
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

@danielhanchen
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements comprehensive support for AMD ROCm GPUs, adding automated hardware detection, specialized PyTorch wheel selection, and source compilation support for training kernels. Key changes include new GPU monitoring utilities for AMD and expanded architecture support for RDNA GPUs. Feedback suggests increasing SMI query timeouts to handle slower systems, refining exception handling in the hardware backend to prevent masking errors, and adopting more flexible version constraints during PyTorch reinstalls.

Comment on lines +23 to +42
def _run_amd_smi(*args: str, timeout: int = 5) -> Optional[Any]:
"""Run amd-smi with the given arguments and return parsed JSON, or None."""
try:
result = subprocess.run(
["amd-smi", *args, "--json"],
capture_output = True,
text = True,
timeout = timeout,
)
except (OSError, subprocess.TimeoutExpired) as e:
logger.warning("amd-smi query failed: %s", e)
return None
if result.returncode != 0 or not result.stdout.strip():
logger.warning("amd-smi returned code %d", result.returncode)
return None
try:
return json.loads(result.stdout)
except json.JSONDecodeError:
logger.warning("Failed to parse amd-smi JSON output")
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _run_amd_smi function uses a hardcoded timeout of 5 seconds. Given that AMD SMI operations can be slow, especially on systems with many GPUs, this might lead to unnecessary failures. Consider making the timeout configurable or increasing it to a more robust value.

Comment on lines +405 to +431
def _smi_query(func_name: str, *args, **kwargs) -> Optional[Dict[str, Any]]:
"""Run a query against the appropriate SMI backend (amd-smi or nvidia-smi).

Returns the result dict if available, or None on failure/unavailability.
"""
if IS_ROCM:
backend_name = "amd-smi"
try:
from . import amd as _backend
except Exception as e:
logger.warning("%s import failed: %s", backend_name, e)
return None
else:
backend_name = "nvidia-smi"
try:
from . import nvidia as _backend
except Exception as e:
logger.warning("%s import failed: %s", backend_name, e)
return None
try:
func = getattr(_backend, func_name)
result = func(*args, **kwargs)
if result.get("available"):
return result
except Exception as e:
logger.warning("%s %s query failed: %s", backend_name, func_name, e)
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _smi_query function uses a broad except Exception block. This can mask unexpected errors during backend initialization or query execution. It is better to catch specific exceptions like ImportError or subprocess.SubprocessError to ensure that genuine issues are not swallowed.

Comment on lines +234 to +244
pip_install(
f"ROCm torch ({tag})",
"--force-reinstall",
"--no-cache-dir",
"torch>=2.4,<2.11.0",
"torchvision<0.26.0",
"torchaudio<2.11.0",
"--index-url",
index_url,
constrain = False,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pip_install call for ROCm torch uses a hardcoded constraint torch>=2.4,<2.11.0. If the environment requires a different version due to system-level ROCm updates, this might cause conflicts. Consider using a variable or a more flexible version range if possible.

Comment thread studio/install_llama_prebuilt.py Fixed
Comment thread studio/install_llama_prebuilt.py Fixed
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…andbytes gate

Address remaining issues surfaced by the round 4 reviewers:

- studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the
  selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever
  the caller already had a ROCm visibility env var set, not only when
  IS_ROCM has already been set by detect_hardware(). Training and
  inference workers call apply_gpu_ids() before detect_hardware()
  runs, so the old guard would leave a forked ROCm worker with a
  stale HIP_VISIBLE_DEVICES mask that no longer matched the
  narrowed CUDA_VISIBLE_DEVICES selection.
- install.sh get_radeon_wheel_url: accept X.Y ROCm versions in
  addition to X.Y.Z. The `/opt/rocm/.info/version` file and some
  hipconfig versions report only two components, and the Radeon
  repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/
  directories, so treating X.Y as invalid caused Radeon hosts to fall
  back to the generic ROCm index even when a matching AMD wheel set
  existed.
- install_python_stack.py _ensure_rocm_torch: only install the AMD
  bitsandbytes build when the venv actually has a ROCm-compatible
  torch (either already present or just installed by this function).
  Previously the bitsandbytes install ran unconditionally, which
  could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch
  on hosts where the ROCm runtime is older than any entry in
  _ROCM_TORCH_INDEX. Also add --force-reinstall so an existing
  CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades.
Copy link
Copy Markdown
Member Author

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 4 (final) review of the PR at 5341e46. The goal of this PR is to add first-class AMD ROCm/HIP support across the installer, Python stack, llama.cpp prebuilt selection, hardware detection, and runtime monitoring. As a summary, after three rounds of hardening the PR is substantially cleaner; the remaining findings are smaller and split into two groups: concrete bugs the round 4 reviewers flagged in the updater / installer, and intentional design trade-offs that reviewers suggest but which can be followed up after merge.

Consolidated findings from 12 reviewers (7 parallel reviewer.py runs + 3 Sonnet subagents + gemini-code-assist + chatgpt-codex-connector) on commit 5341e466:

Reviewers Severity File Finding
3/12 High studio/backend/utils/hardware/hardware.py:1357 apply_gpu_ids() only mirrors the narrowed selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES when IS_ROCM is already True, but workers call apply_gpu_ids() before detect_hardware() runs in the worker process. A ROCm worker inheriting HIP_VISIBLE_DEVICES=2,3 and selecting only [3] ends up with CUDA_VISIBLE_DEVICES=3 / HIP_VISIBLE_DEVICES=2,3, desyncing the ROCm runtime mask
2/12 High install.sh get_radeon_wheel_url Only accepts X.Y.Z ROCm versions. /opt/rocm/.info/version and some hipconfig versions report only X.Y, so the Radeon repo path silently falls back to the generic ROCm index even when rocm-rel-X.Y/ is available
3/12 Med studio/install_python_stack.py:241-246 The round 3 fix that runs bitsandbytes>=0.49.1 unconditionally leaves an AMD bitsandbytes layered on top of a CPU/CUDA torch when the host ROCm runtime is older than every entry in _ROCM_TORCH_INDEX (the "tag is None" branch). Only install bitsandbytes when the venv actually has a ROCm-compatible torch
2/12 Med studio/install_python_stack.py:241-246 The new bitsandbytes install does not use --force-reinstall, so on upgrades where the venv already has a CPU/CUDA bitsandbytes>=0.49.1, the AMD build is never swapped in
1/12 Med studio/install_llama_prebuilt.py detect_host + studio/install_python_stack.py _has_rocm_gpu Pre-existing design decision: these installer-level detectors intentionally ignore HIP_VISIBLE_DEVICES="" / ROCR_VISIBLE_DEVICES=-1 (they provision based on physical hardware, not env-var masks). Not a showstopper, the Studio runtime amd.py and hardware.py already honour the masks correctly
1/12 Low install.sh Radeon version-pair check comment Incorrect comment says "torchvision = torch.minor - 5" while the actual arithmetic (and the correct relation) is + 15
1/12 Low studio/backend/core/training/worker.py:309-314 _run_kwargs.get("timeout") in the TimeoutExpired handler would emit "timed out after None" on a non-HIP path, but is dead code (subprocess.run without timeout= never raises TimeoutExpired)
1/12 Info studio/install_llama_prebuilt.py _detect_host_rocm_version Near-duplicate of _detect_rocm_version() in install_python_stack.py. Follow-up refactor opportunity, not a blocker
1/12 Info studio/install_python_stack.py Windows ROCm torch install Currently prints a warning pointing to docs instead of auto-installing pinned Windows Radeon wheels. Treated as a follow-up feature, not a regression

See inline comments for the concrete fixes.

Concrete suggestions for each finding below. The three High findings have been addressed in a follow-up commit (5305c319).

pre-commit-ci Bot and others added 11 commits April 8, 2026 11:11
…ped GPU id

Two medium-severity defensive fixes from the gemini-code-assist review on
the AMD monitoring backend:

1. _extract_gpu_metrics may return a dict where every value is None when
   amd-smi succeeds (zero exit) but the JSON envelope contains no usable
   fields (error response, unsupported card). The new _has_real_metrics
   helper lets get_primary_gpu_utilization surface available:False and
   lets get_visible_gpu_utilization skip ghost device rows so the UI
   does not render placeholder cards with empty numbers.

2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit":
   "none"}, including the per-GPU id. The previous int(raw_id) call
   silently fell back to the enumeration index in that case, losing the
   real GPU id. Routing raw_id through the existing _parse_numeric
   helper handles bare ints, floats, strings, and the dict shape
   uniformly, with a debug log on parse failure.
…le parser

Both _detect_rocm_version (install_python_stack.py) and
_detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version
or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed
parts[1]. The surrounding broad `except Exception: pass` already swallowed
the resulting IndexError, so a one-component file like "6\n" did fall
through to the next detection source -- but the control flow relied on
exception handling instead of an explicit check.

Add `if len(parts) >= 2:` guards in both helpers so the loop falls through
on its own without raising. Behaviour is unchanged for the common multi-
component case; the previously-silent IndexError path becomes an explicit
no-op.
When validate_server is called without an explicit install_kind (older
call sites that have not been updated), the fallback was only enabling
--n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts
fell through to the CPU validation path even though the prebuilt being
exercised was a HIP binary.

Add host.has_rocm to the fallback expression so the GPU offload flag is
applied consistently with the install_kind=='linux-rocm' / 'windows-hip'
branches above.
…ry_mb

The previous heuristic divided any bare number above 10_000_000 by
1024*1024 on the assumption that large unit-less values were bytes.
This misclassified small VRAM allocations: 5 MB of used VRAM reported
as 5_242_880 bytes without a unit would be taken at face value and
render as 5_242_880 MB (~5 TB) in the monitoring UI.

Modern amd-smi always provides explicit units (MiB/GiB dict form),
and legacy amd-smi returns bare numbers in MB -- the heuristic never
had a real workload to handle. Drop it and default to MB for bare
numeric input, keeping the existing unit-aware branches for dict /
string inputs unchanged.

The unrelated gemini suggestion to "default minor to 0" in the
amd-smi version awk parser was intentionally NOT applied: rocm7.0
and rocm7.1 ship different wheel sets, so silently substituting 0
for a missing minor could install the wrong wheels. The existing
reject-and-fall-through behaviour is safer.
…sing

Three medium findings from gemini-code-assist addressed in this commit:

1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions
   that are not in POSIX and break on BSD/BusyBox coreutils. install.sh
   has a #!/bin/sh shebang so the whole pipeline was rewritten as a
   single awk script that extracts all href="..." hits on each line,
   filters to wheels matching the package prefix and python tag, and
   picks the newest version via zero-padded lexical comparison. No
   external sort or grep is needed.

2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a
   leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to
   the next env var", which is surprising given the clear intent to
   narrow to device 1. Filter empty tokens after the split and return
   the first real one. An all-commas value ("," / ",,,") still falls
   through because no real tokens exist; the empty-string and "-1"
   explicit-zero cases are unchanged.

The unrelated amd-smi version awk parser suggestion was not applied
(see round 4 commit message for rationale: defaulting a missing minor
to 0 could silently install the wrong ROCm wheel set).
…k, bnb, backend label

Consolidated fix batch from a 20-parallel reviewer.py run on the current
head. Each fix is drawn from a high-consensus finding and addresses a
real bug or feature gap, not a stylistic preference.

1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five
   call sites so this branch no longer regresses main's version floor
   (main bumped to 2026.4.4 in unslothai#4876). Without this, merging 4720 would
   silently downgrade the minimum version pin for fresh installs.

2. install.sh: URL-decode Radeon wheel names before extracting the
   torch / torchvision / torchaudio version strings. Real wheel URLs
   from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...")
   so the previous `[+-]` terminator in the sed regex never matched,
   `_torch_ver` stayed empty, `_radeon_versions_match` stayed false,
   and every Radeon consumer install silently fell back to the generic
   ROCm index. Now decode %2B -> + first, then extract, then validate.

3. install.sh: the two AMD bitsandbytes install lines were running
   `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`,
   so upgrades where the venv already has a CPU/CUDA bitsandbytes
   satisfying the constraint would keep the stale non-AMD wheel. Add
   `--force-reinstall --no-cache-dir` to both call sites, matching the
   pattern already used in install_python_stack.py::_ensure_rocm_torch.

4. install_python_stack.py and install_llama_prebuilt.py: add
   `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the
   Python-side ROCm version detectors so they match the chain in
   install.sh::get_torch_index_url. Package-managed ROCm installs
   (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via
   rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig,
   or amd-smi `version` output -- without these fallbacks, `unsloth
   studio update` on such hosts returned None and skipped the ROCm
   torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before
   parsing so epoch-annotated packages parse correctly.

5. hardware.py: add a `_backend_label(device)` helper that returns
   "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and
   use it for every `"backend": ...` emission in JSON responses served
   to the Studio frontend. Internally we still represent ROCm hosts as
   DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API
   surface), but the user-facing API now correctly reports "rocm" on
   AMD boxes instead of labeling them as "cuda".

All 250 simulation scenarios pass (was 233 before this batch: added 17
new regression tests covering the version pin, %2B decoding, bnb
force-reinstall flags, dpkg/rpm fallback presence, and the
_backend_label helper's four-way truth table).
…p to 6.4

Two rounds of fixes in one commit, plus a full URL audit of every PyPI /
download.pytorch.org / repo.radeon.com reference the PR introduces.

amd.py (4 medium gemini findings on commit b3627bc):

1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util
   gate. The follow-up `vram_total_mb > 0` already handles the division
   guard, but the truthiness check was redundant and slightly surprising
   for a 0.0 valid value. Replace with explicit `is not None and > 0`
   for both vram_util and power_util.

2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding
   for non-dict envelopes. A scalar / string JSON response from amd-smi
   would raise AttributeError. Add an isinstance(data, dict) check and
   return None for unexpected shapes.

3. get_visible_gpu_utilization had the same .get() exposure on the outer
   envelope. Rewrite the gpu_list extraction as an explicit
   list/dict/else cascade so a malformed scalar envelope produces
   gpu_list=[data] and continues without raising.

4. The same function's per-entry loop also called gpu_data.get() on
   whatever was inside gpu_list. If a scalar ever leaks into the list
   (directly or via the previous fix's fallback), _extract_gpu_metrics
   would raise on the first .get() inside the helper. Skip non-dict
   entries in the loop before extracting metrics.

install.sh (URL audit finding, previously flagged by 20-reviewer as #13):

5. get_torch_index_url used `rocm6.*` in the rocm tag case statement,
   which matched rocm6.5 and rocm6.6 and emitted
   download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because
   PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the
   supported 6.x minors explicitly and add a rocm6.* fallback branch
   that clips to rocm6.4 (the last supported 6.x wheel set).

URL audit results (all URLs PR 4720 references):
- 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130,
  rocm6.0..6.4,rocm7.0..7.2} return HTTP 200.
- 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3,
  6.4,7.0,7.1,7.2}/ return HTTP 200.
- X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for
  6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z
  -> X.Y fallback sed in the Radeon wheel install block.
- Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the
  llama.cpp GitHub releases API endpoint all return 200.

Test suite: 255 -> 258. New regression coverage:
- U17: get_physical_gpu_count tolerates scalar amd-smi envelope
- U18: get_visible_gpu_utilization tolerates scalar envelope
- U19a-c: vram_util / power_util return None on zero total, but
  vram_total_gb still echoes 0.0 (not None)
- A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported
  6.x minors to rocm6.4 instead of producing a 403 index URL
…n.py backend label

Three high-confidence findings from a second 20-parallel reviewer.py run
on commit 7effb3a. Triaged 15 total findings and applied the three that
were confirmed as real bugs; the rest were either false positives (e.g.
"migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream
via setup.sh regardless), design decisions (e.g. visibility mask env
vars not consulted in installer detection), or edge cases the existing
fallback logic already handles.

1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe
   runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then
   only raises if `torch.cuda.is_available()` is False. On ROCm torch,
   torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.*
   API), so the guard becomes dead code on AMD hosts and multi-GPU AMD
   setups slip through even though unsloth does not support them yet.
   Add a torch.cuda.device_count() > 1 fallback inside the except so
   AMD multi-visible-device setups are flagged consistently with the
   original CUDA memory check.

2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm
   ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when
   SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user
   running `install.sh --no-torch` on an AMD host would still pull in
   bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the
   case block in an outer `[ "$SKIP_TORCH" = false ]` guard.

3. studio/backend/main.py [3/20]: the /api/system endpoint returned
   `"device_backend": get_device().value`, which is "cuda" on ROCm
   hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints
   (hardware.py) already use the _backend_label helper which swaps
   "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same
   helper so the Studio UI reports the backend consistently across all
   endpoints.

4. studio/backend/tests/test_utils.py: update test_backend_matches_device
   to call _backend_label(get_device()) instead of raw get_device().value
   so the test matches the new contract and still passes on CUDA hosts.

Tests: 258 -> 261. New regression coverage:
- X08 main.py /api/system uses _backend_label
- X09 tokenizer multi-GPU guard has device_count() fallback
- X10 fresh-install bnb case block gated on SKIP_TORCH=false
# a single component (e.g. "6\n" on a partial install).
if len(parts) >= 2:
return int(parts[0]), int(parts[1])
except Exception:
m = re.search(r"ROCm version:\s*(\d+)\.(\d+)", result.stdout)
if m:
return int(m.group(1)), int(m.group(2))
except Exception:
and parts[1].split("-")[0].isdigit()
):
return int(parts[0]), int(parts[1].split("-")[0])
except Exception:
if gpu_lines:
has_physical_nvidia = True
has_usable_nvidia = visible_device_tokens != []
except Exception:
# a single component (e.g. "6\n" on a partial install).
if len(parts) >= 2:
return int(parts[0]), int(parts[1])
except Exception:
m = re.search(r"ROCm version:\s*(\d+)\.(\d+)", result.stdout)
if m:
return int(m.group(1)), int(m.group(2))
except Exception:
and parts[1].split("-")[0].isdigit()
):
return int(parts[0]), int(parts[1].split("-")[0])
except Exception:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants