Skip to content

Add AMD ROCm/HIP support across installer and hardware detection#4720

Merged
danielhanchen merged 67 commits into
mainfrom
feature/rocm-support-v2
Apr 10, 2026
Merged

Add AMD ROCm/HIP support across installer and hardware detection#4720
danielhanchen merged 67 commits into
mainfrom
feature/rocm-support-v2

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

@danielhanchen danielhanchen commented Mar 31, 2026

Summary

Add AMD ROCm GPU support across the installer, Python dependency stack, llama.cpp prebuilt selection, and hardware detection module. All existing CUDA, CPU, macOS, and Windows pathways are preserved -- AMD ROCm only activates as a last resort when no NVIDIA GPU is found.

Built on top of the original work from @edamamez (#4708) and @GoldenGrapeGentleman (#4428, #4448, #4449, #4450) -- their commits are preserved with attribution. Supersedes those PRs with bug fixes, validation hardening, and a comprehensive test suite.

Test suite (95 Python + 23 shell tests) is in companion PR #4824.

Radeon-specific wheel installation from @iswaryaalex (#4770) is included.

Changes by file

1. install.sh -- ROCm detection + Radeon wheel selection

Extend get_torch_index_url() to detect AMD ROCm when nvidia-smi is not found. Detection chain: rocminfo > amd-smi > /opt/rocm version file > hipconfig > dpkg-query > rpm. ROCm 7.2+ capped to rocm7.1 index (torch 2.11.0 exceeds current <2.11.0 bound). ROCm < 6.0 rejected (no PyTorch wheels exist). Includes:

  • Validation guard rejecting malformed tags (e.g. rocm. from garbled amd-smi)
  • Debian epoch prefix stripping for dpkg-query
  • ROCm status messaging and bitsandbytes install for AMD
  • Radeon GPU auto-detection via rocminfo Marketing Name
  • Radeon-specific wheel installation from repo.radeon.com with fallback to standard ROCm index
  • Empty triton arg handling (conditional inclusion to avoid PEP508 error)
  • Relaxed wheel suffix matching for manylinux_2_28_x86_64 wheels
  • Hardened amd-smi list probe to require GPU data rows, not just headers
  • Hardened rocminfo probe to reject gfx000 CPU agent (gfx[1-9] instead of gfx[0-9])

2. install_python_stack.py -- ROCm torch reinstall

Add _detect_rocm_version() and _ensure_rocm_torch() to handle the case where unsloth studio update gives a venv CPU-only torch on a ROCm host. Probes the host, checks if torch has the HIP backend, reinstalls from the correct ROCm wheel index if needed. Also adds:

  • _has_usable_nvidia_gpu() to validate nvidia-smi reports a real GPU before giving NVIDIA precedence
  • _has_rocm_gpu() with hardened amd-smi and rocminfo probes
  • amd-smi version as additional ROCm version detection fallback
  • Windows AMD GPU warning that validates actual GPU presence
  • Only skips reinstall when HIP is already present (not stale CUDA builds)

3. install_llama_prebuilt.py -- ROCm prebuilt selection

Add has_rocm to HostInfo, extend detect_host() to probe for ROCm, and route ROCm hosts to upstream prebuilts:

Host Behavior
Linux x86_64 ROCm (no NVIDIA) Try ROCm 7.2 prebuilt, fall back to source build with HIP
Windows x86_64 ROCm (no NVIDIA) Try HIP prebuilt, fall back to CPU with log message
Any with NVIDIA Normal CUDA path (ROCm ignored)

Also adds runtime_payload_health_groups for linux-rocm and windows-hip install kinds.

4. Hardware IS_ROCM flag + AMD monitoring

  • detect_hardware() sets IS_ROCM=True when torch.version.hip is present (DeviceType stays CUDA)
  • get_package_versions() returns both cuda and rocm keys
  • IS_ROCM exported via __getattr__ for live value access after detection
  • New amd.py module for AMD GPU monitoring via amd-smi with unit-aware VRAM parsing (GB vs GiB distinction)
  • HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES support in _get_parent_visible_gpu_spec() and apply_gpu_ids()

5. is_rdna() RDNA2/3/3.5/4 expansion

Expand is_rdna() to cover RDNA2 (gfx1030-1036), RDNA3 (gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201).

6. worker.py ROCm Mamba/SSM source builds

  • Probe torch.version.hip in causal-conv1d environment detection
  • ROCm source compilation path with hipcc validation and 1800s timeout
  • Prefer uv pip install when available for faster dependency resolution

7. tokenizer_utils.py error message

Replace "We do not support AMD / Intel machines yet" with a helpful message pointing to ROCm install docs.

Test plan

  • bash -n install.sh -- syntax check passed
  • python -m py_compile on all modified Python files -- passed
  • Live B200 test: CUDA path unchanged, IS_ROCM=False on NVIDIA
  • Backwards compatibility audit: no regressions for NVIDIA/CPU/macOS/Windows users
  • Shell and Python AMD detection probe tests passed
  • bash tests/sh/test_get_torch_index_url.sh -- 23/23 (in companion PR Add ROCm test suite (companion to #4720) #4824)
  • python -m pytest tests/studio/install/test_rocm_support.py -v -- 95/95 (in companion PR Add ROCm test suite (companion to #4720) #4824)

Closes #4708, closes #4428, closes #4448, closes #4449, closes #4450.

edamamez and others added 5 commits March 31, 2026 08:37
Add AMD ROCm GPU detection to get_torch_index_url() in install.sh.
When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm
version file, hipconfig, dpkg-query, and rpm.

Includes validation guard for malformed _rocm_tag, Debian epoch prefix
stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install,
and status messaging. Shell tests expanded to 23 cases.

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a
Linux host has ROCm but the venv received CPU-only torch, and reinstall
with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a
30-second timeout on the torch GPU probe subprocess.

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm
via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream
prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP
prebuilt with CPU fallback). Add linux-rocm and windows-hip install
kinds to runtime_patterns_for_choice().

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Add IS_ROCM flag to hardware.py detect_hardware() (set when
torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM
from __init__.py. Add "rocm" key to get_package_versions().

Replace "We do not support AMD" error in tokenizer_utils.py with a
helpful message pointing to ROCm installation docs.

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Add tests/studio/install/test_rocm_support.py covering all ROCm code
paths across install_llama_prebuilt.py, install_python_stack.py,
hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks
and run without AMD hardware.

Covers: asset selection (11), runtime patterns (5), HostInfo (4),
ROCm version detection (9), torch reinstall (9), index mapping (8),
hardware flag (8), tokenizer message (2), install.sh structure (10),
and live regression (1).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7290199c26

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread install.sh
Comment on lines +989 to +990
amd-smi version 2>/dev/null | awk -F'ROCm version: ' \
'NF>1{gsub(/[^0-9.]/, "", $2); split($2,a,"."); print "rocm"a[1]"."a[2]; ok=1; exit} END{exit !ok}'; } || \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Continue ROCm fallback when amd-smi version is malformed

The amd-smi probe currently returns success as soon as it sees a ROCm version: field, even when the extracted version is empty/non-numeric (for example N/A), so the || fallback chain never reaches /opt/rocm/.info/version, hipconfig, or package metadata. In that case _rocm_tag is later cleared by validation and the installer incorrectly falls back to CPU on ROCm hosts that have other valid version sources.

Useful? React with 👍 / 👎.

Comment thread studio/install_python_stack.py Outdated
Comment on lines +99 to +107
probe = subprocess.run(
[
sys.executable,
"-c",
"import torch; print(torch.version.hip or torch.version.cuda or '')",
],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
timeout = 30,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Catch torch probe timeout before deciding reinstall

The ROCm torch probe uses subprocess.run(..., timeout=30) without exception handling, so a slow or hung import torch raises TimeoutExpired and aborts install_python_stack() instead of continuing install/update. This turns a defensive probe into a hard failure on degraded environments; the timeout path should be handled explicitly and treated as a recoverable probe miss.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces AMD ROCm support across the installation scripts and backend hardware utilities, including new tests and regression checks. Key changes involve adding ROCm version detection, configuring appropriate PyTorch index URLs, and ensuring bitsandbytes is installed with AMD support. Review feedback focused on improving resource management by using context managers for file operations and refining path filtering logic for better code clarity.

Comment thread studio/install_python_stack.py Outdated
os.path.join(rocm_root, "lib", "rocm_version"),
):
try:
parts = open(path).read().strip().split("-")[0].split(".")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file reading operation is not wrapped in a context manager. Using with open(path) as f: is safer and ensures the file descriptor is closed properly.

with open(path) as f:
            parts = f.read().strip().split("-")[0].split(".")

Comment thread studio/install_llama_prebuilt.py Outdated
shutil.which("rocm-smi"),
]
rocm_paths = ["/opt/rocm", os.environ.get("ROCM_PATH", "")]
if any(rocm_hints) or any(os.path.isdir(p) for p in rocm_paths if p):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of any(os.path.isdir(p) for p in rocm_paths if p) is good, but rocm_paths contains an empty string if ROCM_PATH is not set. While if p handles this, it is cleaner to filter the list first.

rocm_paths = [p for p in ["/opt/rocm", os.environ.get("ROCM_PATH")] if p]
        if any(rocm_hints) or any(os.path.isdir(p) for p in rocm_paths):

Address review findings from 8 independent reviewers:

- Wrap _ensure_rocm_torch() torch probe in try/except for
  TimeoutExpired and OSError so a hung or broken torch import does not
  crash the installer (8/8 reviewers flagged this)
- Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to
  prevent installing unsupported torch 2.11.0 from the rocm7.1 index
- Use with-statement for file reads in _detect_rocm_version() to avoid
  resource leaks
- Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of
  default parameter to avoid relative path resolution)
- Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to
  reject rocm0.x tags that would produce nonexistent PyTorch index URLs
- Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0*
  |rocm7.1* pass through, everything else caps to rocm7.1) so future
  ROCm 10+ does not fall through to a nonexistent index
- Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering
- Fix test_probe_timeout_handled: replace zero-assertion test with
  proper assertions verifying reinstall proceeds after timeout
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements AMD ROCm support by adding hardware detection logic, updating installation scripts to fetch ROCm-compatible PyTorch and bitsandbytes wheels, and refining asset resolution for prebuilt binaries. It also introduces a comprehensive test suite and updates error messages for AMD users. The review feedback identifies a critical version typo for the bitsandbytes package (0.49.1) that must be corrected to a valid version like 0.45.0 to prevent installation failures. Furthermore, the reviewer recommends replacing silent broad exception handlers in the hardware probing logic with logging to improve maintainability and debuggability.

Comment thread install.sh Outdated
case "$TORCH_INDEX_URL" in
*/rocm*)
substep "installing bitsandbytes for AMD ROCm..."
run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.49.1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The version constraint bitsandbytes>=0.49.1 appears to be a typo. As of current releases, bitsandbytes is in the 0.45.x range. ROCm support was officially added in 0.44.0. Using a non-existent version will cause the installation to fail. Please verify the intended version (likely 0.45.1 or similar).

Suggested change
run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.49.1"
run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.45.0"

Comment thread studio/install_python_stack.py Outdated
pip_install(
"bitsandbytes (AMD)",
"--no-cache-dir",
"bitsandbytes>=0.49.1",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The version constraint bitsandbytes>=0.49.1 appears to be a typo, consistent with the same issue in install.sh. Please verify and correct to a valid released version that supports ROCm (e.g., >=0.45.0).

Suggested change
"bitsandbytes>=0.49.1",
"bitsandbytes>=0.45.0",

Comment on lines +59 to +60
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using broad, silent exception handlers like except Exception: pass. While this is a probe, it is better to at least include a comment explaining why the exception is ignored or log it at a debug level to assist in troubleshooting if version detection fails unexpectedly.

References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Comment on lines 1431 to 1432
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Broad exception handler detected. Per general rules, it is recommended to log the exception or add a descriptive comment instead of a silent pass, even for hardware probing logic.

References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2cb0b52be7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread install.sh Outdated
Comment on lines +1014 to +1018
rocm6.*|rocm7.0*|rocm7.1*)
echo "$_base/$_rocm_tag" ;;
*)
# ROCm 7.2+ (including future 10.x+): cap to rocm7.1
echo "$_base/rocm7.1" ;;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict ROCm tag remap to supported versions

This remap logic treats every detected ROCm tag outside rocm6.*, rocm7.0*, and rocm7.1* as rocm7.1, so older stacks like rocm5.x are incorrectly forced onto the 7.1 wheel index; at the same time, rocm7.1* also matches rocm7.10, which emits a likely non-existent .../rocm7.10 index. In both cases the installer can request incompatible/unavailable wheels and fail instead of falling back cleanly. The mapping should only pass through explicitly supported tags and only cap versions >= 7.2 to rocm7.1.

Useful? React with 👍 / 👎.

Filter None from the ROCM_PATH env var lookup at list construction time
instead of relying on the inline `if p` guard in the any() call.
All 8 reviewers across 2 cycles independently flagged that ROCm
detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core)
as a proxy for GPU presence, which would misroute CPU-only or NVIDIA
hosts that happen to have ROCm tools installed.

Now all 3 detection points (install.sh, install_python_stack.py,
install_llama_prebuilt.py) probe for an actual AMD GPU before
entering the ROCm path:

- install.sh: check rocminfo for gfx* GPU names, or amd-smi list
  for device rows, before version detection
- install_python_stack.py: new _has_rocm_gpu() function probes
  rocminfo and amd-smi list before _ensure_rocm_torch() proceeds
- install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi
  list instead of just checking tool existence or directory paths

Also:
- Shell test mock amd-smi now handles "list" subcommand
- Python tests updated to mock _has_rocm_gpu where needed
- Added test_no_gpu_with_rocm_tools_skips to verify the new guard
- Test index lookups now use sorted() to match production code
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for AMD ROCm across the installation and hardware detection pathways. Key changes include updating install.sh to detect AMD GPUs and select appropriate ROCm-enabled PyTorch wheels, adding ROCm detection logic to the Python stack installer, and updating hardware utility modules to distinguish between NVIDIA CUDA and AMD ROCm (HIP). The PR also includes extensive new tests to verify these pathways without requiring physical AMD hardware. Feedback focuses on improving the robustness of version string parsing and subprocess probing to prevent potential runtime errors like IndexError or AttributeError during hardware detection.

Comment thread install.sh
_rocm_tag=""
_rocm_tag=$({ command -v amd-smi >/dev/null 2>&1 && \
amd-smi version 2>/dev/null | awk -F'ROCm version: ' \
'NF>1{gsub(/[^0-9.]/, "", $2); split($2,a,"."); print "rocm"a[1]"."a[2]; ok=1; exit} END{exit !ok}'; } || \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The awk script for parsing amd-smi version might fail if the version string contains unexpected characters or if the split results in fewer than two elements. While the validation guard at line 1017 handles malformed tags, ensuring that a[1] and a[2] are populated before printing would be more robust.

Comment thread studio/install_python_stack.py Outdated
Comment on lines +57 to +58
parts = fh.read().strip().split("-")[0].split(".")
return int(parts[0]), int(parts[1])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing parts[0] and parts[1] without checking the length of the parts list could raise an IndexError. While this is wrapped in a broad try...except Exception, repository rules advise against silent broad exception handlers. It is better to validate the list length explicitly and ensure any caught exceptions are logged for debugging purposes.

Suggested change
parts = fh.read().strip().split("-")[0].split(".")
return int(parts[0]), int(parts[1])
parts = fh.read().strip().split("-")[0].split(".")
if len(parts) >= 2:
return int(parts[0]), int(parts[1])
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Comment thread studio/install_python_stack.py Outdated
raw = result.stdout.decode().strip().split("\n")[0]
parts = raw.split(".")
if len(parts) >= 2 and parts[0].isdigit():
return int(parts[0]), int(parts[1])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the version file parsing, parts[1] could raise an IndexError if hipconfig --version returns a string without at least one dot. Adding a length check ensures robustness.

Suggested change
return int(parts[0]), int(parts[1])
if len(parts) >= 2 and parts[0].isdigit() and parts[1].isdigit():
return int(parts[0]), int(parts[1])

Comment thread studio/install_python_stack.py Outdated
[
sys.executable,
"-c",
"import torch; print(torch.version.hip or torch.version.cuda or '')",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The probe script import torch; print(torch.version.hip or torch.version.cuda or '') might raise an AttributeError if torch.version does not have a hip or cuda attribute (e.g., in very old or custom builds). Using getattr is safer.

Suggested change
"import torch; print(torch.version.hip or torch.version.cuda or '')",
"import torch; v=torch.version; print(getattr(v, 'hip', None) or getattr(v, 'cuda', None) or '')",

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4286525c53

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread install.sh
Comment on lines +1001 to +1005
_rocm_tag=$({ command -v amd-smi >/dev/null 2>&1 && \
amd-smi version 2>/dev/null | awk -F'ROCm version: ' \
'NF>1{gsub(/[^0-9.]/, "", $2); split($2,a,"."); print "rocm"a[1]"."a[2]; ok=1; exit} END{exit !ok}'; } || \
{ [ -r /opt/rocm/.info/version ] && \
awk -F. '{print "rocm"$1"."$2; exit}' /opt/rocm/.info/version; } || \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent ROCm version probe from aborting install.sh

Because install.sh runs with set -e, this _rocm_tag=$(...) assignment will terminate the script when every version probe in the || chain fails. That happens on real ROCm hosts when GPU detection succeeds (via rocminfo) but version sources are unavailable/unreadable (for example custom ROCm layouts where /opt/rocm/.info/version, hipconfig, and package metadata are missing). In that case the installer exits before reaching the intended CPU fallback path.

Useful? React with 👍 / 👎.

Comment on lines +102 to +104
if result.returncode == 0 and result.stdout.strip():
if marker is None or marker in result.stdout.lower():
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require real device rows in amd-smi GPU detection

The amd-smi path treats any non-empty amd-smi list output as proof of an AMD GPU because marker is None, so informational/header-only output can be misclassified as a real adapter. When that happens on hosts with ROCm tools but no AMD GPU, _ensure_rocm_torch() can incorrectly force a ROCm torch reinstall and fail dependency setup; this check should validate actual device entries (similar to the shell script’s row filtering) rather than just stdout.strip().

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits March 31, 2026 09:37
- Add parts[1].isdigit() check in hipconfig version parsing to handle
  versions like "6.3-HIP" where the minor component has non-numeric
  suffix (strip "-" prefix before int() conversion)
- Use getattr() in torch probe subprocess to safely handle old or
  custom torch builds that may lack torch.version.hip/cuda attributes
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

- Change amd-smi list detection from any-non-empty-output to requiring
  "gpu" marker in output, matching the shell-side NR>1 check. Prevents
  false positives from header-only amd-smi list output.
- Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed
  AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and
  install_llama_prebuilt.py behavior).
- Apply the same amd-smi marker fix to install_llama_prebuilt.py
  detect_host() for consistency.
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 10ec0cdefa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/install_llama_prebuilt.py Outdated
Comment on lines +1438 to +1439
(["rocminfo"], "gfx"),
(["amd-smi", "list"], "gpu"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require concrete AMD device markers in ROCm probe

The new ROCm detector uses ("amd-smi", "list") with marker "gpu" and then accepts any output containing that substring, which can misclassify non-AMD hosts because informational text (for example messages like “No GPUs found” or header-only output) still includes gpu. In environments without usable NVIDIA GPUs, that false positive routes selection toward HIP/ROCm prebuilts instead of CPU, leading to unnecessary validation failures and source-build fallback behavior.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements AMD ROCm support across the installation and hardware detection stack. Key updates include ROCm version detection in the shell installer, the introduction of an IS_ROCM flag in the hardware utility module, and logic in the Python stack installer to handle ROCm-specific PyTorch and bitsandbytes wheels. Additionally, prebuilt binary resolution now supports HIP assets for both Linux and Windows. Feedback was provided regarding the ROCm detection logic in detect_host, which is likely to fail on Windows due to its reliance on Linux-specific CLI tools.

Comment on lines +1435 to +1451
has_rocm = False
if not is_macos:
for _cmd, _marker in (
(["rocminfo"], "gfx"),
(["amd-smi", "list"], "gpu"),
):
_exe = shutil.which(_cmd[0])
if not _exe:
continue
try:
_result = run_capture([_exe, *_cmd[1:]], timeout = 10)
except Exception:
continue
if _result.returncode == 0 and _result.stdout.strip():
if _marker in _result.stdout.lower():
has_rocm = True
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ROCm detection logic in detect_host is likely to fail on native Windows systems. It relies on rocminfo (which is Linux-only) and amd-smi (which is often not in the system PATH on Windows). To improve detection on Windows, consider checking for the presence of amdhip64.dll in standard locations (e.g., C:\Windows\System32 or the ROCm installation directory) or using wmic to query the GPU vendor.

The previous detect_host() ROCm check used rocminfo and amd-smi list
which are Linux-only tools. On Windows, has_rocm would always be False,
making the Windows HIP prebuilt path at line 1794 unreachable.

Now detect_host() uses platform-specific detection:
- Linux: rocminfo (check for gfx GPU names) or amd-smi list
- Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH

This allows Windows AMD users to get the HIP prebuilt binary instead
of silently falling through to the CPU prebuilt.
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for AMD ROCm (HIP) across the installation scripts, backend hardware monitoring, and kernel utilities. Key enhancements include automated AMD GPU and ROCm version detection, the ability to install ROCm-specific PyTorch wheels and bitsandbytes, and hardware monitoring via amd-smi. It also adds support for llama.cpp ROCm/HIP prebuilts on both Linux and Windows and expands RDNA architecture support in the kernel utilities. I have no feedback to provide as there are no review comments.

@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for AMD ROCm GPUs across the installation, hardware detection, and runtime components. Key changes include enhanced GPU detection logic in the installation script to differentiate between NVIDIA and AMD ROCm, enabling the selection of appropriate PyTorch index URLs and direct installation of bitsandbytes for ROCm. A new amd.py module was added to mirror NVIDIA GPU monitoring, providing detailed metrics for ROCm devices, and the core hardware.py was updated to dynamically dispatch to the correct monitoring backend and handle ROCm-specific environment variables for GPU visibility. New functions were introduced in install_python_stack.py to detect ROCm versions and ensure the installation of ROCm-compatible PyTorch wheels and bitsandbytes, with specific warnings for Windows users. The install_llama_prebuilt.py now includes has_rocm in host detection and prioritizes ROCm prebuilts for both Linux and Windows, selecting compatible versions. Kernel utilities expanded RDNA architecture detection, and tokenizer utilities updated error messages for AMD ROCm users. Review comments highlighted potential issues in studio/backend/utils/hardware/amd.py regarding truthiness checks for vram_total_mb and power_limit that could lead to None results if values are exactly 0.0, and suggested more robust type checking for amd-smi output to prevent AttributeError when data is not a dictionary.

Comment on lines +174 to +177
round((vram_used_mb / vram_total_mb) * 100, 1)
if vram_used_mb is not None and vram_total_mb and vram_total_mb > 0
else None
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The truthiness check and vram_total_mb will evaluate to False if vram_total_mb is 0.0, which is a valid numeric value. While the subsequent vram_total_mb > 0 check correctly handles the division-by-zero case, the redundant truthiness check might lead to unexpected None results if a GPU reports exactly zero total VRAM (e.g., in some virtualized or error states).

    vram_util = (
        round((vram_used_mb / vram_total_mb) * 100, 1)
        if vram_used_mb is not None and vram_total_mb is not None and vram_total_mb > 0
        else None
    )

Comment on lines +179 to +182
round((power_draw / power_limit) * 100, 1)
if power_draw is not None and power_limit and power_limit > 0
else None
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the VRAM utilization check, using truthiness for power_limit will return None if the power limit is exactly 0.0. It is safer to use an explicit is not None check.

    power_util = (
        round((power_draw / power_limit) * 100, 1)
        if power_draw is not None and power_limit is not None and power_limit > 0
        else None
    )

if isinstance(data, list):
return len(data)
# Some versions return a dict with a "gpu" key
gpus = data.get("gpu", data.get("gpus", []))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If data is not a dictionary (e.g., if amd-smi returns a JSON array or a scalar unexpectedly), calling .get() will raise an AttributeError. Given that _run_amd_smi returns the result of json.loads(), it is safer to verify the type before accessing keys.

Suggested change
gpus = data.get("gpu", data.get("gpus", []))
gpus = data.get("gpu", data.get("gpus", [])) if isinstance(data, dict) else []

Comment thread studio/backend/utils/hardware/amd.py Outdated
}

gpu_list = (
data if isinstance(data, list) else data.get("gpus", data.get("gpu", [data]))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If data is not a dictionary, data.get() will raise an AttributeError. Although _run_amd_smi usually returns a structured object, adding a type check ensures robustness against malformed SMI output.

    gpu_list = (
        data if isinstance(data, list) else data.get("gpus", data.get("gpu", [data]))
        if isinstance(data, dict) else [data]
    )

danielhanchen and others added 4 commits April 9, 2026 00:57
…k, bnb, backend label

Consolidated fix batch from a 20-parallel reviewer.py run on the current
head. Each fix is drawn from a high-consensus finding and addresses a
real bug or feature gap, not a stylistic preference.

1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five
   call sites so this branch no longer regresses main's version floor
   (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would
   silently downgrade the minimum version pin for fresh installs.

2. install.sh: URL-decode Radeon wheel names before extracting the
   torch / torchvision / torchaudio version strings. Real wheel URLs
   from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...")
   so the previous `[+-]` terminator in the sed regex never matched,
   `_torch_ver` stayed empty, `_radeon_versions_match` stayed false,
   and every Radeon consumer install silently fell back to the generic
   ROCm index. Now decode %2B -> + first, then extract, then validate.

3. install.sh: the two AMD bitsandbytes install lines were running
   `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`,
   so upgrades where the venv already has a CPU/CUDA bitsandbytes
   satisfying the constraint would keep the stale non-AMD wheel. Add
   `--force-reinstall --no-cache-dir` to both call sites, matching the
   pattern already used in install_python_stack.py::_ensure_rocm_torch.

4. install_python_stack.py and install_llama_prebuilt.py: add
   `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the
   Python-side ROCm version detectors so they match the chain in
   install.sh::get_torch_index_url. Package-managed ROCm installs
   (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via
   rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig,
   or amd-smi `version` output -- without these fallbacks, `unsloth
   studio update` on such hosts returned None and skipped the ROCm
   torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before
   parsing so epoch-annotated packages parse correctly.

5. hardware.py: add a `_backend_label(device)` helper that returns
   "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and
   use it for every `"backend": ...` emission in JSON responses served
   to the Studio frontend. Internally we still represent ROCm hosts as
   DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API
   surface), but the user-facing API now correctly reports "rocm" on
   AMD boxes instead of labeling them as "cuda".

All 250 simulation scenarios pass (was 233 before this batch: added 17
new regression tests covering the version pin, %2B decoding, bnb
force-reinstall flags, dpkg/rpm fallback presence, and the
_backend_label helper's four-way truth table).
…p to 6.4

Two rounds of fixes in one commit, plus a full URL audit of every PyPI /
download.pytorch.org / repo.radeon.com reference the PR introduces.

amd.py (4 medium gemini findings on commit b3627bc):

1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util
   gate. The follow-up `vram_total_mb > 0` already handles the division
   guard, but the truthiness check was redundant and slightly surprising
   for a 0.0 valid value. Replace with explicit `is not None and > 0`
   for both vram_util and power_util.

2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding
   for non-dict envelopes. A scalar / string JSON response from amd-smi
   would raise AttributeError. Add an isinstance(data, dict) check and
   return None for unexpected shapes.

3. get_visible_gpu_utilization had the same .get() exposure on the outer
   envelope. Rewrite the gpu_list extraction as an explicit
   list/dict/else cascade so a malformed scalar envelope produces
   gpu_list=[data] and continues without raising.

4. The same function's per-entry loop also called gpu_data.get() on
   whatever was inside gpu_list. If a scalar ever leaks into the list
   (directly or via the previous fix's fallback), _extract_gpu_metrics
   would raise on the first .get() inside the helper. Skip non-dict
   entries in the loop before extracting metrics.

install.sh (URL audit finding, previously flagged by 20-reviewer as #13):

5. get_torch_index_url used `rocm6.*` in the rocm tag case statement,
   which matched rocm6.5 and rocm6.6 and emitted
   download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because
   PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the
   supported 6.x minors explicitly and add a rocm6.* fallback branch
   that clips to rocm6.4 (the last supported 6.x wheel set).

URL audit results (all URLs PR 4720 references):
- 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130,
  rocm6.0..6.4,rocm7.0..7.2} return HTTP 200.
- 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3,
  6.4,7.0,7.1,7.2}/ return HTTP 200.
- X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for
  6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z
  -> X.Y fallback sed in the Radeon wheel install block.
- Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the
  llama.cpp GitHub releases API endpoint all return 200.

Test suite: 255 -> 258. New regression coverage:
- U17: get_physical_gpu_count tolerates scalar amd-smi envelope
- U18: get_visible_gpu_utilization tolerates scalar envelope
- U19a-c: vram_util / power_util return None on zero total, but
  vram_total_gb still echoes 0.0 (not None)
- A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported
  6.x minors to rocm6.4 instead of producing a 403 index URL
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bae24218b0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread install.sh Outdated
Comment on lines +986 to +987
if command -v rocminfo >/dev/null 2>&1 && \
rocminfo 2>/dev/null | awk '/Name:[[:space:]]*gfx[1-9]/{found=1} END{exit !found}'; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Broaden rocminfo GPU match beyond Name: gfx*

The ROCm probe currently requires rocminfo lines to match Name: ... gfx... with gfx immediately after whitespace, but many valid ROCm outputs report GPU ISA as strings like Name: amdgcn-amd-amdhsa--gfx90a.... In those environments (especially minimal/runtime-only installs where amd-smi is absent), _has_amd_rocm_gpu() returns false and get_torch_index_url() incorrectly falls back to CPU wheels on AMD hosts. Please match gfx[1-9] anywhere in relevant rocminfo device lines rather than only the strict prefix form.

Useful? React with 👍 / 👎.

…n.py backend label

Three high-confidence findings from a second 20-parallel reviewer.py run
on commit 7effb3a. Triaged 15 total findings and applied the three that
were confirmed as real bugs; the rest were either false positives (e.g.
"migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream
via setup.sh regardless), design decisions (e.g. visibility mask env
vars not consulted in installer detection), or edge cases the existing
fallback logic already handles.

1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe
   runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then
   only raises if `torch.cuda.is_available()` is False. On ROCm torch,
   torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.*
   API), so the guard becomes dead code on AMD hosts and multi-GPU AMD
   setups slip through even though unsloth does not support them yet.
   Add a torch.cuda.device_count() > 1 fallback inside the except so
   AMD multi-visible-device setups are flagged consistently with the
   original CUDA memory check.

2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm
   ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when
   SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user
   running `install.sh --no-torch` on an AMD host would still pull in
   bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the
   case block in an outer `[ "$SKIP_TORCH" = false ]` guard.

3. studio/backend/main.py [3/20]: the /api/system endpoint returned
   `"device_backend": get_device().value`, which is "cuda" on ROCm
   hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints
   (hardware.py) already use the _backend_label helper which swaps
   "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same
   helper so the Studio UI reports the backend consistently across all
   endpoints.

4. studio/backend/tests/test_utils.py: update test_backend_matches_device
   to call _backend_label(get_device()) instead of raw get_device().value
   so the test matches the new contract and still passes on CUDA hosts.

Tests: 258 -> 261. New regression coverage:
- X08 main.py /api/system uses _backend_label
- X09 tokenizer multi-GPU guard has device_count() fallback
- X10 fresh-install bnb case block gated on SKIP_TORCH=false
danielhanchen and others added 4 commits April 9, 2026 06:39
During install, bitsandbytes was installed without --no-deps, causing
uv to resolve torch from PyPI (CUDA build) and silently overwrite the
ROCm wheels that were just installed in the previous step.

This happened in three places:
- install.sh: bitsandbytes install in both migrated and fresh paths
- install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch()

Additionally, multiple install steps in install_python_stack.py (extras,
overrides, studio deps) can pull in CUDA torch via transitive
dependencies. A final _ensure_rocm_torch() call at the end of the
install sequence ensures ROCm torch is always in place at runtime.

All changes are gated behind ROCm-specific conditions and do not affect
NVIDIA, CPU-only, macOS, or Windows install paths.

Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms
torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install.
…n HIP

On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path:

1. Unsloth's global monkey-patching of transformers model classes
   (LlamaRotaryEmbedding, attention modules) triggers
   _assert_async_cuda_kernel crashes on HIP during generation.
   Training uses different code paths and works fine.

2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion
   failures on MI300X (CDNA3 / gfx942), even without Unsloth patching.

This commit adds a ROCm-specific inference fallback that:
- Skips importing Unsloth at module level (prevents global patching)
- Loads models in 16-bit with plain transformers + PEFT instead
- Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx")
  since pre-quantized HF repos still trigger bnb codepaths
- Guards get_chat_template calls (unavailable without Unsloth import)
- Fixes max_seq_length=0 being passed to from_pretrained (GGUF
  semantics don't apply to transformers path)

The NVIDIA path is completely unchanged -- Unsloth import and
for_inference() optimization remain active. GGUF inference (via
llama-server/HIP) is unaffected since it never imports Python model
classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X)
so 16-bit loading is practical for inference.

Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424):
- Simple generation: PASS
- Compare mode (base vs finetuned): PASS
- GGUF inference + tool calling: PASS (unaffected by this change)
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb7dbd4361

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +19 to +22
if _IS_ROCM_ENV:
FastLanguageModel = None # Loaded on-demand only on NVIDIA
FastVisionModel = None
get_chat_template = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep ROCm non-text loaders callable

Do not null out FastLanguageModel/FastVisionModel at import time for ROCm unless every downstream callsite is guarded. load_model() still uses FastVisionModel in the vision path and FastLanguageModel in the SNAC audio path, so on ROCm these branches now crash with AttributeError before any fallback logic runs, breaking vision/audio model loading on AMD hosts.

Useful? React with 👍 / 👎.

from peft import PeftModel

model = PeftModel.from_pretrained(model, config.path)
tokenizer = AutoTokenizer.from_pretrained(config.path)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Load LoRA tokenizer from base model in ROCm fallback

In the ROCm LoRA branch the base weights are loaded from _base, but the tokenizer is loaded from config.path (the adapter path). Adapter-only PEFT checkpoints commonly omit tokenizer files, so this can fail on ROCm even when the base model is valid. Use the base model as tokenizer source (or fallback to it when adapter tokenizer files are absent) to avoid adapter-only load failures.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits April 9, 2026 13:05
- Add clear RuntimeError for audio/vision model inference on ROCm
  (these paths use Unsloth's FastModel/FastVisionModel which would
  crash on HIP; GGUF inference is the supported path on AMD)
- Remove unused `import os as _os` from the ROCm changes
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 98f0215116

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +15 to +18
_IS_ROCM_ENV = getattr(__import__("torch").version, "hip", None) is not None

if _IS_ROCM_ENV:
FastLanguageModel = None # Loaded on-demand only on NVIDIA
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate ROCm loader nulling on active ROCm hardware

_IS_ROCM_ENV is derived from torch.version.hip (build-time capability), so HIP-built torch on a non-ROCm runtime (e.g., CPU-only host or broken AMD stack) still sets FastLanguageModel = None at import time. Later, load_model() branches on _hw_module.IS_ROCM (runtime detection) and can take the non-ROCm path, which then calls FastLanguageModel.from_pretrained(...) and crashes with AttributeError. Fresh evidence in this patch is the new split between import-time _IS_ROCM_ENV and runtime _hw_module.IS_ROCM, which can legitimately diverge.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits April 9, 2026 20:41
…sage, temperature)

amd-smi on recent ROCm versions (7.x) wraps metric output in a
{"gpu_data": [...]} envelope instead of returning a raw list. This
caused get_primary_gpu_utilization() and get_visible_gpu_utilization()
to fail silently (returning available=False) because the GPU data
dict was never unwrapped.

Additionally:
- VRAM data moved from "vram" to "mem_usage" with "total_vram" /
  "used_vram" keys. Added fallback key lookup.
- Temperature "edge" sensor returns "N/A" on MI300X VF; the previous
  dict.get() chain returned the "N/A" string instead of falling
  through to "hotspot". Changed to a loop that checks each key until
  a parseable value is found.

Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x):
- GPU utilization: 0% (idle), up to 100% during training
- Temperature: 40-44C (from hotspot sensor)
- VRAM: 0.28/191.69 GB (idle)
- Power: 158-211W draw
@danielhanchen
Copy link
Copy Markdown
Member Author

AMD ROCm Testing Report -- MI300X VF (ROCm 7.2, HIP 7.1.25424)

Environment

  • GPU: AMD Instinct MI300X VF (192GB VRAM, gfx942 CDNA3)
  • ROCm: 7.2.0, HIP 7.1.25424
  • PyTorch: 2.10.0+rocm7.1
  • Unsloth: 2026.4.4
  • Transformers: 4.57.6

Fixes Added (3 commits)

  1. install.sh + install_python_stack.py (83567f5): Prevent bitsandbytes from overwriting ROCm torch with CUDA wheels (--no-deps + post-install repair).

  2. inference.py (7cbc51d + 98f0215): ROCm inference fallback -- Unsloth's patched kernels and bnb 4-bit crash on HIP during inference. On ROCm, skips Unsloth import (prevents global monkey-patching), loads models in 16-bit with plain transformers+PEFT, resolves pre-quantized model names. NVIDIA path is byte-for-byte unchanged.

  3. amd.py (4a87946): Fix amd-smi parsing for newer output format -- gpu_data envelope unwrapping, mem_usage key for VRAM, temperature fallback from edge (N/A on VF) to hotspot.

Test Results

Studio Launch & Hardware Detection

Check Result
device_backend: "rocm" PASS
GPU name: AMD Instinct MI300X VF PASS
VRAM: 191.69 GB PASS
chat_only: false PASS

GGUF Inference (4 models, all PASS)

Model Variant Quality
Qwen3.5-4B-GGUF Q4_K_XL Correct: math, creative, knowledge
gemma-3-4b-it-GGUF Q4_K_M Correct: math, creative, knowledge
Llama-3.2-3B-Instruct-GGUF Q4_K_M Correct: math, creative, knowledge
Llama-3.2-1B-Instruct-GGUF Q4_K_M Correct: math, creative, knowledge

GGUF Tool Calling (7 tests, all PASS)

Tool Model Result
Python code execution Qwen3.5-4B PASS -- computed 100 prime sum (24133), 2^100, 20!
Bash/terminal Qwen3.5-4B PASS -- listed files correctly
Web search Qwen3.5-4B PASS -- found current president of France
All tools combined Qwen3.5-4B PASS
Non-streaming tool call Qwen3.5-4B PASS
Python execution Llama-3.2-3B PASS -- sqrt(144) = 12
Bash execution Llama-3.2-3B PASS -- date and hostname

Fine-tuning (5 models, all PASS)

Model Loss Grad Norms
Qwen3-0.6B 2.27 -> 1.52 Healthy (1.5-17.6)
Llama-3.2-1B-Instruct 1.85 -> 1.40 Healthy (1.1-4.3)
Mistral-7B-4bit 1.18 -> 0.94 Healthy (1.8-11.6)
Qwen3-4B 1.98 -> 1.17 Healthy (0.6-9.5)
gemma-3-4b-it 2.11 -> 1.19 Mostly healthy (2 NaN/inf -- known Gemma issue)

Export (all PASS)

Format Size Inference Quality
GGUF Q4_K_M 0.75 GB "Tokyo" for Japan, good haiku
16-bit merged 2.30 GB "Jupiter" for largest planet, "50" for 10*5
LoRA adapter adapter_model.safetensors "blue" for sky, "Python" for language

GGUF Memory & Context

Model Context Native Memory
Qwen3.5-4B-GGUF 262,144 262,144 Correct
Llama-3.2-1B-GGUF 131,072 131,072 Correct
gemma-3-4b-it-GGUF 131,072 131,072 Correct

KV Cache Quantization (all PASS)

q8_0, q4_0, f16 all produce correct "Berlin" for Germany capital.

GPU Monitoring (amd-smi)

Metric Status Value
GPU utilization PASS 0-2% (idle)
Temperature PASS 40-44C (hotspot)
VRAM PASS 0.28-3.2GB / 191.69GB
Power draw PASS 158-211W
Backend label PASS "rocm"

Compare Mode (base vs finetuned)

Test Result
Simple generation PASS ("2 + 2 = 4")
Base model (adapter off) PASS (coherent sky description)
Finetuned (adapter on) PASS (coherent, slightly different)

Embedding Finetuning

Qwen3-Embedding-0.6B with all-nli dataset, 10 steps: PASS (loss converged)

Vision

  • GGUF inference (Qwen3.5-4B vision model): PASS
  • Vision finetuning (Qwen2-VL-2B): Needs max_seq_length >= 1024 for image tokens (not AMD-specific)

NVIDIA/Windows/Mac Safety

All changes verified gated behind IS_ROCM, _IS_ROCM_ENV, IS_WINDOWS, IS_MACOS. NVIDIA else branch in inference.py is byte-for-byte identical to original. No unconditional changes affect non-AMD paths.

* Bug fix detecting radeon

* Expanding GPU target for gfx1100*

* Generalize gfx family-prefix filter to cover gfx10/gfx12 as well

rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the
specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures
the bare family prefix from the generic line, and passing that to
-DGPU_TARGETS breaks the HIP build because clang only accepts specific
gfxNNN ids.

The previous filter only special-cased gfx11. Generalize it so any bare
2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a
specific sibling target is present in the same list. No real AMD GPU has
a 2-digit gfx id, so the filter can only ever drop family prefixes and
never a real target.

Covers the existing gfx11 cases unchanged, and extends the same fix to
gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4),
which would otherwise hit the same build failure on newer rocminfo.

---------

Co-authored-by: Iswarya Alex <iswarya.alex@amd.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
@danielhanchen danielhanchen merged commit cad8c6a into main Apr 10, 2026
5 checks passed
@danielhanchen danielhanchen deleted the feature/rocm-support-v2 branch April 10, 2026 08:56
danielhanchen added a commit that referenced this pull request Apr 10, 2026
#4954)

* Pin bitsandbytes to continuous-release_main on ROCm for 4-bit decode fix

bitsandbytes 0.49.2 on PyPI ships with a broken 4-bit GEMV kernel on
every ROCm target:

  - CDNA (gfx90a / gfx942 / gfx950 = MI210 / MI300X / MI350) via a
    broken blocksize=32/64 warp64 GEMV kernel whose tests were
    explicitly skipped with ROCM_WARP_SIZE_64 guards because the
    code was known broken.
  - RDNA3 / RDNA3.5 (gfx1100-1103 / gfx1150-1152) via a compile-time
    BNB_WARP_SIZE macro in the host-side dispatch that resolves to
    64 when the multi-arch wheel is compiled with CDNA as the
    primary target, so num_blocks is wrong on RDNA and half the GEMV
    output is never written.

At decode shape (1, 1, hidden) both bugs produce NaN. Training is
unaffected because training shapes are (batch, seq_len > 1, hidden)
and never touch the GEMV path. The crash during autoregressive
inference surfaces as _assert_async_cuda_kernel in torch.multinomial
which on HIP becomes a hard HSA_STATUS_ERROR_EXCEPTION instead of
a clean Python error.

Both bugs are fixed by bitsandbytes commit 713a3b8 ("[ROCm] Enable
blocksize 32 4-bit quantization and GEMV kernels on AMD CDNA",
PR #1887, merged 2026-03-09) which replaces BNB_WARP_SIZE with a
runtime hipDeviceGetAttribute query and ships a working CDNA warp64
kernel. That commit has not shipped to PyPI yet, but
continuous-release_main wheels are published on every push to bnb
main via GitHub Releases.

Point the ROCm install path at the continuous-release_main x86_64 and
aarch64 wheels and fall back to PyPI >=0.49.1 when the pre-release is
unreachable (offline installs, firewalled hosts, or architectures not
covered by the pre-release wheels). Drop the pin once bnb cuts a
0.50+ tag on PyPI.

Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): direct
bnb GEMV shape test now returns 0.0078 max abs error at seq_len=1
(no NaN) vs NaN on 0.49.2, and full Unsloth + for_inference + 4-bit
sampling generation works end-to-end.

NVIDIA / CPU / Mac / Windows paths are unaffected -- the helper is
gated on the ROCm torch index and platform.machine() respectively.

* Drop Studio ROCm 16-bit fallback now that bnb 0.50+ fixes 4-bit decode

The 16-bit fallback in studio/backend/core/inference/inference.py was
added as a workaround for a bug that this PR already fixes at the
install layer: bitsandbytes <= 0.49.2 has a broken 4-bit GEMV kernel
on every ROCm target, which NaNs at decode shape (seq_len=1) and
crashes autoregressive inference. bnb PR #1887 (commit 713a3b8, in
0.50.0.dev0+, pinned by install.sh / install_python_stack.py in this
PR) restores correct 4-bit decode on MI300X and verified working
end-to-end with full Unsloth + for_inference + sampling.

Revert the dual code path so ROCm and NVIDIA both go through the
normal FastLanguageModel.from_pretrained + for_inference flow:

  - Remove the conditional `from unsloth import` that skipped the
    import on ROCm. The monkey-patches it was trying to avoid were
    never the cause of the crash; bnb 4-bit GEMV was.
  - Remove the `if _hw_module.IS_ROCM:` branch in load_model that
    loaded with plain transformers + PEFT + bfloat16, and the
    `_resolve_fp16_base` helper it relied on.
  - Remove the `get_chat_template is not None` fallback in
    _load_chat_template_info -- get_chat_template is now always
    imported.
  - Refactor the audio/vision ROCm guard to check _hw_module.IS_ROCM
    directly instead of the removed _IS_ROCM_ENV global. Audio and
    vision on ROCm still need separate validation (FastVisionModel
    and the CSM audio codecs were never tested on HIP) so the guard
    stays for now.

Add _bnb_rocm_4bit_ok() as a runtime safety net for users who
install from this PR before the install.sh bnb pin kicks in, or
whose installer fell back to the PyPI pin because the continuous-
release wheel was unreachable. When the installed bnb is < 0.50 on
ROCm, force load_in_4bit=False and strip any -unsloth-bnb-4bit /
-bnb-4bit suffix from the model path so a pre-quantized repo
resolves to its FP16 sibling instead of pulling bnb back in via
the repo's quantization_config. LoRA adapters whose base is a
pre-quantized repo on old bnb will still fail inside Unsloth's
loader -- the only real fix there is `unsloth studio update`.

Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1):

  - HAPPY path (bnb 0.50.0.dev0, load_in_4bit=True, pre-quantized
    repo): loads in 4-bit via the fixed GEMV, generation returns
    "Paris." for greedy and sampling.
  - SAFETY-NET path (simulated old bnb, suffix-stripped to the
    FP16 sibling, load_in_4bit=False): loads in bf16, generation
    returns "Paris." for greedy and sampling.

Net diff is ~45 lines smaller than the pre-revert state because
the entire plain-transformers 16-bit branch is gone.

* Cache _bnb_rocm_4bit_ok() with functools.cache

load_model() can be called many times in a single session but the bnb
version and hardware state cannot change at runtime, so memoise the
check. First call is ~1.9 ms (dominated by the lazy `import bitsandbytes`
inside the try block), subsequent calls drop to sub-microsecond dict
lookups. Zero behavioral change.

* Shorten verbose bnb/ROCm comments

Comment-only cleanup across install.sh, studio/install_python_stack.py,
and studio/backend/core/inference/inference.py. No behavioral change.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove _bnb_rocm_4bit_ok safety net from inference.py

Studio's ROCm support is brand new (PR #4720, merged today) and every
fresh install pulls the bnb continuous-release_main wheel via
install.sh / install_python_stack.py in this same PR. There are no
existing ROCm Studio installs carrying bnb < 0.50, so the defensive
version-check fallback is guarding against a scenario that cannot
actually occur. Delete the helper, the functools import, and the
safety-net block -- inference.py now calls FastLanguageModel.from_pretrained
directly with no ROCm branching.

* Drop audio/vision ROCm guard in inference.py — verified unblocked by bnb fix

Vision inference was blocked by the same bnb 4-bit GEMV bug that affected
text inference (vision models use bnb 4-bit for the LM backbone). With
bnb 0.50+ pinned in install.sh / install_python_stack.py, vision works
end-to-end on MI300X: Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit
loaded in 4-bit via FastVisionModel + for_inference returns a correct
answer to a multimodal prompt.

Audio (CSM) was never actually blocked by HIP — on this hardware CSM
loads and runs its backbone forward pass fine with bnb 0.50, then fails
during generate() with a transformers-level kwarg validation mismatch
in generation_csm.py (`backbone_last_hidden_state` rejected). That's a
pre-existing transformers/CSM integration bug that reproduces identically
on NVIDIA, so the ROCm-gated guard was never actually protecting users
from anything HIP-specific.

Remove the combined audio/vision guard and the now-unused _hw_module
import. Also restore the one-word "Can be" in an inline comment that
drifted during the earlier comment-shortening pass, so the inference.py
delta vs pre-#4720 is exactly the max_seq_length<=0 crash fix and
nothing else.

* Shorten max_seq_length=0 guard comment to one line

---------

Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
danielhanchen added a commit that referenced this pull request Apr 11, 2026
95 Python tests and 23 shell tests covering ROCm detection,
torch index URL selection, hardware flags, prebuilt asset selection,
and install pathway logic. All tests use mocks -- no AMD hardware required.

Companion to #4720 (AMD ROCm/HIP support).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants