Add AMD ROCm/HIP support across installer and hardware detection by danielhanchen · Pull Request #4720 · unslothai/unsloth

danielhanchen · 2026-03-31T08:38:49Z

Summary

Add AMD ROCm GPU support across the installer, Python dependency stack, llama.cpp prebuilt selection, and hardware detection module. All existing CUDA, CPU, macOS, and Windows pathways are preserved -- AMD ROCm only activates as a last resort when no NVIDIA GPU is found.

Built on top of the original work from @edamamez (#4708) and @GoldenGrapeGentleman (#4428, #4448, #4449, #4450) -- their commits are preserved with attribution. Supersedes those PRs with bug fixes, validation hardening, and a comprehensive test suite.

Test suite (95 Python + 23 shell tests) is in companion PR #4824.

Radeon-specific wheel installation from @iswaryaalex (#4770) is included.

Changes by file

1. `install.sh` -- ROCm detection + Radeon wheel selection

Extend get_torch_index_url() to detect AMD ROCm when nvidia-smi is not found. Detection chain: rocminfo > amd-smi > /opt/rocm version file > hipconfig > dpkg-query > rpm. ROCm 7.2+ capped to rocm7.1 index (torch 2.11.0 exceeds current <2.11.0 bound). ROCm < 6.0 rejected (no PyTorch wheels exist). Includes:

Validation guard rejecting malformed tags (e.g. rocm. from garbled amd-smi)
Debian epoch prefix stripping for dpkg-query
ROCm status messaging and bitsandbytes install for AMD
Radeon GPU auto-detection via rocminfo Marketing Name
Radeon-specific wheel installation from repo.radeon.com with fallback to standard ROCm index
Empty triton arg handling (conditional inclusion to avoid PEP508 error)
Relaxed wheel suffix matching for manylinux_2_28_x86_64 wheels
Hardened amd-smi list probe to require GPU data rows, not just headers
Hardened rocminfo probe to reject gfx000 CPU agent (gfx[1-9] instead of gfx[0-9])

2. `install_python_stack.py` -- ROCm torch reinstall

Add _detect_rocm_version() and _ensure_rocm_torch() to handle the case where unsloth studio update gives a venv CPU-only torch on a ROCm host. Probes the host, checks if torch has the HIP backend, reinstalls from the correct ROCm wheel index if needed. Also adds:

_has_usable_nvidia_gpu() to validate nvidia-smi reports a real GPU before giving NVIDIA precedence
_has_rocm_gpu() with hardened amd-smi and rocminfo probes
amd-smi version as additional ROCm version detection fallback
Windows AMD GPU warning that validates actual GPU presence
Only skips reinstall when HIP is already present (not stale CUDA builds)

3. `install_llama_prebuilt.py` -- ROCm prebuilt selection

Add has_rocm to HostInfo, extend detect_host() to probe for ROCm, and route ROCm hosts to upstream prebuilts:

Host	Behavior
Linux x86_64 ROCm (no NVIDIA)	Try ROCm 7.2 prebuilt, fall back to source build with HIP
Windows x86_64 ROCm (no NVIDIA)	Try HIP prebuilt, fall back to CPU with log message
Any with NVIDIA	Normal CUDA path (ROCm ignored)

Also adds runtime_payload_health_groups for linux-rocm and windows-hip install kinds.

4. Hardware `IS_ROCM` flag + AMD monitoring

detect_hardware() sets IS_ROCM=True when torch.version.hip is present (DeviceType stays CUDA)
get_package_versions() returns both cuda and rocm keys
IS_ROCM exported via __getattr__ for live value access after detection
New amd.py module for AMD GPU monitoring via amd-smi with unit-aware VRAM parsing (GB vs GiB distinction)
HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES support in _get_parent_visible_gpu_spec() and apply_gpu_ids()

5. `is_rdna()` RDNA2/3/3.5/4 expansion

Expand is_rdna() to cover RDNA2 (gfx1030-1036), RDNA3 (gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201).

6. `worker.py` ROCm Mamba/SSM source builds

Probe torch.version.hip in causal-conv1d environment detection
ROCm source compilation path with hipcc validation and 1800s timeout
Prefer uv pip install when available for faster dependency resolution

7. `tokenizer_utils.py` error message

Replace "We do not support AMD / Intel machines yet" with a helpful message pointing to ROCm install docs.

Test plan

bash -n install.sh -- syntax check passed
python -m py_compile on all modified Python files -- passed
Live B200 test: CUDA path unchanged, IS_ROCM=False on NVIDIA
Backwards compatibility audit: no regressions for NVIDIA/CPU/macOS/Windows users
Shell and Python AMD detection probe tests passed
bash tests/sh/test_get_torch_index_url.sh -- 23/23 (in companion PR Add ROCm test suite (companion to #4720) #4824)
python -m pytest tests/studio/install/test_rocm_support.py -v -- 95/95 (in companion PR Add ROCm test suite (companion to #4720) #4824)

Closes #4708, closes #4428, closes #4448, closes #4449, closes #4450.

Add AMD ROCm GPU detection to get_torch_index_url() in install.sh. When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm version file, hipconfig, dpkg-query, and rpm. Includes validation guard for malformed _rocm_tag, Debian epoch prefix stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install, and status messaging. Shell tests expanded to 23 cases. Co-authored-by: Daniel Han <danielhanchen@gmail.com>

Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a Linux host has ROCm but the venv received CPU-only torch, and reinstall with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a 30-second timeout on the torch GPU probe subprocess. Co-authored-by: Daniel Han <danielhanchen@gmail.com>

Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP prebuilt with CPU fallback). Add linux-rocm and windows-hip install kinds to runtime_patterns_for_choice(). Co-authored-by: Daniel Han <danielhanchen@gmail.com>

Add IS_ROCM flag to hardware.py detect_hardware() (set when torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM from __init__.py. Add "rocm" key to get_package_versions(). Replace "We do not support AMD" error in tokenizer_utils.py with a helpful message pointing to ROCm installation docs. Co-authored-by: Daniel Han <danielhanchen@gmail.com>

Add tests/studio/install/test_rocm_support.py covering all ROCm code paths across install_llama_prebuilt.py, install_python_stack.py, hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks and run without AMD hardware. Covers: asset selection (11), runtime patterns (5), HostInfo (4), ROCm version detection (9), torch reinstall (9), index mapping (8), hardware flag (8), tokenizer message (2), install.sh structure (10), and live regression (1).

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7290199c26

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T08:43:07Z

+            amd-smi version 2>/dev/null | awk -F'ROCm version: ' \
+                'NF>1{gsub(/[^0-9.]/, "", $2); split($2,a,"."); print "rocm"a[1]"."a[2]; ok=1; exit} END{exit !ok}'; } || \


Continue ROCm fallback when amd-smi version is malformed

The amd-smi probe currently returns success as soon as it sees a ROCm version: field, even when the extracted version is empty/non-numeric (for example N/A), so the || fallback chain never reaches /opt/rocm/.info/version, hipconfig, or package metadata. In that case _rocm_tag is later cleared by validation and the installer incorrectly falls back to CPU on ROCm hosts that have other valid version sources.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-31T08:43:07Z

+    probe = subprocess.run(
+        [
+            sys.executable,
+            "-c",
+            "import torch; print(torch.version.hip or torch.version.cuda or '')",
+        ],
+        stdout = subprocess.PIPE,
+        stderr = subprocess.DEVNULL,
+        timeout = 30,


Catch torch probe timeout before deciding reinstall

The ROCm torch probe uses subprocess.run(..., timeout=30) without exception handling, so a slow or hung import torch raises TimeoutExpired and aborts install_python_stack() instead of continuing install/update. This turns a defensive probe into a hard failure on degraded environments; the timeout path should be handled explicitly and treated as a recoverable probe miss.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces AMD ROCm support across the installation scripts and backend hardware utilities, including new tests and regression checks. Key changes involve adding ROCm version detection, configuring appropriate PyTorch index URLs, and ensuring bitsandbytes is installed with AMD support. Review feedback focused on improving resource management by using context managers for file operations and refining path filtering logic for better code clarity.

gemini-code-assist · 2026-03-31T08:45:53Z

+        os.path.join(rocm_root, "lib", "rocm_version"),
+    ):
+        try:
+            parts = open(path).read().strip().split("-")[0].split(".")


The file reading operation is not wrapped in a context manager. Using with open(path) as f: is safer and ensures the file descriptor is closed properly.

with open(path) as f: parts = f.read().strip().split("-")[0].split(".")

gemini-code-assist · 2026-03-31T08:45:53Z

+            shutil.which("rocm-smi"),
+        ]
+        rocm_paths = ["/opt/rocm", os.environ.get("ROCM_PATH", "")]
+        if any(rocm_hints) or any(os.path.isdir(p) for p in rocm_paths if p):


The use of any(os.path.isdir(p) for p in rocm_paths if p) is good, but rocm_paths contains an empty string if ROCM_PATH is not set. While if p handles this, it is cleaner to filter the list first.

rocm_paths = [p for p in ["/opt/rocm", os.environ.get("ROCM_PATH")] if p] if any(rocm_hints) or any(os.path.isdir(p) for p in rocm_paths):

Address review findings from 8 independent reviewers: - Wrap _ensure_rocm_torch() torch probe in try/except for TimeoutExpired and OSError so a hung or broken torch import does not crash the installer (8/8 reviewers flagged this) - Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to prevent installing unsupported torch 2.11.0 from the rocm7.1 index - Use with-statement for file reads in _detect_rocm_version() to avoid resource leaks - Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of default parameter to avoid relative path resolution) - Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to reject rocm0.x tags that would produce nonexistent PyTorch index URLs - Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0* |rocm7.1* pass through, everything else caps to rocm7.1) so future ROCm 10+ does not fall through to a nonexistent index - Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering - Fix test_probe_timeout_handled: replace zero-assertion test with proper assertions verifying reinstall proceeds after timeout

danielhanchen · 2026-03-31T08:58:13Z

/gemini review

gemini-code-assist

Code Review

This pull request implements AMD ROCm support by adding hardware detection logic, updating installation scripts to fetch ROCm-compatible PyTorch and bitsandbytes wheels, and refining asset resolution for prebuilt binaries. It also introduces a comprehensive test suite and updates error messages for AMD users. The review feedback identifies a critical version typo for the bitsandbytes package (0.49.1) that must be corrected to a valid version like 0.45.0 to prevent installation failures. Furthermore, the reviewer recommends replacing silent broad exception handlers in the hardware probing logic with logging to improve maintainability and debuggability.

gemini-code-assist · 2026-03-31T09:04:00Z

+        case "$TORCH_INDEX_URL" in
+            */rocm*)
+                substep "installing bitsandbytes for AMD ROCm..."
+                run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.49.1"


The version constraint bitsandbytes>=0.49.1 appears to be a typo. As of current releases, bitsandbytes is in the 0.45.x range. ROCm support was officially added in 0.44.0. Using a non-existent version will cause the installation to fail. Please verify the intended version (likely 0.45.1 or similar).

Suggested change

run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.49.1"

run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.45.0"

gemini-code-assist · 2026-03-31T09:04:00Z

+    pip_install(
+        "bitsandbytes (AMD)",
+        "--no-cache-dir",
+        "bitsandbytes>=0.49.1",


The version constraint bitsandbytes>=0.49.1 appears to be a typo, consistent with the same issue in install.sh. Please verify and correct to a valid released version that supports ROCm (e.g., >=0.45.0).

Suggested change

"bitsandbytes>=0.49.1",

"bitsandbytes>=0.45.0",

gemini-code-assist · 2026-03-31T09:04:01Z

+        except Exception:
+            pass


Avoid using broad, silent exception handlers like except Exception: pass. While this is a probe, it is better to at least include a comment explaining why the exception is ignored or log it at a debug level to assist in troubleshooting if version detection fails unexpectedly.

References

Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

gemini-code-assist · 2026-03-31T09:04:01Z

        except Exception:
            pass


Broad exception handler detected. Per general rules, it is recommended to log the exception or add a descriptive comment instead of a silent pass, even for hardware probing logic.

References

Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2cb0b52be7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T09:05:15Z

+                rocm6.*|rocm7.0*|rocm7.1*)
+                    echo "$_base/$_rocm_tag" ;;
+                *)
+                    # ROCm 7.2+ (including future 10.x+): cap to rocm7.1
+                    echo "$_base/rocm7.1" ;;


Restrict ROCm tag remap to supported versions

This remap logic treats every detected ROCm tag outside rocm6.*, rocm7.0*, and rocm7.1* as rocm7.1, so older stacks like rocm5.x are incorrectly forced onto the 7.1 wheel index; at the same time, rocm7.1* also matches rocm7.10, which emits a likely non-existent .../rocm7.10 index. In both cases the installer can request incompatible/unavailable wheels and fail instead of falling back cleanly. The mapping should only pass through explicitly supported tags and only cap versions >= 7.2 to rocm7.1.

Useful? React with 👍 / 👎.

Filter None from the ROCM_PATH env var lookup at list construction time instead of relying on the inline `if p` guard in the any() call.

All 8 reviewers across 2 cycles independently flagged that ROCm detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core) as a proxy for GPU presence, which would misroute CPU-only or NVIDIA hosts that happen to have ROCm tools installed. Now all 3 detection points (install.sh, install_python_stack.py, install_llama_prebuilt.py) probe for an actual AMD GPU before entering the ROCm path: - install.sh: check rocminfo for gfx* GPU names, or amd-smi list for device rows, before version detection - install_python_stack.py: new _has_rocm_gpu() function probes rocminfo and amd-smi list before _ensure_rocm_torch() proceeds - install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi list instead of just checking tool existence or directory paths Also: - Shell test mock amd-smi now handles "list" subcommand - Python tests updated to mock _has_rocm_gpu where needed - Added test_no_gpu_with_rocm_tools_skips to verify the new guard - Test index lookups now use sorted() to match production code

danielhanchen · 2026-03-31T09:26:02Z

/gemini review

for more information, see https://pre-commit.ci

chatgpt-codex-connector · 2026-03-31T09:26:20Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

gemini-code-assist

Code Review

This pull request introduces comprehensive support for AMD ROCm across the installation and hardware detection pathways. Key changes include updating install.sh to detect AMD GPUs and select appropriate ROCm-enabled PyTorch wheels, adding ROCm detection logic to the Python stack installer, and updating hardware utility modules to distinguish between NVIDIA CUDA and AMD ROCm (HIP). The PR also includes extensive new tests to verify these pathways without requiring physical AMD hardware. Feedback focuses on improving the robustness of version string parsing and subprocess probing to prevent potential runtime errors like IndexError or AttributeError during hardware detection.

gemini-code-assist · 2026-03-31T09:31:50Z

+        _rocm_tag=""
+        _rocm_tag=$({ command -v amd-smi >/dev/null 2>&1 && \
+            amd-smi version 2>/dev/null | awk -F'ROCm version: ' \
+                'NF>1{gsub(/[^0-9.]/, "", $2); split($2,a,"."); print "rocm"a[1]"."a[2]; ok=1; exit} END{exit !ok}'; } || \


The awk script for parsing amd-smi version might fail if the version string contains unexpected characters or if the split results in fewer than two elements. While the validation guard at line 1017 handles malformed tags, ensuring that a[1] and a[2] are populated before printing would be more robust.

gemini-code-assist · 2026-03-31T09:31:50Z

+                parts = fh.read().strip().split("-")[0].split(".")
+            return int(parts[0]), int(parts[1])


Accessing parts[0] and parts[1] without checking the length of the parts list could raise an IndexError. While this is wrapped in a broad try...except Exception, repository rules advise against silent broad exception handlers. It is better to validate the list length explicitly and ensure any caught exceptions are logged for debugging purposes.

Suggested change

parts = fh.read().strip().split("-")[0].split(".")

return int(parts[0]), int(parts[1])

parts = fh.read().strip().split("-")[0].split(".")

if len(parts) >= 2:

return int(parts[0]), int(parts[1])

References

Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

gemini-code-assist · 2026-03-31T09:31:50Z

+                raw = result.stdout.decode().strip().split("\n")[0]
+                parts = raw.split(".")
+                if len(parts) >= 2 and parts[0].isdigit():
+                    return int(parts[0]), int(parts[1])


Similar to the version file parsing, parts[1] could raise an IndexError if hipconfig --version returns a string without at least one dot. Adding a length check ensures robustness.

Suggested change

return int(parts[0]), int(parts[1])

if len(parts) >= 2 and parts[0].isdigit() and parts[1].isdigit():

return int(parts[0]), int(parts[1])

gemini-code-assist · 2026-03-31T09:31:50Z

+            [
+                sys.executable,
+                "-c",
+                "import torch; print(torch.version.hip or torch.version.cuda or '')",


The probe script import torch; print(torch.version.hip or torch.version.cuda or '') might raise an AttributeError if torch.version does not have a hip or cuda attribute (e.g., in very old or custom builds). Using getattr is safer.

Suggested change

"import torch; print(torch.version.hip or torch.version.cuda or '')",

"import torch; v=torch.version; print(getattr(v, 'hip', None) or getattr(v, 'cuda', None) or '')",

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4286525c53

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T09:35:20Z

+        _rocm_tag=$({ command -v amd-smi >/dev/null 2>&1 && \
+            amd-smi version 2>/dev/null | awk -F'ROCm version: ' \
+                'NF>1{gsub(/[^0-9.]/, "", $2); split($2,a,"."); print "rocm"a[1]"."a[2]; ok=1; exit} END{exit !ok}'; } || \
+            { [ -r /opt/rocm/.info/version ] && \
+                awk -F. '{print "rocm"$1"."$2; exit}' /opt/rocm/.info/version; } || \


Prevent ROCm version probe from aborting install.sh

Because install.sh runs with set -e, this _rocm_tag=$(...) assignment will terminate the script when every version probe in the || chain fails. That happens on real ROCm hosts when GPU detection succeeds (via rocminfo) but version sources are unavailable/unreadable (for example custom ROCm layouts where /opt/rocm/.info/version, hipconfig, and package metadata are missing). In that case the installer exits before reaching the intended CPU fallback path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-31T09:35:20Z

+        if result.returncode == 0 and result.stdout.strip():
+            if marker is None or marker in result.stdout.lower():
+                return True


Require real device rows in amd-smi GPU detection

The amd-smi path treats any non-empty amd-smi list output as proof of an AMD GPU because marker is None, so informational/header-only output can be misclassified as a real adapter. When that happens on hosts with ROCm tools but no AMD GPU, _ensure_rocm_torch() can incorrectly force a ROCm torch reinstall and fail dependency setup; this check should validate actual device entries (similar to the shell script’s row filtering) rather than just stdout.strip().

Useful? React with 👍 / 👎.

- Add parts[1].isdigit() check in hipconfig version parsing to handle versions like "6.3-HIP" where the minor component has non-numeric suffix (strip "-" prefix before int() conversion) - Use getattr() in torch probe subprocess to safely handle old or custom torch builds that may lack torch.version.hip/cuda attributes

for more information, see https://pre-commit.ci

chatgpt-codex-connector · 2026-03-31T09:37:52Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

- Change amd-smi list detection from any-non-empty-output to requiring "gpu" marker in output, matching the shell-side NR>1 check. Prevents false positives from header-only amd-smi list output. - Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and install_llama_prebuilt.py behavior). - Apply the same amd-smi marker fix to install_llama_prebuilt.py detect_host() for consistency.

danielhanchen · 2026-03-31T09:52:09Z

/gemini review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 10ec0cdefa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T09:58:31Z

+            (["rocminfo"], "gfx"),
+            (["amd-smi", "list"], "gpu"),


Require concrete AMD device markers in ROCm probe

The new ROCm detector uses ("amd-smi", "list") with marker "gpu" and then accepts any output containing that substring, which can misclassify non-AMD hosts because informational text (for example messages like “No GPUs found” or header-only output) still includes gpu. In environments without usable NVIDIA GPUs, that false positive routes selection toward HIP/ROCm prebuilts instead of CPU, leading to unnecessary validation failures and source-build fallback behavior.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request implements AMD ROCm support across the installation and hardware detection stack. Key updates include ROCm version detection in the shell installer, the introduction of an IS_ROCM flag in the hardware utility module, and logic in the Python stack installer to handle ROCm-specific PyTorch and bitsandbytes wheels. Additionally, prebuilt binary resolution now supports HIP assets for both Linux and Windows. Feedback was provided regarding the ROCm detection logic in detect_host, which is likely to fail on Windows due to its reliance on Linux-specific CLI tools.

gemini-code-assist · 2026-03-31T09:58:56Z

+    has_rocm = False
+    if not is_macos:
+        for _cmd, _marker in (
+            (["rocminfo"], "gfx"),
+            (["amd-smi", "list"], "gpu"),
+        ):
+            _exe = shutil.which(_cmd[0])
+            if not _exe:
+                continue
+            try:
+                _result = run_capture([_exe, *_cmd[1:]], timeout = 10)
+            except Exception:
+                continue
+            if _result.returncode == 0 and _result.stdout.strip():
+                if _marker in _result.stdout.lower():
+                    has_rocm = True
+                    break


The ROCm detection logic in detect_host is likely to fail on native Windows systems. It relies on rocminfo (which is Linux-only) and amd-smi (which is often not in the system PATH on Windows). To improve detection on Windows, consider checking for the presence of amdhip64.dll in standard locations (e.g., C:\Windows\System32 or the ROCm installation directory) or using wmic to query the GPU vendor.

The previous detect_host() ROCm check used rocminfo and amd-smi list which are Linux-only tools. On Windows, has_rocm would always be False, making the Windows HIP prebuilt path at line 1794 unreachable. Now detect_host() uses platform-specific detection: - Linux: rocminfo (check for gfx GPU names) or amd-smi list - Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH This allows Windows AMD users to get the HIP prebuilt binary instead of silently falling through to the CPU prebuilt.

danielhanchen · 2026-04-08T12:17:39Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces comprehensive support for AMD ROCm (HIP) across the installation scripts, backend hardware monitoring, and kernel utilities. Key enhancements include automated AMD GPU and ROCm version detection, the ability to install ROCm-specific PyTorch wheels and bitsandbytes, and hardware monitoring via amd-smi. It also adds support for llama.cpp ROCm/HIP prebuilts on both Linux and Windows and expands RDNA architecture support in the kernel utilities. I have no feedback to provide as there are no review comments.

danielhanchen · 2026-04-09T00:18:41Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces comprehensive support for AMD ROCm GPUs across the installation, hardware detection, and runtime components. Key changes include enhanced GPU detection logic in the installation script to differentiate between NVIDIA and AMD ROCm, enabling the selection of appropriate PyTorch index URLs and direct installation of bitsandbytes for ROCm. A new amd.py module was added to mirror NVIDIA GPU monitoring, providing detailed metrics for ROCm devices, and the core hardware.py was updated to dynamically dispatch to the correct monitoring backend and handle ROCm-specific environment variables for GPU visibility. New functions were introduced in install_python_stack.py to detect ROCm versions and ensure the installation of ROCm-compatible PyTorch wheels and bitsandbytes, with specific warnings for Windows users. The install_llama_prebuilt.py now includes has_rocm in host detection and prioritizes ROCm prebuilts for both Linux and Windows, selecting compatible versions. Kernel utilities expanded RDNA architecture detection, and tokenizer utilities updated error messages for AMD ROCm users. Review comments highlighted potential issues in studio/backend/utils/hardware/amd.py regarding truthiness checks for vram_total_mb and power_limit that could lead to None results if values are exactly 0.0, and suggested more robust type checking for amd-smi output to prevent AttributeError when data is not a dictionary.

gemini-code-assist · 2026-04-09T00:34:17Z

+        round((vram_used_mb / vram_total_mb) * 100, 1)
+        if vram_used_mb is not None and vram_total_mb and vram_total_mb > 0
+        else None
+    )


The truthiness check and vram_total_mb will evaluate to False if vram_total_mb is 0.0, which is a valid numeric value. While the subsequent vram_total_mb > 0 check correctly handles the division-by-zero case, the redundant truthiness check might lead to unexpected None results if a GPU reports exactly zero total VRAM (e.g., in some virtualized or error states).

vram_util = ( round((vram_used_mb / vram_total_mb) * 100, 1) if vram_used_mb is not None and vram_total_mb is not None and vram_total_mb > 0 else None )

gemini-code-assist · 2026-04-09T00:34:17Z

+        round((power_draw / power_limit) * 100, 1)
+        if power_draw is not None and power_limit and power_limit > 0
+        else None
+    )


Similar to the VRAM utilization check, using truthiness for power_limit will return None if the power limit is exactly 0.0. It is safer to use an explicit is not None check.

power_util = ( round((power_draw / power_limit) * 100, 1) if power_draw is not None and power_limit is not None and power_limit > 0 else None )

gemini-code-assist · 2026-04-09T00:34:17Z

+    if isinstance(data, list):
+        return len(data)
+    # Some versions return a dict with a "gpu" key
+    gpus = data.get("gpu", data.get("gpus", []))


If data is not a dictionary (e.g., if amd-smi returns a JSON array or a scalar unexpectedly), calling .get() will raise an AttributeError. Given that _run_amd_smi returns the result of json.loads(), it is safer to verify the type before accessing keys.

Suggested change

gpus = data.get("gpu", data.get("gpus", []))

gpus = data.get("gpu", data.get("gpus", [])) if isinstance(data, dict) else []

gemini-code-assist · 2026-04-09T00:34:17Z

+        }
+
+    gpu_list = (
+        data if isinstance(data, list) else data.get("gpus", data.get("gpu", [data]))


If data is not a dictionary, data.get() will raise an AttributeError. Although _run_amd_smi usually returns a structured object, adding a type check ensures robustness against malformed SMI output.

gpu_list = ( data if isinstance(data, list) else data.get("gpus", data.get("gpu", [data])) if isinstance(data, dict) else [data] )

…k, bnb, backend label Consolidated fix batch from a 20-parallel reviewer.py run on the current head. Each fix is drawn from a high-consensus finding and addresses a real bug or feature gap, not a stylistic preference. 1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five call sites so this branch no longer regresses main's version floor (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would silently downgrade the minimum version pin for fresh installs. 2. install.sh: URL-decode Radeon wheel names before extracting the torch / torchvision / torchaudio version strings. Real wheel URLs from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...") so the previous `[+-]` terminator in the sed regex never matched, `_torch_ver` stayed empty, `_radeon_versions_match` stayed false, and every Radeon consumer install silently fell back to the generic ROCm index. Now decode %2B -> + first, then extract, then validate. 3. install.sh: the two AMD bitsandbytes install lines were running `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`, so upgrades where the venv already has a CPU/CUDA bitsandbytes satisfying the constraint would keep the stale non-AMD wheel. Add `--force-reinstall --no-cache-dir` to both call sites, matching the pattern already used in install_python_stack.py::_ensure_rocm_torch. 4. install_python_stack.py and install_llama_prebuilt.py: add `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the Python-side ROCm version detectors so they match the chain in install.sh::get_torch_index_url. Package-managed ROCm installs (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig, or amd-smi `version` output -- without these fallbacks, `unsloth studio update` on such hosts returned None and skipped the ROCm torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before parsing so epoch-annotated packages parse correctly. 5. hardware.py: add a `_backend_label(device)` helper that returns "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and use it for every `"backend": ...` emission in JSON responses served to the Studio frontend. Internally we still represent ROCm hosts as DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API surface), but the user-facing API now correctly reports "rocm" on AMD boxes instead of labeling them as "cuda". All 250 simulation scenarios pass (was 233 before this batch: added 17 new regression tests covering the version pin, %2B decoding, bnb force-reinstall flags, dpkg/rpm fallback presence, and the _backend_label helper's four-way truth table).

for more information, see https://pre-commit.ci

…p to 6.4 Two rounds of fixes in one commit, plus a full URL audit of every PyPI / download.pytorch.org / repo.radeon.com reference the PR introduces. amd.py (4 medium gemini findings on commit b3627bc): 1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util gate. The follow-up `vram_total_mb > 0` already handles the division guard, but the truthiness check was redundant and slightly surprising for a 0.0 valid value. Replace with explicit `is not None and > 0` for both vram_util and power_util. 2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding for non-dict envelopes. A scalar / string JSON response from amd-smi would raise AttributeError. Add an isinstance(data, dict) check and return None for unexpected shapes. 3. get_visible_gpu_utilization had the same .get() exposure on the outer envelope. Rewrite the gpu_list extraction as an explicit list/dict/else cascade so a malformed scalar envelope produces gpu_list=[data] and continues without raising. 4. The same function's per-entry loop also called gpu_data.get() on whatever was inside gpu_list. If a scalar ever leaks into the list (directly or via the previous fix's fallback), _extract_gpu_metrics would raise on the first .get() inside the helper. Skip non-dict entries in the loop before extracting metrics. install.sh (URL audit finding, previously flagged by 20-reviewer as #13): 5. get_torch_index_url used `rocm6.*` in the rocm tag case statement, which matched rocm6.5 and rocm6.6 and emitted download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the supported 6.x minors explicitly and add a rocm6.* fallback branch that clips to rocm6.4 (the last supported 6.x wheel set). URL audit results (all URLs PR 4720 references): - 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130, rocm6.0..6.4,rocm7.0..7.2} return HTTP 200. - 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3, 6.4,7.0,7.1,7.2}/ return HTTP 200. - X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for 6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z -> X.Y fallback sed in the Radeon wheel install block. - Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the llama.cpp GitHub releases API endpoint all return 200. Test suite: 255 -> 258. New regression coverage: - U17: get_physical_gpu_count tolerates scalar amd-smi envelope - U18: get_visible_gpu_utilization tolerates scalar envelope - U19a-c: vram_util / power_util return None on zero total, but vram_total_gb still echoes 0.0 (not None) - A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported 6.x minors to rocm6.4 instead of producing a 403 index URL

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bae24218b0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T01:59:14Z

+    if command -v rocminfo >/dev/null 2>&1 && \
+       rocminfo 2>/dev/null | awk '/Name:[[:space:]]*gfx[1-9]/{found=1} END{exit !found}'; then


Broaden rocminfo GPU match beyond Name: gfx*

The ROCm probe currently requires rocminfo lines to match Name: ... gfx... with gfx immediately after whitespace, but many valid ROCm outputs report GPU ISA as strings like Name: amdgcn-amd-amdhsa--gfx90a.... In those environments (especially minimal/runtime-only installs where amd-smi is absent), _has_amd_rocm_gpu() returns false and get_torch_index_url() incorrectly falls back to CPU wheels on AMD hosts. Please match gfx[1-9] anywhere in relevant rocminfo device lines rather than only the strict prefix form.

Useful? React with 👍 / 👎.

…n.py backend label Three high-confidence findings from a second 20-parallel reviewer.py run on commit 7effb3a. Triaged 15 total findings and applied the three that were confirmed as real bugs; the rest were either false positives (e.g. "migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream via setup.sh regardless), design decisions (e.g. visibility mask env vars not consulted in installer detection), or edge cases the existing fallback logic already handles. 1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then only raises if `torch.cuda.is_available()` is False. On ROCm torch, torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.* API), so the guard becomes dead code on AMD hosts and multi-GPU AMD setups slip through even though unsloth does not support them yet. Add a torch.cuda.device_count() > 1 fallback inside the except so AMD multi-visible-device setups are flagged consistently with the original CUDA memory check. 2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user running `install.sh --no-torch` on an AMD host would still pull in bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the case block in an outer `[ "$SKIP_TORCH" = false ]` guard. 3. studio/backend/main.py [3/20]: the /api/system endpoint returned `"device_backend": get_device().value`, which is "cuda" on ROCm hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints (hardware.py) already use the _backend_label helper which swaps "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same helper so the Studio UI reports the backend consistently across all endpoints. 4. studio/backend/tests/test_utils.py: update test_backend_matches_device to call _backend_label(get_device()) instead of raw get_device().value so the test matches the new contract and still passes on CUDA hosts. Tests: 258 -> 261. New regression coverage: - X08 main.py /api/system uses _backend_label - X09 tokenizer multi-GPU guard has device_count() fallback - X10 fresh-install bnb case block gated on SKIP_TORCH=false

During install, bitsandbytes was installed without --no-deps, causing uv to resolve torch from PyPI (CUDA build) and silently overwrite the ROCm wheels that were just installed in the previous step. This happened in three places: - install.sh: bitsandbytes install in both migrated and fresh paths - install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch() Additionally, multiple install steps in install_python_stack.py (extras, overrides, studio deps) can pull in CUDA torch via transitive dependencies. A final _ensure_rocm_torch() call at the end of the install sequence ensures ROCm torch is always in place at runtime. All changes are gated behind ROCm-specific conditions and do not affect NVIDIA, CPU-only, macOS, or Windows install paths. Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install.

…n HIP On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path: 1. Unsloth's global monkey-patching of transformers model classes (LlamaRotaryEmbedding, attention modules) triggers _assert_async_cuda_kernel crashes on HIP during generation. Training uses different code paths and works fine. 2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion failures on MI300X (CDNA3 / gfx942), even without Unsloth patching. This commit adds a ROCm-specific inference fallback that: - Skips importing Unsloth at module level (prevents global patching) - Loads models in 16-bit with plain transformers + PEFT instead - Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx") since pre-quantized HF repos still trigger bnb codepaths - Guards get_chat_template calls (unavailable without Unsloth import) - Fixes max_seq_length=0 being passed to from_pretrained (GGUF semantics don't apply to transformers path) The NVIDIA path is completely unchanged -- Unsloth import and for_inference() optimization remain active. GGUF inference (via llama-server/HIP) is unaffected since it never imports Python model classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X) so 16-bit loading is practical for inference. Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424): - Simple generation: PASS - Compare mode (base vs finetuned): PASS - GGUF inference + tool calling: PASS (unaffected by this change)

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb7dbd4361

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T16:31:36Z

+if _IS_ROCM_ENV:
+    FastLanguageModel = None  # Loaded on-demand only on NVIDIA
+    FastVisionModel = None
+    get_chat_template = None


Keep ROCm non-text loaders callable

Do not null out FastLanguageModel/FastVisionModel at import time for ROCm unless every downstream callsite is guarded. load_model() still uses FastVisionModel in the vision path and FastLanguageModel in the SNAC audio path, so on ROCm these branches now crash with AttributeError before any fallback logic runs, breaking vision/audio model loading on AMD hosts.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-09T16:31:36Z

+                        from peft import PeftModel
+
+                        model = PeftModel.from_pretrained(model, config.path)
+                        tokenizer = AutoTokenizer.from_pretrained(config.path)


Load LoRA tokenizer from base model in ROCm fallback

In the ROCm LoRA branch the base weights are loaded from _base, but the tokenizer is loaded from config.path (the adapter path). Adapter-only PEFT checkpoints commonly omit tokenizer files, so this can fail on ROCm even when the base model is valid. Use the base model as tokenizer source (or fallback to it when adapter tokenizer files are absent) to avoid adapter-only load failures.

Useful? React with 👍 / 👎.

- Add clear RuntimeError for audio/vision model inference on ROCm (these paths use Unsloth's FastModel/FastVisionModel which would crash on HIP; GGUF inference is the supported path on AMD) - Remove unused `import os as _os` from the ROCm changes

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 98f0215116

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T20:17:45Z

+_IS_ROCM_ENV = getattr(__import__("torch").version, "hip", None) is not None
+
+if _IS_ROCM_ENV:
+    FastLanguageModel = None  # Loaded on-demand only on NVIDIA


Gate ROCm loader nulling on active ROCm hardware

_IS_ROCM_ENV is derived from torch.version.hip (build-time capability), so HIP-built torch on a non-ROCm runtime (e.g., CPU-only host or broken AMD stack) still sets FastLanguageModel = None at import time. Later, load_model() branches on _hw_module.IS_ROCM (runtime detection) and can take the non-ROCm path, which then calls FastLanguageModel.from_pretrained(...) and crashes with AttributeError. Fresh evidence in this patch is the new split between import-time _IS_ROCM_ENV and runtime _hw_module.IS_ROCM, which can legitimately diverge.

Useful? React with 👍 / 👎.

…sage, temperature) amd-smi on recent ROCm versions (7.x) wraps metric output in a {"gpu_data": [...]} envelope instead of returning a raw list. This caused get_primary_gpu_utilization() and get_visible_gpu_utilization() to fail silently (returning available=False) because the GPU data dict was never unwrapped. Additionally: - VRAM data moved from "vram" to "mem_usage" with "total_vram" / "used_vram" keys. Added fallback key lookup. - Temperature "edge" sensor returns "N/A" on MI300X VF; the previous dict.get() chain returned the "N/A" string instead of falling through to "hotspot". Changed to a loop that checks each key until a parseable value is found. Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x): - GPU utilization: 0% (idle), up to 100% during training - Temperature: 40-44C (from hotspot sensor) - VRAM: 0.28/191.69 GB (idle) - Power: 158-211W draw

for more information, see https://pre-commit.ci

danielhanchen · 2026-04-09T20:42:24Z

AMD ROCm Testing Report -- MI300X VF (ROCm 7.2, HIP 7.1.25424)

Environment

GPU: AMD Instinct MI300X VF (192GB VRAM, gfx942 CDNA3)
ROCm: 7.2.0, HIP 7.1.25424
PyTorch: 2.10.0+rocm7.1
Unsloth: 2026.4.4
Transformers: 4.57.6

Fixes Added (3 commits)

install.sh + install_python_stack.py (83567f5): Prevent bitsandbytes from overwriting ROCm torch with CUDA wheels (--no-deps + post-install repair).
inference.py (7cbc51d + 98f0215): ROCm inference fallback -- Unsloth's patched kernels and bnb 4-bit crash on HIP during inference. On ROCm, skips Unsloth import (prevents global monkey-patching), loads models in 16-bit with plain transformers+PEFT, resolves pre-quantized model names. NVIDIA path is byte-for-byte unchanged.
amd.py (4a87946): Fix amd-smi parsing for newer output format -- gpu_data envelope unwrapping, mem_usage key for VRAM, temperature fallback from edge (N/A on VF) to hotspot.

Test Results

Studio Launch & Hardware Detection

Check	Result
`device_backend: "rocm"`	PASS
GPU name: AMD Instinct MI300X VF	PASS
VRAM: 191.69 GB	PASS
`chat_only: false`	PASS

GGUF Inference (4 models, all PASS)

Model	Variant	Quality
Qwen3.5-4B-GGUF	Q4_K_XL	Correct: math, creative, knowledge
gemma-3-4b-it-GGUF	Q4_K_M	Correct: math, creative, knowledge
Llama-3.2-3B-Instruct-GGUF	Q4_K_M	Correct: math, creative, knowledge
Llama-3.2-1B-Instruct-GGUF	Q4_K_M	Correct: math, creative, knowledge

GGUF Tool Calling (7 tests, all PASS)

Tool	Model	Result
Python code execution	Qwen3.5-4B	PASS -- computed 100 prime sum (24133), 2^100, 20!
Bash/terminal	Qwen3.5-4B	PASS -- listed files correctly
Web search	Qwen3.5-4B	PASS -- found current president of France
All tools combined	Qwen3.5-4B	PASS
Non-streaming tool call	Qwen3.5-4B	PASS
Python execution	Llama-3.2-3B	PASS -- sqrt(144) = 12
Bash execution	Llama-3.2-3B	PASS -- date and hostname

Fine-tuning (5 models, all PASS)

Model	Loss	Grad Norms
Qwen3-0.6B	2.27 -> 1.52	Healthy (1.5-17.6)
Llama-3.2-1B-Instruct	1.85 -> 1.40	Healthy (1.1-4.3)
Mistral-7B-4bit	1.18 -> 0.94	Healthy (1.8-11.6)
Qwen3-4B	1.98 -> 1.17	Healthy (0.6-9.5)
gemma-3-4b-it	2.11 -> 1.19	Mostly healthy (2 NaN/inf -- known Gemma issue)

Export (all PASS)

Format	Size	Inference Quality
GGUF Q4_K_M	0.75 GB	"Tokyo" for Japan, good haiku
16-bit merged	2.30 GB	"Jupiter" for largest planet, "50" for 10*5
LoRA adapter	adapter_model.safetensors	"blue" for sky, "Python" for language

GGUF Memory & Context

Model	Context	Native	Memory
Qwen3.5-4B-GGUF	262,144	262,144	Correct
Llama-3.2-1B-GGUF	131,072	131,072	Correct
gemma-3-4b-it-GGUF	131,072	131,072	Correct

KV Cache Quantization (all PASS)

q8_0, q4_0, f16 all produce correct "Berlin" for Germany capital.

GPU Monitoring (amd-smi)

Metric	Status	Value
GPU utilization	PASS	0-2% (idle)
Temperature	PASS	40-44C (hotspot)
VRAM	PASS	0.28-3.2GB / 191.69GB
Power draw	PASS	158-211W
Backend label	PASS	"rocm"

Compare Mode (base vs finetuned)

Test	Result
Simple generation	PASS ("2 + 2 = 4")
Base model (adapter off)	PASS (coherent sky description)
Finetuned (adapter on)	PASS (coherent, slightly different)

Embedding Finetuning

Qwen3-Embedding-0.6B with all-nli dataset, 10 steps: PASS (loss converged)

Vision

GGUF inference (Qwen3.5-4B vision model): PASS
Vision finetuning (Qwen2-VL-2B): Needs max_seq_length >= 1024 for image tokens (not AMD-specific)

NVIDIA/Windows/Mac Safety

All changes verified gated behind IS_ROCM, _IS_ROCM_ENV, IS_WINDOWS, IS_MACOS. NVIDIA else branch in inference.py is byte-for-byte identical to original. No unconditional changes affect non-AMD paths.

* Bug fix detecting radeon * Expanding GPU target for gfx1100* * Generalize gfx family-prefix filter to cover gfx10/gfx12 as well rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures the bare family prefix from the generic line, and passing that to -DGPU_TARGETS breaks the HIP build because clang only accepts specific gfxNNN ids. The previous filter only special-cased gfx11. Generalize it so any bare 2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a specific sibling target is present in the same list. No real AMD GPU has a 2-digit gfx id, so the filter can only ever drop family prefixes and never a real target. Covers the existing gfx11 cases unchanged, and extends the same fix to gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4), which would otherwise hit the same build failure on newer rocminfo. --------- Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>

#4954) * Pin bitsandbytes to continuous-release_main on ROCm for 4-bit decode fix bitsandbytes 0.49.2 on PyPI ships with a broken 4-bit GEMV kernel on every ROCm target: - CDNA (gfx90a / gfx942 / gfx950 = MI210 / MI300X / MI350) via a broken blocksize=32/64 warp64 GEMV kernel whose tests were explicitly skipped with ROCM_WARP_SIZE_64 guards because the code was known broken. - RDNA3 / RDNA3.5 (gfx1100-1103 / gfx1150-1152) via a compile-time BNB_WARP_SIZE macro in the host-side dispatch that resolves to 64 when the multi-arch wheel is compiled with CDNA as the primary target, so num_blocks is wrong on RDNA and half the GEMV output is never written. At decode shape (1, 1, hidden) both bugs produce NaN. Training is unaffected because training shapes are (batch, seq_len > 1, hidden) and never touch the GEMV path. The crash during autoregressive inference surfaces as _assert_async_cuda_kernel in torch.multinomial which on HIP becomes a hard HSA_STATUS_ERROR_EXCEPTION instead of a clean Python error. Both bugs are fixed by bitsandbytes commit 713a3b8 ("[ROCm] Enable blocksize 32 4-bit quantization and GEMV kernels on AMD CDNA", PR #1887, merged 2026-03-09) which replaces BNB_WARP_SIZE with a runtime hipDeviceGetAttribute query and ships a working CDNA warp64 kernel. That commit has not shipped to PyPI yet, but continuous-release_main wheels are published on every push to bnb main via GitHub Releases. Point the ROCm install path at the continuous-release_main x86_64 and aarch64 wheels and fall back to PyPI >=0.49.1 when the pre-release is unreachable (offline installs, firewalled hosts, or architectures not covered by the pre-release wheels). Drop the pin once bnb cuts a 0.50+ tag on PyPI. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): direct bnb GEMV shape test now returns 0.0078 max abs error at seq_len=1 (no NaN) vs NaN on 0.49.2, and full Unsloth + for_inference + 4-bit sampling generation works end-to-end. NVIDIA / CPU / Mac / Windows paths are unaffected -- the helper is gated on the ROCm torch index and platform.machine() respectively. * Drop Studio ROCm 16-bit fallback now that bnb 0.50+ fixes 4-bit decode The 16-bit fallback in studio/backend/core/inference/inference.py was added as a workaround for a bug that this PR already fixes at the install layer: bitsandbytes <= 0.49.2 has a broken 4-bit GEMV kernel on every ROCm target, which NaNs at decode shape (seq_len=1) and crashes autoregressive inference. bnb PR #1887 (commit 713a3b8, in 0.50.0.dev0+, pinned by install.sh / install_python_stack.py in this PR) restores correct 4-bit decode on MI300X and verified working end-to-end with full Unsloth + for_inference + sampling. Revert the dual code path so ROCm and NVIDIA both go through the normal FastLanguageModel.from_pretrained + for_inference flow: - Remove the conditional `from unsloth import` that skipped the import on ROCm. The monkey-patches it was trying to avoid were never the cause of the crash; bnb 4-bit GEMV was. - Remove the `if _hw_module.IS_ROCM:` branch in load_model that loaded with plain transformers + PEFT + bfloat16, and the `_resolve_fp16_base` helper it relied on. - Remove the `get_chat_template is not None` fallback in _load_chat_template_info -- get_chat_template is now always imported. - Refactor the audio/vision ROCm guard to check _hw_module.IS_ROCM directly instead of the removed _IS_ROCM_ENV global. Audio and vision on ROCm still need separate validation (FastVisionModel and the CSM audio codecs were never tested on HIP) so the guard stays for now. Add _bnb_rocm_4bit_ok() as a runtime safety net for users who install from this PR before the install.sh bnb pin kicks in, or whose installer fell back to the PyPI pin because the continuous- release wheel was unreachable. When the installed bnb is < 0.50 on ROCm, force load_in_4bit=False and strip any -unsloth-bnb-4bit / -bnb-4bit suffix from the model path so a pre-quantized repo resolves to its FP16 sibling instead of pulling bnb back in via the repo's quantization_config. LoRA adapters whose base is a pre-quantized repo on old bnb will still fail inside Unsloth's loader -- the only real fix there is `unsloth studio update`. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): - HAPPY path (bnb 0.50.0.dev0, load_in_4bit=True, pre-quantized repo): loads in 4-bit via the fixed GEMV, generation returns "Paris." for greedy and sampling. - SAFETY-NET path (simulated old bnb, suffix-stripped to the FP16 sibling, load_in_4bit=False): loads in bf16, generation returns "Paris." for greedy and sampling. Net diff is ~45 lines smaller than the pre-revert state because the entire plain-transformers 16-bit branch is gone. * Cache _bnb_rocm_4bit_ok() with functools.cache load_model() can be called many times in a single session but the bnb version and hardware state cannot change at runtime, so memoise the check. First call is ~1.9 ms (dominated by the lazy `import bitsandbytes` inside the try block), subsequent calls drop to sub-microsecond dict lookups. Zero behavioral change. * Shorten verbose bnb/ROCm comments Comment-only cleanup across install.sh, studio/install_python_stack.py, and studio/backend/core/inference/inference.py. No behavioral change. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove _bnb_rocm_4bit_ok safety net from inference.py Studio's ROCm support is brand new (PR #4720, merged today) and every fresh install pulls the bnb continuous-release_main wheel via install.sh / install_python_stack.py in this same PR. There are no existing ROCm Studio installs carrying bnb < 0.50, so the defensive version-check fallback is guarding against a scenario that cannot actually occur. Delete the helper, the functools import, and the safety-net block -- inference.py now calls FastLanguageModel.from_pretrained directly with no ROCm branching. * Drop audio/vision ROCm guard in inference.py — verified unblocked by bnb fix Vision inference was blocked by the same bnb 4-bit GEMV bug that affected text inference (vision models use bnb 4-bit for the LM backbone). With bnb 0.50+ pinned in install.sh / install_python_stack.py, vision works end-to-end on MI300X: Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit loaded in 4-bit via FastVisionModel + for_inference returns a correct answer to a multimodal prompt. Audio (CSM) was never actually blocked by HIP — on this hardware CSM loads and runs its backbone forward pass fine with bnb 0.50, then fails during generate() with a transformers-level kwarg validation mismatch in generation_csm.py (`backbone_last_hidden_state` rejected). That's a pre-existing transformers/CSM integration bug that reproduces identically on NVIDIA, so the ROCm-gated guard was never actually protecting users from anything HIP-specific. Remove the combined audio/vision guard and the now-unused _hw_module import. Also restore the one-word "Can be" in an inline comment that drifted during the earlier comment-shortening pass, so the inference.py delta vs pre-#4720 is exactly the max_seq_length<=0 crash fix and nothing else. * Shorten max_seq_length=0 guard comment to one line --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

95 Python tests and 23 shell tests covering ROCm detection, torch index URL selection, hardware flags, prebuilt asset selection, and install pathway logic. All tests use mocks -- no AMD hardware required. Companion to #4720 (AMD ROCm/HIP support).

edamamez and others added 5 commits March 31, 2026 08:37

danielhanchen requested review from mmathew23 and rolandtannous as code owners March 31, 2026 08:38

[pre-commit.ci] auto fixes from pre-commit.com hooks

7290199

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Mar 31, 2026

View reviewed changes

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 31, 2026

View reviewed changes

danielhanchen added 2 commits March 31, 2026 09:09

Clean up rocm_paths list construction in detect_host()

56098e5

Filter None from the ROCM_PATH env var lookup at list construction time instead of relying on the inline `if p` guard in the any() call.

[pre-commit.ci] auto fixes from pre-commit.com hooks

4286525

for more information, see https://pre-commit.ci

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 31, 2026

View reviewed changes

danielhanchen and others added 2 commits March 31, 2026 09:37

[pre-commit.ci] auto fixes from pre-commit.com hooks

fd43235

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Mar 31, 2026

View reviewed changes

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

danielhanchen added 2 commits March 31, 2026 03:25

Merge branch 'main' into feature/rocm-support-v2

c22312b

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 9, 2026

View reviewed changes

danielhanchen and others added 4 commits April 9, 2026 00:57

[pre-commit.ci] auto fixes from pre-commit.com hooks

1d387d6

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

bae2421

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Apr 9, 2026

View reviewed changes

danielhanchen and others added 4 commits April 9, 2026 06:39

Merge branch 'main' into feature/rocm-support-v2

659c19c

[pre-commit.ci] auto fixes from pre-commit.com hooks

cb7dbd4

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Apr 9, 2026

View reviewed changes

danielhanchen and others added 2 commits April 9, 2026 13:05

Merge branch 'main' into feature/rocm-support-v2

e54ed86

chatgpt-codex-connector Bot reviewed Apr 9, 2026

View reviewed changes

danielhanchen and others added 2 commits April 9, 2026 20:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

1c58bb9

for more information, see https://pre-commit.ci

danielhanchen merged commit cad8c6a into main Apr 10, 2026
5 checks passed

danielhanchen deleted the feature/rocm-support-v2 branch April 10, 2026 08:56

		amd-smi version 2>/dev/null \| awk -F'ROCm version: ' \
		'NF>1{gsub(/[^0-9.]/, "", $2); split($2,a,"."); print "rocm"a[1]"."a[2]; ok=1; exit} END{exit !ok}'; } \|\| \

	run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.49.1"
	run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" "bitsandbytes>=0.45.0"

		parts = fh.read().strip().split("-")[0].split(".")
		return int(parts[0]), int(parts[1])

	return int(parts[0]), int(parts[1])
	if len(parts) >= 2 and parts[0].isdigit() and parts[1].isdigit():
	return int(parts[0]), int(parts[1])

	"import torch; print(torch.version.hip or torch.version.cuda or '')",
	"import torch; v=torch.version; print(getattr(v, 'hip', None) or getattr(v, 'cuda', None) or '')",

	gpus = data.get("gpu", data.get("gpus", []))
	gpus = data.get("gpu", data.get("gpus", [])) if isinstance(data, dict) else []

		if command -v rocminfo >/dev/null 2>&1 && \
		rocminfo 2>/dev/null \| awk '/Name:[[:space:]]*gfx[1-9]/{found=1} END{exit !found}'; then

Uh oh!

Conversation

danielhanchen commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes by file

1. install.sh -- ROCm detection + Radeon wheel selection

2. install_python_stack.py -- ROCm torch reinstall

3. install_llama_prebuilt.py -- ROCm prebuilt selection

4. Hardware IS_ROCM flag + AMD monitoring

5. is_rdna() RDNA2/3/3.5/4 expansion

6. worker.py ROCm Mamba/SSM source builds

7. tokenizer_utils.py error message

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Mar 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Mar 31, 2026

Uh oh!

chatgpt-codex-connector Bot commented Mar 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Mar 31, 2026 •

edited

Loading

1. `install.sh` -- ROCm detection + Radeon wheel selection

2. `install_python_stack.py` -- ROCm torch reinstall

3. `install_llama_prebuilt.py` -- ROCm prebuilt selection

4. Hardware `IS_ROCM` flag + AMD monitoring

5. `is_rdna()` RDNA2/3/3.5/4 expansion

6. `worker.py` ROCm Mamba/SSM source builds

7. `tokenizer_utils.py` error message