fix/strix halo and windows AMD ROCm support#5301
Conversation
…ng workers Training workers are spawned via multiprocessing spawn before detect_hardware() runs, so IS_ROCM is still False. If the user never set HIP_VISIBLE_DEVICES in their shell, _inherits_rocm_visibility is also False, leaving the worker with only CUDA_VISIBLE_DEVICES set. On ROCm hosts the HIP runtime honors HIP_VISIBLE_DEVICES over CUDA_VISIBLE_DEVICES, so the worker saw the full device list and torch raised "no usable HIP accelerator" on some setups. Fall back to probing torch.version.hip (a build-time attribute, safe to read before GPU init) to detect ROCm when neither IS_ROCM nor inherited env vars are available. Mirrors the existing fix in llama_cpp.py for llama-server subprocess GPU pinning. Fixes unslothai#5180
Replace loose OR chain with exact string matches, split into three focused tests, and add a guard check for the try/except wrapper.
for more information, see https://pre-commit.ci
…lback amd-smi on iGPUs with shared/unified memory (e.g. Radeon 8060S on Strix Halo) reports only the dedicated VRAM slice (~512 MB) in its metric output, so get_visible_gpu_utilization() was returning usable_gb ≈ 0.35 GB instead of the full GTT pool (~128 GB). torch.cuda.mem_get_info() already surfaces the correct unified-pool size. Add _reconcile_rocm_unified_memory(): after amd-smi returns a valid result on a ROCm device, cross-check each device's vram_total_gb against torch.cuda.mem_get_info(). When torch reports a larger total, replace the amd-smi VRAM fields in-place. No-op for discrete AMD GPUs where the two sources agree. Fixes: "Falling back to all visible GPUs -- model may not fit" on AMD iGPU machines even when 100+ GB of unified memory is available.
There was a problem hiding this comment.
Code Review
This pull request introduces a _reconcile_rocm_unified_memory function to correctly report VRAM on AMD iGPUs by reconciling amd-smi data with torch memory information, and integrates it into get_visible_gpu_utilization. Feedback suggests applying this same reconciliation logic to get_gpu_utilization to ensure consistent reporting across the module.
The visible-GPU path was already corrected for AMD iGPUs with unified memory
(Strix Halo / Radeon 8060S), but get_gpu_utilization was still returning the
raw 512 MB amd-smi VRAM slice. Studio's /api/train/hardware endpoint and the
live GPU monitor read from this primary path, so users continued seeing the
wrong total even after auto_select_gpu_ids picked the right device.
Refactor to share the per-device correction:
* _apply_unified_memory_correction(metrics, torch_info) -- the actual
replacement logic, in-place on a single metrics dict.
* _reconcile_rocm_unified_memory(...) -- multi-device,
iterates utilization["devices"] (visible-GPU path).
* _reconcile_primary_rocm_unified_memory(...) -- single flat
metrics dict (primary-GPU path), uses parent_visible_spec to pick the
primary index, falls back to ordinal 0 when no visibility env is set.
get_gpu_utilization now calls the primary reconciler under IS_ROCM, so both
endpoints surface the real unified-memory pool on iGPUs while leaving
discrete AMD GPUs untouched (torch_total <= smi_total -> no replace).
Two small follow-ups to the apply_gpu_ids ROCm fallback: 1. Match detect_hardware()'s 'getattr(torch.version, "hip", None) is not None' form so the entire codebase has one canonical 'this torch was built with HIP' check. On every shipping torch wheel hip is either None or a non-empty version string, so the new form agrees with the old bool() form on every real install. 2. Log the probe failure at debug level instead of swallowing it silently. The broad 'except Exception' is intentional (we never want apply_gpu_ids to crash a worker over a probe), but the silent pass made it impossible to tell whether the fallback was firing or being skipped.
…ec before IS_ROCM is set When a user has HIP_VISIBLE_DEVICES set in their shell (e.g. "1" to select GPU 1) but detect_hardware() has not yet run in the Studio parent process, IS_ROCM is still False. _get_parent_visible_gpu_spec() was gated on IS_ROCM so it fell through to CUDA_VISIBLE_DEVICES (unset), saw all physical GPUs, and auto-selected index 0. apply_gpu_ids then overwrote HIP_VISIBLE_DEVICES with "0", making the intended GPU invisible to ROCm torch in the worker, which triggered the "no usable HIP accelerator" error (issue unslothai#5180). Apply the same _inherits_rocm_visibility pattern already used in apply_gpu_ids: check for HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES in the environment regardless of IS_ROCM so the correct GPU index is preserved.
…ry' into fix/rocm-strix-halo-unified-memory
…tered setups The previous rocminfo awk pattern could miss discrete GPUs on machines where HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES is used to mask an integrated GPU — the env vars filter rocminfo output but may not propagate into the install script subprocess, causing detection to fail entirely. Two changes: - Tighten rocminfo pattern from /gfx[0-9]/ && !/gfx000/ to /gfx[1-9][0-9]/ — simpler and correctly excludes the CPU agent (gfx000) without a negative lookahead - Add sysfs KFD topology fallback: reads /sys/class/kfd/kfd/topology/nodes/*/gpu_id which is a kernel-level view unaffected by HIP_VISIBLE_DEVICES or ROCR_VISIBLE_DEVICES Fixes detection failure reported in Discord by Chains (gfx1201 + iGPU machine where env var exclusion of the iGPU caused rocminfo to return no usable device).
The fallback added by this PR reads /sys/class/kfd/kfd/topology/nodes/*/gpu_id files but matches the literal token 'gpu_id' against their content. Those files contain only a single decimal value (e.g. '0' for CPU agents, '50432' for GPU agents), so the regex never matches and 'found' stays 0, making the fallback a no-op on every host. The properties file in the same directory contains key/value lines like 'gpu_id 50432' which is what the existing awk pattern expects. Reproduced with a synthetic sysfs layout: against gpu_id files awk exits 1; against properties files awk exits 0 when any node reports gpu_id > 0.
…setup.sh setup.ps1 only checked nvidia-smi and fell straight to "gpu: none" on AMD machines. setup.sh already probed rocminfo/amd-smi/hipconfig/hipinfo. Add three-tier detection mirroring install_llama_prebuilt.py's detect_host(): 1. hipinfo: gcnArchName in output confirms a real HIP GPU (not just SDK) 2. amd-smi list: "GPU: <digit>" data rows as fallback 3. WMI Win32_VideoController: last resort -- detects AMD GPU even without HIP SDK, then guides user to install it rather than silently going CPU Also corrects the "none" message to mention AMD ROCm alongside NVIDIA so users with AMD hardware understand the requirement. Fixes: rohit-style install where Strix Halo (Radeon 8060S) showed "gpu: none" even with the HIP SDK present.
…h setup.ps1 install.ps1 had the same nvidia-smi-only GPU detection as setup.ps1 before the setup.ps1 fix. Applies the same three-tier AMD detection: 1. hipinfo: gcnArchName confirms real HIP GPU 2. amd-smi list: GPU data rows as fallback 3. WMI Win32_VideoController: detects AMD GPU without HIP SDK and guides user to install it Fixes: install.ps1 showing "gpu: none" while setup.ps1 correctly showed "AMD GPU detected" on the same machine (reported by rohit, RX 7600 XT).
install_python_stack.py: - Add _ROCM_WINDOWS_WHEEL_BASE and _ROCM_WINDOWS_RELEASES constants pointing to AMD repo.radeon.com (ROCm 7.2 -> torch 2.9.1+rocm7.2.1) - Extend _ensure_rocm_torch() with a Windows branch: detects ROCm via _has_rocm_gpu() / _detect_rocm_version(), requires Python 3.12 (cp312 is the only ABI AMD publishes for Windows), installs the direct wheel URL from repo.radeon.com install.ps1: - Capture ROCmVersion during AMD detection via hipconfig --version / amd-smi version (needed for wheel URL selection) - After Get-TorchIndexUrl, add an AMD wheel override block: when HasROCm and Python 3.12 detected, set ROCmTorchWheelUrl to AMD wheel URL - Expand torch install branch to handle ROCmTorchWheelUrl with uv pip install --force-reinstall --no-cache-dir
for more information, see https://pre-commit.ci
AMD publishes matching torchvision-0.24.1+rocm7.2.1 and torchaudio-2.9.1+rocm7.2.1 cp312 wheels at the same repo.radeon.com release folder. Install all three in both install.ps1 and install_python_stack.py Windows ROCm path.
for more information, see https://pre-commit.ci
AMD uses a different version string for 7.1.1 wheels: 2.9.0+rocmsdk20251116 (date-tagged) instead of +rocm7.1.1. Adds the 7.1.1 release folder to both install.ps1 and install_python_stack.py so users with ROCm 7.1 get ROCm torch instead of falling back to CPU.
The AMD Windows torch wheels declare rocm[libraries]==<ver> as a hard dependency. Without installing rocm_sdk_core and rocm_sdk_libraries_custom from the same AMD release folder, uv cannot resolve the dependency and fails with 'No solution found'. Include all 5 wheels in one install call.
@array splatting inside a scriptblock only works when the native command is prefixed with '&'. Invoke-InstallCommand uses '& $Command' to run the block, so @ROCmAllWheelUrls was not being expanded. Extract to scalar variables $rw0-$rw4 which are captured correctly by the closure.
uv's resolver looks up rocm[libraries]==0.1.dev0 on PyPI during dependency resolution before downloading any wheels, and fails because the package doesn't exist on PyPI. --no-deps skips resolution entirely and installs all 5 AMD wheels directly. The GPU runtime dependency is satisfied by the HIP SDK, not a Python package.
…Windows setup.ps1 was always setting CuTag='cpu' for non-NVIDIA hosts and installing cpu-only PyTorch, overwriting the ROCm torch installed by install.ps1. Adds the same AMD wheel selection logic (ROCm version detection, Python 3.12 check, 5-wheel install with --no-deps) to setup.ps1's torch install block. install_python_stack.py: remove IS_WINDOWS guard from _ensure_rocm_torch() call site so the Windows path in _ensure_rocm_torch() is reachable during 'unsloth studio update' as well.
… fix progress counter - Gate the 'must be installed manually' warning on torch.version.hip being empty so it doesn't fire when our ROCm torch install succeeded - Update _TOTAL counter to include the 3 ROCm steps on Windows now that _ensure_rocm_torch() is called there (fixes 10/9 display)
for more information, see https://pre-commit.ci
…unter - Add 'rocm' step after 'cuda' in setup.ps1 showing ROCm version or HIP SDK missing - Move ROCm version detection up to GPU detection block so it's available early - Suppress 'must be installed manually' warning when torch.version.hip is set - Fix _TOTAL counter to include ROCm steps on Windows (fixes 10/9 display)
… is unset AMD's repo.radeon.com wheels (e.g. 2.9.0+rocmsdk20251116) do not set torch.version.hip, leaving it None. All three probes that relied solely on torch.version.hip now also check for 'rocm' in torch.__version__.lower(): - hardware.py detect_hardware(): IS_ROCM was never set, causing the studio to report 'Hardware detected: CPU' even after AMD wheels were installed and HIP DLLs were on PATH. - install_python_stack.py _ensure_rocm_torch(): skip-if-already-installed probe would always reinstall on subsequent runs. - install_python_stack.py Windows AMD warning: suppression check always failed, so the 'must be installed manually' note kept appearing after a successful AMD wheel install.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
1 similar comment
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
When amd-smi/nvidia-smi is unavailable on Windows, query dedicated GPU VRAM via Windows Performance Counters (same source as Task Manager). This gives system-wide cross-process usage, fixing the near-zero reading caused by torch.cuda.mem_get_info only seeing the Studio server process. Linux fallback path unchanged (mem_get_info is system-wide on ROCm).
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Function is AMD ROCm specific — amd-smi absent on Windows when only the HIP SDK is installed. Scoped to IS_ROCM so NVIDIA Windows path is untouched (nvidia-smi handles that case).
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Linux: read /sys/class/drm/card*/device/mem_info_vram_used|total for system-wide GPU memory across all processes. No tools required, always present on Linux AMD systems. Windows: Windows Performance Counter API (already added). Both paths are gated on IS_ROCM and only fire when amd-smi is absent. torch mem_get_info remains as last resort (process-local).
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
1 similar comment
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…nto staging/rocm-super-merge
…s and Linux fallback paths - Windows: GPU utilization via \GPU Engine(*engtype_3D*)\Utilization Percentage perf counter - Windows: temperature and power via ADL (atiadlxx.dll, ships with Adrenalin) - Linux: GPU utilization via DRM sysfs gpu_busy_percent - Linux: temperature via hwmon temp1_input (millidegrees C) - Linux: power via hwmon power1_average / power1_input (microwatts) All paths are no-op fallbacks (None) when the source is unavailable. Mirrors what nvidia-smi provides on the CUDA path.
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
1 similar comment
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
1 similar comment
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
* fix: guard torch.distributed attrs missing on ROCm Windows On Windows, the ROCm build of PyTorch ships without the distributed C extension (torch._C._distributed_c10d), so torch.distributed loads as a partial stub with several attributes missing entirely, including is_initialized, is_torchelastic_launched, and get_rank. unsloth_zoo/utils.py grabbed all three at module import time with bare attribute access, causing an AttributeError the moment the module was imported -- even in code paths like GGUF export that never use distributed features at all. Fix: replace the three bare grabs with getattr(..., lambda: False/0) fallbacks so importing unsloth_zoo never crashes on platforms where torch.distributed is unavailable or stubbed. Root cause tracked in ROCm/TheRock#3284 (libuv / torch.distributed missing on Windows ROCm builds). Companion fix for the export subprocess in unsloth/unsloth#5301. Ref: ROCm/TheRock#3284 Ref: unslothai/unsloth#5301 * fix: stub torchao on Windows ROCm to unblock LoRA training On Windows, the ROCm PyTorch build ships without the full torch.distributed C-extension stack (torch._C._distributed_c10d, DeviceMesh, ProcessGroup, etc.). torchao imports the entire distributed chain at module level, so `import torchao` crashes even in code paths that never touch distributed features (plain LoRA / GRPO training). Fix: if torchao can't be imported, install a sys.meta_path hook (_ROCmTorchaoFinder) that intercepts every "torchao" and "torchao.*" import and returns a self-contained stub: * Sub-module imports (torchao.dtypes, torchao.quantization …) get proper ModuleType stubs registered in sys.modules so the import machinery is satisfied. * Direct attribute access on a stub (e.g. AffineQuantizedTensor) returns a sentinel class created via _ROCmSentinelMeta, which is a real Python type. This makes isinstance(weight, AffineQuantizedTensor) return False (as expected -- no weight is ever an instance of the sentinel) instead of raising TypeError. * Sentinel classes are chainable via metaclass __getattr__, so patterns like torchao.quantization.Float8WeightOnlyConfig resolve cleanly. The hook is only installed when `import torchao` actually fails, so Linux / CUDA environments are completely unaffected. Companion to the torch.distributed attribute-guard in unsloth_zoo/utils.py (commit 115d849) and the torchao export-subprocess stub in unsloth/unsloth#5301. Ref: ROCm/TheRock#3284 * fix: accept *args/**kwargs in torch.distributed fallback lambdas The three no-op lambdas used as getattr fallbacks when torch.distributed attrs are missing on Windows ROCm only accepted zero arguments. Any caller passing a group argument (e.g. get_rank(group=...)) would hit a TypeError. Use lambda *args, **kwargs instead, and switch to the already-imported dist alias for consistency. Suggested by Gemini code review on PR #703. * Optimize torch.distributed attr binding for PR #703 Bind dist.is_initialized/is_torchelastic_launched/get_rank directly in a try/except instead of per-name getattr, so the common path allocates no fallback lambdas; only the ROCm Windows stub hits AttributeError. --------- Co-authored-by: danielhanchen <michaelhan2050@gmail.com>
Follow-up cleanups to the merged AMD ROCm support PR unslothai#5301: 1. De-duplicate the torchao Windows-ROCm import stub into a single shared module (studio/backend/core/_torchao_stub.py); both workers call one install_torchao_windows_rocm_stub() entrypoint. 2. Align the gfx name/arch comment columns in setup.sh and setup.ps1. 3. Isolate the float16 dtype fallback to AMD without native bf16; NVIDIA keeps dtype=None so unsloth's own bf16/fp16/FORCE_FLOAT32 detection is honored. 4. Hoist unconditional stdlib imports (gc, glob, re, subprocess, copy, types, sys, importlib.metadata) from function bodies to module top across the PR unslothai#5301-touched files; heavy/optional/relative imports stay lazy. 5. bitsandbytes Windows-ROCm install now uses plain pip (force_pip=True) instead of UV_SKIP_WHEEL_FILENAME_CHECK, per the AMD hackathon docs. Also adds scripts/verify_import_hoist.py (a scope-aware LEGB AST resolver that catches dangling-alias and rename-clash bugs in import-hoist refactors) and wires it into the Lint CI source-lint job as a self-test plus a pull_request compare gate.

Summary
This PR started as a fix for AMD iGPU unified memory reporting on Strix Halo and grew into full AMD ROCm support for Windows plus targeted Linux fixes. 27 files changed, ~3,700 lines added.
1. Strix Halo / AMD iGPU: unified memory detection (
hardware.py,amd.py)Problem:
amd-smion iGPUs with unified/shared memory (e.g. Radeon 8060S / gfx1151 on Strix Halo) reports only the dedicated VRAM slice (~512 MB - 1 GB) rather than the full unified pool (~128 GB).get_visible_gpu_utilization()was returningusable_gb ~= 0.35 GB, producing a spurious "Falling back to all visible GPUs -- model may not fit" warning even with 100+ GB of unified memory available.Fix: Added
_reconcile_rocm_unified_memory()and_reconcile_primary_rocm_unified_memory()(sharing a common_apply_unified_memory_correction()helper): afteramd-smireturns a result on a ROCm device, cross-check each device'svram_total_gbagainsttorch.cuda.mem_get_info(). When torch reports a larger total, replace theamd-smiVRAM fields in-place. No-op for discrete AMD GPUs where the two sources agree. Applied in bothget_gpu_utilizationandget_visible_gpu_utilization.2. Windows: AMD GPU detection (
install.ps1,setup.ps1)Both scripts now detect AMD GPUs on Windows with the same reliability as the Linux installer.
hipinfo(PATH) ->hipinfo(HIP_PATH\bin / ROCM_PATH\bin) ->amd-smi list->amd-smi static --asic-> WMI marketing-name fallback. The AMD HIP SDK setsHIP_PATH/ROCM_PATHbut does not always add the bin dir toPATH, so the env-var fallback handles this silently."AMD GPU detected -- not ROCm-accessible (HIP 7.1.xxx)"with driver-issue explanation"AMD GPU detected -- HIP SDK not found"with install linkhipconfig --versionstring shown as substeps under thegpustep on successful detection.gcnArchNametokens are lowercased and feature suffixes stripped (e.g.gfx1151:xnack-->gfx1151) before wheel-index lookup.[regex]::Matches()is used instead of-matchto capture all GPU entries on multi-GPU hosts, and results are wrapped with@()to prevent PowerShell unwrapping a single-item result to a scalar string (which was previously causing the arch token to be indexed by character, returning"g"instead of"gfx1200").HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICESrespected when selecting which GPU arch to install for on multi-GPU hosts.3. Windows: AMD ROCm PyTorch wheel installation (
install.ps1,install_python_stack.py)https://repo.amd.com/rocm/whl/{arch_family}/. Supported families: gfx1200/gfx1201 (RDNA 4), gfx1100/gfx1101/gfx1102/gfx1103 (RDNA 3), gfx1150/gfx1151 (Strix).UNSLOTH_ROCM_WINDOWS_MIRRORoverrides the base URL for air-gapped installs.torch,torchvision,torchaudio,rocm_sdk_core, androcm_sdk_libraries_gfx*installed together; wheel version cross-checking ensures the three torch packages are a matched set.torchaoorunsloth) overwrites the ROCm torch with a CPU wheel and reinstalls from the AMD index.HIP_PATHfallback in Python stack:_detect_windows_gfx_arch()probesHIP_PATH\bin/ROCM_PATH\binwhenhipinfois not onPATH, then falls back toamd-smias a third tier, so_ensure_rocm_torch()does not silently skip ROCm wheels on runtime-only AMD installs.rocm7.2index enabled: Previously commented out of_ROCM_TORCH_INDEXpending a torch upper bound bump. Now enabled with a matching_ROCM_TORCH_PKG_SPECSentry (torch>=2.11.0,<2.12.0).4. Windows: ROCm runtime patches (
worker.py,main.py,hardware.py)Several issues in AMD's current Windows wheel require workarounds at startup. All patches are scoped to Windows ROCm only (gated on active HIP runtime via
torch.version.hipor'rocm' in torch.__version__) and self-retire when AMD ships fixes.torch._C._distributed_c10dmissingtorch._Cis a compiled C extension on Windows ROCm. Python cannot do submodule imports from it, sotorch._C._distributed_c10ddoes not exist, causing import failures whentorch.distributedis loaded. A populatedsys.modulesstub is injected beforeimport torch.distributed, exposingFakeProcessGroup,ProcessGroup,Work,Store, and all other symbols. Theif name not in sys.modulesguard means this self-retires the moment AMD fixes their build (tracked: ROCm/TheRock#3284).torchaostubstorchaoimportstorch.distributed._functional_collectivesat module level, which crashes on Windows ROCm via the missing C backend.torchaois stubbed with a custom_StubTypeMetametaclass soisinstance(x, StubClass)returnsFalse(notTypeError) since peft's LoRA torchao path callsisinstance()on the imported types. A_StubSubpackageFindermeta path finder auto-stubs any depth oftorchao.xxx.yyyimports. Scoped to Windows ROCm only so real torchao is unaffected on Windows NVIDIA.BNB_ROCM_VERSIONin server processThe server process (
main.py) and training worker are separate processes, so env vars set in the worker are not visible to the server.main.pynow scanslibbitsandbytes_rocm*.dllat startup to detect the installed DLL suffix and setsBNB_ROCM_VERSIONbefore any bitsandbytes import, eliminating the"Configured ROCm binary not found"error. DLL directory handles are retained at module scope in_ROCM_DLL_HANDLESin bothmain.pyandworker.pyso they are not garbage-collected. Falls back to"72"when no DLL is found._grouped_mmnull kernel (gfx1200, ROCm <= 7.12)On Windows ROCm with HIP < 7.13,
torch._grouped_mmdispatches to a null HIP kernel on gfx1200, causing a0xC0000005access violation (crash) during training. A Python fallback implementation is patched in viatorch.library.Libraryon the CUDA dispatch key whenHIP < 7.13. The fallback handles 2-D (mm), 3-D batched (bmm/matmul), and grouped-with-offsets inputs. Bypassed entirely on HIP >= 7.13 where AMD fixed the kernel. Thetorch.library.Libraryregistration is kept at module scope in_WINDOWS_ROCM_GROUPED_MM_LIBto prevent garbage collection mid-run.amd-smicircuit breakeramd-smion some Windows installs returns non-zero or opens console popups. GPU polling disables itself after 3 consecutive failures (_AMD_SMI_FAILURE_LIMIT = 3) and allamd-smisubprocess calls suppress console windows viawindows_hidden_subprocess_kwargs(). The default timeout is also raised from 5s to 30s on Windows (10s on Linux) to accommodate the cold-hardware ROCm runtime initialisation cost.HIP_VISIBLE_DEVICESsupportHIP_VISIBLE_DEVICESis now honoured in_get_parent_visible_gpu_specandapply_gpu_idssetsHIP_VISIBLE_DEVICESfor ROCm training workers, parallel toCUDA_VISIBLE_DEVICESon NVIDIA.5. Windows: ROCm inference fix (
llama_cpp.py)Problem: The llama.cpp prebuilt bundles its own
rocblas.dllnext to the binary but not the Tensile kernel library files it depends on at runtime (rocblas/library/TensileLibrary*.dat+.hsaco). When rocBLAS initialises for the first real GEMM (prefill), it searches for those files relative to its own location which does not exist in the prebuilt install tree, causing a crash withWinError 10054/10061. This is why the crash only appeared on the first generation request and not during model load.Fix: Set
ROCBLAS_TENSILE_LIBPATHin the llama-server subprocess env to<HIP_PATH>\bin\rocblas\librarywhen the directory exists. Usessetdefaultso a user-supplied value is never overwritten. No-op on CUDA/CPU and Linux.6. Windows:
install_llama_prebuilt.py_resolve_exe()added: checks PATH thenHIP_PATH\bin/ROCM_PATH\binbefore giving up, sohas_rocmis not silentlyFalseon machines where the HIP SDK bin dir is not on PATH.llama-{tag}-bin-win-hip-radeon-x64.zip) added to the simple-policy Windows path so the HIP prebuilt is selected when ROCm is detected.--has-rocmflag added:setup.ps1passes its own detection result through so the prebuilt installer does not need to re-runhipinfo/amd-smiinternally.7. Linux: Strix Halo + ROCm 7.1 routing fix (
install.sh,install_python_stack.py)Problem: ROCm 7.1 has a driver bug causing a segfault in
torch._grouped_mm(moe_utils.py:167) on gfx1151/gfx1150. The Radeon repo began shipping Python 3.13 wheels forrocm-rel-7.1, so the installer was silently landing on the broken combo even for users who expected rocm7.2.Fix: When gfx1151 or gfx1150 is detected and
TORCH_INDEX_URLis*/rocm7.1, the installer overrides tohttps://repo.amd.com/rocm/whl/gfx1151/(AMD's arch-specific index), landing users ontorch==2.11.0+rocm7.13.0which has the actual_grouped_mmkernel fix.UNSLOTH_AMD_ROCM_MIRRORoverrides the base URL for air-gapped installs.HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICESare respected on mixed iGPU+dGPU setups so the override only fires when the runtime-target GPU is actually a Strix arch. Also: ROCm tag normalisation now uses explicitcasearms (e.g.rocm7.1|rocm7.1.*) so a futurerocm7.10does not matchrocm7.1*.install.shalso adds a sysfs KFD topology fallback (/sys/class/kfd/kfd/topology/nodes/*/properties) for AMD detection when neitherrocminfonoramd-smiis available, and explicit warning messages when ROCm version cannot be determined.8. Linux: Ubuntu 24.04 + ROCm 7.x llama.cpp build fix (
setup.sh)Problem: ROCm 7.x ships clang-20, which on Ubuntu 24.04 selects
/usr/lib/gcc/x86_64-linux-gnu/14(a runtime-only dir with no C++ headers) as its default GCC install dir, causingfatal error: 'cstdlib' file not foundduring the llama.cpp HIP source build.Fix:
setup.shusesgcc -print-multiarchto determine the correct multiarch path (falling back to$(uname -m)-linux-gnu), then iterates gcc 14->11 to find the first version with both a runtime dir and/usr/include/c++/<ver>headers, and passes--gcc-install-dir=<path>viaCMAKE_HIP_FLAGS.9. GPU OOM guard (
worker.py)Problem: On ROCm GPUs (confirmed on RDNA 4 gfx1200/gfx1201 and Strix Halo gfx1151), exhausting VRAM/unified memory can cause the HIP driver to hang the GPU ring buffer rather than raising a Python
OutOfMemoryError, resulting in a full system freeze requiring a hard reboot.Fix:
torch.cuda.set_per_process_memory_fraction()is called at worker startup on ROCm only:0.90-- leaves ~1.6 GB headroom on a 16 GB card for HIP context and allocator fragmentation.0.80-- the unified pool is shared with the host OS and page cache. 0.90 x 128 GB would leave only ~13 GB for the entire system, reproducing the freeze rather than preventing it.Classification uses
gcnArchNamestripped of feature suffixes (e.g.gfx1151:xnack-->gfx1151), checked against{"gfx1150", "gfx1151"}. This is naming-independent and avoids the marketing-name ambiguity between Strix Point (890M) and Strix Halo (8060S). The OOM exception handler also surfaces a clear, actionable UI message with concrete remediation steps instead of a raw HIP error string.10. Tests (
tests/studio/install/test_rocm_support.py)~1,231 lines of new test coverage:
TestStrixHaloGfxArchDetection-- arch detection waterfall (hipinfo PATH / HIP_PATH / amd-smi)TestHipSdkEnvPathResolution--HIP_PATH/ROCM_PATHfallback forhipinfo/hipconfigTestHipSdkDetectedSubstep-- terminal output substeps on AMD detectionTestStrixRocm71Override-- Strix routing to AMD arch-specific index;UNSLOTH_AMD_ROCM_MIRROR; non-Strix ROCm 7.1 stays on pytorch.orgTestSetupShGccInstallDir-- Ubuntu 24.04 gcc-install-dir fixTestServerStartupRocmFixes--BNB_ROCM_VERSION, distributed stubs,FakeProcessGroupTestHipSdkInstalledButDeviceInaccessible-- SDK-found-but-not-ROCm-accessible branchHIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES, Strix sibling handling231 passed, 1 skipped.
Files changed
install.ps1studio/setup.ps1install.shstudio/install_python_stack.py_ensure_rocm_torch(),_detect_windows_gfx_arch(), rocm7.2 enabled, Strix routing parity with install.shstudio/setup.sh--gcc-install-dirfix for HIP llama.cpp buildsstudio/backend/main.pyBNB_ROCM_VERSIONauto-detection; ROCm DLL directory registration (_ROCM_DLL_HANDLES)studio/backend/core/training/worker.py_grouped_mmPython fallback;_ROCM_DLL_HANDLES; OOM guard;BNB_ROCM_VERSIONdetectionstudio/backend/utils/hardware/hardware.py_reconcile_rocm_unified_memory();torch.distributedstubs;HIP_VISIBLE_DEVICESinapply_gpu_idsand_get_parent_visible_gpu_spec; broad ROCm detectionstudio/backend/utils/hardware/amd.pyamd-smicircuit breaker;CREATE_NO_WINDOWsuppression; raised timeout; non-integer GPU id warningstudio/backend/core/inference/llama_cpp.pyROCBLAS_TENSILE_LIBPATHfor Windows ROCm inferencestudio/install_llama_prebuilt.py_resolve_exe()HIP_PATH fallback; HIP asset in simple policy;--has-rocmflagtests/studio/install/test_rocm_support.py