Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
7fa0192
Add ROCm detection to install.sh and expand shell tests
edamamez Mar 31, 2026
f3cc758
Add ROCm torch reinstall support to install_python_stack.py
GoldenGrapeGentleman Mar 31, 2026
062e25f
Add ROCm support to llama.cpp prebuilt installer
edamamez Mar 31, 2026
450f5de
Add IS_ROCM hardware flag and fix AMD error message
GoldenGrapeGentleman Mar 31, 2026
f6c2eb8
Add comprehensive ROCm support test suite (68 tests)
danielhanchen Mar 31, 2026
7290199
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 31, 2026
2cb0b52
Harden ROCm support: probe error handling, version cap, validation
danielhanchen Mar 31, 2026
56098e5
Clean up rocm_paths list construction in detect_host()
danielhanchen Mar 31, 2026
9e33c25
Require actual AMD GPU presence before selecting ROCm paths
danielhanchen Mar 31, 2026
4286525
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 31, 2026
7d6ac65
Harden hipconfig version parsing and torch probe compatibility
danielhanchen Mar 31, 2026
fd43235
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 31, 2026
10ec0cd
Strengthen AMD GPU detection and add NVIDIA precedence guard
danielhanchen Mar 31, 2026
c22312b
Merge branch 'main' into feature/rocm-support-v2
danielhanchen Mar 31, 2026
726fab1
Add Windows-specific ROCm/HIP detection in detect_host()
danielhanchen Mar 31, 2026
f17e007
Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows m…
danielhanchen Mar 31, 2026
1482326
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 31, 2026
134638d
Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage
danielhanchen Mar 31, 2026
4dc1dca
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 31, 2026
3881539
Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU…
danielhanchen Mar 31, 2026
326a971
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 31, 2026
2d55e77
Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec
danielhanchen Mar 31, 2026
478bc7f
Fix IS_ROCM test assertion for multi-line formatting
danielhanchen Mar 31, 2026
a067110
Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fi…
danielhanchen Mar 31, 2026
9484014
Merge branch 'main' into feature/rocm-support-v2
danielhanchen Mar 31, 2026
d1e3858
Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428
danielhanchen Apr 1, 2026
d1de729
Support AMD Radeon for studio (#4770)
iswaryaalex Apr 3, 2026
f4d64ff
Merge main into feature/rocm-support-v2, resolve install.sh conflict
danielhanchen Apr 3, 2026
f9a738a
Remove ROCm test files from main PR
danielhanchen Apr 3, 2026
4148591
Fix installer and hardware detection issues for PR #4720
danielhanchen Apr 3, 2026
ae0f9af
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 3, 2026
ec12f9b
Fix GPU detection false positives and add missing health groups
danielhanchen Apr 3, 2026
848c92a
Fix _ensure_rocm_torch and Windows AMD warning false positives
danielhanchen Apr 3, 2026
86735ff
Fix amd-smi GPU detection for GPU[N] output format
danielhanchen Apr 3, 2026
96ba872
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 3, 2026
3caaf30
Harden AMD GPU detection against false positives
danielhanchen Apr 3, 2026
263470e
Remove duplicate comment from pre-commit merge
danielhanchen Apr 3, 2026
1b98f6d
Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean …
danielhanchen Apr 5, 2026
9d7c2e7
Merge main into feature/rocm-support-v2
danielhanchen Apr 5, 2026
b37f7e6
Fix VRAM parsing for string values and GB/GiB consistency
danielhanchen Apr 5, 2026
543e721
Add --no-cache to uv for ROCm HIP source builds
danielhanchen Apr 5, 2026
84a9c55
Fix critical: initialize _amd_gpu_radeon before case block
danielhanchen Apr 5, 2026
c6f5b3a
Fix Windows AMD: route has_rocm hosts to HIP prebuilt path
danielhanchen Apr 5, 2026
810b833
Harden ROCm detection, Radeon wheel fallback, and HIP visibility
danielhanchen Apr 8, 2026
8636fa6
Fix round 2 regressions: ROCm validate_server and Windows HIP routing
danielhanchen Apr 8, 2026
5341e46
Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps
danielhanchen Apr 8, 2026
5305c31
Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bits…
danielhanchen Apr 8, 2026
7d27b2e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 8, 2026
f98aaef
Fix gemini findings: amd-smi metric envelope validation and dict-wrap…
danielhanchen Apr 8, 2026
37432b6
Fix gemini round 2 findings: explicit length guard on ROCm version fi…
danielhanchen Apr 8, 2026
c12e8b7
Fix gemini round 3: include has_rocm in validate_server fallback path
danielhanchen Apr 8, 2026
d25c570
Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memo…
danielhanchen Apr 8, 2026
b3627bc
Fix gemini round 5: POSIX compliance and leading-comma visibility par…
danielhanchen Apr 8, 2026
5211328
Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallbac…
danielhanchen Apr 9, 2026
1d387d6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 9, 2026
7effb3a
Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ cli…
danielhanchen Apr 9, 2026
bae2421
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 9, 2026
a24b27e
Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, mai…
danielhanchen Apr 9, 2026
435c1d8
Split: keep only 10 file(s)
danielhanchen Apr 9, 2026
3989cd2
Fix ROCm visibility masks, published bundle routing, and amd-smi JSON…
danielhanchen Apr 9, 2026
282bf9b
Fix NVIDIA visibility masking and ROCm env var precedence
danielhanchen Apr 9, 2026
087053a
Fix layered ROCm visibility: ROCR narrows physical set, HIP/CUDA sele…
danielhanchen Apr 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
445 changes: 433 additions & 12 deletions install.sh

Large diffs are not rendered by default.

122 changes: 103 additions & 19 deletions studio/backend/core/training/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ def _probe_causal_conv1d_env() -> dict[str, str] | None:
"'python_tag': f'cp{sys.version_info.major}{sys.version_info.minor}', "
"'torch_mm': torch_mm, "
"'cuda_major': str(int(str(torch.version.cuda).split('.', 1)[0])) if torch.version.cuda else '', "
"'hip_version': str(torch.version.hip) if getattr(torch.version, 'hip', None) else '', "
"'cxx11abi': str(torch._C._GLIBCXX_USE_CXX11_ABI).upper()"
"}))"
),
Expand Down Expand Up @@ -237,28 +238,111 @@ def _install_package_wheel_first(
else:
logger.info("No published %s wheel found: %s", display_name, wheel_url)

_send_status(event_queue, f"Installing {display_name} from PyPI...")
pypi_cmd = [
sys.executable,
"-m",
"pip",
"install",
"--no-build-isolation",
"--no-deps",
"--no-cache-dir",
f"{pypi_name}=={pypi_version}",
]
result = _sp.run(
pypi_cmd,
stdout = _sp.PIPE,
stderr = _sp.STDOUT,
text = True,
)
is_hip = env and env.get("hip_version")
if is_hip and not shutil.which("hipcc"):
logger.error(
"%s requires hipcc for source compilation on ROCm. "
"Install the ROCm HIP SDK: https://rocm.docs.amd.com",
display_name,
)
_send_status(
event_queue,
f"{display_name}: hipcc not found (ROCm HIP SDK required)",
)
return

if is_hip:
_send_status(
event_queue,
f"Compiling {display_name} from source for ROCm "
"(this may take several minutes)...",
)
else:
_send_status(event_queue, f"Installing {display_name} from PyPI...")

# Prefer uv for faster dependency resolution when available
if shutil.which("uv"):
pypi_cmd = [
"uv",
"pip",
"install",
"--python",
sys.executable,
"--no-build-isolation",
"--no-deps",
]
# Avoid stale cache artifacts from partial HIP source builds
if is_hip:
pypi_cmd.append("--no-cache")
pypi_cmd.append(f"{pypi_name}=={pypi_version}")
else:
pypi_cmd = [
sys.executable,
"-m",
"pip",
"install",
"--no-build-isolation",
"--no-deps",
"--no-cache-dir",
f"{pypi_name}=={pypi_version}",
]

# Source compilation on ROCm can take 10-30 minutes; use a generous
# timeout. Non-HIP installs preserve the pre-existing "no timeout"
# behaviour so unrelated slow installs (e.g. causal-conv1d source
# build on Linux aarch64 or unsupported torch/CUDA combinations)
# are not aborted at 5 minutes by this PR.
_run_kwargs: dict[str, Any] = {
"stdout": _sp.PIPE,
"stderr": _sp.STDOUT,
"text": True,
}
if is_hip:
_run_kwargs["timeout"] = 1800

try:
result = _sp.run(pypi_cmd, **_run_kwargs)
except _sp.TimeoutExpired:
logger.error(
"%s installation timed out after %ds",
display_name,
_run_kwargs.get("timeout"),
)
_send_status(
event_queue,
f"{display_name} installation timed out after "
f"{_run_kwargs.get('timeout')}s",
)
return

if result.returncode != 0:
logger.error("Failed to install %s from PyPI:\n%s", display_name, result.stdout)
if is_hip:
# Surface a clear error for ROCm source build failures
error_lines = (result.stdout or "").strip().splitlines()
snippet = "\n".join(error_lines[-5:]) if error_lines else "(no output)"
logger.error(
"Failed to compile %s for ROCm:\n%s",
display_name,
result.stdout,
)
_send_status(
event_queue,
f"Failed to compile {display_name} for ROCm. "
"Check that hipcc and ROCm development headers are installed.\n"
f"{snippet}",
)
else:
logger.error(
"Failed to install %s from PyPI:\n%s",
display_name,
result.stdout,
)
return

logger.info("Installed %s from PyPI", display_name)
if is_hip:
logger.info("Compiled and installed %s from source for ROCm", display_name)
else:
logger.info("Installed %s from PyPI", display_name)


def _ensure_causal_conv1d_fast_path(event_queue: Any, model_name: str) -> None:
Expand Down
6 changes: 5 additions & 1 deletion studio/backend/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,7 @@ async def get_system_info():
import platform
import psutil
from utils.hardware import get_device
from utils.hardware.hardware import _backend_label

visibility_info = get_backend_visible_gpu_info()
gpu_info = {
Expand All @@ -250,7 +251,10 @@ async def get_system_info():
return {
"platform": platform.platform(),
"python_version": platform.python_version(),
"device_backend": get_device().value,
# Use the centralized _backend_label helper so the /api/system
# endpoint reports "rocm" on AMD hosts instead of "cuda", matching
# the /api/hardware and /api/gpu-visibility endpoints.
"device_backend": _backend_label(get_device()),
"cpu_count": psutil.cpu_count(),
"memory": {
"total_gb": round(memory.total / 1e9, 2),
Expand Down
10 changes: 10 additions & 0 deletions studio/backend/utils/hardware/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
Hardware detection and GPU utilities
"""

from . import hardware as _hardware
from .hardware import (
DeviceType,
DEVICE,
Expand Down Expand Up @@ -49,6 +50,7 @@
"DeviceType",
"DEVICE",
"CHAT_ONLY",
"IS_ROCM",
"detect_hardware",
"get_device",
"is_apple_silicon",
Expand Down Expand Up @@ -81,3 +83,11 @@
"extract_arch_config",
"estimate_training_vram",
]


def __getattr__(name: str):
"""Resolve IS_ROCM at access time so callers always see the live value
after detect_hardware() runs (it flips the flag in hardware.py)."""
if name == "IS_ROCM":
return getattr(_hardware, "IS_ROCM")
raise AttributeError(name)
Loading
Loading