Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
7fa0192
Add ROCm detection to install.sh and expand shell tests
edamamez Mar 31, 2026
f3cc758
Add ROCm torch reinstall support to install_python_stack.py
GoldenGrapeGentleman Mar 31, 2026
062e25f
Add ROCm support to llama.cpp prebuilt installer
edamamez Mar 31, 2026
450f5de
Add IS_ROCM hardware flag and fix AMD error message
GoldenGrapeGentleman Mar 31, 2026
f6c2eb8
Add comprehensive ROCm support test suite (68 tests)
danielhanchen Mar 31, 2026
7290199
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 31, 2026
2cb0b52
Harden ROCm support: probe error handling, version cap, validation
danielhanchen Mar 31, 2026
56098e5
Clean up rocm_paths list construction in detect_host()
danielhanchen Mar 31, 2026
9e33c25
Require actual AMD GPU presence before selecting ROCm paths
danielhanchen Mar 31, 2026
4286525
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 31, 2026
7d6ac65
Harden hipconfig version parsing and torch probe compatibility
danielhanchen Mar 31, 2026
fd43235
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 31, 2026
10ec0cd
Strengthen AMD GPU detection and add NVIDIA precedence guard
danielhanchen Mar 31, 2026
c22312b
Merge branch 'main' into feature/rocm-support-v2
danielhanchen Mar 31, 2026
726fab1
Add Windows-specific ROCm/HIP detection in detect_host()
danielhanchen Mar 31, 2026
f17e007
Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows m…
danielhanchen Mar 31, 2026
1482326
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 31, 2026
134638d
Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage
danielhanchen Mar 31, 2026
4dc1dca
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 31, 2026
3881539
Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU…
danielhanchen Mar 31, 2026
326a971
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 31, 2026
2d55e77
Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec
danielhanchen Mar 31, 2026
478bc7f
Fix IS_ROCM test assertion for multi-line formatting
danielhanchen Mar 31, 2026
a067110
Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fi…
danielhanchen Mar 31, 2026
9484014
Merge branch 'main' into feature/rocm-support-v2
danielhanchen Mar 31, 2026
d1e3858
Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428
danielhanchen Apr 1, 2026
d1de729
Support AMD Radeon for studio (#4770)
iswaryaalex Apr 3, 2026
f4d64ff
Merge main into feature/rocm-support-v2, resolve install.sh conflict
danielhanchen Apr 3, 2026
f9a738a
Remove ROCm test files from main PR
danielhanchen Apr 3, 2026
4148591
Fix installer and hardware detection issues for PR #4720
danielhanchen Apr 3, 2026
ae0f9af
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 3, 2026
ec12f9b
Fix GPU detection false positives and add missing health groups
danielhanchen Apr 3, 2026
848c92a
Fix _ensure_rocm_torch and Windows AMD warning false positives
danielhanchen Apr 3, 2026
86735ff
Fix amd-smi GPU detection for GPU[N] output format
danielhanchen Apr 3, 2026
96ba872
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 3, 2026
3caaf30
Harden AMD GPU detection against false positives
danielhanchen Apr 3, 2026
263470e
Remove duplicate comment from pre-commit merge
danielhanchen Apr 3, 2026
1b98f6d
Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean …
danielhanchen Apr 5, 2026
9d7c2e7
Merge main into feature/rocm-support-v2
danielhanchen Apr 5, 2026
b37f7e6
Fix VRAM parsing for string values and GB/GiB consistency
danielhanchen Apr 5, 2026
543e721
Add --no-cache to uv for ROCm HIP source builds
danielhanchen Apr 5, 2026
84a9c55
Fix critical: initialize _amd_gpu_radeon before case block
danielhanchen Apr 5, 2026
c6f5b3a
Fix Windows AMD: route has_rocm hosts to HIP prebuilt path
danielhanchen Apr 5, 2026
810b833
Harden ROCm detection, Radeon wheel fallback, and HIP visibility
danielhanchen Apr 8, 2026
8636fa6
Fix round 2 regressions: ROCm validate_server and Windows HIP routing
danielhanchen Apr 8, 2026
5341e46
Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps
danielhanchen Apr 8, 2026
5305c31
Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bits…
danielhanchen Apr 8, 2026
7d27b2e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 8, 2026
f98aaef
Fix gemini findings: amd-smi metric envelope validation and dict-wrap…
danielhanchen Apr 8, 2026
37432b6
Fix gemini round 2 findings: explicit length guard on ROCm version fi…
danielhanchen Apr 8, 2026
c12e8b7
Fix gemini round 3: include has_rocm in validate_server fallback path
danielhanchen Apr 8, 2026
d25c570
Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memo…
danielhanchen Apr 8, 2026
b3627bc
Fix gemini round 5: POSIX compliance and leading-comma visibility par…
danielhanchen Apr 8, 2026
5211328
Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallbac…
danielhanchen Apr 9, 2026
1d387d6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 9, 2026
7effb3a
Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ cli…
danielhanchen Apr 9, 2026
bae2421
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 9, 2026
a24b27e
Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, mai…
danielhanchen Apr 9, 2026
ed97b1f
Split: keep only 1 file(s)
danielhanchen Apr 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -1040,15 +1040,15 @@ if [ "$_MIGRATED" = true ]; then
# to prevent transitive torch resolution.
run_install_cmd "install unsloth (migrated no-torch)" uv pip install --python "$_VENV_PY" --no-deps \
--reinstall-package unsloth --reinstall-package unsloth-zoo \
"unsloth>=2026.4.2" unsloth-zoo
"unsloth>=2026.4.4" unsloth-zoo
_NO_TORCH_RT="$(_find_no_torch_runtime)"
if [ -n "$_NO_TORCH_RT" ]; then
run_install_cmd "install no-torch runtime deps" uv pip install --python "$_VENV_PY" --no-deps -r "$_NO_TORCH_RT"
fi
else
run_install_cmd "install unsloth (migrated)" uv pip install --python "$_VENV_PY" \
--reinstall-package unsloth --reinstall-package unsloth-zoo \
"unsloth>=2026.4.2" unsloth-zoo
"unsloth>=2026.4.4" unsloth-zoo
fi
if [ "$STUDIO_LOCAL_INSTALL" = true ]; then
substep "overlaying local repo (editable)..."
Expand All @@ -1070,7 +1070,7 @@ elif [ -n "$TORCH_INDEX_URL" ]; then
# runtime deps (typer, safetensors, transformers, etc.) with --no-deps.
run_install_cmd "install unsloth (no-torch)" uv pip install --python "$_VENV_PY" --no-deps \
--upgrade-package unsloth --upgrade-package unsloth-zoo \
"unsloth>=2026.4.2" unsloth-zoo
"unsloth>=2026.4.4" unsloth-zoo
_NO_TORCH_RT="$(_find_no_torch_runtime)"
if [ -n "$_NO_TORCH_RT" ]; then
run_install_cmd "install no-torch runtime deps" uv pip install --python "$_VENV_PY" --no-deps -r "$_NO_TORCH_RT"
Expand All @@ -1081,7 +1081,7 @@ elif [ -n "$TORCH_INDEX_URL" ]; then
fi
elif [ "$STUDIO_LOCAL_INSTALL" = true ]; then
run_install_cmd "install unsloth (local)" uv pip install --python "$_VENV_PY" \
--upgrade-package unsloth "unsloth>=2026.4.2" unsloth-zoo
--upgrade-package unsloth "unsloth>=2026.4.4" unsloth-zoo
substep "overlaying local repo (editable)..."
run_install_cmd "overlay local repo" uv pip install --python "$_VENV_PY" -e "$_REPO_ROOT" --no-deps
else
Expand All @@ -1092,7 +1092,7 @@ else
# Fallback: GPU detection failed to produce a URL -- let uv resolve torch
substep "installing unsloth (this may take a few minutes)..."
if [ "$STUDIO_LOCAL_INSTALL" = true ]; then
run_install_cmd "install unsloth (auto torch backend)" uv pip install --python "$_VENV_PY" unsloth-zoo "unsloth>=2026.4.2" --torch-backend=auto
run_install_cmd "install unsloth (auto torch backend)" uv pip install --python "$_VENV_PY" unsloth-zoo "unsloth>=2026.4.4" --torch-backend=auto
substep "overlaying local repo (editable)..."
run_install_cmd "overlay local repo" uv pip install --python "$_VENV_PY" -e "$_REPO_ROOT" --no-deps
else
Expand Down
51 changes: 13 additions & 38 deletions studio/backend/core/training/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,37 +306,15 @@ def _ensure_mamba_ssm(event_queue: Any, model_name: str) -> None:


def _activate_transformers_version(model_name: str) -> None:
"""Activate the correct transformers version BEFORE any ML imports.

If the model needs transformers 5.x, prepend the pre-installed .venv_t5/
directory to sys.path. Otherwise do nothing (default 4.57.x in .venv/).
"""
"""Activate the correct transformers version BEFORE any ML imports."""
# Ensure backend is on path for utils imports
backend_path = str(Path(__file__).resolve().parent.parent.parent)
if backend_path not in sys.path:
sys.path.insert(0, backend_path)

from utils.transformers_version import (
needs_transformers_5,
_resolve_base_model,
_ensure_venv_t5_exists,
_VENV_T5_DIR,
)
from utils.transformers_version import activate_transformers_for_subprocess

resolved = _resolve_base_model(model_name)
if needs_transformers_5(resolved):
if not _ensure_venv_t5_exists():
raise RuntimeError(
f"Cannot activate transformers 5.x: .venv_t5 missing at {_VENV_T5_DIR}"
)
if _VENV_T5_DIR not in sys.path:
sys.path.insert(0, _VENV_T5_DIR)
logger.info("Activated transformers 5.x from %s", _VENV_T5_DIR)
# Propagate to child subprocesses (e.g. GGUF converter)
_pp = os.environ.get("PYTHONPATH", "")
os.environ["PYTHONPATH"] = _VENV_T5_DIR + (os.pathsep + _pp if _pp else "")
else:
logger.info("Using default transformers (4.57.x) for %s", model_name)
activate_transformers_for_subprocess(model_name)


def run_training_process(
Expand Down Expand Up @@ -386,25 +364,22 @@ def run_training_process(
)
return

# ── 1a. Auto-enable trust_remote_code for unsloth/* transformers 5.x models ──
# Some newer architectures (e.g. NemotronH) have config parsing bugs in
# transformers that require trust_remote_code=True as a workaround.
# Only auto-enable for unsloth/* prefixed models (trusted source).
# Exclude Gemma 4 since it is a native transformers 5.5 model and
# trust_remote_code=True would bypass the compiler (disabling fused CE).
from utils.transformers_version import needs_transformers_5

# ── 1a. Auto-enable trust_remote_code for NemotronH/Nano models ──
# NemotronH has config parsing bugs in transformers that require
# trust_remote_code=True as a workaround. Other transformers 5.x models
# (Qwen3.5, Gemma 4, etc.) are native and do NOT need it — enabling it
# bypasses the compiler (disabling fused CE).
# NOTE: Must NOT match Llama-Nemotron (standard Llama architecture).
_NEMOTRON_TRUST_SUBSTRINGS = ("nemotron_h", "nemotron-h", "nemotron-3-nano")
_lowered = model_name.lower()
_is_native_t5 = any(x in _lowered for x in ("gemma-4", "gemma4"))
if (
needs_transformers_5(model_name)
and _lowered.startswith("unsloth/")
and not _is_native_t5
any(sub in _lowered for sub in _NEMOTRON_TRUST_SUBSTRINGS)
and (_lowered.startswith("unsloth/") or _lowered.startswith("nvidia/"))
and not config.get("trust_remote_code", False)
):
config["trust_remote_code"] = True
logger.info(
"Auto-enabled trust_remote_code for unsloth/* transformers 5.x model: %s",
"Auto-enabled trust_remote_code for Nemotron model: %s",
model_name,
)

Expand Down
8 changes: 7 additions & 1 deletion studio/backend/tests/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,8 +191,14 @@ def test_has_backend_key(self):
assert "backend" in get_gpu_memory_info()

def test_backend_matches_device(self):
# The backend field uses _backend_label, which swaps "cuda" for
# "rocm" when running on an AMD host (IS_ROCM=True) so the UI
# can render the correct label. On CUDA / XPU / MLX / CPU hosts
# it is equivalent to `get_device().value`.
from utils.hardware.hardware import _backend_label

result = get_gpu_memory_info()
assert result["backend"] == get_device().value
assert result["backend"] == _backend_label(get_device())

# --- When a GPU IS available ---

Expand Down
Loading