Skip to content

fix(rocm): cap RDNA2 to rocm6.2 torch and fix inference BF16 on RDNA2#3

Closed
LeoBorcherding wants to merge 1 commit into
mainfrom
fix/llvm-error-issue-5337
Closed

fix(rocm): cap RDNA2 to rocm6.2 torch and fix inference BF16 on RDNA2#3
LeoBorcherding wants to merge 1 commit into
mainfrom
fix/llvm-error-issue-5337

Conversation

@LeoBorcherding

Copy link
Copy Markdown
Owner

Summary

Two bugs affecting RDNA2 GPUs (gfx1030-gfx1036, e.g. RX 6600/6700/6800/6900) when ROCm 7.x is installed:

Bug 1 — Installer puts a dev/nightly PyTorch build in the Studio venv

When ROCm 7.x is detected, the installer selects the rocm7.x PyTorch index which serves dev builds (version string contains a git hash, e.g. 2.10.0+rocm7.2.0.gitb6ee5fde). These builds segfault during unsloth's import/patching phase on RDNA2 hardware. The user's system pip had a stable 2.7.1+rocm6.2.4 but the Studio venv got the broken dev build.

Fix: detect the runtime GPU gfx code at install time. If it's RDNA2, cap the torch install to the rocm6.2 index (torch 2.7.x) regardless of system ROCm version. Applied in both install.sh and install_python_stack.py, with --force-reinstall so existing broken builds get replaced.

Bug 2 — Inference subprocess crashes with LLVM ERROR on RDNA2

The training path fix from unslothai#5301 only covered trainer.py. The inference subprocess (InferenceBackend.load_model()) still passed dtype=None to FastLanguageModel/FastVisionModel.from_pretrained(), which auto-selected bfloat16 on RDNA2, triggering:

LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.fdot2.bf16.bf16

Fix: apply the same is_bfloat16_supported() guard from trainer.py to InferenceBackend.load_model(), forcing float16 on hardware that doesn't support bfloat16.

Files changed

  • install.sh — RDNA2 gfx detection + rocm6.2 cap in the shell installer
  • studio/install_python_stack.py — same cap in the Python stack installer (unsloth studio update)
  • studio/backend/core/inference/inference.py — float16 fallback for inference on RDNA2

Closes unslothai#5337

Two bugs affecting RDNA2 (gfx1030-gfx1036, e.g. RX 6600) with ROCm 7.x:

1. Installer selects the rocm7.x PyTorch index when ROCm 7.x is detected,
   landing dev/nightly builds (e.g. 2.10.0+rocm7.2.0.gitXXXXXXXX) in the
   Studio venv. These builds segfault during unsloth import on RDNA2.
   Fix: detect RDNA2 gfx code at install time and cap to the rocm6.2 index
   (torch 2.7.x) in both install.sh and install_python_stack.py, replacing
   any existing broken dev build via --force-reinstall.

2. Inference subprocess passed dtype=None to FastLanguageModel/FastVisionModel
   which auto-selected bfloat16 on RDNA2, crashing with:
     LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.fdot2.bf16.bf16
   Fix: apply the same is_bfloat16_supported() guard from trainer.py to
   InferenceBackend.load_model(), forcing float16 on RDNA2 for inference.

Fixes: unslothai#5337
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.fdot2.bf16.bf16

1 participant