fix(rocm): cap RDNA2 to rocm6.2 torch and fix inference BF16 on RDNA2#3
Closed
LeoBorcherding wants to merge 1 commit into
Closed
fix(rocm): cap RDNA2 to rocm6.2 torch and fix inference BF16 on RDNA2#3LeoBorcherding wants to merge 1 commit into
LeoBorcherding wants to merge 1 commit into
Conversation
Two bugs affecting RDNA2 (gfx1030-gfx1036, e.g. RX 6600) with ROCm 7.x:
1. Installer selects the rocm7.x PyTorch index when ROCm 7.x is detected,
landing dev/nightly builds (e.g. 2.10.0+rocm7.2.0.gitXXXXXXXX) in the
Studio venv. These builds segfault during unsloth import on RDNA2.
Fix: detect RDNA2 gfx code at install time and cap to the rocm6.2 index
(torch 2.7.x) in both install.sh and install_python_stack.py, replacing
any existing broken dev build via --force-reinstall.
2. Inference subprocess passed dtype=None to FastLanguageModel/FastVisionModel
which auto-selected bfloat16 on RDNA2, crashing with:
LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.fdot2.bf16.bf16
Fix: apply the same is_bfloat16_supported() guard from trainer.py to
InferenceBackend.load_model(), forcing float16 on RDNA2 for inference.
Fixes: unslothai#5337
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two bugs affecting RDNA2 GPUs (gfx1030-gfx1036, e.g. RX 6600/6700/6800/6900) when ROCm 7.x is installed:
Bug 1 — Installer puts a dev/nightly PyTorch build in the Studio venv
When ROCm 7.x is detected, the installer selects the
rocm7.xPyTorch index which serves dev builds (version string contains a git hash, e.g.2.10.0+rocm7.2.0.gitb6ee5fde). These builds segfault during unsloth's import/patching phase on RDNA2 hardware. The user's system pip had a stable2.7.1+rocm6.2.4but the Studio venv got the broken dev build.Fix: detect the runtime GPU gfx code at install time. If it's RDNA2, cap the torch install to the
rocm6.2index (torch 2.7.x) regardless of system ROCm version. Applied in bothinstall.shandinstall_python_stack.py, with--force-reinstallso existing broken builds get replaced.Bug 2 — Inference subprocess crashes with LLVM ERROR on RDNA2
The training path fix from unslothai#5301 only covered
trainer.py. The inference subprocess (InferenceBackend.load_model()) still passeddtype=NonetoFastLanguageModel/FastVisionModel.from_pretrained(), which auto-selected bfloat16 on RDNA2, triggering:Fix: apply the same
is_bfloat16_supported()guard fromtrainer.pytoInferenceBackend.load_model(), forcingfloat16on hardware that doesn't support bfloat16.Files changed
install.sh— RDNA2 gfx detection + rocm6.2 cap in the shell installerstudio/install_python_stack.py— same cap in the Python stack installer (unsloth studio update)studio/backend/core/inference/inference.py— float16 fallback for inference on RDNA2Closes unslothai#5337