[Studio] Fix GPU detection for AMD/Intel — add Vulkan VRAM fallback by HellBoxyz · Pull Request #4874 · unslothai/unsloth

HellBoxyz · 2026-04-06T13:57:39Z

Problem

Unsloth Studio doesn't detect GPU on AMD/Intel systems. The VRAM detection (_get_gpu_free_memory()) uses only nvidia-smi, so on non-NVIDIA hardware it returns an empty list. This means:

Studio thinks there's no GPU at all
Context length stays at full native (e.g. 128K) without auto-reduction
KV cache doesn't fit in VRAM and spills into system RAM
Inference is slow because data transfers between GPU and RAM

Fix

Add a vulkaninfo fallback that kicks in when nvidia-smi is not available:

Parses Vulkan memory heap budgets (VK_EXT_memory_budget)
Correctly handles multi-GPU systems and GPUs with multiple DEVICE_LOCAL heaps
nvidia-smi still has priority — zero impact on NVIDIA setups
When nvidia-smi succeeds (returncode 0), its result is authoritative — empty list means no visible GPUs, no fallback to vulkan

Before / After

Before (AMD GPU):

GPUs free: [], selected: None, fit: True
→ 128K context, KV cache in RAM, slow

After (AMD GPU):

Vulkan GPU memory detected: GPU0=7382MiB
GPUs free: [(0, 7382)], selected: [0], fit: False
→ Context auto-reduced to fit VRAM, everything on GPU

Tested on

AMD Radeon RX 5700 XT (8 GB), Windows 11, Vulkan 1.4.341
Model: gemma-4-E4B-it Q4_K_XL (4.8 GB)
Context properly auto-reduced, full GPU offload with -ngl -1
12 unit tests covering parser + orchestrator edge cases

gemini-code-assist

Code Review

This pull request introduces a fallback mechanism for detecting free GPU memory using vulkaninfo, enabling support for AMD, Intel, and other Vulkan-compatible hardware when nvidia-smi is unavailable. The review feedback identifies a logic error in the parsing of vulkaninfo output, where multiple memory heaps are incorrectly treated as distinct GPUs, and provides a more robust implementation that groups heaps by physical device.

gemini-code-assist · 2026-04-06T14:02:29Z

+        # Split output into per-heap blocks at each "	memoryHeaps[N]:"
+        # marker, then check each block for DEVICE_LOCAL flag and budget.
+        heap_sections = re.split(r"(?=\tmemoryHeaps\[\d+\]:)", output)
+        budget_re = re.compile(r"budget\s*=\s*(\d+)")
+
+        gpus: list[tuple[int, int]] = []
+        gpu_idx = 0
+        for section in heap_sections:
+            if not section.strip().startswith("memoryHeaps["):
+                continue
+            if "MEMORY_HEAP_DEVICE_LOCAL_BIT" not in section:
+                continue
+            budget_m = budget_re.search(section)
+            if not budget_m:
+                continue
+            budget_bytes = int(budget_m.group(1))
+            free_mib = budget_bytes // (1024 * 1024)
+            if free_mib > 0:
+                gpus.append((gpu_idx, free_mib))
+                gpu_idx += 1


The current parsing logic for vulkaninfo output is not robust for all systems. It treats every device-local memory heap as a separate GPU, which is incorrect for multi-GPU systems or single GPUs that expose multiple device-local heaps. This can lead to misreporting the number of GPUs and their available memory, causing issues with GPU selection and model offloading.

A more robust approach is to group memory heaps by physical device and report the largest available memory budget for each. This ensures that each physical GPU is represented as a single entry with its correct available VRAM.

# Split output by physical device. vulkaninfo typically separates devices # with headers like "GPU0", "GPU1", etc. on their own lines. # The lookahead (?=...) keeps the delimiter. device_sections = re.split(r"(?=^GPU\\d+\\n)", output, flags=re.MULTILINE) if len(device_sections) > 1: # Filter out any non-GPU sections (like the header before GPU0) device_sections = [s for s in device_sections if s.strip().startswith("GPU")] # If no GPUn headers, device_sections contains the whole output as one element. budget_re = re.compile(r"budget\\s*=\\s*(\\d+)") gpus: list[tuple[int, int]] = [] for gpu_idx, device_section in enumerate(device_sections): # For each physical device, find the largest device-local memory heap budget. # A single GPU can have multiple device-local heaps. max_free_mib = 0 heap_sections = re.split(r"(?=\\tmemoryHeaps\\[\\d+\\]:)", device_section) for section in heap_sections: if "MEMORY_HEAP_DEVICE_LOCAL_BIT" in section: budget_m = budget_re.search(section) if budget_m: budget_bytes = int(budget_m.group(1)) free_mib = budget_bytes // (1024 * 1024) if free_mib > max_free_mib: max_free_mib = free_mib if max_free_mib > 0: gpus.append((gpu_idx, max_free_mib))

_get_gpu_free_memory() relied exclusively on nvidia-smi, returning an empty list on non-NVIDIA systems. This caused the VRAM-aware context auto-reduction logic to be skipped entirely: models launched with full native context (e.g. 128K+), KV caches spilled into system RAM, and inference performance degraded significantly. Add a vulkaninfo fallback that parses VK_EXT_memory_budget heap data to detect DEVICE_LOCAL VRAM budget on AMD, Intel, and any Vulkan-capable GPU. Handles multi-GPU systems (split by GPU device headers) and GPUs with multiple DEVICE_LOCAL heaps (takes largest budget per device). nvidia-smi retains priority — zero impact on NVIDIA setups. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rolandtannous · 2026-04-06T16:13:07Z

This is duplicate of #4720

* Studio: probe AMD GPUs in llama-server VRAM detection _get_gpu_free_memory in studio/backend/core/inference/llama_cpp.py only queried nvidia-smi. On AMD ROCm hosts that returns nothing, so the GPU list is empty, the auto-fit logic falls into the no-gpus branch, and llama-server gets --fit on with no -ngl to anchor it. The model loads on CPU even though the GPU is detected elsewhere in Studio. Addresses #5106. Add a torch-based fallback that runs after nvidia-smi fails or returns empty: import torch if torch.cuda.is_available() and hasattr(torch.cuda, "mem_get_info"): for ordinal in range(torch.cuda.device_count()): free, _total = torch.cuda.mem_get_info(ordinal) gpus.append((ordinal, free // (1024 * 1024))) Works on AMD because the ROCm torch wheels Studio installs reuse the entire torch.cuda.* namespace via HIP. Also rescues NVIDIA hosts where nvidia-smi is missing from PATH (a secondary cause of the bug on Windows). Matches the convention studio/backend/utils/hardware/hardware.py:412 already uses for the same fallback purpose. Verified locally: nvidia-smi path returns the expected GPU and free MiB; torch fallback returns valid VRAM when nvidia-smi is forced to fail. Note: PR #4874 is a draft taking a different approach (parsing vulkaninfo); the two are complementary. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address review feedback on PR #5172 torch.cuda.device_count() enumerates GPUs RELATIVE to the current CUDA_VISIBLE_DEVICES (or HIP_VISIBLE_DEVICES on ROCm). Returning those visible ordinals directly lets _select_gpus rewrite CUDA_VISIBLE_DEVICES with the wrong physical IDs: a process started with CUDA_VISIBLE_DEVICES=2,3 would get its child llama-server relaunched with CUDA_VISIBLE_DEVICES=0,1, targeting the wrong GPUs and violating any scheduler pinning. Translate visible ordinals back through the active CVD/HIP/ROCR mask before returning. Falls through to bare ordinal when no mask is set. Also drop the redundant int() cast on // -- bytes // 2**20 already returns int. Verified: with CUDA_VISIBLE_DEVICES=6 and nvidia-smi forced to fail, the torch fallback now returns (6, free_mib) instead of (0, free_mib). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: fix ROCm visibility precedence + narrow ROCm child env Two reviewer-flagged correctness bugs in the AMD GPU probe path. 1) ROCm visibility precedence was reversed. torch.cuda enumerates GPUs relative to HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES on ROCm builds, but the probe's env-var lookup checked CUDA_VISIBLE_DEVICES first. With CUDA_VISIBLE_DEVICES=0,1 and HIP_VISIBLE_DEVICES=6,7 the probe returned [(0, ...), (1, ...)] when torch's view was actually [(6, ...), (7, ...)]. The wrong physical IDs flowed downstream into CUDA_VISIBLE_DEVICES for the llama-server subprocess, pinning it to GPUs 0,1 instead of 6,7. Fix: branch on torch.version.hip. On ROCm, prefer HIP > ROCR > CUDA (matches torch's own ordering). On NVIDIA, use CUDA only -- ignoring any HIP/ROCR vars the parent happens to have set. 2) Child env narrowing only set CUDA_VISIBLE_DEVICES. On ROCm, llama-server honors HIP/ROCR; if the parent shell exported HIP_VISIBLE_DEVICES=4,5 and the selector picked just GPU 4, the child still saw both because we never narrowed HIP/ROCR. Now we set all three on ROCm so the AMD subprocess actually sees the planned subset. Both branches verified via temp/pr_simulation/sim_5172_rocm_precedence.py (7/7 cases pass), including the reviewer's verbatim R5 case (CVD=0,1 + HIP/ROCR=6,7). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: sort GPU probe result + honor explicitly empty ROCm masks Two reviewer-flagged correctness nits on top of eff55fb. 1) Gemini medium: the torch fallback returned an unsorted list when the visibility mask was non-sequential (e.g. CUDA_VISIBLE_DEVICES=5,2,9), diverging from the docstring guarantee and the nvidia-smi path. Now sorted by physical id. 2) Codex P2: an explicitly empty HIP_VISIBLE_DEVICES="" should mean "no GPUs" per the codebase convention in utils/hardware/hardware.py::_get_parent_visible_gpu_spec. The previous `or` chain treated empty string as falsy and silently fell through to ROCR / CUDA, producing wrong physical IDs. Switch to `is not None` checks to match. Verified via sim_5172_rocm_precedence.py (9/9 cases pass) including the two new R8 (sort) and R9 (empty-HIP honored) cases. * Studio: align nvidia-smi probe with torch fallback (sort + robust CVD) Two follow-up Gemini-medium nits on PR #5172. 1) Fragile CVD parsing on the nvidia-smi path: `cvd.split(",")` would raise ValueError on a trailing comma like "0,1," because the empty trailing token is not skipped. The torch fallback already filters empty tokens via `if x.strip()`; mirror that here. 2) Missing sort guarantee on the nvidia-smi path: the docstring promises sort-by-id, the torch fallback now sorts, but the nvidia-smi path relied on driver enumeration order. Add an explicit sort. Both changes match what shipped in 6b1cccd for the torch fallback, so the two probe paths now have identical CVD parsing + ordering semantics. * Studio: drop cvd.strip() truthiness so empty CVD filters all GPUs Reviewer-flagged correctness bug. The previous `if cvd is not None and cvd.strip():` guard treated `CUDA_VISIBLE_DEVICES=""` as if the variable were unset, leaving `allowed=None` (and `physical_ids=None` on the torch path). On the nvidia-smi path that mattered: nvidia-smi ignores CVD entirely, so the probe's `allowed` filter is the only thing that respects the parent's "no GPUs" intent. Pre-fix the probe returned every physical GPU when the parent had explicitly hidden them. Drop the `.strip()` truthiness check on both paths. The downstream `if x.strip()` token filter still keeps trailing-comma masks like "0,1," safe, and an empty mask now produces an empty allowed/physical set as expected (matching utils/hardware/hardware.py convention). Verified via sim_5172_rocm_precedence.py R10 + R11 (now 11/11 cases pass): nvidia-smi path with `CUDA_VISIBLE_DEVICES=""` now returns [] instead of leaking the hidden GPUs. * Studio: log ROCm env-var failures instead of silently swallowing Reviewer-flagged defensive logging gap. The bare `except Exception: pass` around the HIP/ROCR env-var assignment would mask anything from a missing torch import to an unexpected version object shape. Log at debug so a failed AMD child-env narrowing is at least traceable. Behavior is unchanged: torch missing or version probe failing still leaves the child with only CUDA_VISIBLE_DEVICES set. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

HellBoxyz requested a review from rolandtannous as a code owner April 6, 2026 13:57

gemini-code-assist Bot reviewed Apr 6, 2026

View reviewed changes

HellBoxyz changed the title ~~[Studio] Add vulkaninfo fallback for GPU memory detection on AMD/Intel GPUs~~ [Studio] Fix GPU detection for AMD/Intel — add Vulkan VRAM fallback Apr 6, 2026

HellBoxyz force-pushed the fix/vulkan-gpu-memory-detection branch from 23d2dc0 to a73223d Compare April 6, 2026 16:10

HellBoxyz requested a review from danielhanchen as a code owner April 6, 2026 16:10

rolandtannous marked this pull request as draft April 6, 2026 18:27

danielhanchen mentioned this pull request Apr 24, 2026

Studio: probe AMD GPUs in llama-server VRAM detection #5172

Merged

3 tasks

oobabooga mentioned this pull request May 27, 2026

Studio: add Vulkan llama.cpp support #5819

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Studio] Fix GPU detection for AMD/Intel — add Vulkan VRAM fallback#4874

[Studio] Fix GPU detection for AMD/Intel — add Vulkan VRAM fallback#4874
HellBoxyz wants to merge 1 commit into
unslothai:mainfrom
HellBoxyz:fix/vulkan-gpu-memory-detection

HellBoxyz commented Apr 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Uh oh!

rolandtannous commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

HellBoxyz commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Before / After

Tested on

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

rolandtannous commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HellBoxyz commented Apr 6, 2026 •

edited

Loading