Skip to content

[Studio] Fix GPU detection for AMD/Intel — add Vulkan VRAM fallback#4874

Draft
HellBoxyz wants to merge 1 commit into
unslothai:mainfrom
HellBoxyz:fix/vulkan-gpu-memory-detection
Draft

[Studio] Fix GPU detection for AMD/Intel — add Vulkan VRAM fallback#4874
HellBoxyz wants to merge 1 commit into
unslothai:mainfrom
HellBoxyz:fix/vulkan-gpu-memory-detection

Conversation

@HellBoxyz

@HellBoxyz HellBoxyz commented Apr 6, 2026

Copy link
Copy Markdown

Problem

Unsloth Studio doesn't detect GPU on AMD/Intel systems. The VRAM detection (_get_gpu_free_memory()) uses only nvidia-smi, so on non-NVIDIA hardware it returns an empty list. This means:

  • Studio thinks there's no GPU at all
  • Context length stays at full native (e.g. 128K) without auto-reduction
  • KV cache doesn't fit in VRAM and spills into system RAM
  • Inference is slow because data transfers between GPU and RAM

Fix

Add a vulkaninfo fallback that kicks in when nvidia-smi is not available:

  • Parses Vulkan memory heap budgets (VK_EXT_memory_budget)
  • Correctly handles multi-GPU systems and GPUs with multiple DEVICE_LOCAL heaps
  • nvidia-smi still has priority — zero impact on NVIDIA setups
  • When nvidia-smi succeeds (returncode 0), its result is authoritative — empty list means no visible GPUs, no fallback to vulkan

Before / After

Before (AMD GPU):

GPUs free: [], selected: None, fit: True
→ 128K context, KV cache in RAM, slow

After (AMD GPU):

Vulkan GPU memory detected: GPU0=7382MiB
GPUs free: [(0, 7382)], selected: [0], fit: False
→ Context auto-reduced to fit VRAM, everything on GPU

Tested on

  • AMD Radeon RX 5700 XT (8 GB), Windows 11, Vulkan 1.4.341
  • Model: gemma-4-E4B-it Q4_K_XL (4.8 GB)
  • Context properly auto-reduced, full GPU offload with -ngl -1
  • 12 unit tests covering parser + orchestrator edge cases

@HellBoxyz HellBoxyz requested a review from rolandtannous as a code owner April 6, 2026 13:57

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fallback mechanism for detecting free GPU memory using vulkaninfo, enabling support for AMD, Intel, and other Vulkan-compatible hardware when nvidia-smi is unavailable. The review feedback identifies a logic error in the parsing of vulkaninfo output, where multiple memory heaps are incorrectly treated as distinct GPUs, and provides a more robust implementation that groups heaps by physical device.

Comment on lines +405 to +424
# Split output into per-heap blocks at each " memoryHeaps[N]:"
# marker, then check each block for DEVICE_LOCAL flag and budget.
heap_sections = re.split(r"(?=\tmemoryHeaps\[\d+\]:)", output)
budget_re = re.compile(r"budget\s*=\s*(\d+)")

gpus: list[tuple[int, int]] = []
gpu_idx = 0
for section in heap_sections:
if not section.strip().startswith("memoryHeaps["):
continue
if "MEMORY_HEAP_DEVICE_LOCAL_BIT" not in section:
continue
budget_m = budget_re.search(section)
if not budget_m:
continue
budget_bytes = int(budget_m.group(1))
free_mib = budget_bytes // (1024 * 1024)
if free_mib > 0:
gpus.append((gpu_idx, free_mib))
gpu_idx += 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current parsing logic for vulkaninfo output is not robust for all systems. It treats every device-local memory heap as a separate GPU, which is incorrect for multi-GPU systems or single GPUs that expose multiple device-local heaps. This can lead to misreporting the number of GPUs and their available memory, causing issues with GPU selection and model offloading.

A more robust approach is to group memory heaps by physical device and report the largest available memory budget for each. This ensures that each physical GPU is represented as a single entry with its correct available VRAM.

# Split output by physical device. vulkaninfo typically separates devices
# with headers like "GPU0", "GPU1", etc. on their own lines.
# The lookahead (?=...) keeps the delimiter.
device_sections = re.split(r"(?=^GPU\\d+\\n)", output, flags=re.MULTILINE)
if len(device_sections) > 1:
    # Filter out any non-GPU sections (like the header before GPU0)
    device_sections = [s for s in device_sections if s.strip().startswith("GPU")]
# If no GPUn headers, device_sections contains the whole output as one element.

budget_re = re.compile(r"budget\\s*=\\s*(\\d+)")
gpus: list[tuple[int, int]] = []

for gpu_idx, device_section in enumerate(device_sections):
    # For each physical device, find the largest device-local memory heap budget.
    # A single GPU can have multiple device-local heaps.
    max_free_mib = 0
    heap_sections = re.split(r"(?=\\tmemoryHeaps\\[\\d+\\]:)", device_section)
    for section in heap_sections:
        if "MEMORY_HEAP_DEVICE_LOCAL_BIT" in section:
            budget_m = budget_re.search(section)
            if budget_m:
                budget_bytes = int(budget_m.group(1))
                free_mib = budget_bytes // (1024 * 1024)
                if free_mib > max_free_mib:
                    max_free_mib = free_mib
    
    if max_free_mib > 0:
        gpus.append((gpu_idx, max_free_mib))

@HellBoxyz HellBoxyz changed the title [Studio] Add vulkaninfo fallback for GPU memory detection on AMD/Intel GPUs [Studio] Fix GPU detection for AMD/Intel — add Vulkan VRAM fallback Apr 6, 2026
_get_gpu_free_memory() relied exclusively on nvidia-smi, returning an
empty list on non-NVIDIA systems. This caused the VRAM-aware context
auto-reduction logic to be skipped entirely: models launched with full
native context (e.g. 128K+), KV caches spilled into system RAM, and
inference performance degraded significantly.

Add a vulkaninfo fallback that parses VK_EXT_memory_budget heap data
to detect DEVICE_LOCAL VRAM budget on AMD, Intel, and any Vulkan-capable
GPU. Handles multi-GPU systems (split by GPU device headers) and GPUs
with multiple DEVICE_LOCAL heaps (takes largest budget per device).

nvidia-smi retains priority — zero impact on NVIDIA setups.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@HellBoxyz HellBoxyz force-pushed the fix/vulkan-gpu-memory-detection branch from 23d2dc0 to a73223d Compare April 6, 2026 16:10
@HellBoxyz HellBoxyz requested a review from danielhanchen as a code owner April 6, 2026 16:10
@rolandtannous

Copy link
Copy Markdown

This is duplicate of #4720

@rolandtannous rolandtannous marked this pull request as draft April 6, 2026 18:27
danielhanchen added a commit that referenced this pull request Apr 24, 2026
* Studio: probe AMD GPUs in llama-server VRAM detection

_get_gpu_free_memory in studio/backend/core/inference/llama_cpp.py
only queried nvidia-smi. On AMD ROCm hosts that returns nothing, so
the GPU list is empty, the auto-fit logic falls into the no-gpus
branch, and llama-server gets --fit on with no -ngl to anchor it.
The model loads on CPU even though the GPU is detected elsewhere in
Studio. Addresses #5106.

Add a torch-based fallback that runs after nvidia-smi fails or returns
empty:

    import torch
    if torch.cuda.is_available() and hasattr(torch.cuda, "mem_get_info"):
        for ordinal in range(torch.cuda.device_count()):
            free, _total = torch.cuda.mem_get_info(ordinal)
            gpus.append((ordinal, free // (1024 * 1024)))

Works on AMD because the ROCm torch wheels Studio installs reuse the
entire torch.cuda.* namespace via HIP. Also rescues NVIDIA hosts
where nvidia-smi is missing from PATH (a secondary cause of the bug
on Windows). Matches the convention
studio/backend/utils/hardware/hardware.py:412 already uses for the
same fallback purpose.

Verified locally: nvidia-smi path returns the expected GPU and free
MiB; torch fallback returns valid VRAM when nvidia-smi is forced to
fail. Note: PR #4874 is a draft taking a different approach
(parsing vulkaninfo); the two are complementary.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address review feedback on PR #5172

torch.cuda.device_count() enumerates GPUs RELATIVE to the current
CUDA_VISIBLE_DEVICES (or HIP_VISIBLE_DEVICES on ROCm). Returning
those visible ordinals directly lets _select_gpus rewrite
CUDA_VISIBLE_DEVICES with the wrong physical IDs: a process started
with CUDA_VISIBLE_DEVICES=2,3 would get its child llama-server
relaunched with CUDA_VISIBLE_DEVICES=0,1, targeting the wrong GPUs
and violating any scheduler pinning.

Translate visible ordinals back through the active CVD/HIP/ROCR
mask before returning. Falls through to bare ordinal when no mask
is set. Also drop the redundant int() cast on // -- bytes // 2**20
already returns int.

Verified: with CUDA_VISIBLE_DEVICES=6 and nvidia-smi forced to fail,
the torch fallback now returns (6, free_mib) instead of (0, free_mib).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: fix ROCm visibility precedence + narrow ROCm child env

Two reviewer-flagged correctness bugs in the AMD GPU probe path.

1) ROCm visibility precedence was reversed. torch.cuda enumerates GPUs
   relative to HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES on ROCm builds,
   but the probe's env-var lookup checked CUDA_VISIBLE_DEVICES first. With
   CUDA_VISIBLE_DEVICES=0,1 and HIP_VISIBLE_DEVICES=6,7 the probe returned
   [(0, ...), (1, ...)] when torch's view was actually [(6, ...), (7, ...)].
   The wrong physical IDs flowed downstream into CUDA_VISIBLE_DEVICES for
   the llama-server subprocess, pinning it to GPUs 0,1 instead of 6,7.

   Fix: branch on torch.version.hip. On ROCm, prefer HIP > ROCR > CUDA
   (matches torch's own ordering). On NVIDIA, use CUDA only -- ignoring
   any HIP/ROCR vars the parent happens to have set.

2) Child env narrowing only set CUDA_VISIBLE_DEVICES. On ROCm, llama-server
   honors HIP/ROCR; if the parent shell exported HIP_VISIBLE_DEVICES=4,5
   and the selector picked just GPU 4, the child still saw both because
   we never narrowed HIP/ROCR. Now we set all three on ROCm so the AMD
   subprocess actually sees the planned subset.

Both branches verified via temp/pr_simulation/sim_5172_rocm_precedence.py
(7/7 cases pass), including the reviewer's verbatim R5 case
(CVD=0,1 + HIP/ROCR=6,7).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: sort GPU probe result + honor explicitly empty ROCm masks

Two reviewer-flagged correctness nits on top of eff55fb.

1) Gemini medium: the torch fallback returned an unsorted list when the
   visibility mask was non-sequential (e.g. CUDA_VISIBLE_DEVICES=5,2,9),
   diverging from the docstring guarantee and the nvidia-smi path. Now
   sorted by physical id.

2) Codex P2: an explicitly empty HIP_VISIBLE_DEVICES="" should mean
   "no GPUs" per the codebase convention in
   utils/hardware/hardware.py::_get_parent_visible_gpu_spec. The previous
   `or` chain treated empty string as falsy and silently fell through to
   ROCR / CUDA, producing wrong physical IDs. Switch to `is not None`
   checks to match.

Verified via sim_5172_rocm_precedence.py (9/9 cases pass) including the
two new R8 (sort) and R9 (empty-HIP honored) cases.

* Studio: align nvidia-smi probe with torch fallback (sort + robust CVD)

Two follow-up Gemini-medium nits on PR #5172.

1) Fragile CVD parsing on the nvidia-smi path: `cvd.split(",")` would
   raise ValueError on a trailing comma like "0,1," because the empty
   trailing token is not skipped. The torch fallback already filters
   empty tokens via `if x.strip()`; mirror that here.

2) Missing sort guarantee on the nvidia-smi path: the docstring promises
   sort-by-id, the torch fallback now sorts, but the nvidia-smi path
   relied on driver enumeration order. Add an explicit sort.

Both changes match what shipped in 6b1cccd for the torch fallback, so
the two probe paths now have identical CVD parsing + ordering semantics.

* Studio: drop cvd.strip() truthiness so empty CVD filters all GPUs

Reviewer-flagged correctness bug. The previous `if cvd is not None and
cvd.strip():` guard treated `CUDA_VISIBLE_DEVICES=""` as if the variable
were unset, leaving `allowed=None` (and `physical_ids=None` on the torch
path). On the nvidia-smi path that mattered: nvidia-smi ignores CVD
entirely, so the probe's `allowed` filter is the only thing that
respects the parent's "no GPUs" intent. Pre-fix the probe returned every
physical GPU when the parent had explicitly hidden them.

Drop the `.strip()` truthiness check on both paths. The downstream
`if x.strip()` token filter still keeps trailing-comma masks like
"0,1," safe, and an empty mask now produces an empty allowed/physical
set as expected (matching utils/hardware/hardware.py convention).

Verified via sim_5172_rocm_precedence.py R10 + R11 (now 11/11 cases
pass): nvidia-smi path with `CUDA_VISIBLE_DEVICES=""` now returns []
instead of leaking the hidden GPUs.

* Studio: log ROCm env-var failures instead of silently swallowing

Reviewer-flagged defensive logging gap. The bare `except Exception: pass`
around the HIP/ROCR env-var assignment would mask anything from a
missing torch import to an unexpected version object shape. Log at
debug so a failed AMD child-env narrowing is at least traceable.
Behavior is unchanged: torch missing or version probe failing still
leaves the child with only CUDA_VISIBLE_DEVICES set.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants