Skip to content

Studio: add Vulkan llama.cpp support#5819

Open
oobabooga wants to merge 16 commits into
unslothai:mainfrom
oobabooga:vulkan-support
Open

Studio: add Vulkan llama.cpp support#5819
oobabooga wants to merge 16 commits into
unslothai:mainfrom
oobabooga:vulkan-support

Conversation

@oobabooga

@oobabooga oobabooga commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

This adds Vulkan support to Studio so Intel GPUs get GPU-accelerated llama.cpp inference instead of falling back to CPU. Intel GPUs use Vulkan, not CUDA or ROCm, so today Studio treats them as having no GPU at all. The change reuses Studio's existing VRAM and GPU-selection code, so once we know how much free VRAM there is, context auto-sizing, multi-GPU selection, and layer offload all work the same way they do for NVIDIA and AMD.

How it works

Reading free VRAM through llama.cpp's own Vulkan library

Instead of parsing the text output of a command-line tool, it loads the Vulkan library that already ships with llama.cpp (libggml-vulkan.so on Linux, ggml-vulkan.dll on Windows) and calls ggml_backend_vk_get_device_count and ggml_backend_vk_get_device_memory directly to get free and total VRAM per GPU. A side effect is that the GPU indices come back in llama.cpp's own order, which is what the GPU-pinning step below needs. The library is loaded in a short-lived helper process so no Vulkan context lives inside the main Studio process.

_get_gpu_free_memory() now dispatches by backend: on a Vulkan build it uses the Vulkan reader; otherwise it runs the existing NVIDIA (nvidia-smi/torch) and AMD ROCm path. The NVIDIA/ROCm implementation is unchanged, just moved into a renamed helper.

Leaving host headroom on integrated GPUs

On an integrated GPU, ggml reports shared system RAM as "VRAM" (it sums every memory heap), so auto-sizing context and offload against that number could claim nearly all of RAM and push the host into swap or the OOM killer. For integrated GPUs the reader leaves a per-device host margin of a flat 1 GiB, matching llama.cpp's own --fit-target default, instead of inventing a larger reserve. Whether a device is integrated comes straight from ggml's device type (GGML_BACKEND_DEVICE_TYPE_IGPU), read in the same helper process, so discrete cards are never touched (no VRAM-vs-RAM ratio guessing).

Pinning the GPU with GGML_VK_VISIBLE_DEVICES

When a model fits, Studio already picks the GPU(s) and puts all layers on them (-ngl -1). For CUDA and ROCm it does this with CUDA_VISIBLE_DEVICES / HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES. The Vulkan backend ignores those, so on a Vulkan build we use its own setting, GGML_VK_VISIBLE_DEVICES. The indices come from the same library that read the VRAM, so the pin always targets the right device. Multi-GPU works the same way it does for CUDA and ROCm.

Picking the Vulkan build at install time

The Vulkan builds come from upstream ggml-org/llama.cpp at the same tag, the same way the Windows-CUDA, CPU, and macOS builds already do, so there is nothing new to build or publish (llama-<tag>-bin-ubuntu-vulkan-x64.tar.gz and llama-<tag>-bin-win-vulkan-x64.zip).

The installer picks the Vulkan build only when there is no NVIDIA GPU, no ROCm GPU, and an Intel GPU is detected. Intel detection is:

  • Linux: reads the GPU vendor id from /sys/class/drm/card*/device/vendor and checks for 0x8086 (Intel).
  • Windows: runs Get-CimInstance Win32_VideoController and looks for "Intel".

The new linux-vulkan and windows-vulkan install kinds are wired through the same places the others already go through (runtime file patterns, GPU validation, install health check).

Letting AMD users opt into Vulkan

AMD GPUs run Vulkan too, but by default Studio installs the ROCm build for them. Setting UNSLOTH_FORCE_VULKAN=1 before setup installs the upstream Vulkan build instead. This helps where ROCm is awkward (e.g. some integrated parts), and the integrated-GPU host margin above keeps it safe on shared-memory hardware like Strix Halo. It's scoped to the llama.cpp inference backend. The torch/training stack is a separate installer and still uses ROCm. macOS ignores the flag (Metal, no Vulkan prebuilt).

Why read from llama.cpp instead of parsing text

There is an earlier draft (#4874) that solves the same VRAM-reading problem by running vulkaninfo and parsing its text output for memory budgets. Reading from llama.cpp's library instead is better because:

  • No text parsing. vulkaninfo output is meant for humans and changes between versions, so parsing it is fragile. We call functions and read numbers.
  • The GPU indices match. Text-parsed ordering may not match llama.cpp's own GPU order, so a parsed index can pin the wrong GPU. Reading from the library gives indices in the same order GGML_VK_VISIBLE_DEVICES expects.
  • Accurate free VRAM straight from the driver, instead of a number scraped from text.
  • Covers the whole flow. [Studio] Fix GPU detection for AMD/Intel — add Vulkan VRAM fallback #4874 only fills in the memory number; it does not handle GPU pinning or installer selection. This change does all of it: install selection, VRAM reading, context fit, and GPU pinning.

Related issues & PRs

Testing notes

  • The VRAM reader was tested against a real upstream Vulkan build on a multi-GPU host; free/total match nvidia-smi within normal allocation noise, and the GPU order matches llama.cpp's Vulkan order (which is different from nvidia-smi's order).
  • Install selection was tested for Intel hosts on Linux and Windows through the production (--simple-policy) planner, confirming Vulkan is picked first with CPU as the fallback, and that NVIDIA, ROCm, and no-GPU hosts are unaffected.
  • The integrated-GPU margin was exercised end-to-end against a real upstream Vulkan build: an integrated device has a flat 1 GiB carved off its reported free VRAM, while discrete cards are returned untouched.
  • UNSLOTH_FORCE_VULKAN was tested through the same --simple-policy planner: a forced AMD host on Linux and Windows resolves to the upstream Vulkan asset (with CPU fallback), while the same host without the flag still resolves to its ROCm build.

@oobabooga oobabooga requested a review from rolandtannous as a code owner May 27, 2026 16:33
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Vulkan backend in llama.cpp, enabling GPU acceleration on Intel and other non-NVIDIA/non-AMD GPUs. It adds a short-lived subprocess probe script to query Vulkan VRAM without initializing Vulkan in the main process, handles device pinning via GGML_VK_VISIBLE_DEVICES, and updates the prebuilt installer to detect Intel GPUs and fetch the appropriate Vulkan assets. The review feedback highlights two key improvements: resolving a potential AttributeError on Windows where ctypes.RTLD_GLOBAL is undefined, and checking the subprocess return code and logging stderr to prevent silent failures during the Vulkan GPU probe.

Comment thread studio/backend/core/inference/llama_cpp.py Outdated
Comment thread studio/backend/core/inference/llama_cpp.py
@oobabooga

oobabooga commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

Tested install + inference on my dual NVIDIA GPU setup by overriding the NVIDIA detection and forcing vulkan. Ended up discovering a bug that I fixed at 4fefeeb.

After that, the test worked perfectly end to end, with a small model pinning successfully to only one of the GPUs and generating text.

@oobabooga oobabooga requested a review from danielhanchen as a code owner May 31, 2026 21:40
@oobabooga

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Vulkan backend in llama.cpp, enabling GPU acceleration on Intel and other non-NVIDIA/non-AMD GPUs. It introduces a standalone _vulkan_probe.py script to query free VRAM, integrates Vulkan memory detection and device pinning in llama_cpp.py, adds regression tests, and updates the prebuilt installer to detect Intel GPUs and support Vulkan prebuilt installations. A critical issue was identified in _vulkan_probe.py where accessing ctypes.RTLD_GLOBAL on Windows will raise an AttributeError and crash the probe subprocess; a suggestion was provided to conditionally apply RTLD_GLOBAL only on non-Windows platforms.

Comment thread studio/backend/core/inference/_vulkan_probe.py
oobabooga and others added 2 commits June 9, 2026 00:15
# Conflicts:
#	studio/backend/core/inference/llama_cpp.py
#	studio/install_llama_prebuilt.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add llama.cpp vulkan backend support for non-nvidia cards

2 participants