Studio: add Vulkan llama.cpp support#5819
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Vulkan backend in llama.cpp, enabling GPU acceleration on Intel and other non-NVIDIA/non-AMD GPUs. It adds a short-lived subprocess probe script to query Vulkan VRAM without initializing Vulkan in the main process, handles device pinning via GGML_VK_VISIBLE_DEVICES, and updates the prebuilt installer to detect Intel GPUs and fetch the appropriate Vulkan assets. The review feedback highlights two key improvements: resolving a potential AttributeError on Windows where ctypes.RTLD_GLOBAL is undefined, and checking the subprocess return code and logging stderr to prevent silent failures during the Vulkan GPU probe.
for more information, see https://pre-commit.ci
|
Tested install + inference on my dual NVIDIA GPU setup by overriding the NVIDIA detection and forcing vulkan. Ended up discovering a bug that I fixed at 4fefeeb. After that, the test worked perfectly end to end, with a small model pinning successfully to only one of the GPUs and generating text. |
for more information, see https://pre-commit.ci
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for the Vulkan backend in llama.cpp, enabling GPU acceleration on Intel and other non-NVIDIA/non-AMD GPUs. It introduces a standalone _vulkan_probe.py script to query free VRAM, integrates Vulkan memory detection and device pinning in llama_cpp.py, adds regression tests, and updates the prebuilt installer to detect Intel GPUs and support Vulkan prebuilt installations. A critical issue was identified in _vulkan_probe.py where accessing ctypes.RTLD_GLOBAL on Windows will raise an AttributeError and crash the probe subprocess; a suggestion was provided to conditionally apply RTLD_GLOBAL only on non-Windows platforms.
# Conflicts: # studio/backend/core/inference/llama_cpp.py # studio/install_llama_prebuilt.py
for more information, see https://pre-commit.ci
Summary
This adds Vulkan support to Studio so Intel GPUs get GPU-accelerated llama.cpp inference instead of falling back to CPU. Intel GPUs use Vulkan, not CUDA or ROCm, so today Studio treats them as having no GPU at all. The change reuses Studio's existing VRAM and GPU-selection code, so once we know how much free VRAM there is, context auto-sizing, multi-GPU selection, and layer offload all work the same way they do for NVIDIA and AMD.
How it works
Reading free VRAM through llama.cpp's own Vulkan library
Instead of parsing the text output of a command-line tool, it loads the Vulkan library that already ships with llama.cpp (
libggml-vulkan.soon Linux,ggml-vulkan.dllon Windows) and callsggml_backend_vk_get_device_countandggml_backend_vk_get_device_memorydirectly to get free and total VRAM per GPU. A side effect is that the GPU indices come back in llama.cpp's own order, which is what the GPU-pinning step below needs. The library is loaded in a short-lived helper process so no Vulkan context lives inside the main Studio process._get_gpu_free_memory()now dispatches by backend: on a Vulkan build it uses the Vulkan reader; otherwise it runs the existing NVIDIA (nvidia-smi/torch) and AMD ROCm path. The NVIDIA/ROCm implementation is unchanged, just moved into a renamed helper.Leaving host headroom on integrated GPUs
On an integrated GPU, ggml reports shared system RAM as "VRAM" (it sums every memory heap), so auto-sizing context and offload against that number could claim nearly all of RAM and push the host into swap or the OOM killer. For integrated GPUs the reader leaves a per-device host margin of a flat 1 GiB, matching llama.cpp's own
--fit-targetdefault, instead of inventing a larger reserve. Whether a device is integrated comes straight from ggml's device type (GGML_BACKEND_DEVICE_TYPE_IGPU), read in the same helper process, so discrete cards are never touched (no VRAM-vs-RAM ratio guessing).Pinning the GPU with
GGML_VK_VISIBLE_DEVICESWhen a model fits, Studio already picks the GPU(s) and puts all layers on them (
-ngl -1). For CUDA and ROCm it does this withCUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES. The Vulkan backend ignores those, so on a Vulkan build we use its own setting,GGML_VK_VISIBLE_DEVICES. The indices come from the same library that read the VRAM, so the pin always targets the right device. Multi-GPU works the same way it does for CUDA and ROCm.Picking the Vulkan build at install time
The Vulkan builds come from upstream ggml-org/llama.cpp at the same tag, the same way the Windows-CUDA, CPU, and macOS builds already do, so there is nothing new to build or publish (
llama-<tag>-bin-ubuntu-vulkan-x64.tar.gzandllama-<tag>-bin-win-vulkan-x64.zip).The installer picks the Vulkan build only when there is no NVIDIA GPU, no ROCm GPU, and an Intel GPU is detected. Intel detection is:
/sys/class/drm/card*/device/vendorand checks for0x8086(Intel).Get-CimInstance Win32_VideoControllerand looks for "Intel".The new
linux-vulkanandwindows-vulkaninstall kinds are wired through the same places the others already go through (runtime file patterns, GPU validation, install health check).Letting AMD users opt into Vulkan
AMD GPUs run Vulkan too, but by default Studio installs the ROCm build for them. Setting
UNSLOTH_FORCE_VULKAN=1before setup installs the upstream Vulkan build instead. This helps where ROCm is awkward (e.g. some integrated parts), and the integrated-GPU host margin above keeps it safe on shared-memory hardware like Strix Halo. It's scoped to the llama.cpp inference backend. The torch/training stack is a separate installer and still uses ROCm. macOS ignores the flag (Metal, no Vulkan prebuilt).Why read from llama.cpp instead of parsing text
There is an earlier draft (#4874) that solves the same VRAM-reading problem by running
vulkaninfoand parsing its text output for memory budgets. Reading from llama.cpp's library instead is better because:vulkaninfooutput is meant for humans and changes between versions, so parsing it is fragile. We call functions and read numbers.GGML_VK_VISIBLE_DEVICESexpects.Related issues & PRs
UNSLOTH_FORCE_VULKAN; fine-tuning is not part of this change.)vulkaninfo, and it covers the whole flow rather than just detection.Testing notes
nvidia-smiwithin normal allocation noise, and the GPU order matches llama.cpp's Vulkan order (which is different from nvidia-smi's order).--simple-policy) planner, confirming Vulkan is picked first with CPU as the fallback, and that NVIDIA, ROCm, and no-GPU hosts are unaffected.UNSLOTH_FORCE_VULKANwas tested through the same--simple-policyplanner: a forced AMD host on Linux and Windows resolves to the upstream Vulkan asset (with CPU fallback), while the same host without the flag still resolves to its ROCm build.