Studio: add Vulkan llama.cpp support by oobabooga · Pull Request #5819 · unslothai/unsloth

oobabooga · 2026-05-27T16:33:07Z

Summary

This adds Vulkan support to Studio so Intel GPUs get GPU-accelerated llama.cpp inference instead of falling back to CPU. Intel GPUs use Vulkan, not CUDA or ROCm, so today Studio treats them as having no GPU at all. The change reuses Studio's existing VRAM and GPU-selection code, so once we know how much free VRAM there is, context auto-sizing, multi-GPU selection, and layer offload all work the same way they do for NVIDIA and AMD.

How it works

Reading free VRAM through llama.cpp's own Vulkan library

Instead of parsing the text output of a command-line tool, it loads the Vulkan library that already ships with llama.cpp (libggml-vulkan.so on Linux, ggml-vulkan.dll on Windows) and calls ggml_backend_vk_get_device_count and ggml_backend_vk_get_device_memory directly to get free and total VRAM per GPU. A side effect is that the GPU indices come back in llama.cpp's own order, which is what the GPU-pinning step below needs. The library is loaded in a short-lived helper process so no Vulkan context lives inside the main Studio process.

_get_gpu_free_memory() now dispatches by backend: on a Vulkan build it uses the Vulkan reader; otherwise it runs the existing NVIDIA (nvidia-smi/torch) and AMD ROCm path. The NVIDIA/ROCm implementation is unchanged, just moved into a renamed helper.

Leaving host headroom on integrated GPUs

On an integrated GPU, ggml reports shared system RAM as "VRAM" (it sums every memory heap), so auto-sizing context and offload against that number could claim nearly all of RAM and push the host into swap or the OOM killer. For integrated GPUs the reader leaves a per-device host margin of a flat 1 GiB, matching llama.cpp's own --fit-target default, instead of inventing a larger reserve. Whether a device is integrated comes straight from ggml's device type (GGML_BACKEND_DEVICE_TYPE_IGPU), read in the same helper process, so discrete cards are never touched (no VRAM-vs-RAM ratio guessing).

Pinning the GPU with `GGML_VK_VISIBLE_DEVICES`

When a model fits, Studio already picks the GPU(s) and puts all layers on them (-ngl -1). For CUDA and ROCm it does this with CUDA_VISIBLE_DEVICES / HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES. The Vulkan backend ignores those, so on a Vulkan build we use its own setting, GGML_VK_VISIBLE_DEVICES. The indices come from the same library that read the VRAM, so the pin always targets the right device. Multi-GPU works the same way it does for CUDA and ROCm.

Picking the Vulkan build at install time

The Vulkan builds come from upstream ggml-org/llama.cpp at the same tag, the same way the Windows-CUDA, CPU, and macOS builds already do, so there is nothing new to build or publish (llama-<tag>-bin-ubuntu-vulkan-x64.tar.gz and llama-<tag>-bin-win-vulkan-x64.zip).

The installer picks the Vulkan build only when there is no NVIDIA GPU, no ROCm GPU, and an Intel GPU is detected. Intel detection is:

Linux: reads the GPU vendor id from /sys/class/drm/card*/device/vendor and checks for 0x8086 (Intel).
Windows: runs Get-CimInstance Win32_VideoController and looks for "Intel".

The new linux-vulkan and windows-vulkan install kinds are wired through the same places the others already go through (runtime file patterns, GPU validation, install health check).

Letting AMD users opt into Vulkan

AMD GPUs run Vulkan too, but by default Studio installs the ROCm build for them. Setting UNSLOTH_FORCE_VULKAN=1 before setup installs the upstream Vulkan build instead. This helps where ROCm is awkward (e.g. some integrated parts), and the integrated-GPU host margin above keeps it safe on shared-memory hardware like Strix Halo. It's scoped to the llama.cpp inference backend. The torch/training stack is a separate installer and still uses ROCm. macOS ignores the flag (Metal, no Vulkan prebuilt).

Why read from llama.cpp instead of parsing text

There is an earlier draft (#4874) that solves the same VRAM-reading problem by running vulkaninfo and parsing its text output for memory budgets. Reading from llama.cpp's library instead is better because:

No text parsing. vulkaninfo output is meant for humans and changes between versions, so parsing it is fragile. We call functions and read numbers.
The GPU indices match. Text-parsed ordering may not match llama.cpp's own GPU order, so a parsed index can pin the wrong GPU. Reading from the library gives indices in the same order GGML_VK_VISIBLE_DEVICES expects.
Accurate free VRAM straight from the driver, instead of a number scraped from text.
Covers the whole flow. [Studio] Fix GPU detection for AMD/Intel — add Vulkan VRAM fallback #4874 only fills in the memory number; it does not handle GPU pinning or installer selection. This change does all of it: install selection, VRAM reading, context fit, and GPU pinning.

Related issues & PRs

Closes [Feature] Add llama.cpp vulkan backend support for non-nvidia cards #4497. This is the feature it asks for.
Addresses the Intel part of [Question] Unsloth Studio: When would you add INTEL and AMD support to the new webui? #4452. Intel GPUs now get GPU inference in Studio. (AMD users can opt into Vulkan with UNSLOTH_FORCE_VULKAN; fine-tuning is not part of this change.)
Supersedes [Studio] Fix GPU detection for AMD/Intel — add Vulkan VRAM fallback #4874. Same goal, but it reads from llama.cpp's library instead of parsing vulkaninfo, and it covers the whole flow rather than just detection.
Related to enable studio for intel GPU #4724. Overlapping Intel work; this change is the Vulkan llama.cpp inference part specifically.

Testing notes

The VRAM reader was tested against a real upstream Vulkan build on a multi-GPU host; free/total match nvidia-smi within normal allocation noise, and the GPU order matches llama.cpp's Vulkan order (which is different from nvidia-smi's order).
Install selection was tested for Intel hosts on Linux and Windows through the production (--simple-policy) planner, confirming Vulkan is picked first with CPU as the fallback, and that NVIDIA, ROCm, and no-GPU hosts are unaffected.
The integrated-GPU margin was exercised end-to-end against a real upstream Vulkan build: an integrated device has a flat 1 GiB carved off its reported free VRAM, while discrete cards are returned untouched.
UNSLOTH_FORCE_VULKAN was tested through the same --simple-policy planner: a forced AMD host on Linux and Windows resolves to the upstream Vulkan asset (with CPU fallback), while the same host without the flag still resolves to its ROCm build.

chatgpt-codex-connector · 2026-05-27T16:33:12Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

for more information, see https://pre-commit.ci

unslothai/unsloth#5819

gemini-code-assist

Code Review

This pull request introduces support for the Vulkan backend in llama.cpp, enabling GPU acceleration on Intel and other non-NVIDIA/non-AMD GPUs. It adds a short-lived subprocess probe script to query Vulkan VRAM without initializing Vulkan in the main process, handles device pinning via GGML_VK_VISIBLE_DEVICES, and updates the prebuilt installer to detect Intel GPUs and fetch the appropriate Vulkan assets. The review feedback highlights two key improvements: resolving a potential AttributeError on Windows where ctypes.RTLD_GLOBAL is undefined, and checking the subprocess return code and logging stderr to prevent silent failures during the Vulkan GPU probe.

for more information, see https://pre-commit.ci

oobabooga · 2026-05-29T03:11:09Z

Tested install + inference on my dual NVIDIA GPU setup by overriding the NVIDIA detection and forcing vulkan. Ended up discovering a bug that I fixed at 4fefeeb.

After that, the test worked perfectly end to end, with a small model pinning successfully to only one of the GPUs and generating text.

for more information, see https://pre-commit.ci

oobabooga · 2026-05-31T21:53:55Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for the Vulkan backend in llama.cpp, enabling GPU acceleration on Intel and other non-NVIDIA/non-AMD GPUs. It introduces a standalone _vulkan_probe.py script to query free VRAM, integrates Vulkan memory detection and device pinning in llama_cpp.py, adds regression tests, and updates the prebuilt installer to detect Intel GPUs and support Vulkan prebuilt installations. A critical issue was identified in _vulkan_probe.py where accessing ctypes.RTLD_GLOBAL on Windows will raise an AttributeError and crash the probe subprocess; a suggestion was provided to conditionally apply RTLD_GLOBAL only on non-Windows platforms.

# Conflicts: # studio/backend/core/inference/llama_cpp.py # studio/install_llama_prebuilt.py

for more information, see https://pre-commit.ci

Studio: add Vulkan llama.cpp support

9e9729b

oobabooga requested a review from rolandtannous as a code owner May 27, 2026 16:33

[pre-commit.ci] auto fixes from pre-commit.com hooks

c401f10

for more information, see https://pre-commit.ci

oobabooga added a commit to oobabooga/llama.cpp that referenced this pull request May 27, 2026

Add Vulkan builds (Windows and Linux) to llama-prebuilt-sha256.json for

8d19e60

unslothai/unsloth#5819

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

Comment thread studio/backend/core/inference/llama_cpp.py Outdated

Comment thread studio/backend/core/inference/llama_cpp.py

oobabooga and others added 6 commits May 27, 2026 09:49

Address gemini's feedback

84d98a7

Studio: move the Vulkan VRAM probe into a standalone script

11acf22

[pre-commit.ci] auto fixes from pre-commit.com hooks

7dd21f3

for more information, see https://pre-commit.ci

Merge branch 'main' into vulkan-support

2401515

Improve Vulkan probe error reporting

e50b0af

Resolve llama-server symlink so Vulkan build is detected

4fefeeb

oobabooga added 5 commits May 31, 2026 08:31

Merge branch 'main' into vulkan-support

ea7cd94

Drop unreachable Vulkan fallback in GPU free-memory dispatcher

10faad1

Skip the Intel GPU probe when NVIDIA or ROCm is present

dafeb79

Reserve host RAM headroom for Vulkan integrated GPUs

1980e59

Add a UNSLOTH_FORCE_VULKAN environment variable

31f4a36

oobabooga requested a review from danielhanchen as a code owner May 31, 2026 21:40

[pre-commit.ci] auto fixes from pre-commit.com hooks

c3482d4

for more information, see https://pre-commit.ci

gemini-code-assist Bot reviewed May 31, 2026

View reviewed changes

Comment thread studio/backend/core/inference/_vulkan_probe.py

oobabooga and others added 2 commits June 9, 2026 00:15

Merge branch 'main' into vulkan-support

cca5ca5

# Conflicts: # studio/backend/core/inference/llama_cpp.py # studio/install_llama_prebuilt.py

[pre-commit.ci] auto fixes from pre-commit.com hooks

7563d91

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Studio: add Vulkan llama.cpp support#5819

Studio: add Vulkan llama.cpp support#5819
oobabooga wants to merge 16 commits into
unslothai:mainfrom
oobabooga:vulkan-support

oobabooga commented May 27, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

oobabooga commented May 29, 2026 •

edited

Loading

Uh oh!

oobabooga commented May 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

oobabooga commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Reading free VRAM through llama.cpp's own Vulkan library

Leaving host headroom on integrated GPUs

Pinning the GPU with GGML_VK_VISIBLE_DEVICES

Picking the Vulkan build at install time

Letting AMD users opt into Vulkan

Why read from llama.cpp instead of parsing text

Related issues & PRs

Testing notes

Uh oh!

chatgpt-codex-connector Bot commented May 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

oobabooga commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oobabooga commented May 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oobabooga commented May 27, 2026 •

edited

Loading

Pinning the GPU with `GGML_VK_VISIBLE_DEVICES`

oobabooga commented May 29, 2026 •

edited

Loading