vulkan: Implement set_tensor_async and the event interfaces by jeffbolznv · Pull Request #18047 · ggml-org/llama.cpp

jeffbolznv · 2025-12-15T04:49:26Z

The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from #7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.

It would be interesting to test on Linux how this interacts with #18012.

0cc4m · 2025-12-21T08:36:35Z

I tested loading gpt-oss 120B mxfp4 on Linux, and this is what I got:

MMAP: 56s
No MMAP: 34s
PR: 25s

That's a decent improvement.

0cc4m · 2025-12-21T08:52:32Z

But I am getting a bunch of validation warnings:

Validation Warning: [ BestPractices-Event-SignalSignaledEvent ] | MessageID = 0x8302d873
vkQueueSubmit(): pSubmits[0].pCommandBuffers[0] VkCommandBuffer 0x555f5efaf9f0 sets event VkEvent 0xe900000000e9 which is already in the signaled state (set by previously submitted command buffers or from the host). If this is not the desired behavior, the event must be reset before it is set again.
Objects: 2
    [0] VkCommandBuffer 0x555f5efaf9f0
    [1] VkEvent 0xe900000000e9

The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.

jeffbolznv · 2025-12-21T17:27:10Z

I had tested with VVL before creating the PR, and still can't reproduce those warnings. What version of the Vulkan SDK are you using? I'm on 1.4.335. What command line?

MMAP: 56s
No MMAP: 34s
PR: 25s

This is a nice speedup. Had you cherry-picked #18012 for this testing? I've gone ahead and rebased so the PR now has that code in the baseline.

0cc4m · 2025-12-21T17:51:34Z

I'm also on 1.4.335, yes, but not using the SDK directly, just the Arch Linux packages. The command line was this:

time build_vk/bin/llama-completion -c 16384 -n 1 --ignore-eos -m models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -no-cnv --no-mmap

It just loads the model, generates one token and exits, and times how long that process took.

I didn't cherry-pick #18012, but I always use a master branch and merge the PR I'm testing, so it was using it.

jeffbolznv · 2025-12-21T18:26:19Z

I found that my vulkan configurator settings were filtering out warnings. I can reproduce it now. But it's a false positive - the BestPractices layer isn't tracking vkResetEvent calls. I'll file a VVL issue.

jeffbolznv · 2025-12-21T18:28:28Z

I added it to the other VVL issue I had already filed: KhronosGroup/Vulkan-ValidationLayers#11288 (comment)

LostRuins · 2025-12-25T17:15:27Z

Hello @jeffbolznv after this PR, I am unable to use vulkan on my iGPU with GPT-OSS 20B, getting llama_model_load: error loading model: vk::CommandBuffer::begin: ErrorOutOfHostMemory. Before this commit it works fine.

If I use mmap, then loading it works fine. To add on: this only applies to my Intel iGPU. My nvidia dGPU is okay.

C:\Users\user\Desktop\llama-b7539-bin-win-vulkan-x64>llama-cli.exe -ngl 99 --model D:\ExtDrive\models\test_models\gpt-oss-20b-Q2_K.gguf -fa off --device Vulkan0 -c 4096 --no-mmap
load_backend: loaded RPC backend from C:\Users\user\Desktop\llama-b7539-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) RaptorLake-S Mobile Graphics Controller (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 4090 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from C:\Users\user\Desktop\llama-b7539-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\user\Desktop\llama-b7539-bin-win-vulkan-x64\ggml-cpu-haswell.dll

Loading model... |llama_model_load: error loading model: vk::CommandBuffer::begin: ErrorOutOfHostMemory
llama_model_load_from_file_impl: failed to load model                                                                  \common_init_from_params: failed to load model 'D:\ExtDrive\models\test_models\gpt-oss-20b-Q2_K.gguf'
Failed to load the modeld to load model, 'D:\ExtDrive\models\test_models\gpt-oss-20b-Q2_K.gguf'

Reverting to llama-b7501-bin-win-vulkan-x64 and it works fine.

Revert "vulkan: Implement set_tensor_async and the event interfaces (ggml-org#18047)" This reverts commit e1f15b4. (+1 squashed commits) Squashed commits: [3cfbc7b1a] Revert "vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (ggml-org#18302)" This reverts commit 2a9ea20.

This reverts commit dfa1b72.

d-shehu · 2026-01-03T22:01:17Z

I believe this is also causing an issue for me @jeffbolznv with llama-server and twin Vulkan GPUS (Radeon r9700). As soon as I updated to the release with this PR from b7501 it always crashes while loading any of my test models including QWen 3 Next 80B and GPT-OSS 120B.

I disable mmap because mmap causes crashes when loading larger models like QWen 3 Coder 480B which otherwise load with RAM + VRAM.

I tried adding GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 but it still crashes.

[40957] load_tensors: offloading output layer to GPU
[40957] load_tensors: offloading 47 repeating layers to GPU
[40957] load_tensors: offloaded 49/49 layers to GPU
[40957] load_tensors: RPC0[192.168.5.2:26001] model buffer size = 11950.43 MiB
[40957] load_tensors:      Vulkan0 model buffer size = 17297.64 MiB
[40957] load_tensors:      Vulkan1 model buffer size = 16765.81 MiB
[40957] load_tensors:  Vulkan_Host model buffer size =   166.92 MiB
[40957] load_all_data: device RPC0 does not support async, host buffers or events
[40957] .........................load_all_data: using async uploads for device Vulkan0, buffer type Vulkan0, backend Vulkan0
[40957] llama_model_load: error loading model: read error: Bad address
[40957] llama_model_load_from_file_impl: failed to load model
[40957] common_init_from_params: failed to load model './models/models--unsloth--Qwen3-Next-80B-A3B-Instruct/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf'
[40957] srv    load_model: failed to load model, './models/models--unsloth--Qwen3-Next-80B-A3B-Instruct/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf'

jhemmond · 2026-01-06T01:53:36Z

This is causing DeviceOutOfMemory for any model that cant strictly fit into VRAM. i can provide more details if needed. I have UMA with iGPU set to 23 GB. This means Qwe3-Next , GLM Air, and other large models are out of commission.

jeffbolznv · 2026-01-06T02:32:34Z

@jhemmond please file an issue with more details.

jhemmond · 2026-01-06T14:42:11Z

@jeffbolznv Filed one here: #18642

…#18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.

The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from #7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.

…#18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.

jeffbolznv requested a review from 0cc4m as a code owner December 15, 2025 04:49

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 15, 2025

loci-dev mentioned this pull request Dec 15, 2025

UPSTREAM PR #18047: vulkan: Implement set_tensor_async and the event interfaces auroralabs-loci/llama.cpp#572

Open

lemmi mentioned this pull request Dec 15, 2025

Async DirectIO model loading on Linux #18012

Merged

jeffbolznv force-pushed the async_load branch from 6a31e8d to 7dddd68 Compare December 21, 2025 17:24

0cc4m approved these changes Dec 21, 2025

View reviewed changes

0cc4m merged commit e1f15b4 into ggml-org:master Dec 21, 2025
132 of 136 checks passed

inforithmics mentioned this pull request Dec 22, 2025

Update Vulkan Sdk to 1.4.341.1 ollama/ollama#13546

Open

1 task

engrtipusultan mentioned this pull request Dec 23, 2025

Misc. bug: Vulkan Backend llama-server and llama-bench Cannot Run Model with mmap = 0 #18317

Closed

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Dec 28, 2025

Revert "Triage: revert ggml-org#18047 and ggml-org#18302"

a94d5ff

This reverts commit dfa1b72.

d-shehu mentioned this pull request Jan 4, 2026

Eval bug: b7574 error loading model: read error: Bad address #18473

Closed

wallentri88 mentioned this pull request Feb 24, 2026

Eval bug: qwen35 and qwen35moe graph split issues (Severe PP impact, crashes) #19864

Closed

Cath0deRay mentioned this pull request Feb 25, 2026

Eval bug: Qwen3-Coder-30B-A3B-Instruct crash in vulkan Intel ARL #19420

Closed

neilopet mentioned this pull request Mar 1, 2026

vulkan: add UMA zero-copy async transfers and fix event_record deferred memcpy handling #20018

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Implement set_tensor_async and the event interfaces#18047

vulkan: Implement set_tensor_async and the event interfaces#18047
0cc4m merged 1 commit into
ggml-org:masterfrom
jeffbolznv:async_load

jeffbolznv commented Dec 15, 2025

Uh oh!

0cc4m commented Dec 21, 2025

Uh oh!

0cc4m commented Dec 21, 2025

Uh oh!

jeffbolznv commented Dec 21, 2025

Uh oh!

0cc4m commented Dec 21, 2025 •

edited

Loading

Uh oh!

jeffbolznv commented Dec 21, 2025

Uh oh!

jeffbolznv commented Dec 21, 2025

Uh oh!

Uh oh!

LostRuins commented Dec 25, 2025 •

edited

Loading

Uh oh!

d-shehu commented Jan 3, 2026

Uh oh!

jhemmond commented Jan 6, 2026

Uh oh!

jeffbolznv commented Jan 6, 2026

Uh oh!

jhemmond commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jeffbolznv commented Dec 15, 2025

Uh oh!

0cc4m commented Dec 21, 2025

Uh oh!

0cc4m commented Dec 21, 2025

Uh oh!

jeffbolznv commented Dec 21, 2025

Uh oh!

0cc4m commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Dec 21, 2025

Uh oh!

jeffbolznv commented Dec 21, 2025

Uh oh!

Uh oh!

LostRuins commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-shehu commented Jan 3, 2026

Uh oh!

jhemmond commented Jan 6, 2026

Uh oh!

jeffbolznv commented Jan 6, 2026

Uh oh!

jhemmond commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

0cc4m commented Dec 21, 2025 •

edited

Loading

LostRuins commented Dec 25, 2025 •

edited

Loading