Skip to content

vulkan: Implement set_tensor_async and the event interfaces#18047

Merged
0cc4m merged 1 commit into
ggml-org:masterfrom
jeffbolznv:async_load
Dec 21, 2025
Merged

vulkan: Implement set_tensor_async and the event interfaces#18047
0cc4m merged 1 commit into
ggml-org:masterfrom
jeffbolznv:async_load

Conversation

@jeffbolznv

Copy link
Copy Markdown
Contributor

The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from #7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.

It would be interesting to test on Linux how this interacts with #18012.

@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner December 15, 2025 04:49
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 15, 2025
@0cc4m

0cc4m commented Dec 21, 2025

Copy link
Copy Markdown
Contributor

I tested loading gpt-oss 120B mxfp4 on Linux, and this is what I got:

MMAP: 56s
No MMAP: 34s
PR: 25s

That's a decent improvement.

@0cc4m

0cc4m commented Dec 21, 2025

Copy link
Copy Markdown
Contributor

But I am getting a bunch of validation warnings:

Validation Warning: [ BestPractices-Event-SignalSignaledEvent ] | MessageID = 0x8302d873
vkQueueSubmit(): pSubmits[0].pCommandBuffers[0] VkCommandBuffer 0x555f5efaf9f0 sets event VkEvent 0xe900000000e9 which is already in the signaled state (set by previously submitted command buffers or from the host). If this is not the desired behavior, the event must be reset before it is set again.
Objects: 2
    [0] VkCommandBuffer 0x555f5efaf9f0
    [1] VkEvent 0xe900000000e9

The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
@jeffbolznv

Copy link
Copy Markdown
Contributor Author

I had tested with VVL before creating the PR, and still can't reproduce those warnings. What version of the Vulkan SDK are you using? I'm on 1.4.335. What command line?

MMAP: 56s
No MMAP: 34s
PR: 25s

This is a nice speedup. Had you cherry-picked #18012 for this testing? I've gone ahead and rebased so the PR now has that code in the baseline.

@0cc4m

0cc4m commented Dec 21, 2025

Copy link
Copy Markdown
Contributor

I'm also on 1.4.335, yes, but not using the SDK directly, just the Arch Linux packages. The command line was this:

time build_vk/bin/llama-completion -c 16384 -n 1 --ignore-eos -m models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -no-cnv --no-mmap

It just loads the model, generates one token and exits, and times how long that process took.

I didn't cherry-pick #18012, but I always use a master branch and merge the PR I'm testing, so it was using it.

@jeffbolznv

Copy link
Copy Markdown
Contributor Author

I found that my vulkan configurator settings were filtering out warnings. I can reproduce it now. But it's a false positive - the BestPractices layer isn't tracking vkResetEvent calls. I'll file a VVL issue.

@jeffbolznv

Copy link
Copy Markdown
Contributor Author

I added it to the other VVL issue I had already filed: KhronosGroup/Vulkan-ValidationLayers#11288 (comment)

@0cc4m 0cc4m merged commit e1f15b4 into ggml-org:master Dec 21, 2025
132 of 136 checks passed
@LostRuins

LostRuins commented Dec 25, 2025

Copy link
Copy Markdown
Collaborator

Hello @jeffbolznv after this PR, I am unable to use vulkan on my iGPU with GPT-OSS 20B, getting llama_model_load: error loading model: vk::CommandBuffer::begin: ErrorOutOfHostMemory. Before this commit it works fine.

If I use mmap, then loading it works fine. To add on: this only applies to my Intel iGPU. My nvidia dGPU is okay.

C:\Users\user\Desktop\llama-b7539-bin-win-vulkan-x64>llama-cli.exe -ngl 99 --model D:\ExtDrive\models\test_models\gpt-oss-20b-Q2_K.gguf -fa off --device Vulkan0 -c 4096 --no-mmap
load_backend: loaded RPC backend from C:\Users\user\Desktop\llama-b7539-bin-win-vulkan-x64\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Intel(R) RaptorLake-S Mobile Graphics Controller (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 4090 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from C:\Users\user\Desktop\llama-b7539-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\user\Desktop\llama-b7539-bin-win-vulkan-x64\ggml-cpu-haswell.dll

Loading model... |llama_model_load: error loading model: vk::CommandBuffer::begin: ErrorOutOfHostMemory
llama_model_load_from_file_impl: failed to load model                                                                  \common_init_from_params: failed to load model 'D:\ExtDrive\models\test_models\gpt-oss-20b-Q2_K.gguf'
Failed to load the modeld to load model, 'D:\ExtDrive\models\test_models\gpt-oss-20b-Q2_K.gguf'

Reverting to llama-b7501-bin-win-vulkan-x64 and it works fine.

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Dec 25, 2025
Revert "vulkan: Implement set_tensor_async and the event interfaces (ggml-org#18047)"

This reverts commit e1f15b4. (+1 squashed commits)

Squashed commits:

[3cfbc7b1a] Revert "vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (ggml-org#18302)"

This reverts commit 2a9ea20.
LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Dec 28, 2025
@d-shehu

d-shehu commented Jan 3, 2026

Copy link
Copy Markdown

I believe this is also causing an issue for me @jeffbolznv with llama-server and twin Vulkan GPUS (Radeon r9700). As soon as I updated to the release with this PR from b7501 it always crashes while loading any of my test models including QWen 3 Next 80B and GPT-OSS 120B.

I disable mmap because mmap causes crashes when loading larger models like QWen 3 Coder 480B which otherwise load with RAM + VRAM.

I tried adding GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 but it still crashes.

[40957] load_tensors: offloading output layer to GPU
[40957] load_tensors: offloading 47 repeating layers to GPU
[40957] load_tensors: offloaded 49/49 layers to GPU
[40957] load_tensors: RPC0[192.168.5.2:26001] model buffer size = 11950.43 MiB
[40957] load_tensors:      Vulkan0 model buffer size = 17297.64 MiB
[40957] load_tensors:      Vulkan1 model buffer size = 16765.81 MiB
[40957] load_tensors:  Vulkan_Host model buffer size =   166.92 MiB
[40957] load_all_data: device RPC0 does not support async, host buffers or events
[40957] .........................load_all_data: using async uploads for device Vulkan0, buffer type Vulkan0, backend Vulkan0
[40957] llama_model_load: error loading model: read error: Bad address
[40957] llama_model_load_from_file_impl: failed to load model
[40957] common_init_from_params: failed to load model './models/models--unsloth--Qwen3-Next-80B-A3B-Instruct/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf'
[40957] srv    load_model: failed to load model, './models/models--unsloth--Qwen3-Next-80B-A3B-Instruct/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf'

@jhemmond

jhemmond commented Jan 6, 2026

Copy link
Copy Markdown

This is causing DeviceOutOfMemory for any model that cant strictly fit into VRAM. i can provide more details if needed. I have UMA with iGPU set to 23 GB. This means Qwe3-Next , GLM Air, and other large models are out of commission.

@jeffbolznv

Copy link
Copy Markdown
Contributor Author

@jhemmond please file an issue with more details.

@jhemmond

jhemmond commented Jan 6, 2026

Copy link
Copy Markdown

@jeffbolznv Filed one here: #18642

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
…#18047)

The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…#18047)

The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
…#18047)

The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
…#18047)

The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
…#18047)

The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
phibya pushed a commit to ziee-ai/llama.cpp that referenced this pull request May 29, 2026
…#18047)

The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
…#18047)

The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants