vulkan: Implement set_tensor_async and the event interfaces#18047
Conversation
|
I tested loading gpt-oss 120B mxfp4 on Linux, and this is what I got: MMAP: 56s That's a decent improvement. |
|
But I am getting a bunch of validation warnings: |
The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.
6a31e8d to
7dddd68
Compare
|
I had tested with VVL before creating the PR, and still can't reproduce those warnings. What version of the Vulkan SDK are you using? I'm on 1.4.335. What command line?
This is a nice speedup. Had you cherry-picked #18012 for this testing? I've gone ahead and rebased so the PR now has that code in the baseline. |
|
I'm also on 1.4.335, yes, but not using the SDK directly, just the Arch Linux packages. The command line was this: It just loads the model, generates one token and exits, and times how long that process took. I didn't cherry-pick #18012, but I always use a master branch and merge the PR I'm testing, so it was using it. |
|
I found that my vulkan configurator settings were filtering out warnings. I can reproduce it now. But it's a false positive - the BestPractices layer isn't tracking vkResetEvent calls. I'll file a VVL issue. |
|
I added it to the other VVL issue I had already filed: KhronosGroup/Vulkan-ValidationLayers#11288 (comment) |
|
Hello @jeffbolznv after this PR, I am unable to use vulkan on my iGPU with GPT-OSS 20B, getting If I use mmap, then loading it works fine. To add on: this only applies to my Intel iGPU. My nvidia dGPU is okay. Reverting to |
Revert "vulkan: Implement set_tensor_async and the event interfaces (ggml-org#18047)" This reverts commit e1f15b4. (+1 squashed commits) Squashed commits: [3cfbc7b1a] Revert "vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (ggml-org#18302)" This reverts commit 2a9ea20.
This reverts commit dfa1b72.
|
I believe this is also causing an issue for me @jeffbolznv with llama-server and twin Vulkan GPUS (Radeon r9700). As soon as I updated to the release with this PR from b7501 it always crashes while loading any of my test models including QWen 3 Next 80B and GPT-OSS 120B. I disable mmap because mmap causes crashes when loading larger models like QWen 3 Coder 480B which otherwise load with RAM + VRAM. I tried adding GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 but it still crashes. |
|
This is causing DeviceOutOfMemory for any model that cant strictly fit into VRAM. i can provide more details if needed. I have UMA with iGPU set to 23 GB. This means Qwe3-Next , GLM Air, and other large models are out of commission. |
|
@jhemmond please file an issue with more details. |
|
@jeffbolznv Filed one here: #18642 |
…#18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.
The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from #7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.
…#18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.
…#18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.
…#18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.
…#18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.
…#18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.
…#18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from ggml-org#7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.
The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from #7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.
It would be interesting to test on Linux how this interacts with #18012.