Skip to content

vulkan: fix matmul integer pipeline selection#23005

Merged
0cc4m merged 2 commits into
masterfrom
0cc4m/vulkan-fix-matmul-pipeline-int-selection
May 14, 2026
Merged

vulkan: fix matmul integer pipeline selection#23005
0cc4m merged 2 commits into
masterfrom
0cc4m/vulkan-fix-matmul-pipeline-int-selection

Conversation

@0cc4m
Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m commented May 13, 2026

Overview

Fix a pipeline selection issue we missed in #22693. The int pipeline values must match the device-specific non-int ones. Fixes #22992. Thank you @deepakhj for the report and correct suggestion.

Requirements

@0cc4m 0cc4m requested a review from a team as a code owner May 13, 2026 08:04
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 13, 2026
@jeffbolznv
Copy link
Copy Markdown
Contributor

What is the crash? It's not obvious to me why the allowed integer shapes must be a subset of the non-integer shapes.

@Nindaleth
Copy link
Copy Markdown
Contributor

Here's the backtrace:

gdb --args ./build/bin/llama-bench -m models/llama-2-7b.Q4_0.gguf -ngl 100 -fa 1 -d 8192 --device Vulkan1
...
Thread 1 "llama-bench" received signal SIGSEGV, Segmentation fault.
(gdb) bt
#0  0x00007ffff3d37bca in ggml_vk_mul_mat_q_f16(ggml_backend_vk_context*, std::shared_ptr<vk_context_struct>&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, bool) () from /home/black_fox/src/llama.cpp/build/bin/libggml-vulkan.so.0
#1  0x00007ffff3d4ce4e in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool) [clone .isra.0] ()
   from /home/black_fox/src/llama.cpp/build/bin/libggml-vulkan.so.0
#2  0x00007ffff3d4dc64 in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/black_fox/src/llama.cpp/build/bin/libggml-vulkan.so.0
#3  0x00007ffff7f03d7b in ggml_backend_sched_graph_compute_async () from /home/black_fox/src/llama.cpp/build/bin/libggml-base.so.0
#4  0x00007ffff7641660 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/black_fox/src/llama.cpp/build/bin/libllama.so.0
#5  0x00007ffff7645205 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) ()
   from /home/black_fox/src/llama.cpp/build/bin/libllama.so.0
#6  0x00007ffff7648fb2 in llama_context::decode(llama_batch const&) () from /home/black_fox/src/llama.cpp/build/bin/libllama.so.0
#7  0x00007ffff764aeee in llama_decode () from /home/black_fox/src/llama.cpp/build/bin/libllama.so.0
#8  0x0000000000406f2b in test_prompt(llama_context*, int, int, int) ()
#9  0x0000000000403a4e in main ()

@bricss
Copy link
Copy Markdown

bricss commented May 13, 2026

Vulkan build crashing 💥 with 419 Segmentation fault

@jeffbolznv
Copy link
Copy Markdown
Contributor

Oh, I guess it's because we have the if (device->mul_mat ## ID ## ... checks when creating the pipelines. I had missed that in my search. Maybe a better fix would be to change CREATE_MMQ to look at the _int variables.

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented May 13, 2026

But also these bools are used for device-tuning and enabling all is only correct for Nvidia. I didn't look into it further than that.

@jeffbolznv
Copy link
Copy Markdown
Contributor

The device tuning may eventually need to be different for int vs float, and we shouldn't try to create the pipelines if mul_mat_*int is false, so I think CREATE_MMQ needs to change. For perf or bug workarounds, we could additionally set mul_mat*int on a per-device basis if needed.

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented May 14, 2026

@ggml-org/maintainers Another approval needed.

@0cc4m 0cc4m merged commit dbe7901 into master May 14, 2026
48 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-fix-matmul-pipeline-int-selection branch May 14, 2026 08:36
SharkEzz added a commit to SharkEzz/llama.cpp that referenced this pull request May 14, 2026
commit dbe7901
Author: Ruben Ortlam <rortlam@redhat.com>
Date:   Thu May 14 10:36:54 2026 +0200

    vulkan: fix matmul integer pipeline selection (ggml-org#23005)

    * vulkan: fix matmul integer pipeline selection

    * gate pipeline creation with the right bools

commit 320a6a4
Author: Aleksander Grygier <aleksander.grygier@gmail.com>
Date:   Thu May 14 08:09:29 2026 +0200

    fix: Autoscroll detection (ggml-org#23026)

commit 9ed6e19
Author: Katostrofik <georgiopapairo@gmail.com>
Date:   Thu May 14 01:39:14 2026 -0400

    SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (ggml-org#21597)

    * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

    Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
    in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
    DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
    zeMemAllocDevice uses the SVM/P2P path with no host staging.

    On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
    consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
    With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
    no performance regression.

    All Level Zero calls include automatic fallback to the original SYCL
    allocation path if Level Zero interop is unavailable.

    * SYCL: address review feedback - remove try/catch, check device types, deduplicate

    - Remove try/catch from malloc/free/memcpy helpers, check backend and
      device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
    - Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
      and declare in common.hpp to eliminate code duplication
    - Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
    - Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
      host-staged path for iGPU-to-dGPU transfers
    - Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
      in CMakeLists.txt (co-authored with @arthw)

    * SYCL: add build/runtime flags for Level Zero, address review feedback

    Implements the architecture suggested by @arthw: compile-time and runtime
    flags to cleanly separate Level Zero and SYCL memory API paths.

    - Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
      Zero code is wrapped in #ifdef so the build works on systems without
      the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
      loader library and headers are checked before enabling.

    - Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
      whether Level Zero or SYCL memory APIs are used. Only one API style is
      used per session, no mixing. If Level Zero is enabled but the devices
      don't support the Level Zero backend, it auto-disables with a warning.

    - Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
      is not called anywhere in the backend) and used try/catch for flow control.

    - Update SYCL.md with documentation for both new parameters.

    Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
    GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
    (Claude). Code reviewed and tested on my hardware.

    * SYCL: unify Level Zero malloc/free call sites, address review feedback

    Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
    Both functions are now unconditionally available — Level Zero code is
    uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

    Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
    traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
    sites (-29 lines net).

    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    * SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

    Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
    so the Level Zero code path is compiled and tested in CI.

    Fix two bugs found during extended dual-GPU testing (no
    ONEAPI_DEVICE_SELECTOR set):

    - The Level Zero backend check was iterating all SYCL devices
      including CPU. The OpenCL CPU device caused Level Zero to be
      disabled for the GPUs, defeating the fix on multi-GPU systems.
      Added is_gpu() filter so only GPU devices are checked.

    - sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
      were still calling sycl::malloc/sycl::free directly, bypassing the
      Level Zero path. Routed through ggml_sycl_malloc_device/free_device
      for consistency with the other device memory call sites.

    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

    * SYCL: address arthw review feedback on Level Zero memory API structure

    - Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
      only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
    - Switch both helpers to use g_ggml_sycl_enable_level_zero global
      instead of per-call queue backend checks
    - Remove #ifdef wrapper from global definition; always declare at 0,
      add #else branch in init block so it stays 0 when L0 not compiled in
    - Update init loop comment to explain GPU-only device check
    - CMakeLists: message(STATUS) before the if block; align option wording

    AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
    B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
    Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
    <5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

    Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

    * SYCL: remove unused cstdio/cstdlib includes from common.cpp

    Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

    * Apply suggestions from code review

    Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

    * SYCL: preserve Level Zero allocation path during early malloc

    * ci: fix Level Zero package conflict in Intel Docker build

    * ci: find Level Zero loader in oneAPI package step

    * ci: allow Windows SYCL package without Level Zero DLL

    ---------

    Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 14, 2026
* vulkan: fix matmul integer pipeline selection

* gate pipeline creation with the right bools
@poetinha
Copy link
Copy Markdown

Can someone take a look on #23106? It seems related to #22992. Thank you!

dandm1 pushed a commit to dandm1/llama.cpp that referenced this pull request May 16, 2026
* vulkan: fix matmul integer pipeline selection

* gate pipeline creation with the right bools
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
* vulkan: fix matmul integer pipeline selection

* gate pipeline creation with the right bools
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request May 19, 2026
* vulkan: fix matmul integer pipeline selection

* gate pipeline creation with the right bools
DrBearJew referenced this pull request in DrBearJew/RoxxY May 22, 2026
* vulkan: fix matmul integer pipeline selection

* gate pipeline creation with the right bools

(cherry picked from commit dbe7901ca652cc8fe314e6972e3ce9cb33edf33d)
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
* vulkan: fix matmul integer pipeline selection

* gate pipeline creation with the right bools
carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026
* vulkan: fix matmul integer pipeline selection

* gate pipeline creation with the right bools

(cherry picked from commit dbe7901)
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
* vulkan: fix matmul integer pipeline selection

* gate pipeline creation with the right bools
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* vulkan: fix matmul integer pipeline selection

* gate pipeline creation with the right bools
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Vulkan: llama-bench segfault on AMD RDNA2 (RX 6800 XT)

6 participants