vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it by jeffbolznv · Pull Request #21572 · ggml-org/llama.cpp

jeffbolznv · 2026-04-07T17:25:25Z

Overview

Replace the current RTE16 handling with something that applies to all shaders.

Additional information

I don't have easy access to a Turing system where lack of RTE16 would cause a failure, but it should be immediately obvious in CI if this is broken.

I was disappointed to find that there doesn't seem to be a good way to include spirv.hpp. It's in a different location in windows and linux Vulkan SDK installs, and for linux without the Vulkan SDK I don't think it's available with the current packages we assume are installed (though I'm not confident of this). Is there a better way?

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, I used Claude to implement most of this change, with some directed bugfix suggestions and manual edits.

jeffbolznv · 2026-04-07T17:38:26Z

Second commit uses FetchContent to get SPIRV-Headers. I guess this should be OK, other ggml backends also use fetchcontent, and it's a small header-only dependency.

jeffbolznv · 2026-04-07T20:00:38Z

I don't know what's going on with the CI failure. Last time I had a failure in llama-save-load-state it was due to different shaders being used in each run (due to fusion issues). I don't see how this change would have triggered a similar issue. I haven't been able to reproduce it locally on a 4070.

0cc4m · 2026-04-10T13:52:12Z

I like it, it's cleaner. I can compile and run it successfully, but I don't know if I have hardware that would fail the tests without rte. This also needs a rebase.

masamaru-san · 2026-04-10T16:43:06Z

This PR works excellectly for llama.cpp 👍, but it causes issues in projects that use GGML as a submodule, such as stable-diffusion.cpp. This is because CMake's FetchContent contaminates the parent work tree.

Adding only the necessary spv definitions to ggml-vulkan.cpp avoids the need to use FetchContent and prevents an increase in external dependencies. Would you consider doing this?

jeffbolznv · 2026-04-10T17:14:05Z

I've rebased it.

I'm not sure what to do about the fetchcontent concern. Manually defining the spirv enums we need is possible, but there may be other dependencies we want to fetch at some point in the future and I'd like to have a real solution.

I think the other main way of getting dependencies is via submodules, but I'm not sure how that interacts with ggml conditionally including ggml-vulkan, and it adds a manual step to clone/fetch the submodules.

masamaru-san · 2026-04-11T00:47:09Z

To be honest, I find it too difficult to use CMake and Git submodules properly. When I first applied this PR to ggml under stable-diffusion.cpp, I couldn't figure out what was happning in the working tree.
After asking Copilot AI and going through some trial and error, I found that FetchContent was the cause, but I still don't understand anything beyond that.

However, I am concerned that the spirv.hpp file version obtained via FetchContent might not be compatible with the Vulkan SDK or glslc used in each user's environment -- at least for the LunarG Vulkan SDK for Windows.
(The AI also said that spirv.hpp is sometimes vendor-specific, e.g. for Android -- is that true?)

jeffbolznv · 2026-04-11T00:56:33Z

What was the actual error you saw? From what I can tell, the main concerns with fetchcontent are in case of conflicting versions being fetched. But stable-diffusion.cpp doesn't fetch spirv-headers, so I don't see why it would cause a problem.

spirv.hpp is generally backward compatible (not sure if it's 100% guaranteed, but pretty close), and I'm not aware of anything else in llama.cpp/ggml/sd depending on it. So IMO, it's more of a theoretical concern for some other project that may include ggml.

masamaru-san · 2026-04-11T01:12:22Z

Oh, thanks for the quick reply!

Here's the specific issue I encountered:
The deps directory and its contents, retrieved via fetchcontent, are created in the build directory (i.e. out/build/x64-Release/).
That directory is supposed to be ignored by .gitignore, but for some reason(?), it ends up being added to the working tree.

There might be some kind of mistake in the CMakeLists.txt file for the parent stable-diffusion.cpp, but I can't tell for sure.

jeffbolznv · 2026-04-11T01:22:52Z

Hmm, it is supposed to be fetched to the build directory. But I don't know why it's not getting ignored on your system.

masamaru-san · 2026-04-11T01:38:39Z

If this is just an issue with my specific environment (as is often the case with Windows), I'll handle it myself within my local-repo., so please don't worry about it.

0cc4m · 2026-04-13T09:49:47Z

I was disappointed to find that there doesn't seem to be a good way to include spirv.hpp. It's in a different location in windows and linux Vulkan SDK installs, and for linux without the Vulkan SDK I don't think it's available with the current packages we assume are installed (though I'm not confident of this). Is there a better way?

It might be okay to patch the location for Windows and Linux if it's in different places in the SDK, but shouldn't cmake be able to resolve that? I can't easily look at the Windows SDK since it's an exe.

On Fedora I can install spirv-headers-devel and get the file in /usr/include/spirv/unified1/spirv.hpp. On Arch and Ubuntu I can install spirv-headers and get the same file. On the Linux SDK I get it in x86_64/include/spirv/unified1/spirv.hpp and it also provides a cmake file for it.

jeffbolznv · 2026-04-13T10:00:02Z

Just to make sure I understand what you're suggesting: Remove fetchcontent, ifdef the include to use a different path for windows vs linux, and update the build guide to tell folks to install spirv-headers or spirv-headers-devel?

0cc4m · 2026-04-13T10:04:00Z

Yes, basically. Or is there a good reason not to? I can't check how it works on Windows, currently.

jeffbolznv · 2026-04-13T10:05:58Z

On windows I have the file in C:\vulkansdk\1.4.341.1\Include\spirv-headers. OK, I'll try this change.

jeffbolznv · 2026-04-13T11:47:54Z

The CI-intel failures are on selfhosted systems and I think these haven't installed the spirv-headers package. So this is what I was concerned about in the original description. We could continue down this path and just install it on that system, but I'm slightly worried about folks filing a ton of issues about build failures.

0cc4m

We could fetch it as a fallback, but building is already an "advanced" activity and I don't think it's a big problem to add a small dependency.

jeffbolznv · 2026-04-13T12:12:58Z

@rillomas I think you own the selfhosted vulkan CI systems, can you install spirv-headers in them?

rillomas · 2026-04-14T02:32:08Z

@jeffbolznv I've installed spirv-headers on both Win/Linux instances. Can you try again?

…device supports it

jeffbolznv · 2026-04-14T09:05:49Z

Thanks, it's passing now.

0cc4m

LGTM. No CMake change needed at all?

jeffbolznv · 2026-04-14T12:27:34Z

Yeah, I didn't find a need for a cmake change. Seems like the include paths are already there (the VulkanSDK on windows, or /usr/include on linux).

See: ggml-org/llama.cpp#21572 Signed-off-by: Craig Andrews <candrews@gentoo.org>

…p#21572

Origin's April upstream-sync rebase interleaved two changes that left the Vulkan turbo3 KV path broken: * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE rounding to a runtime SPIR-V patch and dropped the _rte shader variants plus rte.glsl itself. * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV support against a base that still had those variants. After the rebase, the tree had dangling cpy_f32_*_rte_len / _data references, a two-arg SET_ROWS macro called with one arg, a #include "rte.glsl" in a shader whose header no longer exists, and MMQ shader variants generated for turbo3_0 even though the flash_attn MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp failed to compile on a clean checkout (spirv-headers + all of the above) and the shader-gen emitted garbage variants. Separately, turbo3 flash-attn pipelines were only wired up for FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0}) and tripped the Br == wg_denoms[0] assertion as soon as a prefill ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3 aborted on the first real forward pass. Changes: * Drop the if (float_controls_rte_fp16) / else branches around cpy_f32_quant pipeline creation and collapse SET_ROWS to a single variant, matching upstream post-1f30ac0ce. * Remove the #include "rte.glsl" from copy_to_quant.comp. * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader generator (no MMQ code path for it). * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1) and the _cm2 counterpart alongside the other quant types. Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan 1.4.341, spirv-headers 1.4.341.0): * Full HIP+Vulkan build is clean with no shader compile errors. * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147 * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 : 530 cases pass (previously aborted on case 3). * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 : still green (no HIP regression). * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3 -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128 correctness issue on the Vulkan turbo3 decode path is pre-existing and orthogonal to this change. llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend: F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17 Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81

…device supports it (ggml-org#21572) * vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it * use FetchContent to get SPIRV-Headers * Fetch spirv-headers unconditionally * remove fetchcontent, rely on installed headers * fix ubuntu job * Update docs/build.md

Origin's April upstream-sync rebase interleaved two changes that left the Vulkan turbo3 KV path broken: * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE rounding to a runtime SPIR-V patch and dropped the _rte shader variants plus rte.glsl itself. * TheTom/llama-cpp-turboquant PR #62 (ff8bb73) added turbo3 KV support against a base that still had those variants. After the rebase, the tree had dangling cpy_f32_*_rte_len / _data references, a two-arg SET_ROWS macro called with one arg, a #include "rte.glsl" in a shader whose header no longer exists, and MMQ shader variants generated for turbo3_0 even though the flash_attn MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp failed to compile on a clean checkout (spirv-headers + all of the above) and the shader-gen emitted garbage variants. Separately, turbo3 flash-attn pipelines were only wired up for FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0}) and tripped the Br == wg_denoms[0] assertion as soon as a prefill ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3 aborted on the first real forward pass. Changes: * Drop the if (float_controls_rte_fp16) / else branches around cpy_f32_quant pipeline creation and collapse SET_ROWS to a single variant, matching upstream post-1f30ac0ce. * Remove the #include "rte.glsl" from copy_to_quant.comp. * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader generator (no MMQ code path for it). * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1) and the _cm2 counterpart alongside the other quant types. Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan 1.4.341, spirv-headers 1.4.341.0): * Full HIP+Vulkan build is clean with no shader compile errors. * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147 * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 : 530 cases pass (previously aborted on case 3). * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 : still green (no HIP regression). * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3 -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128 correctness issue on the Vulkan turbo3 decode path is pre-existing and orthogonal to this change. llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend: F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17 Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

…device supports it (ggml-org#21572) * vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it * use FetchContent to get SPIRV-Headers * Fetch spirv-headers unconditionally * remove fetchcontent, rely on installed headers * fix ubuntu job * Update docs/build.md

…hen the device supports it (ggml-org#21572)" This reverts commit 1f30ac0.

…device supports it (ggml-org#21572) * vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it * use FetchContent to get SPIRV-Headers * Fetch spirv-headers unconditionally * remove fetchcontent, rely on installed headers * fix ubuntu job * Update docs/build.md

Origin's April upstream-sync rebase interleaved two changes that left the Vulkan turbo3 KV path broken: * ggml-org/llama.cpp upstream PR ggml-org#21572 (5a36bd0) moved fp16 RTE rounding to a runtime SPIR-V patch and dropped the _rte shader variants plus rte.glsl itself. * TheTom/llama-cpp-turboquant PR TheTom#62 (6f88d87) added turbo3 KV support against a base that still had those variants. After the rebase, the tree had dangling cpy_f32_*_rte_len / _data references, a two-arg SET_ROWS macro called with one arg, a #include "rte.glsl" in a shader whose header no longer exists, and MMQ shader variants generated for turbo3_0 even though the flash_attn MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp failed to compile on a clean checkout (spirv-headers + all of the above) and the shader-gen emitted garbage variants. Separately, turbo3 flash-attn pipelines were only wired up for FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0}) and tripped the Br == wg_denoms[0] assertion as soon as a prefill ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3 aborted on the first real forward pass. Changes: * Drop the if (float_controls_rte_fp16) / else branches around cpy_f32_quant pipeline creation and collapse SET_ROWS to a single variant, matching upstream post-5a36bd0fd. * Remove the #include "rte.glsl" from copy_to_quant.comp. * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader generator (no MMQ code path for it). * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1) and the _cm2 counterpart alongside the other quant types. Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan 1.4.341, spirv-headers 1.4.341.0): * Full HIP+Vulkan build is clean with no shader compile errors. * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147 * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 : 530 cases pass (previously aborted on case 3). * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 : still green (no HIP regression). * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3 -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128 correctness issue on the Vulkan turbo3 decode path is pre-existing and orthogonal to this change. llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend: F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17 Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81

…device supports it (ggml-org#21572) * vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it * use FetchContent to get SPIRV-Headers * Fetch spirv-headers unconditionally * remove fetchcontent, rely on installed headers * fix ubuntu job * Update docs/build.md

Origin's April upstream-sync rebase interleaved two changes that left the Vulkan turbo3 KV path broken: * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE rounding to a runtime SPIR-V patch and dropped the _rte shader variants plus rte.glsl itself. * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV support against a base that still had those variants. After the rebase, the tree had dangling cpy_f32_*_rte_len / _data references, a two-arg SET_ROWS macro called with one arg, a #include "rte.glsl" in a shader whose header no longer exists, and MMQ shader variants generated for turbo3_0 even though the flash_attn MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp failed to compile on a clean checkout (spirv-headers + all of the above) and the shader-gen emitted garbage variants. Separately, turbo3 flash-attn pipelines were only wired up for FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0}) and tripped the Br == wg_denoms[0] assertion as soon as a prefill ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3 aborted on the first real forward pass. Changes: * Drop the if (float_controls_rte_fp16) / else branches around cpy_f32_quant pipeline creation and collapse SET_ROWS to a single variant, matching upstream post-1f30ac0ce. * Remove the #include "rte.glsl" from copy_to_quant.comp. * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader generator (no MMQ code path for it). * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1) and the _cm2 counterpart alongside the other quant types. Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan 1.4.341, spirv-headers 1.4.341.0): * Full HIP+Vulkan build is clean with no shader compile errors. * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147 * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 : 530 cases pass (previously aborted on case 3). * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 : still green (no HIP regression). * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3 -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128 correctness issue on the Vulkan turbo3 decode path is pre-existing and orthogonal to this change. llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend: F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17 Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81

Origin's April upstream-sync rebase interleaved two changes that left the Vulkan turbo3 KV path broken: * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE rounding to a runtime SPIR-V patch and dropped the _rte shader variants plus rte.glsl itself. * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV support against a base that still had those variants. After the rebase, the tree had dangling cpy_f32_*_rte_len / _data references, a two-arg SET_ROWS macro called with one arg, a MMQ shader variants generated for turbo3_0 even though the flash_attn MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp failed to compile on a clean checkout (spirv-headers + all of the above) and the shader-gen emitted garbage variants. Separately, turbo3 flash-attn pipelines were only wired up for FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0}) and tripped the Br == wg_denoms[0] assertion as soon as a prefill ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3 aborted on the first real forward pass. Changes: * Drop the if (float_controls_rte_fp16) / else branches around cpy_f32_quant pipeline creation and collapse SET_ROWS to a single variant, matching upstream post-1f30ac0ce. * Remove the #include "rte.glsl" from copy_to_quant.comp. * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader generator (no MMQ code path for it). * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1) and the _cm2 counterpart alongside the other quant types. Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan 1.4.341, spirv-headers 1.4.341.0): * Full HIP+Vulkan build is clean with no shader compile errors. * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147 * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 : 530 cases pass (previously aborted on case 3). * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 : still green (no HIP regression). * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3 -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128 correctness issue on the Vulkan turbo3 decode path is pre-existing and orthogonal to this change. llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend: F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17 Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81

Origin's April upstream-sync rebase interleaved two changes that left the Vulkan turbo3 KV path broken: * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE rounding to a runtime SPIR-V patch and dropped the _rte shader variants plus rte.glsl itself. * TheTom/llama-cpp-turboquant PR ggml-org#62 (ff8bb73) added turbo3 KV support against a base that still had those variants. After the rebase, the tree had dangling cpy_f32_*_rte_len / _data references, a two-arg SET_ROWS macro called with one arg, a MMQ shader variants generated for turbo3_0 even though the flash_attn MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp failed to compile on a clean checkout (spirv-headers + all of the above) and the shader-gen emitted garbage variants. Separately, turbo3 flash-attn pipelines were only wired up for FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0}) and tripped the Br == wg_denoms[0] assertion as soon as a prefill ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3 aborted on the first real forward pass. Changes: * Drop the if (float_controls_rte_fp16) / else branches around cpy_f32_quant pipeline creation and collapse SET_ROWS to a single variant, matching upstream post-1f30ac0ce. * Remove the #include "rte.glsl" from copy_to_quant.comp. * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader generator (no MMQ code path for it). * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1) and the _cm2 counterpart alongside the other quant types. Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan 1.4.341, spirv-headers 1.4.341.0): * Full HIP+Vulkan build is clean with no shader compile errors. * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147 * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 : 530 cases pass (previously aborted on case 3). * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 : still green (no HIP regression). * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3 -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128 correctness issue on the Vulkan turbo3 decode path is pre-existing and orthogonal to this change. llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend: F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17 Refs: TheTom/llama-cpp-turboquant issues ggml-org#50, ggml-org#64, ggml-org#81

jeffbolznv requested a review from a team as a code owner April 7, 2026 17:25

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Apr 7, 2026

jeffbolznv force-pushed the rte_spirv branch from 8a66141 to b0af078 Compare April 10, 2026 17:12

jeffbolznv requested a review from ngxson as a code owner April 13, 2026 10:29

github-actions Bot added documentation Improvements or additions to documentation devops improvements to build systems and github actions labels Apr 13, 2026

jeffbolznv requested a review from a team as a code owner April 13, 2026 11:45

0cc4m reviewed Apr 13, 2026

View reviewed changes

Comment thread docs/build.md Outdated

jeffbolznv added 4 commits April 14, 2026 01:26

vulkan: Programmatically add RoundingModeRTE to all shaders when the …

8903fcf

…device supports it

use FetchContent to get SPIRV-Headers

c2cbd78

Fetch spirv-headers unconditionally

ffdd535

remove fetchcontent, rely on installed headers

85a4da5

jeffbolznv force-pushed the rte_spirv branch from 25a6531 to 8b85c5b Compare April 14, 2026 06:26

0cc4m approved these changes Apr 14, 2026

View reviewed changes

CISC approved these changes Apr 14, 2026

View reviewed changes

0cc4m merged commit 1f30ac0 into ggml-org:master Apr 14, 2026
46 of 48 checks passed

gentoo-bot pushed a commit to gentoo/guru that referenced this pull request Apr 14, 2026

sci-misc/llama-cpp: depend on dev-util/spirv-headers

6a097c5

See: ggml-org/llama.cpp#21572 Signed-off-by: Craig Andrews <candrews@gentoo.org>

EZForever added a commit to EZForever/llama.cpp-builds that referenced this pull request Apr 14, 2026

vulkan-radv: Fetch SPIRV headers on build following ggml-org/llama.cp…

059f2ef

…p#21572

apollosenvy mentioned this pull request Apr 18, 2026

vulkan: fix turbo3 build + coopmat FA after April upstream sync TheTom/llama-cpp-turboquant#87

Merged

CISC mentioned this pull request Apr 19, 2026

ci : install spirv-headers for vulkan-cross #22109

Merged

traversaro mentioned this pull request Apr 19, 2026

[bot-automerge] llama.cpp v8846 conda-forge/llama.cpp-feedstock#100

Closed

3 tasks

ilopezluna mentioned this pull request Apr 20, 2026

Fix Linux CPU build: add spirv-headers dependency docker/model-runner#872

Merged

vkhaitan added a commit to vkhaitan/vllama.cpp that referenced this pull request Apr 27, 2026

Revert "vulkan: Programmatically add RoundingModeRTE to all shaders w…

3cd6119

…hen the device supports it (ggml-org#21572)" This reverts commit 1f30ac0.

aittalam mentioned this pull request Apr 29, 2026

Update llama.cpp submodule to 7b8443ac7 mozilla-ai/llamafile#951

Merged

7 tasks

vkhaitan added a commit to vkhaitan/vllama.cpp that referenced this pull request Apr 29, 2026

Revert "vulkan: Programmatically add RoundingModeRTE to all shaders w…

e575f2e

…hen the device supports it (ggml-org#21572)" This reverts commit 1f30ac0.

Conversation

jeffbolznv commented Apr 7, 2026

Overview

Additional information

Requirements

Uh oh!

jeffbolznv commented Apr 7, 2026

Uh oh!

jeffbolznv commented Apr 7, 2026

Uh oh!

0cc4m commented Apr 10, 2026

Uh oh!

masamaru-san commented Apr 10, 2026

Uh oh!

jeffbolznv commented Apr 10, 2026

Uh oh!

masamaru-san commented Apr 11, 2026

Uh oh!

jeffbolznv commented Apr 11, 2026

Uh oh!

masamaru-san commented Apr 11, 2026

Uh oh!

jeffbolznv commented Apr 11, 2026

Uh oh!

masamaru-san commented Apr 11, 2026

Uh oh!

0cc4m commented Apr 13, 2026

Uh oh!

jeffbolznv commented Apr 13, 2026

Uh oh!

0cc4m commented Apr 13, 2026

Uh oh!

jeffbolznv commented Apr 13, 2026

Uh oh!

jeffbolznv commented Apr 13, 2026

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeffbolznv commented Apr 13, 2026

Uh oh!

rillomas commented Apr 14, 2026

Uh oh!

jeffbolznv commented Apr 14, 2026

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

jeffbolznv commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants