Skip to content

vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it#21572

Merged
0cc4m merged 6 commits into
ggml-org:masterfrom
jeffbolznv:rte_spirv
Apr 14, 2026
Merged

vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it#21572
0cc4m merged 6 commits into
ggml-org:masterfrom
jeffbolznv:rte_spirv

Conversation

@jeffbolznv
Copy link
Copy Markdown
Contributor

Overview

Replace the current RTE16 handling with something that applies to all shaders.

Additional information

I don't have easy access to a Turing system where lack of RTE16 would cause a failure, but it should be immediately obvious in CI if this is broken.

I was disappointed to find that there doesn't seem to be a good way to include spirv.hpp. It's in a different location in windows and linux Vulkan SDK installs, and for linux without the Vulkan SDK I don't think it's available with the current packages we assume are installed (though I'm not confident of this). Is there a better way?

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, I used Claude to implement most of this change, with some directed bugfix suggestions and manual edits.

@jeffbolznv jeffbolznv requested a review from a team as a code owner April 7, 2026 17:25
@jeffbolznv
Copy link
Copy Markdown
Contributor Author

Second commit uses FetchContent to get SPIRV-Headers. I guess this should be OK, other ggml backends also use fetchcontent, and it's a small header-only dependency.

@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Apr 7, 2026
@jeffbolznv
Copy link
Copy Markdown
Contributor Author

I don't know what's going on with the CI failure. Last time I had a failure in llama-save-load-state it was due to different shaders being used in each run (due to fusion issues). I don't see how this change would have triggered a similar issue. I haven't been able to reproduce it locally on a 4070.

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Apr 10, 2026

I like it, it's cleaner. I can compile and run it successfully, but I don't know if I have hardware that would fail the tests without rte. This also needs a rebase.

@masamaru-san
Copy link
Copy Markdown

This PR works excellectly for llama.cpp 👍, but it causes issues in projects that use GGML as a submodule, such as stable-diffusion.cpp. This is because CMake's FetchContent contaminates the parent work tree.

Adding only the necessary spv definitions to ggml-vulkan.cpp avoids the need to use FetchContent and prevents an increase in external dependencies. Would you consider doing this?

@jeffbolznv
Copy link
Copy Markdown
Contributor Author

I've rebased it.

I'm not sure what to do about the fetchcontent concern. Manually defining the spirv enums we need is possible, but there may be other dependencies we want to fetch at some point in the future and I'd like to have a real solution.

I think the other main way of getting dependencies is via submodules, but I'm not sure how that interacts with ggml conditionally including ggml-vulkan, and it adds a manual step to clone/fetch the submodules.

@masamaru-san
Copy link
Copy Markdown

To be honest, I find it too difficult to use CMake and Git submodules properly. When I first applied this PR to ggml under stable-diffusion.cpp, I couldn't figure out what was happning in the working tree.
After asking Copilot AI and going through some trial and error, I found that FetchContent was the cause, but I still don't understand anything beyond that.

However, I am concerned that the spirv.hpp file version obtained via FetchContent might not be compatible with the Vulkan SDK or glslc used in each user's environment -- at least for the LunarG Vulkan SDK for Windows.
(The AI also said that spirv.hpp is sometimes vendor-specific, e.g. for Android -- is that true?)

@jeffbolznv
Copy link
Copy Markdown
Contributor Author

What was the actual error you saw? From what I can tell, the main concerns with fetchcontent are in case of conflicting versions being fetched. But stable-diffusion.cpp doesn't fetch spirv-headers, so I don't see why it would cause a problem.

spirv.hpp is generally backward compatible (not sure if it's 100% guaranteed, but pretty close), and I'm not aware of anything else in llama.cpp/ggml/sd depending on it. So IMO, it's more of a theoretical concern for some other project that may include ggml.

@masamaru-san
Copy link
Copy Markdown

Oh, thanks for the quick reply!

Here's the specific issue I encountered:
The deps directory and its contents, retrieved via fetchcontent, are created in the build directory (i.e. out/build/x64-Release/).
That directory is supposed to be ignored by .gitignore, but for some reason(?), it ends up being added to the working tree.

There might be some kind of mistake in the CMakeLists.txt file for the parent stable-diffusion.cpp, but I can't tell for sure.

@jeffbolznv
Copy link
Copy Markdown
Contributor Author

Hmm, it is supposed to be fetched to the build directory. But I don't know why it's not getting ignored on your system.

@masamaru-san
Copy link
Copy Markdown

If this is just an issue with my specific environment (as is often the case with Windows), I'll handle it myself within my local-repo., so please don't worry about it.

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Apr 13, 2026

I was disappointed to find that there doesn't seem to be a good way to include spirv.hpp. It's in a different location in windows and linux Vulkan SDK installs, and for linux without the Vulkan SDK I don't think it's available with the current packages we assume are installed (though I'm not confident of this). Is there a better way?

It might be okay to patch the location for Windows and Linux if it's in different places in the SDK, but shouldn't cmake be able to resolve that? I can't easily look at the Windows SDK since it's an exe.

On Fedora I can install spirv-headers-devel and get the file in /usr/include/spirv/unified1/spirv.hpp. On Arch and Ubuntu I can install spirv-headers and get the same file. On the Linux SDK I get it in x86_64/include/spirv/unified1/spirv.hpp and it also provides a cmake file for it.

@jeffbolznv
Copy link
Copy Markdown
Contributor Author

Just to make sure I understand what you're suggesting: Remove fetchcontent, ifdef the include to use a different path for windows vs linux, and update the build guide to tell folks to install spirv-headers or spirv-headers-devel?

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Apr 13, 2026

Yes, basically. Or is there a good reason not to? I can't check how it works on Windows, currently.

@jeffbolznv
Copy link
Copy Markdown
Contributor Author

On windows I have the file in C:\vulkansdk\1.4.341.1\Include\spirv-headers. OK, I'll try this change.

@jeffbolznv jeffbolznv requested a review from ngxson as a code owner April 13, 2026 10:29
@github-actions github-actions Bot added documentation Improvements or additions to documentation devops improvements to build systems and github actions labels Apr 13, 2026
@jeffbolznv jeffbolznv requested a review from a team as a code owner April 13, 2026 11:45
@jeffbolznv
Copy link
Copy Markdown
Contributor Author

The CI-intel failures are on selfhosted systems and I think these haven't installed the spirv-headers package. So this is what I was concerned about in the original description. We could continue down this path and just install it on that system, but I'm slightly worried about folks filing a ton of issues about build failures.

Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could fetch it as a fallback, but building is already an "advanced" activity and I don't think it's a big problem to add a small dependency.

Comment thread docs/build.md Outdated
@jeffbolznv
Copy link
Copy Markdown
Contributor Author

@rillomas I think you own the selfhosted vulkan CI systems, can you install spirv-headers in them?

@rillomas
Copy link
Copy Markdown
Contributor

@jeffbolznv I've installed spirv-headers on both Win/Linux instances. Can you try again?

@jeffbolznv
Copy link
Copy Markdown
Contributor Author

Thanks, it's passing now.

Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No CMake change needed at all?

@jeffbolznv
Copy link
Copy Markdown
Contributor Author

Yeah, I didn't find a need for a cmake change. Seems like the include paths are already there (the VulkanSDK on windows, or /usr/include on linux).

@0cc4m 0cc4m merged commit 1f30ac0 into ggml-org:master Apr 14, 2026
46 of 48 checks passed
gentoo-bot pushed a commit to gentoo/guru that referenced this pull request Apr 14, 2026
See: ggml-org/llama.cpp#21572

Signed-off-by: Craig Andrews <candrews@gentoo.org>
EZForever added a commit to EZForever/llama.cpp-builds that referenced this pull request Apr 14, 2026
apollosenvy pushed a commit to apollosenvy/llama-cpp-turboquant that referenced this pull request Apr 17, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
coutinhomarco pushed a commit to coutinhomarco/llama-cpp-turboquant that referenced this pull request Apr 18, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
mengqin pushed a commit to mengqin/llama.cpp that referenced this pull request Apr 20, 2026
…device supports it (ggml-org#21572)

* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it

* use FetchContent to get SPIRV-Headers

* Fetch spirv-headers unconditionally

* remove fetchcontent, rely on installed headers

* fix ubuntu job

* Update docs/build.md
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Apr 21, 2026
…device supports it (ggml-org#21572)

* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it

* use FetchContent to get SPIRV-Headers

* Fetch spirv-headers unconditionally

* remove fetchcontent, rely on installed headers

* fix ubuntu job

* Update docs/build.md
TheTom pushed a commit to TheTom/llama-cpp-turboquant that referenced this pull request Apr 22, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
…device supports it (ggml-org#21572)

* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it

* use FetchContent to get SPIRV-Headers

* Fetch spirv-headers unconditionally

* remove fetchcontent, rely on installed headers

* fix ubuntu job

* Update docs/build.md
vkhaitan added a commit to vkhaitan/vllama.cpp that referenced this pull request Apr 27, 2026
vkhaitan added a commit to vkhaitan/vllama.cpp that referenced this pull request Apr 29, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
…device supports it (ggml-org#21572)

* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it

* use FetchContent to get SPIRV-Headers

* Fetch spirv-headers unconditionally

* remove fetchcontent, rely on installed headers

* fix ubuntu job

* Update docs/build.md
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
…device supports it (ggml-org#21572)

* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it

* use FetchContent to get SPIRV-Headers

* Fetch spirv-headers unconditionally

* remove fetchcontent, rely on installed headers

* fix ubuntu job

* Update docs/build.md
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (5a36bd0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (6f88d87) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-5a36bd0fd.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (5a36bd0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (6f88d87) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-5a36bd0fd.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
…device supports it (ggml-org#21572)

* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it

* use FetchContent to get SPIRV-Headers

* Fetch spirv-headers unconditionally

* remove fetchcontent, rely on installed headers

* fix ubuntu job

* Update docs/build.md
sbaier1 pushed a commit to sbaier1/llama-cpp-turboquant that referenced this pull request May 8, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
sbaier1 pushed a commit to sbaier1/llama-cpp-turboquant that referenced this pull request May 8, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
Jcfunk pushed a commit to Jcfunk/llama.cpp that referenced this pull request May 9, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR ggml-org#62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues ggml-org#50, ggml-org#64, ggml-org#81
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR ggml-org#62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues ggml-org#50, ggml-org#64, ggml-org#81
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants