Skip to content

vulkan: fix and complete turbo3 KV cache support#62

Merged
TheTom merged 2 commits into
TheTom:feature/turboquant-kv-cachefrom
Titaniumtown:pr/turboquant-vulkan-work
Apr 9, 2026
Merged

vulkan: fix and complete turbo3 KV cache support#62
TheTom merged 2 commits into
TheTom:feature/turboquant-kv-cachefrom
Titaniumtown:pr/turboquant-vulkan-work

Conversation

@Titaniumtown
Copy link
Copy Markdown

Overview

The PR for initial vulkan support (#33), was not fully working for me, it was resulting in gibberish output on Gemma 4 being gibberish. I put claude in a closed loop with real hardware to solve the problem. It based it's work on the cuda kernel that exists.

Things done:

  • Fix block size: 32 -> 128
  • add WHT shader
  • add flash attention support
  • fix dequantization for turbo3 on vulkan
  • Rewrite SET_ROWS path to be correct also 12.7x faster (on a380)

This PR also adds some tests for vulkan + turbo3 that would've caught the issues in #33.

Additional information

#33

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Claude Opus 4.6 was used in a closed loop with real hardware.

@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Apr 9, 2026

Thanks for the Vulkan fix and the test coverage — built and ran locally on M5 Max (Metal) and the Vulkan-only code is properly isolated, no impact on CUDA or Metal at runtime. The expanded FA test matrix (adding TURBO3_0 and hsk=128 for non-F16 types) is a win: all 6256 FA tests pass on Metal.

One blocker before merging: the 27 new TURBO_WHT and TURBO_WHT_ROUNDTRIP tests fail on Metal due to tolerance set too tight. Errors land in the 3.25e-7 to 1.015e-6 range, but the default NMSE threshold is 1e-7, which is tighter than f32 SIMD reduction precision on GPU.

[TURBO_WHT] ERR = 0.000000325 > 0.000000100   TURBO_WHT(head_dim=128,n_heads=1,direction=0): FAIL
[TURBO_WHT] ERR = 0.000000387 > 0.000000100   TURBO_WHT(head_dim=128,n_heads=4,direction=0): FAIL
...
  0/27 tests passed
  Backend MTL0: FAIL

Math is correct, just a tolerance problem. You already did the right thing on test_set_rows_turbo3 (max_nmse_err() = 0.05 because turbo3 is lossy), just need the same treatment on the pure WHT tests:

struct test_turbo_wht : public test_case {
    // ...
    double max_nmse_err() override {
        return 1e-5;  // f32 SIMD reduction precision on GPU backends
    }
};

struct test_turbo_wht_roundtrip : public test_case {
    // ...
    double max_nmse_err() override {
        return 1e-5;
    }
};

1e-5 is conservative — you could probably go tighter if you want. Once that's in, happy to merge.

Minor side note (not a blocker): the 22 new SET_ROWS_TURBO3 tests report "not supported" on Metal and get skipped (0/0 tests passed on that backend). Metal clearly supports the op in production via kernel_set_rows_turbo3_i32/i64 templates, so the test leaf-tensor setup is probably mismatched against what the Metal supports_op check expects. Worth investigating separately so the tests actually exercise Metal, but doesn't block this PR.

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
@Titaniumtown Titaniumtown force-pushed the pr/turboquant-vulkan-work branch from 81b7019 to 6a29b58 Compare April 9, 2026 16:39
@Titaniumtown
Copy link
Copy Markdown
Author

Let me know if that works. I don't have any apple devices to test.

@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Apr 9, 2026

Verified on M5 Max (Metal):

  • TURBO_WHT: 27/27 pass
  • TURBO_WHT_ROUNDTRIP: pass ✅
  • FLASH_ATTN_EXT: 6256/6256 pass

The 1e-5 tolerance is the right call — actual errors land around 3-5e-7, so you've got plenty of headroom. Thanks for the quick turnaround, merging now.

Note: SET_ROWS_TURBO3 still reports "not supported" on Metal (skipped 0/0). Not a blocker since Metal clearly supports the op in production, but worth a separate look later if you want the tests to actually exercise Metal.

@TheTom TheTom merged commit 8590cbf into TheTom:feature/turboquant-kv-cache Apr 9, 2026
1 check passed
apollosenvy pushed a commit to apollosenvy/llama-cpp-turboquant that referenced this pull request Apr 17, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
coutinhomarco pushed a commit to coutinhomarco/llama-cpp-turboquant that referenced this pull request Apr 18, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
TheTom pushed a commit that referenced this pull request Apr 22, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (5a36bd0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (6f88d87) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-5a36bd0fd.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (5a36bd0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (6f88d87) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-5a36bd0fd.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
sbaier1 pushed a commit to sbaier1/llama-cpp-turboquant that referenced this pull request May 8, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
#include "rte.glsl" in a shader whose header no longer exists, and
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
sbaier1 pushed a commit to sbaier1/llama-cpp-turboquant that referenced this pull request May 8, 2026
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR ggml-org#21572 (1f30ac0) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR TheTom#62 (ff8bb73) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues TheTom#50, TheTom#64, TheTom#81
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants