Skip to content

QVAC-18665: TurboQuant (Vulkan) rebased into 8189 #133

Open
jesusmb1995 wants to merge 6 commits into
tetherto:temp-8189from
jesusmb1995:8189_tbq4
Open

QVAC-18665: TurboQuant (Vulkan) rebased into 8189 #133
jesusmb1995 wants to merge 6 commits into
tetherto:temp-8189from
jesusmb1995:8189_tbq4

Conversation

@jesusmb1995
Copy link
Copy Markdown

@jesusmb1995 jesusmb1995 commented May 12, 2026

Brings previously merged TurboQuant PR #115 to 8189

Rebase & Fixes Summary

  • Critical Fixes: Fixes an f16 upstream P*V dropout regression and resolves divergent barriers in copy_to_quant.comp.
  • Testing: Added new test-backend-ops test leg (f16 FLASH_ATTN_EXT).
  • Performance & PPL Verified: Perplexity and throughput perfectly match PR 115 baselines across AMD Strix and NVIDIA 5090 tests. See numbers/details on task comments https://app.asana.com/1/45238840754660/project/1209610027650937/task/1214639981590135
  • Complete test-turboquant.sh executed both on 5090 and Strix

AI Review

  • Score: Production ready (no functional regressions).
  • For the full detailed AI review, bug fix details, and performance test notes, please see this comment.

gianni-cor and others added 2 commits May 8, 2026 19:07
QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0)
fix TurboQuant Vulkan shaders after upstream rebase

Rename tq_utils.comp to tq_utils.glsl in every #include so it is treated as a header (not a standalone shader entry point) by the build, and update flash_attn.comp / flash_attn_base.glsl to use the new K_BLOCK_SIZE / V_BLOCK_SIZE split when the upstream BLOCK_SIZE macro is no longer set per-binding.

In flash_attn_cm1.comp restore the centroid-K gather and the V_BLOCK_SIZE>1 P*V loop to the cm1 thread layout (col_tid stride over Bc, d_local*threads_per_rowgroup+col_tid over HSV/4). In flash_attn_base.glsl guard the generic dequantize4 definitions by DATA_A_* type, introduce BINDING_IDX_K / BINDING_IDX_V, and gate the OV4 binding on D_TYPEV4. Drop the duplicated default argument on ggml_vk_flash_attn_coopmat_shmem_support definition.

In vulkan-shaders-gen.cpp gate the _64 same-type and mixed K/V flash-attention emissions on the fp16 outer iteration so identical SPV blobs are not generated twice, thread D_TYPEV4=vec4 through every variant, and re-enable get_rows_<t> f16 emission. In test-backend-ops the V tensor in FA now uses type_V instead of type_KV.

share Psh staging between f16 and TurboQuant V in cm1 FA

In flash_attn_cm1.comp, fold the V_BLOCK_SIZE>1 path into the same Pf-staging-via-Psh layout used by f16 V. The fused dequant + scalar P*V loop now reads Pf back from Psh[c * psh_stride + tile_row(r) / 4] (one f16vec4 holding 4 owned rows) and iterates Bc / threads_per_rowgroup columns with col_tid stride and HSV/4 vec4 lanes via d = d_local*threads_per_rowgroup + col_tid. Lf is computed during Pf staging and Of needs no cross-thread reduction, so the existing subgroupAdd reduction at end of main() still applies.

Also fix QJL correction to index sfsh as a vec4 (r/4, r%4) since sfsh is ACC_TYPEV4.

make TBQ quantize early-exits uniform via subgroupBroadcast

Replace the early returns in copy_to_quant.comp quantize() (norm<1e-15, r_norm<1e-15) with if/else around uniform skip flags broadcast from lane 0, so all threads reach the subsequent barriers together. This keeps barrier() reachability uniform even on devices that synthesise multiple subgroups per workgroup.

Originally attempted as a fix for an RTX 5090 hang in the TBQ QJL pack, but that hang turned out to be caused by single-lane byte stores after subgroupBallot rather than divergent barriers (see follow-up patch). The control-flow restructure is kept on its own merits.

fix RTX 5090 hang in TBQ QJL byte pack

In copy_to_quant.comp, the SG_SIZE>=TQ_WG fast path had lid==0 alone emit all 16 uint8_t qjl[] SSBO stores across an [[unroll]] STRIDE loop following subgroupBallot. NVIDIA driver 595 on RTX 5090 (NV_coopmat2) hangs in this pattern at STRIDE>=3 (TBQ3_0 / TBQ4_0 with BK=128); STRIDE=2 (_64 variants) and the atomicOr stitch path were fine.

Unify both NSG==1 and NSG>1 paths to stage QJL bits into tq_sh_qjl[] and have lanes 0..QJL_WORDS-1 each emit one 32-bit word as four byte stores, matching the working stitch layout. Verified with tests/test-turboquant.sh --short and --full on the 5090.

Merge pull request ggml-org#115 from jesusmb1995/turboquant

QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0)

fix Vulkan FA cm1 dropping P*V on f16/f16 KV

The same-type f16 FLASH_ATTN_EXT variant is emitted without any DATA_A_* / DATA_K_* / DATA_V_* macro, so the per-type ladders in flash_attn_base.glsl never define K_BLOCK_SIZE / V_BLOCK_SIZE. The cm1 shader guards its f16 coopmat P*V section with `#if V_BLOCK_SIZE == 1`, which on an undefined macro evaluates to `0 == 1` and silently drops the entire P*V loop from the SPIR-V. Of is only rescaled by eM and never accumulates P*V, pushing Mistral-Q4_K_M perplexity to ~1609 on the f16/f16 row on Strix (gfx1151, Mesa 25.2, KHR_cooperative_matrix); scalar FA is unaffected because its f16 path is gated on BLOCK_SIZE, not V_BLOCK_SIZE, and nb=1 falls back to scalar anyway.

Add `#ifndef`-guarded defaults of 1 for K_BLOCK_SIZE / V_BLOCK_SIZE so the f16/f16 variant lands on the same numerical value the TBQ/PQ ladders already define for f16 K/V. Also add a tightly-filtered FLASH_ATTN_EXT leg (~8 cases, nb >= 2 to actually hit the coopmat1 path) to test-turboquant.sh, since the existing `-p "tbq|pq"` filter never matches the upstream f16/f16 case.
@jesusmb1995 jesusmb1995 changed the title turboquant rebase QVAC-18665: TurboQuant rebased into 8189 May 12, 2026
The Metal backend does not currently support TurboQuant (TBQ) or PolarQuant (PQ) types for standalone MUL_MAT and MUL_MAT_ID operations. Returning true in ggml_metal_device_supports_op caused test-backend-ops to assert and crash because the pipeline setup could not find the relevant kernels. Explicitly reject these types to fall back safely.
@jesusmb1995
Copy link
Copy Markdown
Author

macOS-arm64 test-backend-ops failure is pre-existing on temp-8189 baseline, not caused by this PR.

The current failing job (run 25753678331, job 75636543172) reports:

36: [SOFT_MAX_BACK] ERR = 0.001221320 > 0.000000100   SOFT_MAX_BACK(type=f32,ne=[1023,15,1,1],scale=1.000000,max_bias=0.000000): FAIL
36: [SOFT_MAX_BACK] ERR = 0.000900075 > 0.000000100   SOFT_MAX_BACK(type=f32,ne=[1023,1023,1,1],scale=1.000000,max_bias=0.000000): FAIL
36: [SOFT_MAX_BACK] ERR = 0.002131379 > 0.000000100   SOFT_MAX_BACK(type=f32,ne=[1023,15,1,1],scale=0.100000,max_bias=0.000000): FAIL
36: [SOFT_MAX_BACK] ERR = 0.000943703 > 0.000000100   SOFT_MAX_BACK(type=f32,ne=[1023,1023,1,1],scale=0.100000,max_bias=0.000000): FAIL
36:   Backend MTL0: FAIL

The previous Asserting on type 40 (TBQ/PQ on Metal mul_mat) abort that motivated the recent Metal exclusion patch is gone — the test now runs to completion and surfaces this separate, unrelated Metal SOFT_MAX_BACK numerical-accuracy regression.

The exact same four SOFT_MAX_BACK failures, with the same magnitudes, already appear on PR #131 (run 25588996013, job 75123047326), which targets temp-8189 and contains no TurboQuant changes. Likely root cause is one of the upstream Metal commits pulled in during the temp-8189 rebase (e34743a9b Add SOFT_MAX_BACK metal kernel / 5fa6d7787 metal: port OUT_PROD, SILU_BACK, SOFT_MAX_BACK, RMS_NORM_BACK ops to split architecture).

So: the Metal patch in this PR fixed the type-40 abort it was meant to fix; the remaining macOS-arm64 failure is a temp-8189 baseline issue and should be addressed in a separate PR (or by the temp-8189 maintainer) rather than in this rebase.

@jesusmb1995
Copy link
Copy Markdown
Author

ubuntu-24-cmake-webgpu failure is a pre-existing flake on temp-8189, unrelated to this PR.

The current failure (run 25753678331, job 75636543611) is:

The following tests FAILED:
     32 - test-thread-safety (Timeout)

test-thread-safety runs the WebGPU backend through llvmpipe (Mesa software rasterizer) and hangs mid-generation, hitting the 3600s ctest timeout. No assertion, no error — just a hang.

Same exact failure (test-thread-safety Timeout on ubuntu-24-cmake-webgpu) reproduces on other recent PRs targeting temp-8189 that contain no TurboQuant changes:

Other PRs on the same base pass (#131, #128), confirming flake rather than deterministic regression. Likely root cause is upstream WebGPU/llvmpipe concurrency in the temp-8189 rebase, not anything introduced here.

Recommend a re-run of this job; not a TBQ-side blocker.

All flash_attn shader variants emitted by vulkan-shaders-gen.cpp pass D_TYPEV4=vec4, and every consumer of data_ov4 in flash_attn{,_cm1}.comp and flash_attn_base.glsl::gqaStore is unconditional. The #ifdef was decorative and could mislead future variants into thinking the OV4 alias is optional. Declare it unconditionally to match actual usage.
@jesusmb1995
Copy link
Copy Markdown
Author

jesusmb1995 commented May 12, 2026

AI Review (multiple passes)

Critical

None

High

None

Medium

None

Low

  • [ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp ~ lines 935-947] Indentation inside if (!is_tq_fht) { ... } for the get_rows block is misleading: the if (tname == "f16") / else / final string_to_spv("get_rows_..._f32", ...) lines sit at the outer for-loop indentation, but the closing } on line 946 actually closes the if (!is_tq_fht) (because braces ignore indentation). Functionally correct today, but a strong footgun for future edits — adding a line "outside" the if (visually) would still land inside it. Worth re-indenting before merge. Okayish because this edit minimizes conflicts with upstream rather than style.
  • [ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_cm2.comp lines 142-151] For the upstream same-type Q4_0/Q8_0/TBQ paths, both the legacy #if BLOCK_SIZE > 1 block and the new #if defined(K_BLOCK_SIZE) && K_BLOCK_SIZE > 1 block fire and call setTensorLayoutBlockSizeNV on the same tensor twice with the same value. Harmless (idempotent) but redundant. The legacy block is also dead code under PR 133 because K_BLOCK_SIZE/V_BLOCK_SIZE are always ≥ BLOCK_SIZE for any case where BLOCK_SIZE > 1. Could be cleaned up to a single source of truth. However, less conflicts this way with upstream because original block is preserved.

Notes

  • [src/llama-graph.cpp] The upstream assert GGML_ASSERT(v_mla == nullptr); was relaxed to GGML_ASSERT(!(inp->self_rotk || inp->self_rotv) || v_mla == nullptr); (currently at line 2078) during the rebase to accommodate MLA. This is correct and handled properly. The other build_attn overload around line 2245 has no equivalent assert and unconditionally calls build_attn_mha(..., v_mla, ...), but this asymmetry is pre-existing and identical between PR 115 and PR 133, so not a regression.
  • [ggml/src/ggml-vulkan/vulkan-shaders/flash_attn.comp] The centroid trick implementation is identical to PR 115. It appeared in the diff comparison solely because surrounding variable names and macros (like K_BLOCK_SIZE) were updated during the rebase. The underlying mathematical behavior remains unchanged and correct.
  • [ggml/src/ggml-vulkan/ggml-vulkan.cpp] The previous pass claimed the is_coopmat1_fa_type(kv_type) type_ok filter was added in PR 133. That was incorrect — the filter is byte-identical between PR 115 (commit 3840e11d5, lines 3094-3110) and PR 133. No change here.
  • [ggml/src/ggml-vulkan/vulkan-shaders/copy_to_quant.comp] The if (norm < 1e-15) return; early-out in PR 115 was restructured in PR 133 into an if (skip_quant) {} else {} plus if (!skip_quant) { /* TQ_NORM_CORRECTION + QJL */ } block, with skip_quant = subgroupBroadcast(norm < 1e-15, 0u). Because norm = sqrt(tq_wg_add(local_sq)) is already workgroup-uniform, the subgroupBroadcast is functionally redundant but provides a compiler hint that makes the inner barrier() calls provably subgroup-uniform on backends that lack good uniformity analysis. Same barrier()-inside-conditional pattern existed in PR 115 (after the early return); not a regression.
  • [ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp] PR 133 deliberately stops emitting get_rows_tbq3_0 / get_rows_tbq4_0 / get_rows_pq3_0 / get_rows_pq4_0 (the if (!is_tq_fht) guard at line 936). The comment explains: element-wise dequant doesn't apply for TBQ/PQ since they need an FHT pass. PR 115 emitted those shader variants but they were never referenced from ggml-vulkan.cpp (verified with grep — no get_rows_tbq* / get_rows_pq* references exist). Intentional and safe behavior change.
  • [ggml/src/ggml-vulkan/vulkan-shaders/flash_attn_base.glsl lines 95-97] PR 133 added #ifndef BLOCK_SIZE → #define BLOCK_SIZE 1 near the top of the file, in addition to the existing #ifndef K_BLOCK_SIZE → 1 and #ifndef V_BLOCK_SIZE → 1 fallbacks lower down. This is the actual fix for the upstream same-type f16 regression that motivated the new test-leg in test-turboquant.sh — without these defaults, #if BLOCK_SIZE == 1 / #if K_BLOCK_SIZE == 1 style guards in flash_attn_cm1.comp (e.g. the V_BLOCK_SIZE == 1 guard around the f16 PV coopmat path) silently evaluate as false when the macro is undefined and drop the entire PV section from the SPIR-V.
  • [ggml/src/ggml-vulkan/ggml-vulkan.cpp lines 425-447 and 10169] vk_fa_pipeline_state gained a v_type field (default-initialised to GGML_TYPE_F16 in the constructor and overwritten with v->type at dispatch time). It participates in operator< so that mixed-K/V variants live as separate map entries under the same K-keyed pipeline bucket. Verified: the CREATE_FA and CREATE_FA_MIXED macros correctly filter by fa.first.v_type, and same-type Q4_0/Q8_0/TBQ paths still resolve to their own v_type bucket since v->type is assigned before pipelines.find(...). No issues found here.

Summary

The rebased PR 133 introduces critical fixes over PR 115 (specifically the #ifndef BLOCK_SIZE/K_BLOCK_SIZE/V_BLOCK_SIZE → 1 fallbacks in flash_attn_base.glsl that fix the f16 upstream P*V dropout regression, and the subgroupBroadcast for divergent barriers in copy_to_quant.comp). Re-review found no Critical/High/Medium regressions. The two remaining Low-severity stylistic / code-hygiene issues (misleading indentation in vulkan-shaders-gen.cpp, redundant setTensorLayoutBlockSizeNV call in flash_attn_cm2.comp) are intentionally preserved, as fixing them would increase merge conflicts with upstream without providing functional benefits. The decorative #ifdef D_TYPEV4 guard in flash_attn_base.glsl was dropped in this PR (unconditional OV4 declaration), since every generator call site already passes D_TYPEV4=vec4 and every consumer is unconditional.

Score

4/4 perfect (no functional regressions; 2 minor Low-severity hygiene items left intentionally to minimize upstream conflicts)

@jesusmb1995 jesusmb1995 marked this pull request as ready for review May 12, 2026 20:15
@jesusmb1995 jesusmb1995 requested a review from a team as a code owner May 12, 2026 20:15
@jesusmb1995 jesusmb1995 self-assigned this May 12, 2026
@gianni-cor

This comment has been minimized.

The homogeneous TBQ/PQ scalar FA pipelines (e.g. pq3_0/pq3_0) and all mixed K/V variants are only registered inside the device->fp16 branch, but supports_op advertised them regardless of fp16. On an fp16-less device (or with GGML_VK_DISABLE_F16=1) dispatch hit an uninitialized lazy pipeline and aborted with Br == pipeline->wg_denoms[0]. Gate the op on device->fp16 so the scheduler falls back to CPU instead.

Add a focused homogeneous-PQ cluster to test-backend-ops and a dedicated GGML_VK_DISABLE_F16=1 leg to test-turboquant.sh so the regression is caught without needing real fp16-less hardware.
copy_to_quant.comp declares SG_SIZE with constant_id=1 because id 0 is taken by norepeat in generic_binary_head.glsl. The host-side comment in ggml_vk_load_shaders still said constant_id=0, which would mislead anyone tracing the spec map. Update the comment to match the shader.
@jesusmb1995

This comment has been minimized.

@jesusmb1995 jesusmb1995 changed the title QVAC-18665: TurboQuant rebased into 8189 QVAC-18665: TurboQuant (Vulkan) rebased into 8189 May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants