QVAC-14555: TurboQuant (Vulkan): KV cache quantization (TBQ3_0 / TBQ4_0 / PQ3_0 / PQ4_0)#115
Merged
Merged
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
69522fb to
6497a86
Compare
|
Are you planning to merge this before the rebase to the latest version of llama.cpp? |
f7ba069 to
9d2a659
Compare
Author
Edit: @zoq Since it seems we want this merged in about 1-2 weeks, I would target this version for now. yes, planning to merge this before the rebase. |
0a7559b to
7ea421d
Compare
7ea421d to
3238522
Compare
gianni-cor
reviewed
Apr 24, 2026
This comment was marked as resolved.
This comment was marked as resolved.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Fix a latent correctness bug in the TurboQuant / PolarQuant copy_to_quant
cooperative shader that silently produces wrong bytes on any device whose
gl_SubgroupSize is less than the 32-thread workgroup (Intel Xe/Arc at 8/16,
ARM Mali 4/8/16, some Adreno configurations). Make the path cover every
supported subgroup size, plumb a runtime knob for testing, and add a
dedicated test suite with both real-hardware and software-Vulkan coverage.
Motivation
----------
The original copy_to_quant.comp TBQ/PQ path uses subgroupAdd() for the
per-block norm reductions and subgroupBallot() for the QJL sign-bit sketch,
assuming gl_SubgroupSize == 32 (= the workgroup size). On devices where the
native subgroup is smaller, those ops reduce only within a subgroup, not the
whole workgroup, so each subgroup sees its own partial sum and the output
bytes become whatever the first-subgroup partial happened to produce. The
SET_ROWS path has the same issue. The bug does not reproduce on most
production GPUs (NVIDIA fixed-32, AMD RDNA 32/64, Apple 32) but bites Intel
and several mobile GPUs.
Shader changes (copy_to_quant.comp)
-----------------------------------
* New specialization constant SG_SIZE at constant_id = 1 (slot 0 is already
used by generic_binary_head.glsl's `norepeat` in the SET_ROWS path).
Defaults to 32 so hosts that pass no spec info get the original shader.
* TQ_WG fixed at 32 (the workgroup size); NSG = TQ_WG / SG_SIZE is the
number of subgroups per workgroup.
* New helper tq_wg_add(x): if NSG == 1 (SG_SIZE >= TQ_WG) returns
subgroupAdd(x) -- identical to the original fast path and
dead-code-eliminated by spec-constant folding; if NSG > 1 the per-
subgroup subgroupAdd results are written to shared memory (tq_sh_red)
and stitched with an [[unroll]]-ed sum. Replaces every subgroupAdd() in
the TBQ/PQ/norm-correction paths.
* QJL sign-bit pack: when SG_SIZE >= TQ_WG the original subgroupBallot
fast path runs; when SG_SIZE < TQ_WG it falls back to atomicOr into a
shared uint array and a serial write-out. Same fast-path guard lets
specialization fold the slow branch away when SG_SIZE == 32.
* SG_SIZE > TQ_WG (e.g. AMD wave64 with WG=32) is treated as NSG == 1
via clamp(SG_SIZE, TQ_WG) in tq_wg_add, so those devices take the fast
path even though half the wave is masked off.
Host plumbing (ggml-vulkan.cpp)
-------------------------------
* vk_device_struct grows a tbq_copy_sg_size field (0 = no override).
* Device init reads GGML_VK_TBQ_COPY_SG_SIZE from env, validates against
{4, 8, 16, 32, 64} intersected with the device's
[subgroup_min_size, subgroup_max_size], and emits a structured
"tbq_copy_sg_size_status requested=R applied=A reason=X" line so tests
can tell whether the override was applied or rejected (distinct from
success/failure of the run itself).
* ggml_vk_load_shaders picks the (SG_SIZE spec const, requiredSubgroupSize)
pair used for every CPY-to-quant and SET_ROWS-to-quant pipeline:
- if the env override is set: that value
- else if the device supports size control: mul_mat_subgroup_size
- else: 0 (shader default SG_SIZE=32, no required size) -- matches
pre-patch behaviour on drivers without VK_EXT_subgroup_size_control.
The two-element spec-const vector is {0, SG_SIZE} for the plain CPY
path (slot 0 is ignored by generic_unary_head.glsl) and {1, SG_SIZE}
for SET_ROWS (slot 0 is `norepeat`, always 1).
* Adds a device-selection opt-in GGML_VK_ALLOW_CPU_DEVICES=1 so tests can
pick up software Vulkan ICDs (lavapipe, SwiftShader) that ggml-vulkan
normally filters out. Production code never sets this env var and the
behaviour is unchanged when it isn't set.
New test (tests/test-copy-tbq-subgroups.cpp + CMakeLists)
---------------------------------------------------------
Self-spawning C++ test that for each (SG in {0, 4, 8, 16, 32, 64}, type,
shape) triple runs GPU quantize, compares against a CPU
ggml_quantize_chunk reference, and reports byte-mismatch + dequant NMSE
+ throughput. Key design choices:
* Self-spawn (popen of --child N with a different
GGML_VK_TBQ_COPY_SG_SIZE value per child) because the env var is
consumed once at device init and can only be changed across processes.
* Parses the structured status line from the backend to distinguish
"applied" from "rejected" rows. Rejected rows are labelled
SKIP-<reason> in the per-case table and excluded from the
NMSE-spread assertion (they are duplicates of sg=0 and don't add
independent coverage). Prior phrasing that labelled them OK was
misleading.
* --types comma-separated filter keeps the default CI run fast by
iterating only a subset of TBQ/PQ types.
* Shared pass/fail rule: nmse(gpu vs cpu) <= 1e-6 for every applied
SG; the per-case table stays OK on the legs that couldn't exercise
the stitch path on the host GPU.
Cross-subgroup-size coverage via lavapipe (tests/test-turboquant.sh)
--------------------------------------------------------------------
Real desktop GPUs (NVIDIA, AMD RDNA, Apple, most Adreno) have
minSubgroupSize >= 32, so VK_EXT_subgroup_size_control cannot request the
smaller subgroups the stitch path was written for. To actually exercise
NSG > 1 in CI, the script now also runs the test under lavapipe (Mesa's
CPU Vulkan driver) at LP_NATIVE_VECTOR_WIDTH in {128, 256, 512}, which
gives native subgroupSize {4, 8, 16} respectively and therefore covers
every distinct NSG branch the shader supports:
LP_NATIVE_VECTOR_WIDTH | lavapipe SG | NSG (= TQ_WG / SG)
-----------------------+-------------+--------------------
128 | 4 | 8 (8-way stitch)
256 | 8 | 4 (4-way stitch)
512 | 16 | 2 (2-way stitch)
Combined with the native-GPU leg (NSG=1, fast path), this gives full
coverage of the helper's {1, 2, 4, 8} NSG branches on any host.
Usage and modes
---------------
tests/test-turboquant.sh # short mode (default): CI-friendly
tests/test-turboquant.sh --full # all TBQ/PQ types, full matrix
Short mode restricts the SG-coverage legs to tbq3_0 / pq3_0 / *_64 to keep
default CI runtime bounded; full mode covers all 8 TBQ/PQ types. Both
modes render a Unicode-boxed summary table at the end covering every
subgroup-coverage leg that ran.
This comment was marked as resolved.
This comment was marked as resolved.
01a747f to
97ccf17
Compare
Keep omitted V-cache overrides on f16 when flash attention is disabled, and reject explicit quantized V sweeps early.
Do not advertise Vulkan MUL_MAT_ID for TBQ/PQ types because no ID pipelines exist for them. Plain MUL_MAT support remains enabled.
Centralize TBQ/PQ type checks so rotation and Vulkan support gates use the same type set. Keep Hadamard rotation limited to TBQ/PQ KV caches.
Cover small-n permuted TBQ/PQ MUL_MAT cases so standalone QJL and PQ controls are exercised by the TurboQuant test suite.
Run the standalone QJL correction when small-n TBQ is forced onto the matrix path, and index permuted TBQ batches with separate dim2/dim3 strides.
Exercise the head_dim=64 TBQ/PQ variants in standalone MUL_MAT and mixed FLASH_ATTN_EXT so CI catches regressions in the _64 Vulkan paths, not just copy_to_quant. Use each type's block size when choosing MUL_MAT k so _64 cases run with real 64-block geometry instead of inheriting the d=128 shape.
This comment was marked as resolved.
This comment was marked as resolved.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Keep the GGML type comments aligned with the actual TBQ/PQ block sizes so the enum documents the correct storage cost.
Drop placeholder QJL seed macros that were immediately undefined before use. The numeric seeds stay unchanged; this only removes confusing preprocessor noise around the real constants.
Use subgroup reductions with shared-memory stitching for the standalone TBQ QJL correction, matching the subgroup-size handling used by copy_to_quant. This removes the serial thread-0 reduction while keeping QUANT_K-wide workgroups correct across smaller hardware subgroups.
Exercise the standalone non-FA TBQ QJL correction under lavapipe subgroup sizes 4, 8, and 16. Record the legs in the existing subgroup summary so multi-subgroup reduction regressions are visible in the TurboQuant test run.
oneAPI 2026 removed syclcompat/math.hpp, which the current SYCL helper still includes. Install the versioned 2025.3 compiler and MKL packages in both Ubuntu SYCL jobs so CI keeps using the supported toolchain.
This comment was marked as resolved.
This comment was marked as resolved.
This comment has been minimized.
This comment has been minimized.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements TurboQuant KV cache quantization (Zandieh et al., ICLR 2026) for CPU and Vulkan backends with full Flash Attention support. Compresses KV cache to 3.25-4.25 bits per value, enabling ~4-5x larger context windows on the same hardware.
Paper: https://arxiv.org/pdf/2504.19874
Community discussion:
Related upstream PR: llama : rotate activations for better quantization ggml-org/llama.cpp#21038 (graph-level rotation for existing quant types)
Recommended configurations:
K=pq3_0 V=pq3_0— codebook-only, no QJL overhead. Minimal PPL/speed loss at 3.25 bpw with a small retrieval quality trade-off on long contexts.K=tbq3_0 V=pq3_0— QJL-corrected keys with codebook-only values. Best retrieval accuracy at 3.75 avg bpw, with a moderate speed cost from QJL correction in the FA shader.Features
tbq3_0,tbq4_0,pq3_0,pq4_0(and_64variants)pq3_0, internal type auto-selectscopy_to_quantVulkan path for TBQ/PQ (faster KV writes)How does TurboQuant work?
Random rotations spread values evenly across coordinates, preventing concentration on a few axes where zero-coordinates waste bits. In high dimensions, the marginal distribution of each coordinate of a unit-sphere vector follows a Beta distribution that converges to N(0, 1/d) as d grows. The algorithm exploits this by placing Lloyd-Max codebook centroids at optimal positions for this known distribution, minimizing MSE reconstruction error. Centroids are found by solving a continuous 1-dimensional k-means problem.
An additional QJL correction step (Stage 2) reduces bias in dot-product estimation. It quantizes the residual error from Stage 1 to 1-bit by storing only the signs of the residual vector after applying a random rotation (Hadamard × sign diagonal). Since only signs are stored (no centroid rounding), the paper proves this yields an unbiased dot-product estimator. This step is important for maintaining retrieval quality on long contexts.
Optimization details
Hadamard instead of dense rotation: Rotations based on Hadamard use the butterfly pattern in O(d log d) instead of O(d²). Hadamard is deterministic, but applying a random sign diagonal preserves randomness while remaining orthogonal and invertible.
Dense rotation for K/V/Q at graph level, FHT in shader for QJL: At block sizes d=64/128, O(d²) is negligible and utilizes better GPU parallelism for the graph-level rotation. The butterfly FHT is used inside the Flash Attention shader for the QJL projection, avoiding the need to copy a dense matrix into the shader (which would add memory pressure). Since there is no Q cache, the QJL projection of Q must be recomputed every step to apply corrections against the 1-bit signs stored in K blocks.
q4_0pq3_0pq4_0tbq3_0tbq4_0Implementation overview
vulkan-shaders-gen.cpp— orchestrates SPIR-V compilation of all variant combosggml-vulkan.cpp— host-side: creates pipeline objects, dispatches computeTurboQuant KV cache shader flow (TBQ/PQ is ONLY a KV cache type, never model weights):
STEP 1: Write to cache (same for all paths)
copy_to_quant.comp: float K/V → TBQ/PQ quantized blocksqjl[],d_r)STEP 2: Read cache at attention time (paths diverge here)
PATH A: Scalar Flash Attention (broad HW support, baseline)
flash_attn.comptypes.glsl,tq_utils.comp(viaflash_attn_base.glsl),dequant_funcs.glslPATH B: Cooperative matrix v1 Flash Attention (KHR, cross-vendor)
flash_attn_cm1.compcoopMatMulAddfor K·Q^T (subgroup-scope 16×16 tiles)sfsh[]after coopmat store)PATH C: Cooperative matrix v2 Flash Attention (NV only, most efficient)
flash_attn_cm2.compcoopMatLoadTensorNVwith decode callback (dequant-on-load, no shared memory staging)coopMatMulAdd(workgroup-scope matrices)data_k[]with hardcoded byte offsets per typePATH D: No-FA fallback, small N (MUL_MAT with N ≤ 8, e.g. decode)
mul_mat_vec_tbq3_0.comp/mul_mat_vec_tbq4_0.compPATH E: No-FA fallback, large N (K·Q
MUL_MATwith N > 8, e.g. prefill)-fa offwith a TBQ/PQ K cache. Only the K·Q matmul is affected: V stays f16 under-fa off(upstream guard), so V·A stays on the existing f16 path.mul_mm.compruns with TBQ/PQload_a_to_shmem— centroid dequant ×dinto shared memory, then generic tiled matmul (scalar / cm1 pipelines; cm2 falls through to cm1/scalar since no_mat_f16cm2 shader exists for TBQ/PQ).mul_mm_tbq_qjl_correction.compis dispatched after the main matmul as an additive pass — one workgroup per(row, col, batch),QUANT_Kthreads running the same Walsh–Hadamard + QJL dot product as the vec shader, accumulatingd_r · √(π/2) / QUANT_K · sum_qjl(H(B))intoD.qjl[]/d_r), so Stage 1 alone is exact.not supportedand falls back to CPU.supports_opclaimed TBQ/PQMUL_MATon cm2 devices (RTX 5090) but had no pipeline behind it, so the correctness run segfaulted.tests/test-backend-ops.cppnow covers all 8 TBQ/PQ types ×n ∈ {1,8,16,32}as a repro.src0(permuted layouts) is now routed to the matrix path as well, so TBQ/PQMUL_MATworks regardless ofsrc0stride pattern.Example usage
Works transparently with both head_dim=128 (Llama-3.1, Qwen, Mistral) and head_dim=64 (Llama-3.2-1B/3B) — the right block size is auto-selected.
Results / testing
Automatic CI/CTest should already cover the relevant backend regressions:
test-backend-opsincludes the TBQ/PQ backend-op cases, andtest-copy-tbq-subgroupscovers the Vulkan subgroup copy path. For a normal rebase/regression check, those automatic tests should be enough.The shell scripts below are more useful for manual TurboQuant-focused testing and analysis, especially when you want to skip unrelated tests and compare report numbers more directly. The quickest manual TurboQuant sanity run is:
That runs the TurboQuant correctness/sanity checks fairly quickly.
For a heavier but still manageable check against the simple report numbers, run the PPL and throughput scripts on the 5090 node from my checkout so the same in-place generated input files are reused:
Compare the resulting PPL vs F16 and TG% numbers against the simple report. These longer scripts are probably overkill for a simple backend regression, but they are useful for analysis/report validation.
The test scripts are in this PR, but the input text is downloaded or auto-generated the first time the scripts run. In theory,
test-kv-cache-quantization-perp.shshould use the same text offsets because the slice seed is fixed, but existing cachedwiki.test.offset_<n_ctx>.rawfiles are reused. The report also includes a zip of the same generated input files for anyone who wants to reproduce the exact input texts. RULER is more sensitive because its data is generated from the NVIDIA RULER repo and depends on the generatedvalidation.jsonl, tokenizer, dependency versions, and source data. PPL can also vary across hardware/backends due to numerical differences. For strict reproducibility, use the same hardware and the same already-generated input folders.Please see Asana for latest available data: https://app.asana.com/1/45238840754660/task/1214143691877486/comment/1214346089994897?focus=true
PR for testing integration on LLM Addon: tetherto/qvac#1564
Limitations
-fa offis not supported by this PR. Upstreamllama_init_from_modelrejects quantized V when flash attention is disabled ("V cache quantization requires flash_attn"), and that guard is intentionally left in place. The-fa offK·Q MUL_MAT fix in this PR would extend cleanly to A·V for a quantized V as well, but thev_transV-cache layout used under-fa offis populated byggml_set_rowswith row_size=1, which corrupts anyblck_size > 1type at write time (reproducible on CPU as well, independent of backend). Fixing that is a KV-cache refactor out of scope here; the guard will be revisited once that lands.TBQ / PQ Vulkan support matrix
What runs on GPU vs. is refused by the context, across FA on/off on dense and MoE models. The MoE-KV-cache rows behave the same as dense because attention itself is plain
MUL_MAT/FLASH_ATTN_EXT, notMUL_MAT_ID; MoE routing (MUL_MAT_ID) only applies to the FFN weights, which are never stored as TBQ/PQ.mul_mm.comp+ QJL correction; V·A on the existing f16 pathNotes:
_64block variants (tbq*_0_64,pq*_0_64) have their own pipelines, codebooks, and sign tables.llama-quantizehas no TBQ/PQ target, and no GGUF stores FFN experts in those types), soMUL_MAT_IDnever receives TBQ/PQsrc0. Attention in MoE models is a plainMUL_MAT/FLASH_ATTN_EXTand therefore falls under the "KV cache" rows above.Remaining work