Refactoring: Use STL in input processing and fix off-by-one bug#115
Closed
thomasantony wants to merge 1 commit into
Closed
Refactoring: Use STL in input processing and fix off-by-one bug#115thomasantony wants to merge 1 commit into
thomasantony wants to merge 1 commit into
Conversation
faacdb5 to
68897eb
Compare
ggerganov
added a commit
that referenced
this pull request
Mar 19, 2023
Member
|
Thanks for spotting this! |
jesusmb1995
added a commit
to jesusmb1995/llama.cpp
that referenced
this pull request
Apr 17, 2026
Apply the loop-shape suggested in PR review (ggml-org#115 r3099918856) to the remaining TBQ/PQ vec_dot kernels in ggml/src/ggml-cpu/quants.c, mirroring the pattern already used by ggml_vec_dot_tbq3_0_64_q8_0: - one fewer scratch buffer: drop tmp_y[QK8_0]; q8_0 dequantizes directly into the per-sub-block slice of tmp_y_full (TBQ), or into a single reused tmp_y (PQ, no QJL stage). - one fewer copy loop: no more "tmp_y_full[j*QK8_0+k] = tmp_y[k]" next to the dot accumulation. - one fewer induction variable: replace the running pos++ with the closed-form index &y[i*M + j], with M = QK_{TQ,TQ_64} / QK8_0. - shorter dependency chain on sumf: accumulate per q8_0 sub-block into a local 'acc', sum sub-blocks into a 'base', then update sumf once per outer block (TBQ adds the QJL correction at the same time). - two software prefetches per outer iteration for the next x and y blocks (__builtin_prefetch with rw=0, locality=0). - GGML_RESTRICT on inner xb/yb pointers to make non-aliasing explicit. Kernels touched: - ggml_vec_dot_tbq3_0_q8_0 - ggml_vec_dot_tbq4_0_q8_0 - ggml_vec_dot_pq3_0_q8_0 - ggml_vec_dot_pq3_0_64_q8_0 - ggml_vec_dot_pq4_0_q8_0 - ggml_vec_dot_pq4_0_64_q8_0 The *_64 TBQ kernels already followed this shape and are unchanged. The *_generic wrappers just forward to the main kernels and inherit the changes automatically. Numerics: the algorithm is unchanged; only the FP32 reduction order inside each block changes (associative-only reordering). The QJL correction term and all dequantization paths are bit-identical. Validation on a CPU-only build (Ryzen AI 7 PRO 360, GGML_VULKAN=OFF): - tests/test-turboquant.sh: all checks pass * test-quantize-fns: ok for all TBQ/PQ types * test-quantize-perf perf-sanity gates: ok * test-backend-ops -p "tbq|pq": ok - test-quantize-perf vec_dot_q (4 KB L1-resident microbench, n=10): neutral within ~5% noise, as expected for a fully cache-hot loop where dequantize_row_* dominates and prefetch cannot help. - llama-bench (Mistral-7B-Instruct-v0.3-Q4_K_S, pp128/tg32, fa=on, K=tbq3_0 V=pq3_0, 4 alternating BEFORE/AFTER trials with cooldown): pp: +3.0% mean, won 4/4 trials tg: +3.4% mean, won 4/4 trials - llama-perplexity spot-check on 4 K/V configs x 2 n_ctx (8 datapoints, Mistral-7B-Q4_K_S, wikitext-2, 2 chunks each, fa=on): 5/8 datapoints PPL <= baseline, 3/8 slightly above mean delta = -0.09%, range [-1.49%, +1.22%] all well inside per-chunk stdev (~+/-1.3 absolute, ~+/-11%) -> symmetric around zero, consistent with FP32 reduction-order rounding noise; no regression.
jesusmb1995
added a commit
to jesusmb1995/llama.cpp
that referenced
this pull request
Apr 17, 2026
Apply the loop-shape suggested in PR review (ggml-org#115 r3099918856) to the remaining TBQ/PQ vec_dot kernels in ggml/src/ggml-cpu/quants.c, mirroring the pattern already used by ggml_vec_dot_tbq3_0_64_q8_0: - one fewer scratch buffer: drop tmp_y[QK8_0]; q8_0 dequantizes directly into the per-sub-block slice of tmp_y_full (TBQ), or into a single reused tmp_y (PQ, no QJL stage). - one fewer copy loop: no more "tmp_y_full[j*QK8_0+k] = tmp_y[k]" next to the dot accumulation. - one fewer induction variable: replace the running pos++ with the closed-form index &y[i*M + j], with M = QK_{TQ,TQ_64} / QK8_0. - shorter dependency chain on sumf: accumulate per q8_0 sub-block into a local 'acc', sum sub-blocks into a 'base', then update sumf once per outer block (TBQ adds the QJL correction at the same time). - two software prefetches per outer iteration for the next x and y blocks (__builtin_prefetch with rw=0, locality=0). - GGML_RESTRICT on inner xb/yb pointers to make non-aliasing explicit. Kernels touched: - ggml_vec_dot_tbq3_0_q8_0 - ggml_vec_dot_tbq4_0_q8_0 - ggml_vec_dot_pq3_0_q8_0 - ggml_vec_dot_pq3_0_64_q8_0 - ggml_vec_dot_pq4_0_q8_0 - ggml_vec_dot_pq4_0_64_q8_0 The *_64 TBQ kernels already followed this shape and are unchanged. The *_generic wrappers just forward to the main kernels and inherit the changes automatically. Numerics: the algorithm is unchanged; only the FP32 reduction order inside each block changes (associative-only reordering). The QJL correction term and all dequantization paths are bit-identical. Validation on a CPU-only build (Ryzen AI 7 PRO 360, GGML_VULKAN=OFF): - tests/test-turboquant.sh: all checks pass * test-quantize-fns: ok for all TBQ/PQ types * test-quantize-perf perf-sanity gates: ok * test-backend-ops -p "tbq|pq": ok - test-quantize-perf vec_dot_q (4 KB L1-resident microbench, n=10): neutral within ~5% noise, as expected for a fully cache-hot loop where dequantize_row_* dominates and prefetch cannot help. - llama-bench (Mistral-7B-Instruct-v0.3-Q4_K_S, pp128/tg32, fa=on, K=tbq3_0 V=pq3_0, 4 alternating BEFORE/AFTER trials with cooldown): pp: +3.0% mean, won 4/4 trials tg: +3.4% mean, won 4/4 trials - llama-perplexity spot-check on 4 K/V configs x 2 n_ctx (8 datapoints, Mistral-7B-Q4_K_S, wikitext-2, 2 chunks each, fa=on): 5/8 datapoints PPL <= baseline, 3/8 slightly above mean delta = -0.09%, range [-1.49%, +1.22%] all well inside per-chunk stdev (~+/-1.3 absolute, ~+/-11%) -> symmetric around zero, consistent with FP32 reduction-order rounding noise; no regression.
jesusmb1995
added a commit
to jesusmb1995/llama.cpp
that referenced
this pull request
Apr 17, 2026
The `GG_BUILD_LOW_PERF` → `-DGGML_NATIVE=OFF` append was placed inside `gg_run_ctest_release`, but the top-level driver runs `gg_run ctest_debug` first and `gg_run ctest_release` second. As a result the debug build on the low-perf CI runners (`ggml-ci-x64-cpu-low-perf` and `ggml-ci-arm64-cpu-low-perf`) was compiled with `-march=native` against the build host's CPU and then executed on a different, older-microarch runner in the pool, producing SIGILL during ctest_debug. Ref PR ggml-org#115 CI run 24463228694. Move the append into the top-level flag-handling block, right after `GG_BUILD_NO_SVE`, so `CMAKE_EXTRA` gets `-DGGML_NATIVE=OFF` once, before either ctest function is invoked, and both debug and release builds pick it up. Remove the duplicate inside `gg_run_ctest_release`. No workflow change required: `.github/workflows/build.yml` already exports `GG_BUILD_LOW_PERF=1` for the two low-perf jobs, which was correct; the bug was purely a scoping error in `ci/run.sh`. The other `GG_BUILD_LOW_PERF` checks in the script (ctest label filter, and the top-level branches that skip heavier test functions) are left untouched — they were already at the correct scope.
jesusmb1995
added a commit
to jesusmb1995/llama.cpp
that referenced
this pull request
Apr 20, 2026
The `GG_BUILD_LOW_PERF` → `-DGGML_NATIVE=OFF` append was placed inside
`gg_run_ctest_release`, but the top-level driver runs `gg_run
ctest_debug` first and `gg_run ctest_release` second. As a result the
debug build on the low-perf CI runners (`ggml-ci-x64-cpu-low-perf`
and `ggml-ci-arm64-cpu-low-perf`) was compiled with `-march=native`
against the build host's CPU and then executed on a different,
older-microarch runner in the pool, producing SIGILL during
ctest_debug. Ref PR ggml-org#115 CI run 24463228694.
Move the append into the top-level flag-handling block, right after
`GG_BUILD_NO_SVE`, so `CMAKE_EXTRA` gets `-DGGML_NATIVE=OFF` once,
before either ctest function is invoked, and both debug and release
builds pick it up. Remove the duplicate inside `gg_run_ctest_release`.
No workflow change required: `.github/workflows/build.yml` already
exports `GG_BUILD_LOW_PERF=1` for the two low-perf jobs, which was
correct; the bug was purely a scoping error in `ci/run.sh`. The other
`GG_BUILD_LOW_PERF` checks in the script (ctest label filter, and the
top-level branches that skip heavier test functions) are left
untouched — they were already at the correct scope.
diff --git a/ci/run.sh b/ci/run.sh
index 7fa469b..5c2f7e7 100755
--- a/ci/run.sh
+++ b/ci/run.sh
@@ -118,6 +118,15 @@ if [ ! -z ${GG_BUILD_NO_SVE} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm"
fi
+# Disable native CPU optimizations for low-perf builds to ensure binary
+# compatibility with the (often heterogeneous) CI runner pool. Must be applied
+# at the top level so BOTH gg_run_ctest_debug and gg_run_ctest_release pick it
+# up — otherwise the debug build (which runs first) compiles with -march=native
+# and can SIGILL on a runner whose microarch is older than the build host.
+if [ ! -z ${GG_BUILD_LOW_PERF} ]; then
+ CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF"
+fi
+
if [ -n "${GG_BUILD_KLEIDIAI}" ]; then
echo ">>===== Enabling KleidiAI support"
@@ -236,11 +245,6 @@ function gg_run_ctest_release {
# Check cmake, make and ctest are installed
gg_check_build_requirements
- # Disable native CPU optimizations for low-perf builds to ensure compatibility
- if [ ! -z ${GG_BUILD_LOW_PERF} ]; then
- CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF"
- fi
-
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 247775b..bcc1759 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -49,6 +49,25 @@ static ggml_tensor * ggml_rotate_hadamard(
return res;
}
+// Allocate the Hadamard rotation input used by ggml_rotate_hadamard() for a
+// TurboQuant/PolarQuant K or V stream. Size is the largest power-of-two that
+// divides n_embd_head (>= 64), so the rotation matches the head dim and the
+// PQ/TBQ codebooks see the full d-wide rotated distribution they were fitted
+// to. Returns nullptr when can_rot is false.
+static ggml_tensor * build_hadamard_rot(ggml_context * ctx, bool can_rot, int n_embd_head) {
+ if (!can_rot) {
+ return nullptr;
+ }
+
+ int nrot = 64;
+ do { nrot *= 2; } while (n_embd_head % nrot == 0);
+ nrot /= 2;
+
+ ggml_tensor * rot = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nrot, nrot);
+ ggml_set_input(rot);
+ return rot;
+}
+
void llm_graph_input_embd::set_input(const llama_ubatch * ubatch) {
if (ubatch->token) {
const int64_t n_tokens = ubatch->n_tokens;
@@ -1626,30 +1645,13 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
hparams.n_embd_head_k % 64 == 0 &&
ggml_is_quantized(mctx_cur->type_k());
- if (can_rotk) {
- int nrot = 64;
- do { nrot *= 2; } while (hparams.n_embd_head_k % nrot == 0);
- nrot /= 2;
-
- inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
- ggml_set_input(inp->self_rotk);
- } else {
- inp->self_rotk = nullptr;
- }
-
const bool can_rotv =
!hparams.is_n_embd_v_gqa_variable() &&
hparams.n_embd_head_v % 64 == 0 &&
ggml_is_quantized(mctx_cur->type_v());
- if (can_rotv) {
- int nrot = 64;
-
- inp->self_rotv = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
- ggml_set_input(inp->self_rotv);
- } else {
- inp->self_rotv = nullptr;
- }
+ inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k);
+ inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v);
}
return inp;
@@ -1947,30 +1949,13 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
hparams.n_embd_head_k % 64 == 0 &&
ggml_is_quantized(mctx_cur->get_base()->type_k());
- if (can_rotk) {
- int nrot = 64;
- do { nrot *= 2; } while (hparams.n_embd_head_k % nrot == 0);
- nrot /= 2;
-
- inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
- ggml_set_input(inp->self_rotk);
- } else {
- inp->self_rotk = nullptr;
- }
-
const bool can_rotv =
!hparams.is_n_embd_v_gqa_variable() &&
hparams.n_embd_head_v % 64 == 0 &&
ggml_is_quantized(mctx_cur->get_base()->type_v());
- if (can_rotv) {
- int nrot = 64;
-
- inp->self_rotv = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
- ggml_set_input(inp->self_rotv);
- } else {
- inp->self_rotv = nullptr;
- }
+ inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k);
+ inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v);
}
return (llm_graph_input_attn_kv_iswa *) res->add_input(std::move(inp));
jesusmb1995
added a commit
to jesusmb1995/llama.cpp
that referenced
this pull request
Apr 20, 2026
Register explicit `MUL_MAT` test cases for the TurboQuant / PolarQuant
types (`tbq3_0`, `tbq4_0`, `pq3_0`, `pq4_0`) with `type_b ∈ {f32, f16}`
and sizes that span both dispatch paths:
- n=1, n=8 -> mul_mat_vec path (decode-like)
- n=16, n=32 -> dequant + f16 matmul path (prefill-like)
Motivation: the existing TBQ/PQ coverage in `test-backend-ops` only
registers `FLASH_ATTN_EXT` cases, so the `tests/test-turboquant.sh`
filter (`test-backend-ops test -p "tbq|pq"`) never exercises the
standalone `MUL_MAT` path. That path is the one reported as
`supports_op == yes` on NV `VK_NV_cooperative_matrix2` devices but has
no matching pipeline created in `pipeline_dequant_mul_mat_mat_f16[]`
(see `ggml-vulkan.cpp:3412` — "TBQ/PQ cm2 matmul shaders not yet
generated"). With these cases in place:
- On NV coopmat2 (RTX 5090): the n>=16 cases segfault, matching the
external reproduction from the PR ggml-org#115 review and making the bug
visible to CI.
- On KHR coopmat1 (AMD gfx1150): the n>=16 cases return numerical
garbage (err ~ 1.0), exposing the same missing-fallback issue in
a non-crashing form.
- The n=1/n=8 cases continue to pass via the existing
`mul_mat_vec_tbq*_0` / `mul_mat_vec_pq*_0` shaders, so the new
coverage cleanly isolates which dispatch path is broken.
No source changes to the Vulkan backend; this commit only adds the
test cases needed so the pre-existing bug is caught by
`tests/test-turboquant.sh`.
jesusmb1995
added a commit
to jesusmb1995/llama.cpp
that referenced
this pull request
Apr 20, 2026
…d=64 variants Before this patch, `supports_op` reported TBQ3_0/TBQ4_0/PQ3_0/PQ4_0 (and their `_64` / head_dim=64 variants) as supported for `GGML_OP_MUL_MAT` on the Vulkan backend, but there was no working pipeline behind it. This is the state the external review flags as Issue 3 on PR ggml-org#115: on cm2 (RTX 5090) the support probe claims support, the correctness run then segfaults. The previous `test-backend-ops` commit adds the exact repro for this. Root causes: - On cm2 (NV coopmat2) devices the slot in `pipeline_dequant_mul_mat_mat_f16[]` was empty - the shader was never generated - so dispatches crashed when flash attention was not used. - On cm1 and scalar devices the slot in `pipeline_dequant_mul_mat_mat[]` was wired up, but `mul_mm_funcs.glsl` had no `load_a_to_shmem` implementation for TBQ/PQ. The generic `mul_mm.comp` ran with uninitialized shared memory and produced near-random output. - Even once data loading was fixed, TBQ3_0/TBQ4_0 still produced a small bias for `n > mul_mat_vec_max_cols` because the QJL Stage 2 correction that `mul_mat_vec_tbq*_0.comp` applies in the vec path has no equivalent in the generic matmul shader. - For head_dim=64 models that use the `_64` block variants (TBQ*_0_64 / PQ*_0_64) the `_64` mul_mm pipelines, the `_64` QJL correction shaders, and the `_64`-sized Lloyd-Max codebook / sign arrays were missing, and the vec path is intentionally skipped for `_64` (so every `n` needs the full matmul + QJL correction pair). Scope of this patch is strictly the standalone `MUL_MAT` path with TBQ/PQ `src0` and f32 `src1` that the Issue 3 repro hits, i.e. the `-fa off` K matmul. Fused flash attention (scalar, cm1, cm2) already handles QJL correctly and is unchanged. MoE FFN weights are not affected either: TBQ/PQ are KV-cache quantizations (there is no `llama-quantize` target that produces TBQ/PQ model weights), so `MUL_MAT_ID` never sees them as `src0` - attention in MoE models is a plain `MUL_MAT` / `FLASH_ATTN_EXT` and reuses the same fix. The upstream "V cache quantization requires flash_attn" context-level guard in `src/llama-context.cpp` is intentionally left unchanged: the V matmul under `-fa off` uses a transposed quantized-V layout populated by `ggml_set_rows` with row_size=1, which corrupts any `blck_size > 1` type at write time (reproducible on CPU as well), and that is a separate KV-cache issue out of scope here. Changes: - Add `load_a_to_shmem` implementations for TBQ3_0, TBQ4_0, PQ3_0, PQ4_0 (and their `_64` variants) in `mul_mm_funcs.glsl`, reusing `tbq3_dequant_raw` / `tbq4_dequant_raw` from `tq_utils.comp`. This makes `mul_mm.comp` correct for the centroid part of dequantization (`tbq*_dequant_raw(qs) * d`) on all eight types. - `tq_utils.comp`: pick Stage-1 / QJL-Stage-2 sign bitmasks and the Lloyd-Max codebook (TBQ3_CB / TBQ4_CB) based on whether any `DATA_{A,K,V}_*_0_64` is defined. d=64 blocks use seeds 43/139 and a wider codebook (sigma = 1/sqrt(d) is larger at d=64 than at d=128); previously the shader hardcoded the d=128 constants, so the d=64 variants silently dequantized against the wrong codebook. - New shader `mul_mm_tbq_qjl_correction.comp`. It runs after the main matmul as an additive pass: one workgroup per `(row, col, batch)`, `QUANT_K` threads performing the same Walsh-Hadamard butterfly + `qjl[]` dot product as the vec shader, and accumulates `d_r * sqrt(pi/2) / QUANT_K * sum_qjl(H(B))` into D. Parameterized over `QUANT_K` so the same source emits both `_128` and `_64` SPIR-V. Only TBQ3_0 and TBQ4_0 (and `_64`) have `d_r`/`qjl`, so only those four get a correction pipeline. - `vulkan-shaders-gen.cpp`: * Register the eight correction variants (`mul_mm_qjl_{tbq3_0,tbq4_0}{,_64}_{f32,f16}`). * Emit `matmul_{tbq,pq}{3,4}_0_64_{f32,f16}[_aligned]` for `mul_mm.comp`, in a dedicated block outside the main `type_names` loop so we don't cascade through FA / MUL_MAT_ID / get_rows / ... which either already have dedicated `_64` handling (FA) or don't apply to TBQ/PQ at all. - `ggml-vulkan.cpp`: * Add `pipeline_mul_mm_tbq_qjl[GGML_TYPE_COUNT][2]` on the device and create pipelines at init time for all four TBQ types (128 and 64 block sizes). * In `ggml_vk_get_mul_mat_mat_pipeline`, let cm2 fall through to the cm1/scalar pipeline when no cm2 `_mat_f16` shader exists for a given TBQ/PQ type, so cm2 devices stop segfaulting on these types. * Register TBQ/PQ `_64` in `supports_op` `MUL_MAT` switch so d=64 models are actually routed to the new pipelines instead of falling back to CPU. * Force `split_k = 1` for TBQ `src0` - the QJL correction pass would otherwise be added once per split. * Dispatch the QJL correction pass after the main matmul for TBQ3_0 / TBQ4_0 (and `_64`). For `_128` it's gated on `n > mul_mat_vec_max_cols` (the vec path already corrects for smaller n); for `_64` it runs unconditionally because there is no vec path on this block size. Verified on AMD RDNA3.5 (RADV gfx1150, no cm1/cm2) with `test-backend-ops -o MUL_MAT -b Vulkan0`: all 32 TBQ/PQ x {128, 64} x n in {1,8,16,32} cases pass against f32 B. f16 B for standalone MUL_MAT on TBQ/PQ still reports `not supported` on this device (the scalar/cm1 pipeline consumes f32 src1), which is consistent with the matmul path shipping f32 src1 on the `-fa off` decode/prefill paths used by these tests. cm2 verification is expected to run on the reviewer's RTX 5090 via the Issue 3 repro branch. debug: sentinel in QJL correction vulkan: add dequantize-to-f16 cpy shaders so MUL_MAT with non-contiguous quantized src0 can run on GPU
jesusmb1995
added a commit
to jesusmb1995/llama.cpp
that referenced
this pull request
Apr 21, 2026
…d=64 variants Before this patch, `supports_op` reported TBQ3_0/TBQ4_0/PQ3_0/PQ4_0 (and their `_64` / head_dim=64 variants) as supported for `GGML_OP_MUL_MAT` on the Vulkan backend, but there was no working pipeline behind it. This is the state the external review flags as Issue 3 on PR ggml-org#115: on cm2 (RTX 5090) the support probe claims support, the correctness run then segfaults. The previous `test-backend-ops` commit adds the exact repro for this. Root causes: - On cm2 (NV coopmat2) devices the slot in `pipeline_dequant_mul_mat_mat_f16[]` was empty - the shader was never generated - so dispatches crashed when flash attention was not used. - On cm1 and scalar devices the slot in `pipeline_dequant_mul_mat_mat[]` was wired up, but `mul_mm_funcs.glsl` had no `load_a_to_shmem` implementation for TBQ/PQ. The generic `mul_mm.comp` ran with uninitialized shared memory and produced near-random output. - Even once data loading was fixed, TBQ3_0/TBQ4_0 still produced a small bias for `n > mul_mat_vec_max_cols` because the QJL Stage 2 correction that `mul_mat_vec_tbq*_0.comp` applies in the vec path has no equivalent in the generic matmul shader. - For head_dim=64 models that use the `_64` block variants (TBQ*_0_64 / PQ*_0_64) the `_64` mul_mm pipelines, the `_64` QJL correction shaders, and the `_64`-sized Lloyd-Max codebook / sign arrays were missing, and the vec path is intentionally skipped for `_64` (so every `n` needs the full matmul + QJL correction pair). Scope of this patch is strictly the standalone `MUL_MAT` path with TBQ/PQ `src0` and f32 `src1` that the Issue 3 repro hits, i.e. the `-fa off` K matmul. Fused flash attention (scalar, cm1, cm2) already handles QJL correctly and is unchanged. MoE FFN weights are not affected either: TBQ/PQ are KV-cache quantizations (there is no `llama-quantize` target that produces TBQ/PQ model weights), so `MUL_MAT_ID` never sees them as `src0` - attention in MoE models is a plain `MUL_MAT` / `FLASH_ATTN_EXT` and reuses the same fix. The upstream "V cache quantization requires flash_attn" context-level guard in `src/llama-context.cpp` is intentionally left unchanged: the V matmul under `-fa off` uses a transposed quantized-V layout populated by `ggml_set_rows` with row_size=1, which corrupts any `blck_size > 1` type at write time (reproducible on CPU as well), and that is a separate KV-cache issue out of scope here. Changes: - Add `load_a_to_shmem` implementations for TBQ3_0, TBQ4_0, PQ3_0, PQ4_0 (and their `_64` variants) in `mul_mm_funcs.glsl`, reusing `tbq3_dequant_raw` / `tbq4_dequant_raw` from `tq_utils.comp`. This makes `mul_mm.comp` correct for the centroid part of dequantization (`tbq*_dequant_raw(qs) * d`) on all eight types. - `tq_utils.comp`: pick Stage-1 / QJL-Stage-2 sign bitmasks and the Lloyd-Max codebook (TBQ3_CB / TBQ4_CB) based on whether any `DATA_{A,K,V}_*_0_64` is defined. d=64 blocks use seeds 43/139 and a wider codebook (sigma = 1/sqrt(d) is larger at d=64 than at d=128); previously the shader hardcoded the d=128 constants, so the d=64 variants silently dequantized against the wrong codebook. - New shader `mul_mm_tbq_qjl_correction.comp`. It runs after the main matmul as an additive pass: one workgroup per `(row, col, batch)`, `QUANT_K` threads performing the same Walsh-Hadamard butterfly + `qjl[]` dot product as the vec shader, and accumulates `d_r * sqrt(pi/2) / QUANT_K * sum_qjl(H(B))` into D. Parameterized over `QUANT_K` so the same source emits both `_128` and `_64` SPIR-V. Only TBQ3_0 and TBQ4_0 (and `_64`) have `d_r`/`qjl`, so only those four get a correction pipeline. - `vulkan-shaders-gen.cpp`: * Register the eight correction variants (`mul_mm_qjl_{tbq3_0,tbq4_0}{,_64}_{f32,f16}`). * Emit `matmul_{tbq,pq}{3,4}_0_64_{f32,f16}[_aligned]` for `mul_mm.comp`, in a dedicated block outside the main `type_names` loop so we don't cascade through FA / MUL_MAT_ID / get_rows / ... which either already have dedicated `_64` handling (FA) or don't apply to TBQ/PQ at all. - `ggml-vulkan.cpp`: * Add `pipeline_mul_mm_tbq_qjl[GGML_TYPE_COUNT][2]` on the device and create pipelines at init time for all four TBQ types (128 and 64 block sizes). * In `ggml_vk_get_mul_mat_mat_pipeline`, let cm2 fall through to the cm1/scalar pipeline when no cm2 `_mat_f16` shader exists for a given TBQ/PQ type, so cm2 devices stop segfaulting on these types. * Register TBQ/PQ `_64` in `supports_op` `MUL_MAT` switch so d=64 models are actually routed to the new pipelines instead of falling back to CPU. * Force `split_k = 1` for TBQ `src0` - the QJL correction pass would otherwise be added once per split. * Dispatch the QJL correction pass after the main matmul for TBQ3_0 / TBQ4_0 (and `_64`). For `_128` it's gated on `n > mul_mat_vec_max_cols` (the vec path already corrects for smaller n); for `_64` it runs unconditionally because there is no vec path on this block size. Verified on AMD RDNA3.5 (RADV gfx1150, no cm1/cm2) with `test-backend-ops -o MUL_MAT -b Vulkan0`: all 32 TBQ/PQ x {128, 64} x n in {1,8,16,32} cases pass against f32 B. f16 B for standalone MUL_MAT on TBQ/PQ still reports `not supported` on this device (the scalar/cm1 pipeline consumes f32 src1), which is consistent with the matmul path shipping f32 src1 on the `-fa off` decode/prefill paths used by these tests. cm2 verification is expected to run on the reviewer's RTX 5090 via the Issue 3 repro branch. debug: sentinel in QJL correction vulkan: add dequantize-to-f16 cpy shaders so MUL_MAT with non-contiguous quantized src0 can run on GPU vulkan: add d=64 decision boundaries for TBQ/PQ copy_to_quant Commit 6e26e8b ("vulkan: fix TBQ/PQ standalone MUL_MAT path, QJL correction pass, and d=64 variants") added TQ_D64-gated d=64 Lloyd-Max codebook centroids and random sign diagonals to tq_utils.comp, so that the Vulkan encoder, FA decoder, and non-FA QJL correction pass all match the CPU reference's d=64 constants (ggml-quants.c: TQ{3,4}_CODEBOOK_64, TQ/QJL_SIGN_SEED_64). It missed the corresponding d=64 *decision boundaries* in copy_to_quant.comp, however. The boundaries are the midpoints between adjacent codebook centroids and determine which centroid an input coordinate is quantized to. CPU derives them at runtime from the selected codebook via tq_compute_boundaries(), so it automatically used the d=64 midpoints for _64 blocks. The Vulkan encoder hard-codes them as const float TBQ3_B[7] / TBQ4_B[15], and those constants remained at the d=128 midpoints even inside the #if block that accepts both DATA_A_TBQ*_0 and DATA_A_TBQ*_0_64. Net effect on a head_dim=64 model (Qwen2.5-0.5B): - copy_to_quant bucketed each coordinate by the narrower d=128 boundaries (centroids spaced ~sigma=1/sqrt(128)). - The resulting index was then dequantized with the wider d=64 centroids (spaced ~sigma=1/sqrt(64)), a completely different alphabet. - Every value near a boundary landed on the wrong centroid. Before commit 6e26e8b this was silently consistent: the codebook was also d=128, so encoder and decoder were at least in agreement (just producing a rescaled quantization). When the codebook was fixed to d=64, the boundaries had to move with it. Add a #if defined(TQ_D64) branch for TBQ3_B and TBQ4_B that uses the d=64 midpoints computed from TQ{3,4}_CODEBOOK_64. Values regenerated with scripts/compute_tq_codebooks.py, which now also emits the GLSL boundaries array alongside the C codebook array so future codebook updates keep the two in sync from one source of truth. Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wiki.test offset_64 (--chunks 1), Vulkan AMD RADV gfx1150: tbq3_0 / f16 fa=off: 1659 -> 230 (7.2x better) tbq4_0 / f16 fa=off: ~ -> 300 pq3_0 / f16 fa=off: ~ -> 230 pq4_0 / f16 fa=off: ~ -> 300 pq3_0 / f16 fa=on: ~ -> 225 (now matches fa=off) pq4_0 / f16 fa=on: ~ -> 299 (now matches fa=off) tbq3_0/tbq4_0 fa=on still diverges from fa=off due to a separate issue in the FA QJL Stage-2 correction on d=64 blocks, addressed in a follow- up patch. vulkan: read raw Q from the input SSBO for the FA QJL projection The flash-attention shaders compute the QJL Stage-2 correction as correction = d_r * sqrt(pi/2) / QUANT_K * (2*pos_sum - proj_q_sum) where (pos_sum, proj_q_sum) are reductions over FHT(D_qjl * Q). The scalar (flash_attn.comp) and coopmat1 (flash_attn_cm1.comp) paths used to derive the FHT input by reading from Qf -- the shared-memory buffer that already has the attention scale (1/sqrt(head_dim)) multiplied in for the main Q*K dot -- and then dividing that value by p.scale to recover the raw Q before multiplying by the QJL sign diagonal. On cm1 Qf is f16, so the scale round-trip is lossy for the large- magnitude activations seen in e.g. Qwen2.5-0.5B's first-layer massive activations. On the scalar path Qf is f32 and p.scale is usually a power of two (1/sqrt(head_dim)), so x * p.scale / p.scale is bit-exact in principle -- but empirically the pre-scaled-then-un-scaled read still produced materially different FHT input than a raw-Q read. The standalone non-FA QJL shader (mul_mm_tbq_qjl_correction.comp) has always read Q directly from src1 and gets correct results. Match that pattern in the FA path: read Q straight from the data_qv4 SSBO into Qf_qjl_proj, bypassing Qf entirely. cm2 already reads raw Q from data_q directly, so it does not need the change. Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wikitext-2 test (Vulkan AMD RADV gfx1150): wiki.test --chunks 4 -n 128, K=tbq3_0/f16: fa=off : 531 fa=on, before : ~2000 (broken) fa=on, after : 154 wiki.test --chunks 4 -n 128, K=tbq4_0/f16: fa=off : 207 fa=on, after : 77 K=f16/f16 and K=pq*/V=f16 fa=on/off stay within 1% of each other, confirming the change is confined to the TBQ QJL path. vulkan: run non-FA TBQ QJL correction on permuted src0 too The standalone MUL_MAT QJL (Stage 2) correction pass was gated on `!x_non_contig`, which silently skipped the correction on the no-FA attention path because `kq = mul_mat(k, q)` feeds in K after `ggml_permute(k, 0, 2, 1, 3)` -- a non-dim01-contiguous view of the KV cache. With that gate the TBQ attention on the no-FA path was reduced to PQ Stage 1 (centroid-only), producing bit-identical output to `pq3_0` / `pq4_0` and regressing quality vs a CPU-reference TBQ run. It was masked on the MUL_MAT test-backend-ops coverage because those tests use contiguous src0. Two changes: * mul_mm_tbq_qjl_correction.comp: index the A matrix by a real in-memory block stride instead of the `num_blocks_per_row = K / QUANT_K` shortcut. `p.stride_a` and `p.batch_stride_a` are now interpreted as strides in BLOCK units (src0->nb[1] / sizeof(block) and src0->nb[2] / sizeof(block) respectively), matching the way `ggml_vk_flash_attn` already feeds `k_stride` / `k_offset` to the FA shader. For a contiguous src0 the new stride equals the old num_blocks_per_row, so existing tests are unaffected. * ggml_vk_mul_mat_q_f16: compute `qjl_stride_a` / `qjl_stride_batch_a` from src0->nb and drop the `!x_non_contig` exclusion from both the descriptor-set request and the dispatch site. The `qx_buf_offset` still points at the original (pre-permute) TBQ blocks in d_Qx, which is exactly what the correction pass wants -- it just needed the real strides to reach the right block for each (row_a, batch_id). Both paths have to be gated identically -- if only one of them is changed, the dispatched pipeline ends up without a descriptor set and `vkCmdPushConstants` crashes with VK_NULL_HANDLE layout. Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wiki.test --chunks 4 -n 128 (Vulkan AMD RADV gfx1150): K=tbq3_0 V=f16 fa=off (vs fa=off pq3_0): before fix: 561 == 561 (QJL silently skipped, TBQ==PQ) after fix : 634 != 561 (QJL running, TBQ distinct from PQ) CPU ref : 546 K=tbq4_0 V=f16 fa=off (vs fa=off pq4_0): before fix: 173 == 173 (QJL silently skipped) after fix : 133 != 173 (QJL running, 23% lower PPL) CPU ref : 154 f16/f16 and pq* paths are unchanged: the gate only opens for TBQ. FA is unaffected since it has its own inlined QJL epilogue.
Seunghhon
pushed a commit
to Seunghhon/llama.cpp
that referenced
this pull request
Apr 26, 2026
phuongncn
pushed a commit
to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4
that referenced
this pull request
Apr 28, 2026
* MMQ for Q6_0 * Add Q6_0 MMQ to template generator --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
LifesLight
pushed a commit
to LifesLight/custom.llama.cpp
that referenced
this pull request
May 2, 2026
Cherry-pick signalnine PR ggml-org#53: auto-asymmetric GQA + turbo VEC FA opts
ljubomirj
pushed a commit
to ljubomirj/llama.cpp
that referenced
this pull request
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There was an off-by-one bug in the line
it would copy
n_batch+1elements rather thann_batch. This PR fixes that and also switches to using STL instead of a while loop