Refactoring: Use STL in input processing and fix off-by-one bug by thomasantony · Pull Request #115 · ggml-org/llama.cpp

thomasantony · 2023-03-14T04:25:25Z

There was an off-by-one bug in the line

if (embd.size() > params.n_batch) {
    break;
}

it would copy n_batch+1 elements rather than n_batch. This PR fixes that and also switches to using STL instead of a while loop

ggerganov · 2023-03-19T17:48:02Z

Thanks for spotting this!
For now I prefer to keep the loop-based implementation - I find it easier to read.
Pushed the off-by-one fix to master

Apply the loop-shape suggested in PR review (ggml-org#115 r3099918856) to the remaining TBQ/PQ vec_dot kernels in ggml/src/ggml-cpu/quants.c, mirroring the pattern already used by ggml_vec_dot_tbq3_0_64_q8_0: - one fewer scratch buffer: drop tmp_y[QK8_0]; q8_0 dequantizes directly into the per-sub-block slice of tmp_y_full (TBQ), or into a single reused tmp_y (PQ, no QJL stage). - one fewer copy loop: no more "tmp_y_full[j*QK8_0+k] = tmp_y[k]" next to the dot accumulation. - one fewer induction variable: replace the running pos++ with the closed-form index &y[i*M + j], with M = QK_{TQ,TQ_64} / QK8_0. - shorter dependency chain on sumf: accumulate per q8_0 sub-block into a local 'acc', sum sub-blocks into a 'base', then update sumf once per outer block (TBQ adds the QJL correction at the same time). - two software prefetches per outer iteration for the next x and y blocks (__builtin_prefetch with rw=0, locality=0). - GGML_RESTRICT on inner xb/yb pointers to make non-aliasing explicit. Kernels touched: - ggml_vec_dot_tbq3_0_q8_0 - ggml_vec_dot_tbq4_0_q8_0 - ggml_vec_dot_pq3_0_q8_0 - ggml_vec_dot_pq3_0_64_q8_0 - ggml_vec_dot_pq4_0_q8_0 - ggml_vec_dot_pq4_0_64_q8_0 The *_64 TBQ kernels already followed this shape and are unchanged. The *_generic wrappers just forward to the main kernels and inherit the changes automatically. Numerics: the algorithm is unchanged; only the FP32 reduction order inside each block changes (associative-only reordering). The QJL correction term and all dequantization paths are bit-identical. Validation on a CPU-only build (Ryzen AI 7 PRO 360, GGML_VULKAN=OFF): - tests/test-turboquant.sh: all checks pass * test-quantize-fns: ok for all TBQ/PQ types * test-quantize-perf perf-sanity gates: ok * test-backend-ops -p "tbq|pq": ok - test-quantize-perf vec_dot_q (4 KB L1-resident microbench, n=10): neutral within ~5% noise, as expected for a fully cache-hot loop where dequantize_row_* dominates and prefetch cannot help. - llama-bench (Mistral-7B-Instruct-v0.3-Q4_K_S, pp128/tg32, fa=on, K=tbq3_0 V=pq3_0, 4 alternating BEFORE/AFTER trials with cooldown): pp: +3.0% mean, won 4/4 trials tg: +3.4% mean, won 4/4 trials - llama-perplexity spot-check on 4 K/V configs x 2 n_ctx (8 datapoints, Mistral-7B-Q4_K_S, wikitext-2, 2 chunks each, fa=on): 5/8 datapoints PPL <= baseline, 3/8 slightly above mean delta = -0.09%, range [-1.49%, +1.22%] all well inside per-chunk stdev (~+/-1.3 absolute, ~+/-11%) -> symmetric around zero, consistent with FP32 reduction-order rounding noise; no regression.

The `GG_BUILD_LOW_PERF` → `-DGGML_NATIVE=OFF` append was placed inside `gg_run_ctest_release`, but the top-level driver runs `gg_run ctest_debug` first and `gg_run ctest_release` second. As a result the debug build on the low-perf CI runners (`ggml-ci-x64-cpu-low-perf` and `ggml-ci-arm64-cpu-low-perf`) was compiled with `-march=native` against the build host's CPU and then executed on a different, older-microarch runner in the pool, producing SIGILL during ctest_debug. Ref PR ggml-org#115 CI run 24463228694. Move the append into the top-level flag-handling block, right after `GG_BUILD_NO_SVE`, so `CMAKE_EXTRA` gets `-DGGML_NATIVE=OFF` once, before either ctest function is invoked, and both debug and release builds pick it up. Remove the duplicate inside `gg_run_ctest_release`. No workflow change required: `.github/workflows/build.yml` already exports `GG_BUILD_LOW_PERF=1` for the two low-perf jobs, which was correct; the bug was purely a scoping error in `ci/run.sh`. The other `GG_BUILD_LOW_PERF` checks in the script (ctest label filter, and the top-level branches that skip heavier test functions) are left untouched — they were already at the correct scope.

The `GG_BUILD_LOW_PERF` → `-DGGML_NATIVE=OFF` append was placed inside `gg_run_ctest_release`, but the top-level driver runs `gg_run ctest_debug` first and `gg_run ctest_release` second. As a result the debug build on the low-perf CI runners (`ggml-ci-x64-cpu-low-perf` and `ggml-ci-arm64-cpu-low-perf`) was compiled with `-march=native` against the build host's CPU and then executed on a different, older-microarch runner in the pool, producing SIGILL during ctest_debug. Ref PR ggml-org#115 CI run 24463228694. Move the append into the top-level flag-handling block, right after `GG_BUILD_NO_SVE`, so `CMAKE_EXTRA` gets `-DGGML_NATIVE=OFF` once, before either ctest function is invoked, and both debug and release builds pick it up. Remove the duplicate inside `gg_run_ctest_release`. No workflow change required: `.github/workflows/build.yml` already exports `GG_BUILD_LOW_PERF=1` for the two low-perf jobs, which was correct; the bug was purely a scoping error in `ci/run.sh`. The other `GG_BUILD_LOW_PERF` checks in the script (ctest label filter, and the top-level branches that skip heavier test functions) are left untouched — they were already at the correct scope. diff --git a/ci/run.sh b/ci/run.sh index 7fa469b..5c2f7e7 100755 --- a/ci/run.sh +++ b/ci/run.sh @@ -118,6 +118,15 @@ if [ ! -z ${GG_BUILD_NO_SVE} ]; then CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm" fi +# Disable native CPU optimizations for low-perf builds to ensure binary +# compatibility with the (often heterogeneous) CI runner pool. Must be applied +# at the top level so BOTH gg_run_ctest_debug and gg_run_ctest_release pick it +# up — otherwise the debug build (which runs first) compiles with -march=native +# and can SIGILL on a runner whose microarch is older than the build host. +if [ ! -z ${GG_BUILD_LOW_PERF} ]; then + CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF" +fi + if [ -n "${GG_BUILD_KLEIDIAI}" ]; then echo ">>===== Enabling KleidiAI support" @@ -236,11 +245,6 @@ function gg_run_ctest_release { # Check cmake, make and ctest are installed gg_check_build_requirements - # Disable native CPU optimizations for low-perf builds to ensure compatibility - if [ ! -z ${GG_BUILD_LOW_PERF} ]; then - CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF" - fi - (time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log (time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp index 247775b..bcc1759 100644 --- a/src/llama-graph.cpp +++ b/src/llama-graph.cpp @@ -49,6 +49,25 @@ static ggml_tensor * ggml_rotate_hadamard( return res; } +// Allocate the Hadamard rotation input used by ggml_rotate_hadamard() for a +// TurboQuant/PolarQuant K or V stream. Size is the largest power-of-two that +// divides n_embd_head (>= 64), so the rotation matches the head dim and the +// PQ/TBQ codebooks see the full d-wide rotated distribution they were fitted +// to. Returns nullptr when can_rot is false. +static ggml_tensor * build_hadamard_rot(ggml_context * ctx, bool can_rot, int n_embd_head) { + if (!can_rot) { + return nullptr; + } + + int nrot = 64; + do { nrot *= 2; } while (n_embd_head % nrot == 0); + nrot /= 2; + + ggml_tensor * rot = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nrot, nrot); + ggml_set_input(rot); + return rot; +} + void llm_graph_input_embd::set_input(const llama_ubatch * ubatch) { if (ubatch->token) { const int64_t n_tokens = ubatch->n_tokens; @@ -1626,30 +1645,13 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl( hparams.n_embd_head_k % 64 == 0 && ggml_is_quantized(mctx_cur->type_k()); - if (can_rotk) { - int nrot = 64; - do { nrot *= 2; } while (hparams.n_embd_head_k % nrot == 0); - nrot /= 2; - - inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot); - ggml_set_input(inp->self_rotk); - } else { - inp->self_rotk = nullptr; - } - const bool can_rotv = !hparams.is_n_embd_v_gqa_variable() && hparams.n_embd_head_v % 64 == 0 && ggml_is_quantized(mctx_cur->type_v()); - if (can_rotv) { - int nrot = 64; - - inp->self_rotv = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot); - ggml_set_input(inp->self_rotv); - } else { - inp->self_rotv = nullptr; - } + inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k); + inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v); } return inp; @@ -1947,30 +1949,13 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const hparams.n_embd_head_k % 64 == 0 && ggml_is_quantized(mctx_cur->get_base()->type_k()); - if (can_rotk) { - int nrot = 64; - do { nrot *= 2; } while (hparams.n_embd_head_k % nrot == 0); - nrot /= 2; - - inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot); - ggml_set_input(inp->self_rotk); - } else { - inp->self_rotk = nullptr; - } - const bool can_rotv = !hparams.is_n_embd_v_gqa_variable() && hparams.n_embd_head_v % 64 == 0 && ggml_is_quantized(mctx_cur->get_base()->type_v()); - if (can_rotv) { - int nrot = 64; - - inp->self_rotv = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot); - ggml_set_input(inp->self_rotv); - } else { - inp->self_rotv = nullptr; - } + inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k); + inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v); } return (llm_graph_input_attn_kv_iswa *) res->add_input(std::move(inp));

Register explicit `MUL_MAT` test cases for the TurboQuant / PolarQuant types (`tbq3_0`, `tbq4_0`, `pq3_0`, `pq4_0`) with `type_b ∈ {f32, f16}` and sizes that span both dispatch paths: - n=1, n=8 -> mul_mat_vec path (decode-like) - n=16, n=32 -> dequant + f16 matmul path (prefill-like) Motivation: the existing TBQ/PQ coverage in `test-backend-ops` only registers `FLASH_ATTN_EXT` cases, so the `tests/test-turboquant.sh` filter (`test-backend-ops test -p "tbq|pq"`) never exercises the standalone `MUL_MAT` path. That path is the one reported as `supports_op == yes` on NV `VK_NV_cooperative_matrix2` devices but has no matching pipeline created in `pipeline_dequant_mul_mat_mat_f16[]` (see `ggml-vulkan.cpp:3412` — "TBQ/PQ cm2 matmul shaders not yet generated"). With these cases in place: - On NV coopmat2 (RTX 5090): the n>=16 cases segfault, matching the external reproduction from the PR ggml-org#115 review and making the bug visible to CI. - On KHR coopmat1 (AMD gfx1150): the n>=16 cases return numerical garbage (err ~ 1.0), exposing the same missing-fallback issue in a non-crashing form. - The n=1/n=8 cases continue to pass via the existing `mul_mat_vec_tbq*_0` / `mul_mat_vec_pq*_0` shaders, so the new coverage cleanly isolates which dispatch path is broken. No source changes to the Vulkan backend; this commit only adds the test cases needed so the pre-existing bug is caught by `tests/test-turboquant.sh`.

…d=64 variants Before this patch, `supports_op` reported TBQ3_0/TBQ4_0/PQ3_0/PQ4_0 (and their `_64` / head_dim=64 variants) as supported for `GGML_OP_MUL_MAT` on the Vulkan backend, but there was no working pipeline behind it. This is the state the external review flags as Issue 3 on PR ggml-org#115: on cm2 (RTX 5090) the support probe claims support, the correctness run then segfaults. The previous `test-backend-ops` commit adds the exact repro for this. Root causes: - On cm2 (NV coopmat2) devices the slot in `pipeline_dequant_mul_mat_mat_f16[]` was empty - the shader was never generated - so dispatches crashed when flash attention was not used. - On cm1 and scalar devices the slot in `pipeline_dequant_mul_mat_mat[]` was wired up, but `mul_mm_funcs.glsl` had no `load_a_to_shmem` implementation for TBQ/PQ. The generic `mul_mm.comp` ran with uninitialized shared memory and produced near-random output. - Even once data loading was fixed, TBQ3_0/TBQ4_0 still produced a small bias for `n > mul_mat_vec_max_cols` because the QJL Stage 2 correction that `mul_mat_vec_tbq*_0.comp` applies in the vec path has no equivalent in the generic matmul shader. - For head_dim=64 models that use the `_64` block variants (TBQ*_0_64 / PQ*_0_64) the `_64` mul_mm pipelines, the `_64` QJL correction shaders, and the `_64`-sized Lloyd-Max codebook / sign arrays were missing, and the vec path is intentionally skipped for `_64` (so every `n` needs the full matmul + QJL correction pair). Scope of this patch is strictly the standalone `MUL_MAT` path with TBQ/PQ `src0` and f32 `src1` that the Issue 3 repro hits, i.e. the `-fa off` K matmul. Fused flash attention (scalar, cm1, cm2) already handles QJL correctly and is unchanged. MoE FFN weights are not affected either: TBQ/PQ are KV-cache quantizations (there is no `llama-quantize` target that produces TBQ/PQ model weights), so `MUL_MAT_ID` never sees them as `src0` - attention in MoE models is a plain `MUL_MAT` / `FLASH_ATTN_EXT` and reuses the same fix. The upstream "V cache quantization requires flash_attn" context-level guard in `src/llama-context.cpp` is intentionally left unchanged: the V matmul under `-fa off` uses a transposed quantized-V layout populated by `ggml_set_rows` with row_size=1, which corrupts any `blck_size > 1` type at write time (reproducible on CPU as well), and that is a separate KV-cache issue out of scope here. Changes: - Add `load_a_to_shmem` implementations for TBQ3_0, TBQ4_0, PQ3_0, PQ4_0 (and their `_64` variants) in `mul_mm_funcs.glsl`, reusing `tbq3_dequant_raw` / `tbq4_dequant_raw` from `tq_utils.comp`. This makes `mul_mm.comp` correct for the centroid part of dequantization (`tbq*_dequant_raw(qs) * d`) on all eight types. - `tq_utils.comp`: pick Stage-1 / QJL-Stage-2 sign bitmasks and the Lloyd-Max codebook (TBQ3_CB / TBQ4_CB) based on whether any `DATA_{A,K,V}_*_0_64` is defined. d=64 blocks use seeds 43/139 and a wider codebook (sigma = 1/sqrt(d) is larger at d=64 than at d=128); previously the shader hardcoded the d=128 constants, so the d=64 variants silently dequantized against the wrong codebook. - New shader `mul_mm_tbq_qjl_correction.comp`. It runs after the main matmul as an additive pass: one workgroup per `(row, col, batch)`, `QUANT_K` threads performing the same Walsh-Hadamard butterfly + `qjl[]` dot product as the vec shader, and accumulates `d_r * sqrt(pi/2) / QUANT_K * sum_qjl(H(B))` into D. Parameterized over `QUANT_K` so the same source emits both `_128` and `_64` SPIR-V. Only TBQ3_0 and TBQ4_0 (and `_64`) have `d_r`/`qjl`, so only those four get a correction pipeline. - `vulkan-shaders-gen.cpp`: * Register the eight correction variants (`mul_mm_qjl_{tbq3_0,tbq4_0}{,_64}_{f32,f16}`). * Emit `matmul_{tbq,pq}{3,4}_0_64_{f32,f16}[_aligned]` for `mul_mm.comp`, in a dedicated block outside the main `type_names` loop so we don't cascade through FA / MUL_MAT_ID / get_rows / ... which either already have dedicated `_64` handling (FA) or don't apply to TBQ/PQ at all. - `ggml-vulkan.cpp`: * Add `pipeline_mul_mm_tbq_qjl[GGML_TYPE_COUNT][2]` on the device and create pipelines at init time for all four TBQ types (128 and 64 block sizes). * In `ggml_vk_get_mul_mat_mat_pipeline`, let cm2 fall through to the cm1/scalar pipeline when no cm2 `_mat_f16` shader exists for a given TBQ/PQ type, so cm2 devices stop segfaulting on these types. * Register TBQ/PQ `_64` in `supports_op` `MUL_MAT` switch so d=64 models are actually routed to the new pipelines instead of falling back to CPU. * Force `split_k = 1` for TBQ `src0` - the QJL correction pass would otherwise be added once per split. * Dispatch the QJL correction pass after the main matmul for TBQ3_0 / TBQ4_0 (and `_64`). For `_128` it's gated on `n > mul_mat_vec_max_cols` (the vec path already corrects for smaller n); for `_64` it runs unconditionally because there is no vec path on this block size. Verified on AMD RDNA3.5 (RADV gfx1150, no cm1/cm2) with `test-backend-ops -o MUL_MAT -b Vulkan0`: all 32 TBQ/PQ x {128, 64} x n in {1,8,16,32} cases pass against f32 B. f16 B for standalone MUL_MAT on TBQ/PQ still reports `not supported` on this device (the scalar/cm1 pipeline consumes f32 src1), which is consistent with the matmul path shipping f32 src1 on the `-fa off` decode/prefill paths used by these tests. cm2 verification is expected to run on the reviewer's RTX 5090 via the Issue 3 repro branch. debug: sentinel in QJL correction vulkan: add dequantize-to-f16 cpy shaders so MUL_MAT with non-contiguous quantized src0 can run on GPU

…d=64 variants Before this patch, `supports_op` reported TBQ3_0/TBQ4_0/PQ3_0/PQ4_0 (and their `_64` / head_dim=64 variants) as supported for `GGML_OP_MUL_MAT` on the Vulkan backend, but there was no working pipeline behind it. This is the state the external review flags as Issue 3 on PR ggml-org#115: on cm2 (RTX 5090) the support probe claims support, the correctness run then segfaults. The previous `test-backend-ops` commit adds the exact repro for this. Root causes: - On cm2 (NV coopmat2) devices the slot in `pipeline_dequant_mul_mat_mat_f16[]` was empty - the shader was never generated - so dispatches crashed when flash attention was not used. - On cm1 and scalar devices the slot in `pipeline_dequant_mul_mat_mat[]` was wired up, but `mul_mm_funcs.glsl` had no `load_a_to_shmem` implementation for TBQ/PQ. The generic `mul_mm.comp` ran with uninitialized shared memory and produced near-random output. - Even once data loading was fixed, TBQ3_0/TBQ4_0 still produced a small bias for `n > mul_mat_vec_max_cols` because the QJL Stage 2 correction that `mul_mat_vec_tbq*_0.comp` applies in the vec path has no equivalent in the generic matmul shader. - For head_dim=64 models that use the `_64` block variants (TBQ*_0_64 / PQ*_0_64) the `_64` mul_mm pipelines, the `_64` QJL correction shaders, and the `_64`-sized Lloyd-Max codebook / sign arrays were missing, and the vec path is intentionally skipped for `_64` (so every `n` needs the full matmul + QJL correction pair). Scope of this patch is strictly the standalone `MUL_MAT` path with TBQ/PQ `src0` and f32 `src1` that the Issue 3 repro hits, i.e. the `-fa off` K matmul. Fused flash attention (scalar, cm1, cm2) already handles QJL correctly and is unchanged. MoE FFN weights are not affected either: TBQ/PQ are KV-cache quantizations (there is no `llama-quantize` target that produces TBQ/PQ model weights), so `MUL_MAT_ID` never sees them as `src0` - attention in MoE models is a plain `MUL_MAT` / `FLASH_ATTN_EXT` and reuses the same fix. The upstream "V cache quantization requires flash_attn" context-level guard in `src/llama-context.cpp` is intentionally left unchanged: the V matmul under `-fa off` uses a transposed quantized-V layout populated by `ggml_set_rows` with row_size=1, which corrupts any `blck_size > 1` type at write time (reproducible on CPU as well), and that is a separate KV-cache issue out of scope here. Changes: - Add `load_a_to_shmem` implementations for TBQ3_0, TBQ4_0, PQ3_0, PQ4_0 (and their `_64` variants) in `mul_mm_funcs.glsl`, reusing `tbq3_dequant_raw` / `tbq4_dequant_raw` from `tq_utils.comp`. This makes `mul_mm.comp` correct for the centroid part of dequantization (`tbq*_dequant_raw(qs) * d`) on all eight types. - `tq_utils.comp`: pick Stage-1 / QJL-Stage-2 sign bitmasks and the Lloyd-Max codebook (TBQ3_CB / TBQ4_CB) based on whether any `DATA_{A,K,V}_*_0_64` is defined. d=64 blocks use seeds 43/139 and a wider codebook (sigma = 1/sqrt(d) is larger at d=64 than at d=128); previously the shader hardcoded the d=128 constants, so the d=64 variants silently dequantized against the wrong codebook. - New shader `mul_mm_tbq_qjl_correction.comp`. It runs after the main matmul as an additive pass: one workgroup per `(row, col, batch)`, `QUANT_K` threads performing the same Walsh-Hadamard butterfly + `qjl[]` dot product as the vec shader, and accumulates `d_r * sqrt(pi/2) / QUANT_K * sum_qjl(H(B))` into D. Parameterized over `QUANT_K` so the same source emits both `_128` and `_64` SPIR-V. Only TBQ3_0 and TBQ4_0 (and `_64`) have `d_r`/`qjl`, so only those four get a correction pipeline. - `vulkan-shaders-gen.cpp`: * Register the eight correction variants (`mul_mm_qjl_{tbq3_0,tbq4_0}{,_64}_{f32,f16}`). * Emit `matmul_{tbq,pq}{3,4}_0_64_{f32,f16}[_aligned]` for `mul_mm.comp`, in a dedicated block outside the main `type_names` loop so we don't cascade through FA / MUL_MAT_ID / get_rows / ... which either already have dedicated `_64` handling (FA) or don't apply to TBQ/PQ at all. - `ggml-vulkan.cpp`: * Add `pipeline_mul_mm_tbq_qjl[GGML_TYPE_COUNT][2]` on the device and create pipelines at init time for all four TBQ types (128 and 64 block sizes). * In `ggml_vk_get_mul_mat_mat_pipeline`, let cm2 fall through to the cm1/scalar pipeline when no cm2 `_mat_f16` shader exists for a given TBQ/PQ type, so cm2 devices stop segfaulting on these types. * Register TBQ/PQ `_64` in `supports_op` `MUL_MAT` switch so d=64 models are actually routed to the new pipelines instead of falling back to CPU. * Force `split_k = 1` for TBQ `src0` - the QJL correction pass would otherwise be added once per split. * Dispatch the QJL correction pass after the main matmul for TBQ3_0 / TBQ4_0 (and `_64`). For `_128` it's gated on `n > mul_mat_vec_max_cols` (the vec path already corrects for smaller n); for `_64` it runs unconditionally because there is no vec path on this block size. Verified on AMD RDNA3.5 (RADV gfx1150, no cm1/cm2) with `test-backend-ops -o MUL_MAT -b Vulkan0`: all 32 TBQ/PQ x {128, 64} x n in {1,8,16,32} cases pass against f32 B. f16 B for standalone MUL_MAT on TBQ/PQ still reports `not supported` on this device (the scalar/cm1 pipeline consumes f32 src1), which is consistent with the matmul path shipping f32 src1 on the `-fa off` decode/prefill paths used by these tests. cm2 verification is expected to run on the reviewer's RTX 5090 via the Issue 3 repro branch. debug: sentinel in QJL correction vulkan: add dequantize-to-f16 cpy shaders so MUL_MAT with non-contiguous quantized src0 can run on GPU vulkan: add d=64 decision boundaries for TBQ/PQ copy_to_quant Commit 6e26e8b ("vulkan: fix TBQ/PQ standalone MUL_MAT path, QJL correction pass, and d=64 variants") added TQ_D64-gated d=64 Lloyd-Max codebook centroids and random sign diagonals to tq_utils.comp, so that the Vulkan encoder, FA decoder, and non-FA QJL correction pass all match the CPU reference's d=64 constants (ggml-quants.c: TQ{3,4}_CODEBOOK_64, TQ/QJL_SIGN_SEED_64). It missed the corresponding d=64 *decision boundaries* in copy_to_quant.comp, however. The boundaries are the midpoints between adjacent codebook centroids and determine which centroid an input coordinate is quantized to. CPU derives them at runtime from the selected codebook via tq_compute_boundaries(), so it automatically used the d=64 midpoints for _64 blocks. The Vulkan encoder hard-codes them as const float TBQ3_B[7] / TBQ4_B[15], and those constants remained at the d=128 midpoints even inside the #if block that accepts both DATA_A_TBQ*_0 and DATA_A_TBQ*_0_64. Net effect on a head_dim=64 model (Qwen2.5-0.5B): - copy_to_quant bucketed each coordinate by the narrower d=128 boundaries (centroids spaced ~sigma=1/sqrt(128)). - The resulting index was then dequantized with the wider d=64 centroids (spaced ~sigma=1/sqrt(64)), a completely different alphabet. - Every value near a boundary landed on the wrong centroid. Before commit 6e26e8b this was silently consistent: the codebook was also d=128, so encoder and decoder were at least in agreement (just producing a rescaled quantization). When the codebook was fixed to d=64, the boundaries had to move with it. Add a #if defined(TQ_D64) branch for TBQ3_B and TBQ4_B that uses the d=64 midpoints computed from TQ{3,4}_CODEBOOK_64. Values regenerated with scripts/compute_tq_codebooks.py, which now also emits the GLSL boundaries array alongside the C codebook array so future codebook updates keep the two in sync from one source of truth. Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wiki.test offset_64 (--chunks 1), Vulkan AMD RADV gfx1150: tbq3_0 / f16 fa=off: 1659 -> 230 (7.2x better) tbq4_0 / f16 fa=off: ~ -> 300 pq3_0 / f16 fa=off: ~ -> 230 pq4_0 / f16 fa=off: ~ -> 300 pq3_0 / f16 fa=on: ~ -> 225 (now matches fa=off) pq4_0 / f16 fa=on: ~ -> 299 (now matches fa=off) tbq3_0/tbq4_0 fa=on still diverges from fa=off due to a separate issue in the FA QJL Stage-2 correction on d=64 blocks, addressed in a follow- up patch. vulkan: read raw Q from the input SSBO for the FA QJL projection The flash-attention shaders compute the QJL Stage-2 correction as correction = d_r * sqrt(pi/2) / QUANT_K * (2*pos_sum - proj_q_sum) where (pos_sum, proj_q_sum) are reductions over FHT(D_qjl * Q). The scalar (flash_attn.comp) and coopmat1 (flash_attn_cm1.comp) paths used to derive the FHT input by reading from Qf -- the shared-memory buffer that already has the attention scale (1/sqrt(head_dim)) multiplied in for the main Q*K dot -- and then dividing that value by p.scale to recover the raw Q before multiplying by the QJL sign diagonal. On cm1 Qf is f16, so the scale round-trip is lossy for the large- magnitude activations seen in e.g. Qwen2.5-0.5B's first-layer massive activations. On the scalar path Qf is f32 and p.scale is usually a power of two (1/sqrt(head_dim)), so x * p.scale / p.scale is bit-exact in principle -- but empirically the pre-scaled-then-un-scaled read still produced materially different FHT input than a raw-Q read. The standalone non-FA QJL shader (mul_mm_tbq_qjl_correction.comp) has always read Q directly from src1 and gets correct results. Match that pattern in the FA path: read Q straight from the data_qv4 SSBO into Qf_qjl_proj, bypassing Qf entirely. cm2 already reads raw Q from data_q directly, so it does not need the change. Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wikitext-2 test (Vulkan AMD RADV gfx1150): wiki.test --chunks 4 -n 128, K=tbq3_0/f16: fa=off : 531 fa=on, before : ~2000 (broken) fa=on, after : 154 wiki.test --chunks 4 -n 128, K=tbq4_0/f16: fa=off : 207 fa=on, after : 77 K=f16/f16 and K=pq*/V=f16 fa=on/off stay within 1% of each other, confirming the change is confined to the TBQ QJL path. vulkan: run non-FA TBQ QJL correction on permuted src0 too The standalone MUL_MAT QJL (Stage 2) correction pass was gated on `!x_non_contig`, which silently skipped the correction on the no-FA attention path because `kq = mul_mat(k, q)` feeds in K after `ggml_permute(k, 0, 2, 1, 3)` -- a non-dim01-contiguous view of the KV cache. With that gate the TBQ attention on the no-FA path was reduced to PQ Stage 1 (centroid-only), producing bit-identical output to `pq3_0` / `pq4_0` and regressing quality vs a CPU-reference TBQ run. It was masked on the MUL_MAT test-backend-ops coverage because those tests use contiguous src0. Two changes: * mul_mm_tbq_qjl_correction.comp: index the A matrix by a real in-memory block stride instead of the `num_blocks_per_row = K / QUANT_K` shortcut. `p.stride_a` and `p.batch_stride_a` are now interpreted as strides in BLOCK units (src0->nb[1] / sizeof(block) and src0->nb[2] / sizeof(block) respectively), matching the way `ggml_vk_flash_attn` already feeds `k_stride` / `k_offset` to the FA shader. For a contiguous src0 the new stride equals the old num_blocks_per_row, so existing tests are unaffected. * ggml_vk_mul_mat_q_f16: compute `qjl_stride_a` / `qjl_stride_batch_a` from src0->nb and drop the `!x_non_contig` exclusion from both the descriptor-set request and the dispatch site. The `qx_buf_offset` still points at the original (pre-permute) TBQ blocks in d_Qx, which is exactly what the correction pass wants -- it just needed the real strides to reach the right block for each (row_a, batch_id). Both paths have to be gated identically -- if only one of them is changed, the dispatched pipeline ends up without a descriptor set and `vkCmdPushConstants` crashes with VK_NULL_HANDLE layout. Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wiki.test --chunks 4 -n 128 (Vulkan AMD RADV gfx1150): K=tbq3_0 V=f16 fa=off (vs fa=off pq3_0): before fix: 561 == 561 (QJL silently skipped, TBQ==PQ) after fix : 634 != 561 (QJL running, TBQ distinct from PQ) CPU ref : 546 K=tbq4_0 V=f16 fa=off (vs fa=off pq4_0): before fix: 173 == 173 (QJL silently skipped) after fix : 133 != 173 (QJL running, 23% lower PPL) CPU ref : 154 f16/f16 and pq* paths are unchanged: the gate only opens for TBQ. FA is unaffected since it has its own inlined QJL epilogue.

* MMQ for Q6_0 * Add Q6_0 MMQ to template generator --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Cherry-pick signalnine PR ggml-org#53: auto-asymmetric GQA + turbo VEC FA opts

Use STL in input processing and fix off-by-one bug

68897eb

thomasantony force-pushed the refactor_input_processing branch from faacdb5 to 68897eb Compare March 14, 2023 05:07

ggerganov added a commit that referenced this pull request Mar 19, 2023

Fix off-by-one bug (#115)

c494ed5

ggerganov closed this Mar 19, 2023

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

Fix off-by-one bug (ggml-org#115)

9b87f28

phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026

MMQ for Q6_0 (ggml-org#115)

f0a0503

* MMQ for Q6_0 * Add Q6_0 MMQ to template generator --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

LifesLight pushed a commit to LifesLight/custom.llama.cpp that referenced this pull request May 2, 2026

Merge pull request ggml-org#115 from TheTom/pr53-auto-asymmetric

0fa61d7

Cherry-pick signalnine PR ggml-org#53: auto-asymmetric GQA + turbo VEC FA opts

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

Fix off-by-one bug (ggml-org#115)

8d4eaf2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring: Use STL in input processing and fix off-by-one bug#115

Refactoring: Use STL in input processing and fix off-by-one bug#115
thomasantony wants to merge 1 commit into
ggml-org:masterfrom
thomasantony:refactor_input_processing

thomasantony commented Mar 14, 2023

Uh oh!

ggerganov commented Mar 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thomasantony commented Mar 14, 2023

Uh oh!

ggerganov commented Mar 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants