Skip to content

Refactoring: Use STL in input processing and fix off-by-one bug#115

Closed
thomasantony wants to merge 1 commit into
ggml-org:masterfrom
thomasantony:refactor_input_processing
Closed

Refactoring: Use STL in input processing and fix off-by-one bug#115
thomasantony wants to merge 1 commit into
ggml-org:masterfrom
thomasantony:refactor_input_processing

Conversation

@thomasantony
Copy link
Copy Markdown

There was an off-by-one bug in the line

if (embd.size() > params.n_batch) {
    break;
}

it would copy n_batch+1 elements rather than n_batch. This PR fixes that and also switches to using STL instead of a while loop

@thomasantony thomasantony force-pushed the refactor_input_processing branch from faacdb5 to 68897eb Compare March 14, 2023 05:07
ggerganov added a commit that referenced this pull request Mar 19, 2023
@ggerganov
Copy link
Copy Markdown
Member

Thanks for spotting this!
For now I prefer to keep the loop-based implementation - I find it easier to read.
Pushed the off-by-one fix to master

@ggerganov ggerganov closed this Mar 19, 2023
jesusmb1995 added a commit to jesusmb1995/llama.cpp that referenced this pull request Apr 17, 2026
Apply the loop-shape suggested in PR review (ggml-org#115 r3099918856) to the
remaining TBQ/PQ vec_dot kernels in ggml/src/ggml-cpu/quants.c, mirroring
the pattern already used by ggml_vec_dot_tbq3_0_64_q8_0:

  - one fewer scratch buffer: drop tmp_y[QK8_0]; q8_0 dequantizes
    directly into the per-sub-block slice of tmp_y_full (TBQ),
    or into a single reused tmp_y (PQ, no QJL stage).
  - one fewer copy loop: no more "tmp_y_full[j*QK8_0+k] = tmp_y[k]"
    next to the dot accumulation.
  - one fewer induction variable: replace the running pos++ with the
    closed-form index &y[i*M + j], with M = QK_{TQ,TQ_64} / QK8_0.
  - shorter dependency chain on sumf: accumulate per q8_0 sub-block
    into a local 'acc', sum sub-blocks into a 'base', then update sumf
    once per outer block (TBQ adds the QJL correction at the same time).
  - two software prefetches per outer iteration for the next x and y
    blocks (__builtin_prefetch with rw=0, locality=0).
  - GGML_RESTRICT on inner xb/yb pointers to make non-aliasing explicit.

Kernels touched:

  - ggml_vec_dot_tbq3_0_q8_0
  - ggml_vec_dot_tbq4_0_q8_0
  - ggml_vec_dot_pq3_0_q8_0
  - ggml_vec_dot_pq3_0_64_q8_0
  - ggml_vec_dot_pq4_0_q8_0
  - ggml_vec_dot_pq4_0_64_q8_0

The *_64 TBQ kernels already followed this shape and are unchanged.
The *_generic wrappers just forward to the main kernels and inherit
the changes automatically.

Numerics: the algorithm is unchanged; only the FP32 reduction order
inside each block changes (associative-only reordering). The QJL
correction term and all dequantization paths are bit-identical.

Validation on a CPU-only build (Ryzen AI 7 PRO 360, GGML_VULKAN=OFF):

  - tests/test-turboquant.sh: all checks pass
      * test-quantize-fns: ok for all TBQ/PQ types
      * test-quantize-perf perf-sanity gates: ok
      * test-backend-ops -p "tbq|pq": ok
  - test-quantize-perf vec_dot_q (4 KB L1-resident microbench, n=10):
    neutral within ~5% noise, as expected for a fully cache-hot loop
    where dequantize_row_* dominates and prefetch cannot help.
  - llama-bench (Mistral-7B-Instruct-v0.3-Q4_K_S, pp128/tg32, fa=on,
    K=tbq3_0 V=pq3_0, 4 alternating BEFORE/AFTER trials with cooldown):
      pp: +3.0% mean, won 4/4 trials
      tg: +3.4% mean, won 4/4 trials
  - llama-perplexity spot-check on 4 K/V configs x 2 n_ctx (8 datapoints,
    Mistral-7B-Q4_K_S, wikitext-2, 2 chunks each, fa=on):
      5/8 datapoints PPL <= baseline, 3/8 slightly above
      mean delta = -0.09%, range [-1.49%, +1.22%]
      all well inside per-chunk stdev (~+/-1.3 absolute, ~+/-11%)
    -> symmetric around zero, consistent with FP32 reduction-order
       rounding noise; no regression.
jesusmb1995 added a commit to jesusmb1995/llama.cpp that referenced this pull request Apr 17, 2026
Apply the loop-shape suggested in PR review (ggml-org#115 r3099918856) to the
remaining TBQ/PQ vec_dot kernels in ggml/src/ggml-cpu/quants.c, mirroring
the pattern already used by ggml_vec_dot_tbq3_0_64_q8_0:

  - one fewer scratch buffer: drop tmp_y[QK8_0]; q8_0 dequantizes
    directly into the per-sub-block slice of tmp_y_full (TBQ),
    or into a single reused tmp_y (PQ, no QJL stage).
  - one fewer copy loop: no more "tmp_y_full[j*QK8_0+k] = tmp_y[k]"
    next to the dot accumulation.
  - one fewer induction variable: replace the running pos++ with the
    closed-form index &y[i*M + j], with M = QK_{TQ,TQ_64} / QK8_0.
  - shorter dependency chain on sumf: accumulate per q8_0 sub-block
    into a local 'acc', sum sub-blocks into a 'base', then update sumf
    once per outer block (TBQ adds the QJL correction at the same time).
  - two software prefetches per outer iteration for the next x and y
    blocks (__builtin_prefetch with rw=0, locality=0).
  - GGML_RESTRICT on inner xb/yb pointers to make non-aliasing explicit.

Kernels touched:

  - ggml_vec_dot_tbq3_0_q8_0
  - ggml_vec_dot_tbq4_0_q8_0
  - ggml_vec_dot_pq3_0_q8_0
  - ggml_vec_dot_pq3_0_64_q8_0
  - ggml_vec_dot_pq4_0_q8_0
  - ggml_vec_dot_pq4_0_64_q8_0

The *_64 TBQ kernels already followed this shape and are unchanged.
The *_generic wrappers just forward to the main kernels and inherit
the changes automatically.

Numerics: the algorithm is unchanged; only the FP32 reduction order
inside each block changes (associative-only reordering). The QJL
correction term and all dequantization paths are bit-identical.

Validation on a CPU-only build (Ryzen AI 7 PRO 360, GGML_VULKAN=OFF):

  - tests/test-turboquant.sh: all checks pass
      * test-quantize-fns: ok for all TBQ/PQ types
      * test-quantize-perf perf-sanity gates: ok
      * test-backend-ops -p "tbq|pq": ok
  - test-quantize-perf vec_dot_q (4 KB L1-resident microbench, n=10):
    neutral within ~5% noise, as expected for a fully cache-hot loop
    where dequantize_row_* dominates and prefetch cannot help.
  - llama-bench (Mistral-7B-Instruct-v0.3-Q4_K_S, pp128/tg32, fa=on,
    K=tbq3_0 V=pq3_0, 4 alternating BEFORE/AFTER trials with cooldown):
      pp: +3.0% mean, won 4/4 trials
      tg: +3.4% mean, won 4/4 trials
  - llama-perplexity spot-check on 4 K/V configs x 2 n_ctx (8 datapoints,
    Mistral-7B-Q4_K_S, wikitext-2, 2 chunks each, fa=on):
      5/8 datapoints PPL <= baseline, 3/8 slightly above
      mean delta = -0.09%, range [-1.49%, +1.22%]
      all well inside per-chunk stdev (~+/-1.3 absolute, ~+/-11%)
    -> symmetric around zero, consistent with FP32 reduction-order
       rounding noise; no regression.
jesusmb1995 added a commit to jesusmb1995/llama.cpp that referenced this pull request Apr 17, 2026
The `GG_BUILD_LOW_PERF` → `-DGGML_NATIVE=OFF` append was placed inside
`gg_run_ctest_release`, but the top-level driver runs `gg_run
ctest_debug` first and `gg_run ctest_release` second. As a result the
debug build on the low-perf CI runners (`ggml-ci-x64-cpu-low-perf`
and `ggml-ci-arm64-cpu-low-perf`) was compiled with `-march=native`
against the build host's CPU and then executed on a different,
older-microarch runner in the pool, producing SIGILL during
ctest_debug. Ref PR ggml-org#115 CI run 24463228694.

Move the append into the top-level flag-handling block, right after
`GG_BUILD_NO_SVE`, so `CMAKE_EXTRA` gets `-DGGML_NATIVE=OFF` once,
before either ctest function is invoked, and both debug and release
builds pick it up. Remove the duplicate inside `gg_run_ctest_release`.

No workflow change required: `.github/workflows/build.yml` already
exports `GG_BUILD_LOW_PERF=1` for the two low-perf jobs, which was
correct; the bug was purely a scoping error in `ci/run.sh`. The other
`GG_BUILD_LOW_PERF` checks in the script (ctest label filter, and the
top-level branches that skip heavier test functions) are left
untouched — they were already at the correct scope.
jesusmb1995 added a commit to jesusmb1995/llama.cpp that referenced this pull request Apr 20, 2026
    The `GG_BUILD_LOW_PERF` → `-DGGML_NATIVE=OFF` append was placed inside
    `gg_run_ctest_release`, but the top-level driver runs `gg_run
    ctest_debug` first and `gg_run ctest_release` second. As a result the
    debug build on the low-perf CI runners (`ggml-ci-x64-cpu-low-perf`
    and `ggml-ci-arm64-cpu-low-perf`) was compiled with `-march=native`
    against the build host's CPU and then executed on a different,
    older-microarch runner in the pool, producing SIGILL during
    ctest_debug. Ref PR ggml-org#115 CI run 24463228694.

    Move the append into the top-level flag-handling block, right after
    `GG_BUILD_NO_SVE`, so `CMAKE_EXTRA` gets `-DGGML_NATIVE=OFF` once,
    before either ctest function is invoked, and both debug and release
    builds pick it up. Remove the duplicate inside `gg_run_ctest_release`.

    No workflow change required: `.github/workflows/build.yml` already
    exports `GG_BUILD_LOW_PERF=1` for the two low-perf jobs, which was
    correct; the bug was purely a scoping error in `ci/run.sh`. The other
    `GG_BUILD_LOW_PERF` checks in the script (ctest label filter, and the
    top-level branches that skip heavier test functions) are left
    untouched — they were already at the correct scope.

diff --git a/ci/run.sh b/ci/run.sh
index 7fa469b..5c2f7e7 100755
--- a/ci/run.sh
+++ b/ci/run.sh
@@ -118,6 +118,15 @@ if [ ! -z ${GG_BUILD_NO_SVE} ]; then
     CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm"
 fi

+# Disable native CPU optimizations for low-perf builds to ensure binary
+# compatibility with the (often heterogeneous) CI runner pool. Must be applied
+# at the top level so BOTH gg_run_ctest_debug and gg_run_ctest_release pick it
+# up — otherwise the debug build (which runs first) compiles with -march=native
+# and can SIGILL on a runner whose microarch is older than the build host.
+if [ ! -z ${GG_BUILD_LOW_PERF} ]; then
+    CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF"
+fi
+
 if [ -n "${GG_BUILD_KLEIDIAI}" ]; then
     echo ">>===== Enabling KleidiAI support"

@@ -236,11 +245,6 @@ function gg_run_ctest_release {
     # Check cmake, make and ctest are installed
     gg_check_build_requirements

-    # Disable native CPU optimizations for low-perf builds to ensure compatibility
-    if [ ! -z ${GG_BUILD_LOW_PERF} ]; then
-        CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_NATIVE=OFF"
-    fi
-
     (time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
     (time make -j$(nproc)                                    ) 2>&1 | tee -a $OUT/${ci}-make.log

diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 247775b..bcc1759 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -49,6 +49,25 @@ static ggml_tensor * ggml_rotate_hadamard(
     return res;
 }

+// Allocate the Hadamard rotation input used by ggml_rotate_hadamard() for a
+// TurboQuant/PolarQuant K or V stream. Size is the largest power-of-two that
+// divides n_embd_head (>= 64), so the rotation matches the head dim and the
+// PQ/TBQ codebooks see the full d-wide rotated distribution they were fitted
+// to. Returns nullptr when can_rot is false.
+static ggml_tensor * build_hadamard_rot(ggml_context * ctx, bool can_rot, int n_embd_head) {
+    if (!can_rot) {
+        return nullptr;
+    }
+
+    int nrot = 64;
+    do { nrot *= 2; } while (n_embd_head % nrot == 0);
+    nrot /= 2;
+
+    ggml_tensor * rot = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nrot, nrot);
+    ggml_set_input(rot);
+    return rot;
+}
+
 void llm_graph_input_embd::set_input(const llama_ubatch * ubatch) {
     if (ubatch->token) {
         const int64_t n_tokens = ubatch->n_tokens;
@@ -1626,30 +1645,13 @@ static std::unique_ptr<llm_graph_input_attn_kv> build_attn_inp_kv_impl(
             hparams.n_embd_head_k % 64 == 0 &&
             ggml_is_quantized(mctx_cur->type_k());

-        if (can_rotk) {
-            int nrot = 64;
-            do { nrot *= 2; } while (hparams.n_embd_head_k % nrot == 0);
-            nrot /= 2;
-
-            inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
-            ggml_set_input(inp->self_rotk);
-        } else {
-            inp->self_rotk = nullptr;
-        }
-
         const bool can_rotv =
             !hparams.is_n_embd_v_gqa_variable() &&
             hparams.n_embd_head_v % 64 == 0 &&
             ggml_is_quantized(mctx_cur->type_v());

-        if (can_rotv) {
-            int nrot = 64;
-
-            inp->self_rotv = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
-            ggml_set_input(inp->self_rotv);
-        } else {
-            inp->self_rotv = nullptr;
-        }
+        inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k);
+        inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v);
     }

     return inp;
@@ -1947,30 +1949,13 @@ llm_graph_input_attn_kv_iswa * llm_graph_context::build_attn_inp_kv_iswa() const
             hparams.n_embd_head_k % 64 == 0 &&
             ggml_is_quantized(mctx_cur->get_base()->type_k());

-        if (can_rotk) {
-            int nrot = 64;
-            do { nrot *= 2; } while (hparams.n_embd_head_k % nrot == 0);
-            nrot /= 2;
-
-            inp->self_rotk = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
-            ggml_set_input(inp->self_rotk);
-        } else {
-            inp->self_rotk = nullptr;
-        }
-
         const bool can_rotv =
             !hparams.is_n_embd_v_gqa_variable() &&
             hparams.n_embd_head_v % 64 == 0 &&
             ggml_is_quantized(mctx_cur->get_base()->type_v());

-        if (can_rotv) {
-            int nrot = 64;
-
-            inp->self_rotv = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, nrot, nrot);
-            ggml_set_input(inp->self_rotv);
-        } else {
-            inp->self_rotv = nullptr;
-        }
+        inp->self_rotk = build_hadamard_rot(ctx0, can_rotk, hparams.n_embd_head_k);
+        inp->self_rotv = build_hadamard_rot(ctx0, can_rotv, hparams.n_embd_head_v);
     }

     return (llm_graph_input_attn_kv_iswa *) res->add_input(std::move(inp));
jesusmb1995 added a commit to jesusmb1995/llama.cpp that referenced this pull request Apr 20, 2026
Register explicit `MUL_MAT` test cases for the TurboQuant / PolarQuant
types (`tbq3_0`, `tbq4_0`, `pq3_0`, `pq4_0`) with `type_b ∈ {f32, f16}`
and sizes that span both dispatch paths:

  - n=1, n=8   -> mul_mat_vec path (decode-like)
  - n=16, n=32 -> dequant + f16 matmul path (prefill-like)

Motivation: the existing TBQ/PQ coverage in `test-backend-ops` only
registers `FLASH_ATTN_EXT` cases, so the `tests/test-turboquant.sh`
filter (`test-backend-ops test -p "tbq|pq"`) never exercises the
standalone `MUL_MAT` path. That path is the one reported as
`supports_op == yes` on NV `VK_NV_cooperative_matrix2` devices but has
no matching pipeline created in `pipeline_dequant_mul_mat_mat_f16[]`
(see `ggml-vulkan.cpp:3412` — "TBQ/PQ cm2 matmul shaders not yet
generated"). With these cases in place:

  - On NV coopmat2 (RTX 5090): the n>=16 cases segfault, matching the
    external reproduction from the PR ggml-org#115 review and making the bug
    visible to CI.
  - On KHR coopmat1 (AMD gfx1150): the n>=16 cases return numerical
    garbage (err ~ 1.0), exposing the same missing-fallback issue in
    a non-crashing form.
  - The n=1/n=8 cases continue to pass via the existing
    `mul_mat_vec_tbq*_0` / `mul_mat_vec_pq*_0` shaders, so the new
    coverage cleanly isolates which dispatch path is broken.

No source changes to the Vulkan backend; this commit only adds the
test cases needed so the pre-existing bug is caught by
`tests/test-turboquant.sh`.
jesusmb1995 added a commit to jesusmb1995/llama.cpp that referenced this pull request Apr 20, 2026
…d=64 variants

Before this patch, `supports_op` reported TBQ3_0/TBQ4_0/PQ3_0/PQ4_0 (and
their `_64` / head_dim=64 variants) as supported for `GGML_OP_MUL_MAT` on
the Vulkan backend, but there was no working pipeline behind it. This is
the state the external review flags as Issue 3 on PR ggml-org#115: on cm2 (RTX
5090) the support probe claims support, the correctness run then
segfaults. The previous `test-backend-ops` commit adds the exact repro
for this.

Root causes:

  - On cm2 (NV coopmat2) devices the slot in
    `pipeline_dequant_mul_mat_mat_f16[]` was empty - the shader was
    never generated - so dispatches crashed when flash attention was
    not used.

  - On cm1 and scalar devices the slot in
    `pipeline_dequant_mul_mat_mat[]` was wired up, but `mul_mm_funcs.glsl`
    had no `load_a_to_shmem` implementation for TBQ/PQ. The generic
    `mul_mm.comp` ran with uninitialized shared memory and produced
    near-random output.

  - Even once data loading was fixed, TBQ3_0/TBQ4_0 still produced a
    small bias for `n > mul_mat_vec_max_cols` because the QJL Stage 2
    correction that `mul_mat_vec_tbq*_0.comp` applies in the vec path
    has no equivalent in the generic matmul shader.

  - For head_dim=64 models that use the `_64` block variants
    (TBQ*_0_64 / PQ*_0_64) the `_64` mul_mm pipelines, the `_64` QJL
    correction shaders, and the `_64`-sized Lloyd-Max codebook / sign
    arrays were missing, and the vec path is intentionally skipped for
    `_64` (so every `n` needs the full matmul + QJL correction pair).

Scope of this patch is strictly the standalone `MUL_MAT` path with TBQ/PQ
`src0` and f32 `src1` that the Issue 3 repro hits, i.e. the `-fa off` K
matmul. Fused flash attention (scalar, cm1, cm2) already handles QJL
correctly and is unchanged. MoE FFN weights are not affected either:
TBQ/PQ are KV-cache quantizations (there is no `llama-quantize` target
that produces TBQ/PQ model weights), so `MUL_MAT_ID` never sees them as
`src0` - attention in MoE models is a plain `MUL_MAT` /
`FLASH_ATTN_EXT` and reuses the same fix.

The upstream "V cache quantization requires flash_attn" context-level
guard in `src/llama-context.cpp` is intentionally left unchanged: the V
matmul under `-fa off` uses a transposed quantized-V layout populated
by `ggml_set_rows` with row_size=1, which corrupts any `blck_size > 1`
type at write time (reproducible on CPU as well), and that is a
separate KV-cache issue out of scope here.

Changes:

  - Add `load_a_to_shmem` implementations for TBQ3_0, TBQ4_0, PQ3_0,
    PQ4_0 (and their `_64` variants) in `mul_mm_funcs.glsl`, reusing
    `tbq3_dequant_raw` / `tbq4_dequant_raw` from `tq_utils.comp`.
    This makes `mul_mm.comp` correct for the centroid part of
    dequantization (`tbq*_dequant_raw(qs) * d`) on all eight types.

  - `tq_utils.comp`: pick Stage-1 / QJL-Stage-2 sign bitmasks and the
    Lloyd-Max codebook (TBQ3_CB / TBQ4_CB) based on whether any
    `DATA_{A,K,V}_*_0_64` is defined. d=64 blocks use seeds 43/139 and
    a wider codebook (sigma = 1/sqrt(d) is larger at d=64 than at
    d=128); previously the shader hardcoded the d=128 constants, so
    the d=64 variants silently dequantized against the wrong codebook.

  - New shader `mul_mm_tbq_qjl_correction.comp`. It runs after the
    main matmul as an additive pass: one workgroup per
    `(row, col, batch)`, `QUANT_K` threads performing the same
    Walsh-Hadamard butterfly + `qjl[]` dot product as the vec shader,
    and accumulates `d_r * sqrt(pi/2) / QUANT_K * sum_qjl(H(B))` into
    D. Parameterized over `QUANT_K` so the same source emits both
    `_128` and `_64` SPIR-V. Only TBQ3_0 and TBQ4_0 (and `_64`) have
    `d_r`/`qjl`, so only those four get a correction pipeline.

  - `vulkan-shaders-gen.cpp`:
      * Register the eight correction variants
        (`mul_mm_qjl_{tbq3_0,tbq4_0}{,_64}_{f32,f16}`).
      * Emit `matmul_{tbq,pq}{3,4}_0_64_{f32,f16}[_aligned]` for
        `mul_mm.comp`, in a dedicated block outside the main
        `type_names` loop so we don't cascade through FA / MUL_MAT_ID /
        get_rows / ... which either already have dedicated `_64`
        handling (FA) or don't apply to TBQ/PQ at all.

  - `ggml-vulkan.cpp`:
      * Add `pipeline_mul_mm_tbq_qjl[GGML_TYPE_COUNT][2]` on the device
        and create pipelines at init time for all four TBQ types (128
        and 64 block sizes).
      * In `ggml_vk_get_mul_mat_mat_pipeline`, let cm2 fall through to
        the cm1/scalar pipeline when no cm2 `_mat_f16` shader exists
        for a given TBQ/PQ type, so cm2 devices stop segfaulting on
        these types.
      * Register TBQ/PQ `_64` in `supports_op` `MUL_MAT` switch so
        d=64 models are actually routed to the new pipelines instead
        of falling back to CPU.
      * Force `split_k = 1` for TBQ `src0` - the QJL correction pass
        would otherwise be added once per split.
      * Dispatch the QJL correction pass after the main matmul for
        TBQ3_0 / TBQ4_0 (and `_64`). For `_128` it's gated on
        `n > mul_mat_vec_max_cols` (the vec path already corrects for
        smaller n); for `_64` it runs unconditionally because there is
        no vec path on this block size.

Verified on AMD RDNA3.5 (RADV gfx1150, no cm1/cm2) with
`test-backend-ops -o MUL_MAT -b Vulkan0`: all 32 TBQ/PQ x {128, 64} x
n in {1,8,16,32} cases pass against f32 B. f16 B for standalone
MUL_MAT on TBQ/PQ still reports `not supported` on this device (the
scalar/cm1 pipeline consumes f32 src1), which is consistent with the
matmul path shipping f32 src1 on the `-fa off` decode/prefill paths
used by these tests. cm2 verification is expected to run on the
reviewer's RTX 5090 via the Issue 3 repro branch.

debug: sentinel in QJL correction

vulkan: add dequantize-to-f16 cpy shaders so MUL_MAT with non-contiguous quantized src0 can run on GPU
jesusmb1995 added a commit to jesusmb1995/llama.cpp that referenced this pull request Apr 21, 2026
…d=64 variants

Before this patch, `supports_op` reported TBQ3_0/TBQ4_0/PQ3_0/PQ4_0 (and
their `_64` / head_dim=64 variants) as supported for `GGML_OP_MUL_MAT` on
the Vulkan backend, but there was no working pipeline behind it. This is
the state the external review flags as Issue 3 on PR ggml-org#115: on cm2 (RTX
5090) the support probe claims support, the correctness run then
segfaults. The previous `test-backend-ops` commit adds the exact repro
for this.

Root causes:

  - On cm2 (NV coopmat2) devices the slot in
    `pipeline_dequant_mul_mat_mat_f16[]` was empty - the shader was
    never generated - so dispatches crashed when flash attention was
    not used.

  - On cm1 and scalar devices the slot in
    `pipeline_dequant_mul_mat_mat[]` was wired up, but `mul_mm_funcs.glsl`
    had no `load_a_to_shmem` implementation for TBQ/PQ. The generic
    `mul_mm.comp` ran with uninitialized shared memory and produced
    near-random output.

  - Even once data loading was fixed, TBQ3_0/TBQ4_0 still produced a
    small bias for `n > mul_mat_vec_max_cols` because the QJL Stage 2
    correction that `mul_mat_vec_tbq*_0.comp` applies in the vec path
    has no equivalent in the generic matmul shader.

  - For head_dim=64 models that use the `_64` block variants
    (TBQ*_0_64 / PQ*_0_64) the `_64` mul_mm pipelines, the `_64` QJL
    correction shaders, and the `_64`-sized Lloyd-Max codebook / sign
    arrays were missing, and the vec path is intentionally skipped for
    `_64` (so every `n` needs the full matmul + QJL correction pair).

Scope of this patch is strictly the standalone `MUL_MAT` path with TBQ/PQ
`src0` and f32 `src1` that the Issue 3 repro hits, i.e. the `-fa off` K
matmul. Fused flash attention (scalar, cm1, cm2) already handles QJL
correctly and is unchanged. MoE FFN weights are not affected either:
TBQ/PQ are KV-cache quantizations (there is no `llama-quantize` target
that produces TBQ/PQ model weights), so `MUL_MAT_ID` never sees them as
`src0` - attention in MoE models is a plain `MUL_MAT` /
`FLASH_ATTN_EXT` and reuses the same fix.

The upstream "V cache quantization requires flash_attn" context-level
guard in `src/llama-context.cpp` is intentionally left unchanged: the V
matmul under `-fa off` uses a transposed quantized-V layout populated
by `ggml_set_rows` with row_size=1, which corrupts any `blck_size > 1`
type at write time (reproducible on CPU as well), and that is a
separate KV-cache issue out of scope here.

Changes:

  - Add `load_a_to_shmem` implementations for TBQ3_0, TBQ4_0, PQ3_0,
    PQ4_0 (and their `_64` variants) in `mul_mm_funcs.glsl`, reusing
    `tbq3_dequant_raw` / `tbq4_dequant_raw` from `tq_utils.comp`.
    This makes `mul_mm.comp` correct for the centroid part of
    dequantization (`tbq*_dequant_raw(qs) * d`) on all eight types.

  - `tq_utils.comp`: pick Stage-1 / QJL-Stage-2 sign bitmasks and the
    Lloyd-Max codebook (TBQ3_CB / TBQ4_CB) based on whether any
    `DATA_{A,K,V}_*_0_64` is defined. d=64 blocks use seeds 43/139 and
    a wider codebook (sigma = 1/sqrt(d) is larger at d=64 than at
    d=128); previously the shader hardcoded the d=128 constants, so
    the d=64 variants silently dequantized against the wrong codebook.

  - New shader `mul_mm_tbq_qjl_correction.comp`. It runs after the
    main matmul as an additive pass: one workgroup per
    `(row, col, batch)`, `QUANT_K` threads performing the same
    Walsh-Hadamard butterfly + `qjl[]` dot product as the vec shader,
    and accumulates `d_r * sqrt(pi/2) / QUANT_K * sum_qjl(H(B))` into
    D. Parameterized over `QUANT_K` so the same source emits both
    `_128` and `_64` SPIR-V. Only TBQ3_0 and TBQ4_0 (and `_64`) have
    `d_r`/`qjl`, so only those four get a correction pipeline.

  - `vulkan-shaders-gen.cpp`:
      * Register the eight correction variants
        (`mul_mm_qjl_{tbq3_0,tbq4_0}{,_64}_{f32,f16}`).
      * Emit `matmul_{tbq,pq}{3,4}_0_64_{f32,f16}[_aligned]` for
        `mul_mm.comp`, in a dedicated block outside the main
        `type_names` loop so we don't cascade through FA / MUL_MAT_ID /
        get_rows / ... which either already have dedicated `_64`
        handling (FA) or don't apply to TBQ/PQ at all.

  - `ggml-vulkan.cpp`:
      * Add `pipeline_mul_mm_tbq_qjl[GGML_TYPE_COUNT][2]` on the device
        and create pipelines at init time for all four TBQ types (128
        and 64 block sizes).
      * In `ggml_vk_get_mul_mat_mat_pipeline`, let cm2 fall through to
        the cm1/scalar pipeline when no cm2 `_mat_f16` shader exists
        for a given TBQ/PQ type, so cm2 devices stop segfaulting on
        these types.
      * Register TBQ/PQ `_64` in `supports_op` `MUL_MAT` switch so
        d=64 models are actually routed to the new pipelines instead
        of falling back to CPU.
      * Force `split_k = 1` for TBQ `src0` - the QJL correction pass
        would otherwise be added once per split.
      * Dispatch the QJL correction pass after the main matmul for
        TBQ3_0 / TBQ4_0 (and `_64`). For `_128` it's gated on
        `n > mul_mat_vec_max_cols` (the vec path already corrects for
        smaller n); for `_64` it runs unconditionally because there is
        no vec path on this block size.

Verified on AMD RDNA3.5 (RADV gfx1150, no cm1/cm2) with
`test-backend-ops -o MUL_MAT -b Vulkan0`: all 32 TBQ/PQ x {128, 64} x
n in {1,8,16,32} cases pass against f32 B. f16 B for standalone
MUL_MAT on TBQ/PQ still reports `not supported` on this device (the
scalar/cm1 pipeline consumes f32 src1), which is consistent with the
matmul path shipping f32 src1 on the `-fa off` decode/prefill paths
used by these tests. cm2 verification is expected to run on the
reviewer's RTX 5090 via the Issue 3 repro branch.

debug: sentinel in QJL correction

vulkan: add dequantize-to-f16 cpy shaders so MUL_MAT with non-contiguous quantized src0 can run on GPU

vulkan: add d=64 decision boundaries for TBQ/PQ copy_to_quant

Commit 6e26e8b ("vulkan: fix TBQ/PQ standalone MUL_MAT path, QJL
correction pass, and d=64 variants") added TQ_D64-gated d=64 Lloyd-Max
codebook centroids and random sign diagonals to tq_utils.comp, so that
the Vulkan encoder, FA decoder, and non-FA QJL correction pass all match
the CPU reference's d=64 constants (ggml-quants.c: TQ{3,4}_CODEBOOK_64,
TQ/QJL_SIGN_SEED_64). It missed the corresponding d=64 *decision
boundaries* in copy_to_quant.comp, however.

The boundaries are the midpoints between adjacent codebook centroids and
determine which centroid an input coordinate is quantized to. CPU derives
them at runtime from the selected codebook via tq_compute_boundaries(),
so it automatically used the d=64 midpoints for _64 blocks. The Vulkan
encoder hard-codes them as const float TBQ3_B[7] / TBQ4_B[15], and those
constants remained at the d=128 midpoints even inside the #if block that
accepts both DATA_A_TBQ*_0 and DATA_A_TBQ*_0_64.

Net effect on a head_dim=64 model (Qwen2.5-0.5B):

  - copy_to_quant bucketed each coordinate by the narrower d=128
    boundaries (centroids spaced ~sigma=1/sqrt(128)).
  - The resulting index was then dequantized with the wider d=64
    centroids (spaced ~sigma=1/sqrt(64)), a completely different
    alphabet.
  - Every value near a boundary landed on the wrong centroid.

Before commit 6e26e8b this was silently consistent: the codebook was
also d=128, so encoder and decoder were at least in agreement (just
producing a rescaled quantization). When the codebook was fixed to d=64,
the boundaries had to move with it.

Add a #if defined(TQ_D64) branch for TBQ3_B and TBQ4_B that uses the
d=64 midpoints computed from TQ{3,4}_CODEBOOK_64. Values regenerated
with scripts/compute_tq_codebooks.py, which now also emits the GLSL
boundaries array alongside the C codebook array so future codebook
updates keep the two in sync from one source of truth.

Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wiki.test offset_64
(--chunks 1), Vulkan AMD RADV gfx1150:

  tbq3_0 / f16  fa=off:   1659 -> 230   (7.2x better)
  tbq4_0 / f16  fa=off:   ~    -> 300
  pq3_0  / f16  fa=off:   ~    -> 230
  pq4_0  / f16  fa=off:   ~    -> 300
  pq3_0  / f16  fa=on:    ~    -> 225  (now matches fa=off)
  pq4_0  / f16  fa=on:    ~    -> 299  (now matches fa=off)

tbq3_0/tbq4_0 fa=on still diverges from fa=off due to a separate issue
in the FA QJL Stage-2 correction on d=64 blocks, addressed in a follow-
up patch.

vulkan: read raw Q from the input SSBO for the FA QJL projection

The flash-attention shaders compute the QJL Stage-2 correction as

    correction = d_r * sqrt(pi/2) / QUANT_K * (2*pos_sum - proj_q_sum)

where (pos_sum, proj_q_sum) are reductions over FHT(D_qjl * Q).  The
scalar (flash_attn.comp) and coopmat1 (flash_attn_cm1.comp) paths used
to derive the FHT input by reading from Qf -- the shared-memory buffer
that already has the attention scale (1/sqrt(head_dim)) multiplied in
for the main Q*K dot -- and then dividing that value by p.scale to
recover the raw Q before multiplying by the QJL sign diagonal.

On cm1 Qf is f16, so the scale round-trip is lossy for the large-
magnitude activations seen in e.g. Qwen2.5-0.5B's first-layer massive
activations.  On the scalar path Qf is f32 and p.scale is usually a
power of two (1/sqrt(head_dim)), so x * p.scale / p.scale is bit-exact
in principle -- but empirically the pre-scaled-then-un-scaled read
still produced materially different FHT input than a raw-Q read.

The standalone non-FA QJL shader (mul_mm_tbq_qjl_correction.comp) has
always read Q directly from src1 and gets correct results.  Match that
pattern in the FA path: read Q straight from the data_qv4 SSBO into
Qf_qjl_proj, bypassing Qf entirely.  cm2 already reads raw Q from
data_q directly, so it does not need the change.

Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wikitext-2 test
(Vulkan AMD RADV gfx1150):

    wiki.test --chunks 4 -n 128, K=tbq3_0/f16:
        fa=off         :   531
        fa=on, before  :  ~2000  (broken)
        fa=on, after   :   154

    wiki.test --chunks 4 -n 128, K=tbq4_0/f16:
        fa=off         :   207
        fa=on, after   :    77

    K=f16/f16 and K=pq*/V=f16 fa=on/off stay within 1% of each other,
    confirming the change is confined to the TBQ QJL path.

vulkan: run non-FA TBQ QJL correction on permuted src0 too

The standalone MUL_MAT QJL (Stage 2) correction pass was gated on
`!x_non_contig`, which silently skipped the correction on the no-FA
attention path because `kq = mul_mat(k, q)` feeds in K after
`ggml_permute(k, 0, 2, 1, 3)` -- a non-dim01-contiguous view of the
KV cache.  With that gate the TBQ attention on the no-FA path was
reduced to PQ Stage 1 (centroid-only), producing bit-identical
output to `pq3_0` / `pq4_0` and regressing quality vs a CPU-reference
TBQ run.  It was masked on the MUL_MAT test-backend-ops coverage
because those tests use contiguous src0.

Two changes:

  * mul_mm_tbq_qjl_correction.comp: index the A matrix by a real
    in-memory block stride instead of the `num_blocks_per_row = K /
    QUANT_K` shortcut.  `p.stride_a` and `p.batch_stride_a` are now
    interpreted as strides in BLOCK units (src0->nb[1] /
    sizeof(block) and src0->nb[2] / sizeof(block) respectively),
    matching the way `ggml_vk_flash_attn` already feeds `k_stride` /
    `k_offset` to the FA shader.  For a contiguous src0 the new
    stride equals the old num_blocks_per_row, so existing tests are
    unaffected.

  * ggml_vk_mul_mat_q_f16: compute `qjl_stride_a` /
    `qjl_stride_batch_a` from src0->nb and drop the
    `!x_non_contig` exclusion from both the descriptor-set request
    and the dispatch site.  The `qx_buf_offset` still points at the
    original (pre-permute) TBQ blocks in d_Qx, which is exactly what
    the correction pass wants -- it just needed the real strides to
    reach the right block for each (row_a, batch_id).

Both paths have to be gated identically -- if only one of them is
changed, the dispatched pipeline ends up without a descriptor set
and `vkCmdPushConstants` crashes with VK_NULL_HANDLE layout.

Measured impact on Qwen2.5-0.5B-Instruct-Q8_0, wiki.test --chunks 4
-n 128 (Vulkan AMD RADV gfx1150):

    K=tbq3_0 V=f16 fa=off (vs fa=off pq3_0):
        before fix:    561 == 561 (QJL silently skipped, TBQ==PQ)
        after fix :    634 != 561 (QJL running, TBQ distinct from PQ)
        CPU ref   :    546

    K=tbq4_0 V=f16 fa=off (vs fa=off pq4_0):
        before fix:    173 == 173 (QJL silently skipped)
        after fix :    133 != 173 (QJL running, 23% lower PPL)
        CPU ref   :    154

f16/f16 and pq* paths are unchanged: the gate only opens for TBQ.
FA is unaffected since it has its own inlined QJL epilogue.
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* MMQ for Q6_0

* Add Q6_0 MMQ to template generator

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
LifesLight pushed a commit to LifesLight/custom.llama.cpp that referenced this pull request May 2, 2026
Cherry-pick signalnine PR ggml-org#53: auto-asymmetric GQA + turbo VEC FA opts
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants