Skip to content

ci: fix turbo build + test failures on feature/turboquant-kv-cache#66

Merged
TheTom merged 1 commit into
TheTom:feature/turboquant-kv-cachefrom
apollosenvy:fix/ci-turbo-build
Apr 14, 2026
Merged

ci: fix turbo build + test failures on feature/turboquant-kv-cache#66
TheTom merged 1 commit into
TheTom:feature/turboquant-kv-cachefrom
apollosenvy:fix/ci-turbo-build

Conversation

@apollosenvy
Copy link
Copy Markdown

Picks the CI back up to green on top of feature/turboquant-kv-cache after the recent turbo merges. Every fix is the minimum needed to unblock the corresponding job.

Changes

1. ggml-cpu/ops.cppggml_compute_forward_clamp

The abort branch is guarded by -Werror=switch but is missing the three TurboQuant KV-cache types, which breaks every GCC build after the turbo types were introduced. Added TURBO2_0 / TURBO3_0 / TURBO4_0 alongside the existing TQ* cases so they fall through to GGML_ABORT.

2. ggml/include/ggml-rpc.h — static_assert

GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 → 97 but the RPC static_assert wasn't updated. Bumped RPC_PROTO_PATCH_VERSION 1 → 2 and the assert to 97. This unblocks ubuntu-latest-rpc, windows-latest-hip, and every other RPC-linked job.

3. tests/test-quantize-fns.cpp

Three related problems:

a. Heap-buffer-overflow (ASan) in dot_product_error / total_quantization_error
std::vector<uint8_t> tmp_q(2*test_size) is too small when vec_dot_type == GGML_TYPE_F32 (the case for turbo quants), because from_float writes test_size * sizeof(float) bytes into it. ASan trace:

==ERROR: heap-buffer-overflow ... WRITE of size 16384 at 0x... (8-byte region)
    #1 ggml_cpu_fp32_to_fp32 ggml/src/ggml-cpu/ggml-cpu.c:3487
    #2 dot_product_error     tests/test-quantize-fns.cpp:93

Fix: size the scratch buffers to std::max(2*test_size, test_size*sizeof(float)).

b. Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness
Their dequantize_row_turbo*_0 intentionally stays in the WHT-rotated domain — the inverse WHT is applied separately via GGML_OP_TURBO_WHT in the attention graph, so a round-trip against the original float buffer is not meaningful. (This is documented right in the turbo4 dequant: "No inverse WHT, dequant stays in the rotated domain … The inverse WHT is applied to the attention output via GGML_OP_TURBO_WHT (direction=1) in the graph.") The existing threshold table can't describe that, so we skip these types with an explanatory message, same pattern as deprecated types skipped via blck_size == 0.

c. TQ3_1S error budget
TQ3_1S is a 3-bit weight quant but currently pays the default 4-bit budgets. Added it to the 3-bit buckets (MAX_QUANTIZATION_TOTAL_ERROR_3BITS / MAX_DOT_PRODUCT_ERROR_LOWBIT), matching Q3_K / IQ3_S.

4. ggml-cuda/turbo-quant.cuh

cudaMemcpyToSymbol, cudaMemcpyFromSymbol, and cudaDeviceSynchronize are wrapped in CUDA_CHECK(...) so the HIP build (where hipError_t is [[nodiscard]]) no longer trips -Werror on unused-result. Fourteen call sites touched.

Verification

Local, on Arch Linux (CPU build only):

ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs" --timeout 300
...
100% tests passed, 0 tests failed out of 41

test-tokenizers-ggml-vocabs is excluded only because test-tokenizer-0 wasn't built in the partial target — CI builds it.

test-quantize-fns was also run under AddressSanitizer (build-sanitize-style config) and is now clean. Before this PR it aborted with malloc(): corrupted top size.

Related

  • Mirrors the RPC_PROTO_PATCH_VERSION bump approach used in Fix GGML_OP_COUNT assertion for RPC #65 (cpburnz).
  • Does not alter any turbo kernels or quant math — only switch coverage, buffer sizing, test thresholds, and CUDA_CHECK wrappers.

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Apr 14, 2026

in test queue

@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Apr 14, 2026

tested on M2 Pro (macOS):

build: clean (Metal), clean (Metal + RPC)

test-quantize-fns: all pass (exit 0). turbo2/3/4 correctly skipped with "rotated-domain KV quant" message. tq3_1s/tq4_1s pass.

RPC static_assert: confirmed fixed, builds clean with -DGGML_RPC=ON

bench (1.5B Q4_K_M, turbo4 KV, p512/tg128): 1477 t/s prefill, 71.9 t/s decode — no regression

note: this PR overlaps with #65 on the RPC assert fix (both bump to 97). #65 is already merged so there may be a trivial merge conflict on that line. the rest of the changes (cpu ops, CUDA HIP, test harness) are independent and clean.

nice work on the ASan catch and the turbo type skip logic. thanks!

@TheTom TheTom merged commit 45f8a06 into TheTom:feature/turboquant-kv-cache Apr 14, 2026
22 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants