ci: fix turbo build + test failures on feature/turboquant-kv-cache by apollosenvy · Pull Request #66 · TheTom/llama-cpp-turboquant

apollosenvy · 2026-04-10T07:32:00Z

Picks the CI back up to green on top of feature/turboquant-kv-cache after the recent turbo merges. Every fix is the minimum needed to unblock the corresponding job.

Changes

1. `ggml-cpu/ops.cpp` — `ggml_compute_forward_clamp`

The abort branch is guarded by -Werror=switch but is missing the three TurboQuant KV-cache types, which breaks every GCC build after the turbo types were introduced. Added TURBO2_0 / TURBO3_0 / TURBO4_0 alongside the existing TQ* cases so they fall through to GGML_ABORT.

2. `ggml/include/ggml-rpc.h` — static_assert

GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 → 97 but the RPC static_assert wasn't updated. Bumped RPC_PROTO_PATCH_VERSION 1 → 2 and the assert to 97. This unblocks ubuntu-latest-rpc, windows-latest-hip, and every other RPC-linked job.

3. `tests/test-quantize-fns.cpp`

Three related problems:

a. Heap-buffer-overflow (ASan) in dot_product_error / total_quantization_error
std::vector<uint8_t> tmp_q(2*test_size) is too small when vec_dot_type == GGML_TYPE_F32 (the case for turbo quants), because from_float writes test_size * sizeof(float) bytes into it. ASan trace:

==ERROR: heap-buffer-overflow ... WRITE of size 16384 at 0x... (8-byte region)
    #1 ggml_cpu_fp32_to_fp32 ggml/src/ggml-cpu/ggml-cpu.c:3487
    #2 dot_product_error     tests/test-quantize-fns.cpp:93

Fix: size the scratch buffers to std::max(2*test_size, test_size*sizeof(float)).

b. Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness
Their dequantize_row_turbo*_0 intentionally stays in the WHT-rotated domain — the inverse WHT is applied separately via GGML_OP_TURBO_WHT in the attention graph, so a round-trip against the original float buffer is not meaningful. (This is documented right in the turbo4 dequant: "No inverse WHT, dequant stays in the rotated domain … The inverse WHT is applied to the attention output via GGML_OP_TURBO_WHT (direction=1) in the graph.") The existing threshold table can't describe that, so we skip these types with an explanatory message, same pattern as deprecated types skipped via blck_size == 0.

c. TQ3_1S error budget
TQ3_1S is a 3-bit weight quant but currently pays the default 4-bit budgets. Added it to the 3-bit buckets (MAX_QUANTIZATION_TOTAL_ERROR_3BITS / MAX_DOT_PRODUCT_ERROR_LOWBIT), matching Q3_K / IQ3_S.

4. `ggml-cuda/turbo-quant.cuh`

cudaMemcpyToSymbol, cudaMemcpyFromSymbol, and cudaDeviceSynchronize are wrapped in CUDA_CHECK(...) so the HIP build (where hipError_t is [[nodiscard]]) no longer trips -Werror on unused-result. Fourteen call sites touched.

Verification

Local, on Arch Linux (CPU build only):

ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs" --timeout 300
...
100% tests passed, 0 tests failed out of 41

test-tokenizers-ggml-vocabs is excluded only because test-tokenizer-0 wasn't built in the partial target — CI builds it.

test-quantize-fns was also run under AddressSanitizer (build-sanitize-style config) and is now clean. Before this PR it aborted with malloc(): corrupted top size.

Four changes that together make the TheTom/llama-cpp-turboquant CI green on top of feature/turboquant-kv-cache: 1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp The abort branch's -Werror=switch was missing the three TurboQuant KV cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build after the turbo types were added. Added them next to the TQ* cases so the fall-through to GGML_ABORT stays consistent. 2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC static_assert was not updated. Bumped the patch version 1->2 and the assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build. 3. tests/test-quantize-fns.cpp a) Heap-buffer-overflow (ASan trace): the scratch buffers std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants), because from_float then writes test_size*sizeof(float) bytes into them. Size buffers to std::max(2*test_size, test_size*sizeof(float)) so every quant's from_float fits. b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their dequantize_row_turbo*_0 intentionally stays in the WHT-rotated domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT in the attention graph, so a round-trip against the original float buffer is not meaningful and the existing threshold table does not apply. c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to 3-bit error budgets instead of the default 4-bit ones. 4. ggml-cuda/turbo-quant.cuh Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]]) no longer trips -Werror on unused-result. Verified locally on Arch Linux (CPU build): ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs" => 100% tests passed, 0 tests failed out of 41 (the excluded tokenizer test needs the model-fetch fixture) test-quantize-fns runs clean under AddressSanitizer as well. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TheTom · 2026-04-14T18:36:55Z

in test queue

TheTom · 2026-04-14T18:41:46Z

tested on M2 Pro (macOS):

build: clean (Metal), clean (Metal + RPC)

test-quantize-fns: all pass (exit 0). turbo2/3/4 correctly skipped with "rotated-domain KV quant" message. tq3_1s/tq4_1s pass.

RPC static_assert: confirmed fixed, builds clean with -DGGML_RPC=ON

bench (1.5B Q4_K_M, turbo4 KV, p512/tg128): 1477 t/s prefill, 71.9 t/s decode — no regression

note: this PR overlaps with #65 on the RPC assert fix (both bump to 97). #65 is already merged so there may be a trivial merge conflict on that line. the rest of the changes (cpu ops, CUDA HIP, test harness) are independent and clean.

nice work on the ASan catch and the turbo type skip logic. thanks!

github-actions Bot added Nvidia GPU ggml testing labels Apr 10, 2026

Titaniumtown approved these changes Apr 10, 2026

View reviewed changes

TheTom merged commit 45f8a06 into TheTom:feature/turboquant-kv-cache Apr 14, 2026
22 of 47 checks passed

apollosenvy mentioned this pull request Apr 17, 2026

feat: ROCm/HIP support for turbo3 KV cache (gfx1100/RDNA3) #5

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: fix turbo build + test failures on feature/turboquant-kv-cache#66

ci: fix turbo build + test failures on feature/turboquant-kv-cache#66
TheTom merged 1 commit into
TheTom:feature/turboquant-kv-cachefrom
apollosenvy:fix/ci-turbo-build

apollosenvy commented Apr 10, 2026

Uh oh!

TheTom commented Apr 14, 2026

Uh oh!

TheTom commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

apollosenvy commented Apr 10, 2026

Changes

1. ggml-cpu/ops.cpp — ggml_compute_forward_clamp

2. ggml/include/ggml-rpc.h — static_assert

3. tests/test-quantize-fns.cpp

4. ggml-cuda/turbo-quant.cuh

Verification

Related

Uh oh!

TheTom commented Apr 14, 2026

Uh oh!

TheTom commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. `ggml-cpu/ops.cpp` — `ggml_compute_forward_clamp`

2. `ggml/include/ggml-rpc.h` — static_assert

3. `tests/test-quantize-fns.cpp`

4. `ggml-cuda/turbo-quant.cuh`