ci: fix turbo build + test failures on feature/turboquant-kv-cache#66
Merged
TheTom merged 1 commit intoApr 14, 2026
Merged
Conversation
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Titaniumtown
approved these changes
Apr 10, 2026
Owner
|
in test queue |
Owner
|
tested on M2 Pro (macOS): build: clean (Metal), clean (Metal + RPC) test-quantize-fns: all pass (exit 0). turbo2/3/4 correctly skipped with "rotated-domain KV quant" message. tq3_1s/tq4_1s pass. RPC static_assert: confirmed fixed, builds clean with bench (1.5B Q4_K_M, turbo4 KV, p512/tg128): 1477 t/s prefill, 71.9 t/s decode — no regression note: this PR overlaps with #65 on the RPC assert fix (both bump to 97). #65 is already merged so there may be a trivial merge conflict on that line. the rest of the changes (cpu ops, CUDA HIP, test harness) are independent and clean. nice work on the ASan catch and the turbo type skip logic. thanks! |
45f8a06
into
TheTom:feature/turboquant-kv-cache
22 of 47 checks passed
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Picks the CI back up to green on top of
feature/turboquant-kv-cacheafter the recent turbo merges. Every fix is the minimum needed to unblock the corresponding job.Changes
1.
ggml-cpu/ops.cpp—ggml_compute_forward_clampThe abort branch is guarded by
-Werror=switchbut is missing the three TurboQuant KV-cache types, which breaks every GCC build after the turbo types were introduced. AddedTURBO2_0 / TURBO3_0 / TURBO4_0alongside the existingTQ*cases so they fall through toGGML_ABORT.2.
ggml/include/ggml-rpc.h— static_assertGGML_OP_TURBO_WHTbumpedGGML_OP_COUNTfrom 96 → 97 but the RPCstatic_assertwasn't updated. BumpedRPC_PROTO_PATCH_VERSION1 → 2 and the assert to 97. This unblocksubuntu-latest-rpc,windows-latest-hip, and every other RPC-linked job.3.
tests/test-quantize-fns.cppThree related problems:
a. Heap-buffer-overflow (ASan) in
dot_product_error/total_quantization_errorstd::vector<uint8_t> tmp_q(2*test_size)is too small whenvec_dot_type == GGML_TYPE_F32(the case for turbo quants), becausefrom_floatwritestest_size * sizeof(float)bytes into it. ASan trace:Fix: size the scratch buffers to
std::max(2*test_size, test_size*sizeof(float)).b. Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness
Their
dequantize_row_turbo*_0intentionally stays in the WHT-rotated domain — the inverse WHT is applied separately viaGGML_OP_TURBO_WHTin the attention graph, so a round-trip against the original float buffer is not meaningful. (This is documented right in the turbo4 dequant: "No inverse WHT, dequant stays in the rotated domain … The inverse WHT is applied to the attention output via GGML_OP_TURBO_WHT (direction=1) in the graph.") The existing threshold table can't describe that, so we skip these types with an explanatory message, same pattern as deprecated types skipped viablck_size == 0.c. TQ3_1S error budget
TQ3_1S is a 3-bit weight quant but currently pays the default 4-bit budgets. Added it to the 3-bit buckets (
MAX_QUANTIZATION_TOTAL_ERROR_3BITS/MAX_DOT_PRODUCT_ERROR_LOWBIT), matching Q3_K / IQ3_S.4.
ggml-cuda/turbo-quant.cuhcudaMemcpyToSymbol,cudaMemcpyFromSymbol, andcudaDeviceSynchronizeare wrapped inCUDA_CHECK(...)so the HIP build (wherehipError_tis[[nodiscard]]) no longer trips-Werroron unused-result. Fourteen call sites touched.Verification
Local, on Arch Linux (CPU build only):
test-tokenizers-ggml-vocabsis excluded only becausetest-tokenizer-0wasn't built in the partial target — CI builds it.test-quantize-fnswas also run under AddressSanitizer (build-sanitize-style config) and is now clean. Before this PR it aborted withmalloc(): corrupted top size.Related
RPC_PROTO_PATCH_VERSIONbump approach used in Fix GGML_OP_COUNT assertion for RPC #65 (cpburnz).