From cda9fffedd76154077c7d3d725e31f7505eb1557 Mon Sep 17 00:00:00 2001 From: marksverdhei Date: Thu, 4 Jun 2026 23:21:07 +0200 Subject: [PATCH] chore(args): mark tbq3_0 / tbq4_0 KV-cache as experimental (closes #70) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Measured perplexity on Qwen3.5-0.8B-BF16 / wikitext-2 / ctx=512: | cache-type | PPL | vs f16 | |------------|--------|--------| | f16 | 19.08 | baseline | | q8_0 | 19.08 | lossless | | tbq3_0 | 1252.30 | 65x worse | | tbq4_0 | 1393.00 | 73x worse | TBQ KV-cache produces near-random output. Likely root cause is statistical: TBQ's rotated-domain codebook was calibrated for weight distributions, not the K/V tensor distributions seen during inference. The encoding scheme itself cannot faithfully represent KV values. Snoop-kube's cluster audit confirms zero deployments use tbq* KV-cache (every host uses q8_0 or q4_0). DFlash also defaults to q8_0 (PR #65). No production consumer exists. This PR adds a one-line experimental note to the --cache-type-k/v and --cache-type-k-draft/v-draft help text, referencing issue #70 for the full data + recommendation. Code path stays in place — Markus may have roadmap intent I'm not aware of; this just stops anyone reading --help from assuming tbq* is a usable choice without checking. Follow-ups if Markus prefers full removal: * drop tbq3_0/tbq4_0 from common/arg.cpp's kv_cache_types list * keep the ftypes (TBQ weight quantization is separate from KV use) * close issues #124 + #125 as wont-fix --- common/arg.cpp | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/common/arg.cpp b/common/arg.cpp index 04ddf32fde67..e3efa6562dbe 100644 --- a/common/arg.cpp +++ b/common/arg.cpp @@ -2049,7 +2049,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex string_format( "KV cache data type for K\n" "allowed values: %s\n" - "(default: %s)", + "(default: %s)\n" + "note: tbq3_0 / tbq4_0 are experimental — measured ~65-73x worse perplexity vs q8_0 on Qwen3.5-0.8B (issue #70)", get_all_kv_cache_types().c_str(), ggml_type_name(params.cache_type_k) ), @@ -2062,7 +2063,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex string_format( "KV cache data type for V\n" "allowed values: %s\n" - "(default: %s)", + "(default: %s)\n" + "note: tbq3_0 / tbq4_0 are experimental — measured ~65-73x worse perplexity vs q8_0 on Qwen3.5-0.8B (issue #70)", get_all_kv_cache_types().c_str(), ggml_type_name(params.cache_type_v) ), @@ -3525,7 +3527,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex string_format( "KV cache data type for K for the draft model\n" "allowed values: %s\n" - "(default: %s)", + "(default: %s)\n" + "note: tbq3_0 / tbq4_0 are experimental — measured ~65-73x worse perplexity vs q8_0 on Qwen3.5-0.8B (issue #70)", get_all_kv_cache_types().c_str(), ggml_type_name(params.speculative.draft.cache_type_k) ), @@ -3538,7 +3541,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex string_format( "KV cache data type for V for the draft model\n" "allowed values: %s\n" - "(default: %s)", + "(default: %s)\n" + "note: tbq3_0 / tbq4_0 are experimental — measured ~65-73x worse perplexity vs q8_0 on Qwen3.5-0.8B (issue #70)", get_all_kv_cache_types().c_str(), ggml_type_name(params.speculative.draft.cache_type_v) ),