From cda9fffedd76154077c7d3d725e31f7505eb1557 Mon Sep 17 00:00:00 2001
From: marksverdhei <marksverdhei@hotmail.com>
Date: Thu, 4 Jun 2026 23:21:07 +0200
Subject: [PATCH] chore(args): mark tbq3_0 / tbq4_0 KV-cache as experimental
 (closes #70)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Measured perplexity on Qwen3.5-0.8B-BF16 / wikitext-2 / ctx=512:

| cache-type | PPL    | vs f16 |
|------------|--------|--------|
| f16        | 19.08  | baseline |
| q8_0       | 19.08  | lossless |
| tbq3_0     | 1252.30 | 65x worse |
| tbq4_0     | 1393.00 | 73x worse |

TBQ KV-cache produces near-random output. Likely root cause is statistical:
TBQ's rotated-domain codebook was calibrated for weight distributions, not
the K/V tensor distributions seen during inference. The encoding scheme
itself cannot faithfully represent KV values.

Snoop-kube's cluster audit confirms zero deployments use tbq* KV-cache
(every host uses q8_0 or q4_0). DFlash also defaults to q8_0 (PR #65).
No production consumer exists.

This PR adds a one-line experimental note to the --cache-type-k/v and
--cache-type-k-draft/v-draft help text, referencing issue #70 for the
full data + recommendation. Code path stays in place — Markus may have
roadmap intent I'm not aware of; this just stops anyone reading --help
from assuming tbq* is a usable choice without checking.

Follow-ups if Markus prefers full removal:
* drop tbq3_0/tbq4_0 from common/arg.cpp's kv_cache_types list
* keep the ftypes (TBQ weight quantization is separate from KV use)
* close issues #124 + #125 as wont-fix
---
 common/arg.cpp | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/common/arg.cpp b/common/arg.cpp
index 04ddf32fde67..e3efa6562dbe 100644
--- a/common/arg.cpp
+++ b/common/arg.cpp
@@ -2049,7 +2049,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
         string_format(
             "KV cache data type for K\n"
             "allowed values: %s\n"
-            "(default: %s)",
+            "(default: %s)\n"
+            "note: tbq3_0 / tbq4_0 are experimental — measured ~65-73x worse perplexity vs q8_0 on Qwen3.5-0.8B (issue #70)",
             get_all_kv_cache_types().c_str(),
             ggml_type_name(params.cache_type_k)
         ),
@@ -2062,7 +2063,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
         string_format(
             "KV cache data type for V\n"
             "allowed values: %s\n"
-            "(default: %s)",
+            "(default: %s)\n"
+            "note: tbq3_0 / tbq4_0 are experimental — measured ~65-73x worse perplexity vs q8_0 on Qwen3.5-0.8B (issue #70)",
             get_all_kv_cache_types().c_str(),
             ggml_type_name(params.cache_type_v)
         ),
@@ -3525,7 +3527,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
         string_format(
             "KV cache data type for K for the draft model\n"
             "allowed values: %s\n"
-            "(default: %s)",
+            "(default: %s)\n"
+            "note: tbq3_0 / tbq4_0 are experimental — measured ~65-73x worse perplexity vs q8_0 on Qwen3.5-0.8B (issue #70)",
             get_all_kv_cache_types().c_str(),
             ggml_type_name(params.speculative.draft.cache_type_k)
         ),
@@ -3538,7 +3541,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
         string_format(
             "KV cache data type for V for the draft model\n"
             "allowed values: %s\n"
-            "(default: %s)",
+            "(default: %s)\n"
+            "note: tbq3_0 / tbq4_0 are experimental — measured ~65-73x worse perplexity vs q8_0 on Qwen3.5-0.8B (issue #70)",
             get_all_kv_cache_types().c_str(),
             ggml_type_name(params.speculative.draft.cache_type_v)
         ),