QVAC-19557 tts-cpp: stream chatterbox GGUF tensor data instead of staging the full file by ogad-tether · Pull Request #43 · tetherto/qvac-ext-lib-whisper.cpp

ogad-tether · 2026-06-10T17:12:47Z

Context

The iOS QVAC SDK chatterbox e2e tests peak at ~3.1 GB physFootprint and get jetsam-killed (QVAC-19557 — the tts-chatterbox-* variants are currently skipped on iOS Device Farm). Part of that peak is transient: every chatterbox GGUF load used gguf_init_from_file(no_alloc=false), which materialises the entire tensor-data section in host memory before a single byte reaches the backend buffer:

T3 load: +0.5–1.1 GB transient while the staging blob and the freshly-allocated backend weight buffer coexist.
voice_encoder_load / campplus_load / s3tokv2_load and the two mel-filterbank reads in main.cpp re-staged the whole T3 / S3Gen file again just to memcpy a few MB of F32 tensors out of it (the voice-encoder one runs right after the T3 weights are already resident).
load_s3gen_gguf runs on the s3gen_preload background thread while T3 is resident, so its ~1 GB staging blob landed exactly on the process peak.

Change

New src/gguf_stream.h provides gguf_stream_reader: open the GGUF with no_alloc=true (metadata-only tensors), allocate the destination, then stream each tensor's payload from the file via fseek/fread — ggml_backend_tensor_set in 8 MiB chunks for backend weights (to_backend), or a single read into host vectors (to_host). Peak host overhead per load drops from sizeof(data section) to 8 MiB. Tensor size is validated against the destination before any byte is copied, so metadata/file drift fails loudly instead of corrupting weights.

Converted call sites: load_model_gguf (turbo), load_model_gguf_mtl, load_s3gen_gguf, voice_encoder_load, campplus_load, s3tokv2_load, and the s3gen/mel_fb/24k_80 + campplus/mel_fb_kaldi_80 single-tensor reads.

Left as-is (follow-ups): mel2wav.cpp (standalone demo tool, not in the SDK path) and supertonic_gguf.cpp (same staging pattern, separate scope).

Uses only gguf_* / ggml-backend registry-safe APIs — no direct ggml-cpu symbols, so the Android GGML_BACKEND_DL=ON constraint (qvac PR ggml-org#2502 revert) is unaffected.

Testing

New test/test_gguf_stream.cpp (ctest label unit, no model fixture needed): writes a synthetic GGUF — including a >8 MiB tensor so the chunked copy crosses a chunk boundary — and asserts byte-exact parity between the streaming and legacy staging loads for both to_backend and to_host, plus loud failure on unknown names and size mismatches.
ctest -L unit: 26/26 pass.
Not run: a real chatterbox GGUF end-to-end load (no local fixtures on this machine — test-t3-caches-mtl / test-cpu-caches-{turbo,mtl} will exercise the new path wherever the GGUF fixtures are staged).

🤖 Generated with Claude Code

GustavoA1604 · 2026-06-10T20:35:46Z

+    void release_scratch() {
+        scratch_.clear();
+        scratch_.shrink_to_fit();
+    }


Seems like this function is dead code

Good catch — it was speculative API with no call site (every loader destroys the reader right after the copy loop, so the scratch never outlives the load). Removed in 054ca71.

GustavoA1604

Wrongly approved previously

Review feedback on #43: no call site ever needed to drop the 8 MiB chunk scratch before the reader goes out of scope (every loader destroys the reader as soon as the copy loop finishes), so the method was dead code. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Review follow-up on #43: quantising the T3 KV cache cuts its up-front allocation from f32's 4 B/elem to f16's 2 B or q8_0's ~1.06 B (one fp16 scale per 32 values) — q8_0 is ~27% of f32, so the same memory budget buys ~3.7x the context length. New EngineOptions::kv_cache_type ("f32" default / "f16" / "q8_0"), plumbed through load_model_gguf{,_mtl} into hparams.kv_type, plus a --kv-cache-type CLI flag. Unknown strings warn and fall back to f32 so a typo can't silently change numerics. The enabling change is a KV slab layout swap from head-major [HD, n_ctx, n_heads] to token-major [HD*n_heads, n_ctx] (one ggml_row_size(kv_type, HD*n_heads) row per cached position, heads packed inside the row — llama.cpp's layout): - the per-step append at position n_past becomes a CONTIGUOUS span, which is what a quantised dtype requires — ggml-cpu's dup→quantized path GGML_ABORTs on a non-contiguous dst; - the append consumes the pre-permute K (rope output) / V (projection output) directly, dropping two per-layer ggml_cont(permute(...)) on the MTL path; - flash_attn_ext reads the [HD, L, n_heads] slice with plain strides (pos stride = one token row, head stride = one HD-row inside it) and consumes f16/q8_0 K/V natively on CPU and Metal (kernel_flash_attn_ext_q8_0_dk64_dv64 matches head_dim=64); - all offsets land on whole HD=64-element rows = two q8_0 blocks, so quantised views stay block-aligned. The MTL B=2 batched write splits into one cpy per batch half (a single ne[3]=2 view would have a batch gap and stop being contiguous). Validated on real GGUFs (turbo Q4_0 + mtl Q4_0, CPU + Metal on M2): - f32 old-vs-new BYTE-IDENTICAL greedy token sequences on turbo CPU, mtl CPU, and mtl Metal (B=2) — the layout swap changes nothing at the default dtype; - turbo greedy: f32 == f16 == q8_0, identical across CPU and Metal (90/90 tokens, zero argmax flips); - mtl greedy: f16/q8_0 each diverge from f32 by a single near-tie argmax flip (CFG's cond-uncond mixing amplifies quantisation epsilon); whisper-cli transcribes the q8_0 output to the exact input text on both variants; - Metal T3 decode gets FASTER from the bandwidth saving: 1139 ms (f32) -> 832 ms (f16) / 935 ms (q8_0) on turbo; - ctest -L unit 26/26; test-t3-caches + test-cpu-caches fixture runs green against the downloaded GGUFs. The tts-ggml addon knob (kvCacheType + q8_0 default for chatterbox) follows in qvac once the registry port picks this up — same flow as the supertonic EngineOptions additions in 0.2.1. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ogad-tether · 2026-06-12T09:31:59Z

Pushed 3939db19 — adds the suggested quantised KV cache (EngineOptions::kv_cache_type = f32 default / f16 / q8_0, plus a --kv-cache-type CLI flag). q8_0 stores the cache at ~27% of f32, so the same memory budget buys ~3.7× the context length.

Why it's in this PR: the enabling change is a KV slab layout swap (head-major → token-major, llama.cpp-style) so the per-step append is a contiguous span — ggml-cpu's dup→quantized path hard-aborts on non-contiguous destinations. Same QVAC-19557 memory arc as the streaming loads. Bonus: the MTL path drops two per-layer ggml_cont(permute(...)) since the append now consumes the pre-permute K/V directly.

Validation on real GGUFs (turbo + mtl, CPU + Metal, M2):

f32 old-vs-new: byte-identical greedy token sequences on all three paths (turbo CPU, mtl CPU, mtl Metal B=2) — the layout swap is a pure refactor at the default dtype.
turbo greedy: f32 == f16 == q8_0 identical (90/90 tokens, zero argmax flips), CPU and Metal.
mtl greedy: f16/q8_0 diverge from f32 by a single near-tie argmax flip (CFG mixing amplifies quantisation epsilon — the sequence then forks as any take does); whisper-cli transcribes the q8_0 audio to the exact input text on both variants.
Metal T3 decode gets faster from the bandwidth saving: 1139 → 832 ms (f16) / 935 ms (q8_0).
ctest -L unit 26/26; test-t3-caches / test-cpu-caches fixture runs green.

Default stays f32 (bit-exact). The tts-ggml addon knob (kvCacheType, with q8_0 + a larger default ctx for chatterbox) follows in tetherto/qvac once the registry port picks this revision up — same flow as the supertonic EngineOptions additions in 0.2.1.

… unit test Addresses two review comments on #43: 1. (main.cpp:326) "validating the string isn't enough — nothing checks the active backend's ggml_flash_attn_ext supports the requested quantized/f16 K/V; on an unsupported backend this aborts at graph compute rather than degrading gracefully." New chatterbox_resolve_kv_type(backend, requested, head_dim, n_head, n_kv_head): after the backend is initialised, build a throwaway no_alloc flash_attn_ext node shaped like the real T3 attention (Q=F32 [HD,1,n_head], K/V=requested [HD,8,n_kv_head], null mask = the N=1 step path) and ask ggml_backend_supports_op. Unsupported -> fall back to GGML_TYPE_F32 with a stderr warning instead of asserting deep in ggml. Wired into load_model_gguf (turbo) and load_model_gguf_mtl, replacing the raw `hp.kv_type = kv_type`. F32 short-circuits (always supported); null backend -> F32. Caveat documented in-code: a backend that ADVERTISES support via supports_op but faults at compute is not caught here — ggml-vulkan reports q8_0 K/V FA as supported on both scalar and coopmat2 paths, so this is the guard for honest backends / future ports, not a substitute for an upstream ggml-vulkan fix. 2. (main.cpp:447) "add tests that exercise the quantized KV cache paths." New fixture-free test-kv-cache-type (ctest label "unit"): covers chatterbox_kv_type_from_str (incl. unknown -> f32 guard) and chatterbox_resolve_kv_type against a CPU backend (retains f16/q8_0, F32 + null-backend short-circuits). Verified: ctest -L unit 27/27; CPU greedy tokens still byte-identical f32 vs q8_0 (probe doesn't perturb numerics); q8_0 retained on CPU and on Vulkan/MoltenVK (scalar FA path) — load logs show KV=408 MB (q8_0) vs ~1.6 GB (f32) at the turbo GGUF's native n_ctx. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ging the full file Every chatterbox GGUF load used gguf_init_from_file(no_alloc=false), which materialises the ENTIRE tensor-data section in host memory before a single byte reaches the backend buffer. On iOS that staging blob is what pushed the QVAC SDK test process to a ~3.1 GB peak footprint (task_vm_info.physFootprint) and into jetsam: - T3 load: +0.5-1.1 GB transient (staging + backend buffer coexist) - voice_encoder_load / campplus_load / s3tokv2_load and the two mel-fb reads in main.cpp re-staged the whole T3 / S3Gen file again just to memcpy a few MB of F32 tensors out of it - load_s3gen_gguf runs on the s3gen_preload background thread while the T3 weights are already resident, so its ~1 GB staging blob landed exactly on the process peak New src/gguf_stream.h provides gguf_stream_reader: open the GGUF with no_alloc=true (metadata-only tensors), allocate the destination, then stream each tensor's payload from the file via fseek/fread — ggml_backend_tensor_set in 8 MiB chunks for backend weights (to_backend), or a single read into host vectors (to_host). Peak host overhead per load drops from sizeof(data section) to 8 MiB. Tensor size is validated against the destination before any byte is copied, so metadata/file drift fails loudly instead of corrupting weights. Converted call sites: load_model_gguf (turbo, main.cpp), load_model_gguf_mtl (t3_mtl.cpp), load_s3gen_gguf (chatterbox_tts.cpp), voice_encoder_load, campplus_load, s3tokv2_load, and the s3gen/mel_fb/24k_80 + campplus/mel_fb_kaldi_80 single-tensor reads. Left as-is: mel2wav.cpp (standalone demo tool, not in the SDK path) and supertonic_gguf.cpp (same pattern, separate follow-up). test/test_gguf_stream.cpp (ctest label "unit", no model fixture needed) writes a synthetic GGUF — including a >8 MiB tensor so the chunked copy crosses a chunk boundary — and asserts byte-exact parity between the streaming and legacy staging loads for both to_backend and to_host, plus loud failure on unknown names and size mismatches. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Review feedback on #43: no call site ever needed to drop the 8 MiB chunk scratch before the reader goes out of scope (every loader destroys the reader as soon as the copy loop finishes), so the method was dead code. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Review follow-up on #43: quantising the T3 KV cache cuts its up-front allocation from f32's 4 B/elem to f16's 2 B or q8_0's ~1.06 B (one fp16 scale per 32 values) — q8_0 is ~27% of f32, so the same memory budget buys ~3.7x the context length. New EngineOptions::kv_cache_type ("f32" default / "f16" / "q8_0"), plumbed through load_model_gguf{,_mtl} into hparams.kv_type, plus a --kv-cache-type CLI flag. Unknown strings warn and fall back to f32 so a typo can't silently change numerics. The enabling change is a KV slab layout swap from head-major [HD, n_ctx, n_heads] to token-major [HD*n_heads, n_ctx] (one ggml_row_size(kv_type, HD*n_heads) row per cached position, heads packed inside the row — llama.cpp's layout): - the per-step append at position n_past becomes a CONTIGUOUS span, which is what a quantised dtype requires — ggml-cpu's dup→quantized path GGML_ABORTs on a non-contiguous dst; - the append consumes the pre-permute K (rope output) / V (projection output) directly, dropping two per-layer ggml_cont(permute(...)) on the MTL path; - flash_attn_ext reads the [HD, L, n_heads] slice with plain strides (pos stride = one token row, head stride = one HD-row inside it) and consumes f16/q8_0 K/V natively on CPU and Metal (kernel_flash_attn_ext_q8_0_dk64_dv64 matches head_dim=64); - all offsets land on whole HD=64-element rows = two q8_0 blocks, so quantised views stay block-aligned. The MTL B=2 batched write splits into one cpy per batch half (a single ne[3]=2 view would have a batch gap and stop being contiguous). Validated on real GGUFs (turbo Q4_0 + mtl Q4_0, CPU + Metal on M2): - f32 old-vs-new BYTE-IDENTICAL greedy token sequences on turbo CPU, mtl CPU, and mtl Metal (B=2) — the layout swap changes nothing at the default dtype; - turbo greedy: f32 == f16 == q8_0, identical across CPU and Metal (90/90 tokens, zero argmax flips); - mtl greedy: f16/q8_0 each diverge from f32 by a single near-tie argmax flip (CFG's cond-uncond mixing amplifies quantisation epsilon); whisper-cli transcribes the q8_0 output to the exact input text on both variants; - Metal T3 decode gets FASTER from the bandwidth saving: 1139 ms (f32) -> 832 ms (f16) / 935 ms (q8_0) on turbo; - ctest -L unit 26/26; test-t3-caches + test-cpu-caches fixture runs green against the downloaded GGUFs. The tts-ggml addon knob (kvCacheType + q8_0 default for chatterbox) follows in qvac once the registry port picks this up — same flow as the supertonic EngineOptions additions in 0.2.1. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… unit test Addresses two review comments on #43: 1. (main.cpp:326) "validating the string isn't enough — nothing checks the active backend's ggml_flash_attn_ext supports the requested quantized/f16 K/V; on an unsupported backend this aborts at graph compute rather than degrading gracefully." New chatterbox_resolve_kv_type(backend, requested, head_dim, n_head, n_kv_head): after the backend is initialised, build a throwaway no_alloc flash_attn_ext node shaped like the real T3 attention (Q=F32 [HD,1,n_head], K/V=requested [HD,8,n_kv_head], null mask = the N=1 step path) and ask ggml_backend_supports_op. Unsupported -> fall back to GGML_TYPE_F32 with a stderr warning instead of asserting deep in ggml. Wired into load_model_gguf (turbo) and load_model_gguf_mtl, replacing the raw `hp.kv_type = kv_type`. F32 short-circuits (always supported); null backend -> F32. Caveat documented in-code: a backend that ADVERTISES support via supports_op but faults at compute is not caught here — ggml-vulkan reports q8_0 K/V FA as supported on both scalar and coopmat2 paths, so this is the guard for honest backends / future ports, not a substitute for an upstream ggml-vulkan fix. 2. (main.cpp:447) "add tests that exercise the quantized KV cache paths." New fixture-free test-kv-cache-type (ctest label "unit"): covers chatterbox_kv_type_from_str (incl. unknown -> f32 guard) and chatterbox_resolve_kv_type against a CPU backend (retains f16/q8_0, F32 + null-backend short-circuits). Verified: ctest -L unit 27/27; CPU greedy tokens still byte-identical f32 vs q8_0 (probe doesn't perturb numerics); q8_0 retained on CPU and on Vulkan/MoltenVK (scalar FA path) — load logs show KV=408 MB (q8_0) vs ~1.6 GB (f32) at the turbo GGUF's native n_ctx. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…at2 FA fault) ggml-vulkan's supports_op advertises quantized K/V flash-attention as supported, but the NV_coopmat2 kernel faults at compute on a q8_0 K/V cache. Toggle-confirmed in the downstream GPU CI A/B (same NVIDIA RTX 4000 coopmat2 runners, ubuntu-22.04 + ubuntu-24.04): q8_0 KV default -> SIGSEGV (139) on both f32 KV default -> pass on both Only the chatterbox default KV dtype differed; rules out the token-major KV layout and the pre-existing chatterbox-Vulkan graph. MoltenVK (scalar FA, no coopmat) runs q8_0 fine and byte-identical to f32, so it's specific to the coopmat2 dequant-in-shader path. The load-time capability probe (chatterbox_resolve_kv_type) can't catch this — supports_op returns true — so add a targeted guard: quantized K/V on a Vulkan backend falls back to f32 with a stderr warning. f16 (the native FA input type, not dequantized in-shader) is left intact; Metal / CPU keep quantized K/V (validated byte-identical greedy decode). Net: the tts-ggml addon's q8_0 chatterbox default transparently downgrades to f32 on Vulkan (Linux/Windows/Android) at load — no addon change — while iOS/Metal and CPU keep the q8_0 memory win. Re-enabling q8_0 on Vulkan is a one-line revert once the upstream coopmat2 FA kernel handles quantized K/V. Verified on MoltenVK: q8_0 -> warns + f32 KV (1536 MB), f16 -> 768 MB, f32 -> 1536 MB; CPU q8_0 unaffected (408 MB). ctest -L unit 27/27. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ogad-tether · 2026-06-15T21:55:54Z

Rebased onto current master (52d09d0, post-#42 supertonic3 + #47 QVAC-20484 S3Gen streaming).

Dropped the Android symbol-fix commit (8b012789) — that change (swapping the 3 raw ggml_backend_is_cpu calls in supertonic_gguf.cpp for tts_cpp::detail::backend_is_cpu) is the same edit as f7d4d6c (QVAC-19254), which already ships via the qvac packages/tts-ggml overlay and is the canonical owner of that fix. Keeping it here would have duplicated/collided. This PR is now chatterbox memory work only:

adbe6ac6 stream GGUF tensor data instead of staging the full file
49293b99 drop unused gguf_stream_reader::release_scratch
026cfe24 selectable KV-cache dtype (f32|f16|q8_0), token-major slab
e562109b KV-dtype capability probe + F32 fallback + unit test
05770ccc force f32 KV on Vulkan for quantized cache (coopmat2 FA fault)

All cherry-picks auto-merged cleanly over #47's segmentation changes — no conflicts. Verified locally on Metal: full libtts-cpp.a compiles, and both chatterbox unit tests (test-gguf-stream, test-kv-cache-type) pass. supertonic_gguf.cpp is intentionally left untouched (still f7d4d6c's domain).

Landing still gated on the QVAC-19254 / registry-publish ownership being sorted so we point at a single tts-cpp revision rather than forking the registry.

ogad-tether requested review from a team as code owners June 10, 2026 17:12

ogad-tether mentioned this pull request Jun 10, 2026

QVAC-19557 tts-ggml: cap Chatterbox T3 context (KV cache) at 2048 by default tetherto/qvac#2527

Open

GustavoA1604 requested changes Jun 10, 2026

View reviewed changes

GustavoA1604 approved these changes Jun 10, 2026

View reviewed changes

GustavoA1604 requested changes Jun 10, 2026

View reviewed changes

ogad-tether requested a review from GustavoA1604 June 11, 2026 23:20

ogad-tether mentioned this pull request Jun 12, 2026

tts-cpp: publish 2026-06-12 — QVAC-19557 chatterbox memory (PR #43) + Android-safe symbols tetherto/qvac-registry-vcpkg#188

Draft

GustavoA1604 requested changes Jun 12, 2026

View reviewed changes

Comment thread tts-cpp/src/main.cpp

Comment thread tts-cpp/src/main.cpp

ogad-tether requested a review from GustavoA1604 June 15, 2026 12:11

ogad-tether self-assigned this Jun 15, 2026

GustavoA1604 previously approved these changes Jun 15, 2026

View reviewed changes

ogad-tether and others added 5 commits June 15, 2026 22:52

ogad-tether dismissed GustavoA1604’s stale review via 05770cc June 15, 2026 21:55

ogad-tether force-pushed the feat/chatterbox-load-streaming branch from c8620cf to 05770cc Compare June 15, 2026 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-19557 tts-cpp: stream chatterbox GGUF tensor data instead of staging the full file#43

QVAC-19557 tts-cpp: stream chatterbox GGUF tensor data instead of staging the full file#43
ogad-tether wants to merge 5 commits into
masterfrom
feat/chatterbox-load-streaming

ogad-tether commented Jun 10, 2026

Uh oh!

GustavoA1604 Jun 10, 2026

Uh oh!

ogad-tether Jun 11, 2026

Uh oh!

Uh oh!

GustavoA1604 left a comment

Uh oh!

ogad-tether commented Jun 12, 2026

Uh oh!

Uh oh!

Uh oh!

ogad-tether commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ogad-tether commented Jun 10, 2026

Context

Change

Testing

Uh oh!

GustavoA1604 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

ogad-tether Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GustavoA1604 left a comment

Choose a reason for hiding this comment

Uh oh!

ogad-tether commented Jun 12, 2026

Uh oh!

Uh oh!

Uh oh!

ogad-tether commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants