Skip to content

QVAC-19557 tts-cpp: stream chatterbox GGUF tensor data instead of staging the full file#43

Open
ogad-tether wants to merge 5 commits into
masterfrom
feat/chatterbox-load-streaming
Open

QVAC-19557 tts-cpp: stream chatterbox GGUF tensor data instead of staging the full file#43
ogad-tether wants to merge 5 commits into
masterfrom
feat/chatterbox-load-streaming

Conversation

@ogad-tether

Copy link
Copy Markdown

Context

The iOS QVAC SDK chatterbox e2e tests peak at ~3.1 GB physFootprint and get jetsam-killed (QVAC-19557 — the tts-chatterbox-* variants are currently skipped on iOS Device Farm). Part of that peak is transient: every chatterbox GGUF load used gguf_init_from_file(no_alloc=false), which materialises the entire tensor-data section in host memory before a single byte reaches the backend buffer:

  • T3 load: +0.5–1.1 GB transient while the staging blob and the freshly-allocated backend weight buffer coexist.
  • voice_encoder_load / campplus_load / s3tokv2_load and the two mel-filterbank reads in main.cpp re-staged the whole T3 / S3Gen file again just to memcpy a few MB of F32 tensors out of it (the voice-encoder one runs right after the T3 weights are already resident).
  • load_s3gen_gguf runs on the s3gen_preload background thread while T3 is resident, so its ~1 GB staging blob landed exactly on the process peak.

Change

New src/gguf_stream.h provides gguf_stream_reader: open the GGUF with no_alloc=true (metadata-only tensors), allocate the destination, then stream each tensor's payload from the file via fseek/freadggml_backend_tensor_set in 8 MiB chunks for backend weights (to_backend), or a single read into host vectors (to_host). Peak host overhead per load drops from sizeof(data section) to 8 MiB. Tensor size is validated against the destination before any byte is copied, so metadata/file drift fails loudly instead of corrupting weights.

Converted call sites: load_model_gguf (turbo), load_model_gguf_mtl, load_s3gen_gguf, voice_encoder_load, campplus_load, s3tokv2_load, and the s3gen/mel_fb/24k_80 + campplus/mel_fb_kaldi_80 single-tensor reads.

Left as-is (follow-ups): mel2wav.cpp (standalone demo tool, not in the SDK path) and supertonic_gguf.cpp (same staging pattern, separate scope).

Uses only gguf_* / ggml-backend registry-safe APIs — no direct ggml-cpu symbols, so the Android GGML_BACKEND_DL=ON constraint (qvac PR ggml-org#2502 revert) is unaffected.

Testing

  • New test/test_gguf_stream.cpp (ctest label unit, no model fixture needed): writes a synthetic GGUF — including a >8 MiB tensor so the chunked copy crosses a chunk boundary — and asserts byte-exact parity between the streaming and legacy staging loads for both to_backend and to_host, plus loud failure on unknown names and size mismatches.
  • ctest -L unit: 26/26 pass.
  • Not run: a real chatterbox GGUF end-to-end load (no local fixtures on this machine — test-t3-caches-mtl / test-cpu-caches-{turbo,mtl} will exercise the new path wherever the GGUF fixtures are staged).

🤖 Generated with Claude Code

Comment thread tts-cpp/src/gguf_stream.h Outdated
void release_scratch() {
scratch_.clear();
scratch_.shrink_to_fit();
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this function is dead code

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — it was speculative API with no call site (every loader destroys the reader right after the copy loop, so the scratch never outlives the load). Removed in 054ca71.

Comment thread tts-cpp/src/chatterbox_tts.cpp

@GustavoA1604 GustavoA1604 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrongly approved previously

ogad-tether added a commit that referenced this pull request Jun 11, 2026
Review feedback on #43: no call site ever needed to drop the 8 MiB
chunk scratch before the reader goes out of scope (every loader
destroys the reader as soon as the copy loop finishes), so the method
was dead code.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ogad-tether ogad-tether requested a review from GustavoA1604 June 11, 2026 23:20
ogad-tether added a commit that referenced this pull request Jun 12, 2026
Review follow-up on #43: quantising the T3 KV cache cuts its
up-front allocation from f32's 4 B/elem to f16's 2 B or q8_0's
~1.06 B (one fp16 scale per 32 values) — q8_0 is ~27% of f32, so the
same memory budget buys ~3.7x the context length.

New EngineOptions::kv_cache_type ("f32" default / "f16" / "q8_0"),
plumbed through load_model_gguf{,_mtl} into hparams.kv_type, plus a
--kv-cache-type CLI flag.  Unknown strings warn and fall back to f32
so a typo can't silently change numerics.

The enabling change is a KV slab layout swap from head-major
[HD, n_ctx, n_heads] to token-major [HD*n_heads, n_ctx] (one
ggml_row_size(kv_type, HD*n_heads) row per cached position, heads
packed inside the row — llama.cpp's layout):

- the per-step append at position n_past becomes a CONTIGUOUS span,
  which is what a quantised dtype requires — ggml-cpu's
  dup→quantized path GGML_ABORTs on a non-contiguous dst;
- the append consumes the pre-permute K (rope output) / V
  (projection output) directly, dropping two per-layer
  ggml_cont(permute(...)) on the MTL path;
- flash_attn_ext reads the [HD, L, n_heads] slice with plain strides
  (pos stride = one token row, head stride = one HD-row inside it)
  and consumes f16/q8_0 K/V natively on CPU and Metal
  (kernel_flash_attn_ext_q8_0_dk64_dv64 matches head_dim=64);
- all offsets land on whole HD=64-element rows = two q8_0 blocks, so
  quantised views stay block-aligned.

The MTL B=2 batched write splits into one cpy per batch half (a
single ne[3]=2 view would have a batch gap and stop being
contiguous).

Validated on real GGUFs (turbo Q4_0 + mtl Q4_0, CPU + Metal on M2):

- f32 old-vs-new BYTE-IDENTICAL greedy token sequences on turbo CPU,
  mtl CPU, and mtl Metal (B=2) — the layout swap changes nothing at
  the default dtype;
- turbo greedy: f32 == f16 == q8_0, identical across CPU and Metal
  (90/90 tokens, zero argmax flips);
- mtl greedy: f16/q8_0 each diverge from f32 by a single near-tie
  argmax flip (CFG's cond-uncond mixing amplifies quantisation
  epsilon); whisper-cli transcribes the q8_0 output to the exact
  input text on both variants;
- Metal T3 decode gets FASTER from the bandwidth saving: 1139 ms
  (f32) -> 832 ms (f16) / 935 ms (q8_0) on turbo;
- ctest -L unit 26/26; test-t3-caches + test-cpu-caches fixture runs
  green against the downloaded GGUFs.

The tts-ggml addon knob (kvCacheType + q8_0 default for chatterbox)
follows in qvac once the registry port picks this up — same flow as
the supertonic EngineOptions additions in 0.2.1.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ogad-tether

Copy link
Copy Markdown
Author

Pushed 3939db19 — adds the suggested quantised KV cache (EngineOptions::kv_cache_type = f32 default / f16 / q8_0, plus a --kv-cache-type CLI flag). q8_0 stores the cache at ~27% of f32, so the same memory budget buys ~3.7× the context length.

Why it's in this PR: the enabling change is a KV slab layout swap (head-major → token-major, llama.cpp-style) so the per-step append is a contiguous span — ggml-cpu's dup→quantized path hard-aborts on non-contiguous destinations. Same QVAC-19557 memory arc as the streaming loads. Bonus: the MTL path drops two per-layer ggml_cont(permute(...)) since the append now consumes the pre-permute K/V directly.

Validation on real GGUFs (turbo + mtl, CPU + Metal, M2):

  • f32 old-vs-new: byte-identical greedy token sequences on all three paths (turbo CPU, mtl CPU, mtl Metal B=2) — the layout swap is a pure refactor at the default dtype.
  • turbo greedy: f32 == f16 == q8_0 identical (90/90 tokens, zero argmax flips), CPU and Metal.
  • mtl greedy: f16/q8_0 diverge from f32 by a single near-tie argmax flip (CFG mixing amplifies quantisation epsilon — the sequence then forks as any take does); whisper-cli transcribes the q8_0 audio to the exact input text on both variants.
  • Metal T3 decode gets faster from the bandwidth saving: 1139 → 832 ms (f16) / 935 ms (q8_0).
  • ctest -L unit 26/26; test-t3-caches / test-cpu-caches fixture runs green.

Default stays f32 (bit-exact). The tts-ggml addon knob (kvCacheType, with q8_0 + a larger default ctx for chatterbox) follows in tetherto/qvac once the registry port picks this revision up — same flow as the supertonic EngineOptions additions in 0.2.1.

Comment thread tts-cpp/src/main.cpp
Comment thread tts-cpp/src/main.cpp
ogad-tether added a commit that referenced this pull request Jun 15, 2026
… unit test

Addresses two review comments on #43:

1. (main.cpp:326) "validating the string isn't enough — nothing checks
   the active backend's ggml_flash_attn_ext supports the requested
   quantized/f16 K/V; on an unsupported backend this aborts at graph
   compute rather than degrading gracefully."

   New chatterbox_resolve_kv_type(backend, requested, head_dim, n_head,
   n_kv_head): after the backend is initialised, build a throwaway
   no_alloc flash_attn_ext node shaped like the real T3 attention
   (Q=F32 [HD,1,n_head], K/V=requested [HD,8,n_kv_head], null mask =
   the N=1 step path) and ask ggml_backend_supports_op.  Unsupported ->
   fall back to GGML_TYPE_F32 with a stderr warning instead of
   asserting deep in ggml.  Wired into load_model_gguf (turbo) and
   load_model_gguf_mtl, replacing the raw `hp.kv_type = kv_type`.
   F32 short-circuits (always supported); null backend -> F32.

   Caveat documented in-code: a backend that ADVERTISES support via
   supports_op but faults at compute is not caught here — ggml-vulkan
   reports q8_0 K/V FA as supported on both scalar and coopmat2 paths,
   so this is the guard for honest backends / future ports, not a
   substitute for an upstream ggml-vulkan fix.

2. (main.cpp:447) "add tests that exercise the quantized KV cache
   paths."  New fixture-free test-kv-cache-type (ctest label "unit"):
   covers chatterbox_kv_type_from_str (incl. unknown -> f32 guard) and
   chatterbox_resolve_kv_type against a CPU backend (retains f16/q8_0,
   F32 + null-backend short-circuits).

Verified: ctest -L unit 27/27; CPU greedy tokens still byte-identical
f32 vs q8_0 (probe doesn't perturb numerics); q8_0 retained on CPU and
on Vulkan/MoltenVK (scalar FA path) — load logs show KV=408 MB (q8_0)
vs ~1.6 GB (f32) at the turbo GGUF's native n_ctx.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ogad-tether ogad-tether requested a review from GustavoA1604 June 15, 2026 12:11
@ogad-tether ogad-tether self-assigned this Jun 15, 2026
GustavoA1604
GustavoA1604 previously approved these changes Jun 15, 2026
ogad-tether and others added 5 commits June 15, 2026 22:52
…ging the full file

Every chatterbox GGUF load used gguf_init_from_file(no_alloc=false),
which materialises the ENTIRE tensor-data section in host memory before
a single byte reaches the backend buffer.  On iOS that staging blob is
what pushed the QVAC SDK test process to a ~3.1 GB peak footprint
(task_vm_info.physFootprint) and into jetsam:

- T3 load:    +0.5-1.1 GB transient (staging + backend buffer coexist)
- voice_encoder_load / campplus_load / s3tokv2_load and the two mel-fb
  reads in main.cpp re-staged the whole T3 / S3Gen file again just to
  memcpy a few MB of F32 tensors out of it
- load_s3gen_gguf runs on the s3gen_preload background thread while the
  T3 weights are already resident, so its ~1 GB staging blob landed
  exactly on the process peak

New src/gguf_stream.h provides gguf_stream_reader: open the GGUF with
no_alloc=true (metadata-only tensors), allocate the destination, then
stream each tensor's payload from the file via fseek/fread —
ggml_backend_tensor_set in 8 MiB chunks for backend weights
(to_backend), or a single read into host vectors (to_host).  Peak host
overhead per load drops from sizeof(data section) to 8 MiB.  Tensor
size is validated against the destination before any byte is copied, so
metadata/file drift fails loudly instead of corrupting weights.

Converted call sites: load_model_gguf (turbo, main.cpp),
load_model_gguf_mtl (t3_mtl.cpp), load_s3gen_gguf (chatterbox_tts.cpp),
voice_encoder_load, campplus_load, s3tokv2_load, and the
s3gen/mel_fb/24k_80 + campplus/mel_fb_kaldi_80 single-tensor reads.
Left as-is: mel2wav.cpp (standalone demo tool, not in the SDK path) and
supertonic_gguf.cpp (same pattern, separate follow-up).

test/test_gguf_stream.cpp (ctest label "unit", no model fixture needed)
writes a synthetic GGUF — including a >8 MiB tensor so the chunked copy
crosses a chunk boundary — and asserts byte-exact parity between the
streaming and legacy staging loads for both to_backend and to_host,
plus loud failure on unknown names and size mismatches.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review feedback on #43: no call site ever needed to drop the 8 MiB
chunk scratch before the reader goes out of scope (every loader
destroys the reader as soon as the copy loop finishes), so the method
was dead code.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review follow-up on #43: quantising the T3 KV cache cuts its
up-front allocation from f32's 4 B/elem to f16's 2 B or q8_0's
~1.06 B (one fp16 scale per 32 values) — q8_0 is ~27% of f32, so the
same memory budget buys ~3.7x the context length.

New EngineOptions::kv_cache_type ("f32" default / "f16" / "q8_0"),
plumbed through load_model_gguf{,_mtl} into hparams.kv_type, plus a
--kv-cache-type CLI flag.  Unknown strings warn and fall back to f32
so a typo can't silently change numerics.

The enabling change is a KV slab layout swap from head-major
[HD, n_ctx, n_heads] to token-major [HD*n_heads, n_ctx] (one
ggml_row_size(kv_type, HD*n_heads) row per cached position, heads
packed inside the row — llama.cpp's layout):

- the per-step append at position n_past becomes a CONTIGUOUS span,
  which is what a quantised dtype requires — ggml-cpu's
  dup→quantized path GGML_ABORTs on a non-contiguous dst;
- the append consumes the pre-permute K (rope output) / V
  (projection output) directly, dropping two per-layer
  ggml_cont(permute(...)) on the MTL path;
- flash_attn_ext reads the [HD, L, n_heads] slice with plain strides
  (pos stride = one token row, head stride = one HD-row inside it)
  and consumes f16/q8_0 K/V natively on CPU and Metal
  (kernel_flash_attn_ext_q8_0_dk64_dv64 matches head_dim=64);
- all offsets land on whole HD=64-element rows = two q8_0 blocks, so
  quantised views stay block-aligned.

The MTL B=2 batched write splits into one cpy per batch half (a
single ne[3]=2 view would have a batch gap and stop being
contiguous).

Validated on real GGUFs (turbo Q4_0 + mtl Q4_0, CPU + Metal on M2):

- f32 old-vs-new BYTE-IDENTICAL greedy token sequences on turbo CPU,
  mtl CPU, and mtl Metal (B=2) — the layout swap changes nothing at
  the default dtype;
- turbo greedy: f32 == f16 == q8_0, identical across CPU and Metal
  (90/90 tokens, zero argmax flips);
- mtl greedy: f16/q8_0 each diverge from f32 by a single near-tie
  argmax flip (CFG's cond-uncond mixing amplifies quantisation
  epsilon); whisper-cli transcribes the q8_0 output to the exact
  input text on both variants;
- Metal T3 decode gets FASTER from the bandwidth saving: 1139 ms
  (f32) -> 832 ms (f16) / 935 ms (q8_0) on turbo;
- ctest -L unit 26/26; test-t3-caches + test-cpu-caches fixture runs
  green against the downloaded GGUFs.

The tts-ggml addon knob (kvCacheType + q8_0 default for chatterbox)
follows in qvac once the registry port picks this up — same flow as
the supertonic EngineOptions additions in 0.2.1.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… unit test

Addresses two review comments on #43:

1. (main.cpp:326) "validating the string isn't enough — nothing checks
   the active backend's ggml_flash_attn_ext supports the requested
   quantized/f16 K/V; on an unsupported backend this aborts at graph
   compute rather than degrading gracefully."

   New chatterbox_resolve_kv_type(backend, requested, head_dim, n_head,
   n_kv_head): after the backend is initialised, build a throwaway
   no_alloc flash_attn_ext node shaped like the real T3 attention
   (Q=F32 [HD,1,n_head], K/V=requested [HD,8,n_kv_head], null mask =
   the N=1 step path) and ask ggml_backend_supports_op.  Unsupported ->
   fall back to GGML_TYPE_F32 with a stderr warning instead of
   asserting deep in ggml.  Wired into load_model_gguf (turbo) and
   load_model_gguf_mtl, replacing the raw `hp.kv_type = kv_type`.
   F32 short-circuits (always supported); null backend -> F32.

   Caveat documented in-code: a backend that ADVERTISES support via
   supports_op but faults at compute is not caught here — ggml-vulkan
   reports q8_0 K/V FA as supported on both scalar and coopmat2 paths,
   so this is the guard for honest backends / future ports, not a
   substitute for an upstream ggml-vulkan fix.

2. (main.cpp:447) "add tests that exercise the quantized KV cache
   paths."  New fixture-free test-kv-cache-type (ctest label "unit"):
   covers chatterbox_kv_type_from_str (incl. unknown -> f32 guard) and
   chatterbox_resolve_kv_type against a CPU backend (retains f16/q8_0,
   F32 + null-backend short-circuits).

Verified: ctest -L unit 27/27; CPU greedy tokens still byte-identical
f32 vs q8_0 (probe doesn't perturb numerics); q8_0 retained on CPU and
on Vulkan/MoltenVK (scalar FA path) — load logs show KV=408 MB (q8_0)
vs ~1.6 GB (f32) at the turbo GGUF's native n_ctx.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…at2 FA fault)

ggml-vulkan's supports_op advertises quantized K/V flash-attention as
supported, but the NV_coopmat2 kernel faults at compute on a q8_0 K/V
cache.  Toggle-confirmed in the downstream GPU CI A/B (same NVIDIA
RTX 4000 coopmat2 runners, ubuntu-22.04 + ubuntu-24.04):

  q8_0 KV default -> SIGSEGV (139) on both
  f32  KV default -> pass on both

Only the chatterbox default KV dtype differed; rules out the
token-major KV layout and the pre-existing chatterbox-Vulkan graph.
MoltenVK (scalar FA, no coopmat) runs q8_0 fine and byte-identical to
f32, so it's specific to the coopmat2 dequant-in-shader path.

The load-time capability probe (chatterbox_resolve_kv_type) can't catch
this — supports_op returns true — so add a targeted guard: quantized
K/V on a Vulkan backend falls back to f32 with a stderr warning. f16
(the native FA input type, not dequantized in-shader) is left intact;
Metal / CPU keep quantized K/V (validated byte-identical greedy decode).

Net: the tts-ggml addon's q8_0 chatterbox default transparently
downgrades to f32 on Vulkan (Linux/Windows/Android) at load — no addon
change — while iOS/Metal and CPU keep the q8_0 memory win.  Re-enabling
q8_0 on Vulkan is a one-line revert once the upstream coopmat2 FA kernel
handles quantized K/V.

Verified on MoltenVK: q8_0 -> warns + f32 KV (1536 MB), f16 -> 768 MB,
f32 -> 1536 MB; CPU q8_0 unaffected (408 MB).  ctest -L unit 27/27.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ogad-tether

Copy link
Copy Markdown
Author

Rebased onto current master (52d09d0, post-#42 supertonic3 + #47 QVAC-20484 S3Gen streaming).

Dropped the Android symbol-fix commit (8b012789) — that change (swapping the 3 raw ggml_backend_is_cpu calls in supertonic_gguf.cpp for tts_cpp::detail::backend_is_cpu) is the same edit as f7d4d6c (QVAC-19254), which already ships via the qvac packages/tts-ggml overlay and is the canonical owner of that fix. Keeping it here would have duplicated/collided. This PR is now chatterbox memory work only:

  1. adbe6ac6 stream GGUF tensor data instead of staging the full file
  2. 49293b99 drop unused gguf_stream_reader::release_scratch
  3. 026cfe24 selectable KV-cache dtype (f32|f16|q8_0), token-major slab
  4. e562109b KV-dtype capability probe + F32 fallback + unit test
  5. 05770ccc force f32 KV on Vulkan for quantized cache (coopmat2 FA fault)

All cherry-picks auto-merged cleanly over #47's segmentation changes — no conflicts. Verified locally on Metal: full libtts-cpp.a compiles, and both chatterbox unit tests (test-gguf-stream, test-kv-cache-type) pass. supertonic_gguf.cpp is intentionally left untouched (still f7d4d6c's domain).

Landing still gated on the QVAC-19254 / registry-publish ownership being sorted so we point at a single tts-cpp revision rather than forking the registry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants