QVAC-19557 tts-cpp: stream chatterbox GGUF tensor data instead of staging the full file#43
QVAC-19557 tts-cpp: stream chatterbox GGUF tensor data instead of staging the full file#43ogad-tether wants to merge 5 commits into
Conversation
| void release_scratch() { | ||
| scratch_.clear(); | ||
| scratch_.shrink_to_fit(); | ||
| } |
There was a problem hiding this comment.
Seems like this function is dead code
There was a problem hiding this comment.
Good catch — it was speculative API with no call site (every loader destroys the reader right after the copy loop, so the scratch never outlives the load). Removed in 054ca71.
Review feedback on #43: no call site ever needed to drop the 8 MiB chunk scratch before the reader goes out of scope (every loader destroys the reader as soon as the copy loop finishes), so the method was dead code. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review follow-up on #43: quantising the T3 KV cache cuts its up-front allocation from f32's 4 B/elem to f16's 2 B or q8_0's ~1.06 B (one fp16 scale per 32 values) — q8_0 is ~27% of f32, so the same memory budget buys ~3.7x the context length. New EngineOptions::kv_cache_type ("f32" default / "f16" / "q8_0"), plumbed through load_model_gguf{,_mtl} into hparams.kv_type, plus a --kv-cache-type CLI flag. Unknown strings warn and fall back to f32 so a typo can't silently change numerics. The enabling change is a KV slab layout swap from head-major [HD, n_ctx, n_heads] to token-major [HD*n_heads, n_ctx] (one ggml_row_size(kv_type, HD*n_heads) row per cached position, heads packed inside the row — llama.cpp's layout): - the per-step append at position n_past becomes a CONTIGUOUS span, which is what a quantised dtype requires — ggml-cpu's dup→quantized path GGML_ABORTs on a non-contiguous dst; - the append consumes the pre-permute K (rope output) / V (projection output) directly, dropping two per-layer ggml_cont(permute(...)) on the MTL path; - flash_attn_ext reads the [HD, L, n_heads] slice with plain strides (pos stride = one token row, head stride = one HD-row inside it) and consumes f16/q8_0 K/V natively on CPU and Metal (kernel_flash_attn_ext_q8_0_dk64_dv64 matches head_dim=64); - all offsets land on whole HD=64-element rows = two q8_0 blocks, so quantised views stay block-aligned. The MTL B=2 batched write splits into one cpy per batch half (a single ne[3]=2 view would have a batch gap and stop being contiguous). Validated on real GGUFs (turbo Q4_0 + mtl Q4_0, CPU + Metal on M2): - f32 old-vs-new BYTE-IDENTICAL greedy token sequences on turbo CPU, mtl CPU, and mtl Metal (B=2) — the layout swap changes nothing at the default dtype; - turbo greedy: f32 == f16 == q8_0, identical across CPU and Metal (90/90 tokens, zero argmax flips); - mtl greedy: f16/q8_0 each diverge from f32 by a single near-tie argmax flip (CFG's cond-uncond mixing amplifies quantisation epsilon); whisper-cli transcribes the q8_0 output to the exact input text on both variants; - Metal T3 decode gets FASTER from the bandwidth saving: 1139 ms (f32) -> 832 ms (f16) / 935 ms (q8_0) on turbo; - ctest -L unit 26/26; test-t3-caches + test-cpu-caches fixture runs green against the downloaded GGUFs. The tts-ggml addon knob (kvCacheType + q8_0 default for chatterbox) follows in qvac once the registry port picks this up — same flow as the supertonic EngineOptions additions in 0.2.1. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Pushed Why it's in this PR: the enabling change is a KV slab layout swap (head-major → token-major, llama.cpp-style) so the per-step append is a contiguous span — ggml-cpu's dup→quantized path hard-aborts on non-contiguous destinations. Same QVAC-19557 memory arc as the streaming loads. Bonus: the MTL path drops two per-layer Validation on real GGUFs (turbo + mtl, CPU + Metal, M2):
Default stays f32 (bit-exact). The tts-ggml addon knob ( |
… unit test Addresses two review comments on #43: 1. (main.cpp:326) "validating the string isn't enough — nothing checks the active backend's ggml_flash_attn_ext supports the requested quantized/f16 K/V; on an unsupported backend this aborts at graph compute rather than degrading gracefully." New chatterbox_resolve_kv_type(backend, requested, head_dim, n_head, n_kv_head): after the backend is initialised, build a throwaway no_alloc flash_attn_ext node shaped like the real T3 attention (Q=F32 [HD,1,n_head], K/V=requested [HD,8,n_kv_head], null mask = the N=1 step path) and ask ggml_backend_supports_op. Unsupported -> fall back to GGML_TYPE_F32 with a stderr warning instead of asserting deep in ggml. Wired into load_model_gguf (turbo) and load_model_gguf_mtl, replacing the raw `hp.kv_type = kv_type`. F32 short-circuits (always supported); null backend -> F32. Caveat documented in-code: a backend that ADVERTISES support via supports_op but faults at compute is not caught here — ggml-vulkan reports q8_0 K/V FA as supported on both scalar and coopmat2 paths, so this is the guard for honest backends / future ports, not a substitute for an upstream ggml-vulkan fix. 2. (main.cpp:447) "add tests that exercise the quantized KV cache paths." New fixture-free test-kv-cache-type (ctest label "unit"): covers chatterbox_kv_type_from_str (incl. unknown -> f32 guard) and chatterbox_resolve_kv_type against a CPU backend (retains f16/q8_0, F32 + null-backend short-circuits). Verified: ctest -L unit 27/27; CPU greedy tokens still byte-identical f32 vs q8_0 (probe doesn't perturb numerics); q8_0 retained on CPU and on Vulkan/MoltenVK (scalar FA path) — load logs show KV=408 MB (q8_0) vs ~1.6 GB (f32) at the turbo GGUF's native n_ctx. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ging the full file Every chatterbox GGUF load used gguf_init_from_file(no_alloc=false), which materialises the ENTIRE tensor-data section in host memory before a single byte reaches the backend buffer. On iOS that staging blob is what pushed the QVAC SDK test process to a ~3.1 GB peak footprint (task_vm_info.physFootprint) and into jetsam: - T3 load: +0.5-1.1 GB transient (staging + backend buffer coexist) - voice_encoder_load / campplus_load / s3tokv2_load and the two mel-fb reads in main.cpp re-staged the whole T3 / S3Gen file again just to memcpy a few MB of F32 tensors out of it - load_s3gen_gguf runs on the s3gen_preload background thread while the T3 weights are already resident, so its ~1 GB staging blob landed exactly on the process peak New src/gguf_stream.h provides gguf_stream_reader: open the GGUF with no_alloc=true (metadata-only tensors), allocate the destination, then stream each tensor's payload from the file via fseek/fread — ggml_backend_tensor_set in 8 MiB chunks for backend weights (to_backend), or a single read into host vectors (to_host). Peak host overhead per load drops from sizeof(data section) to 8 MiB. Tensor size is validated against the destination before any byte is copied, so metadata/file drift fails loudly instead of corrupting weights. Converted call sites: load_model_gguf (turbo, main.cpp), load_model_gguf_mtl (t3_mtl.cpp), load_s3gen_gguf (chatterbox_tts.cpp), voice_encoder_load, campplus_load, s3tokv2_load, and the s3gen/mel_fb/24k_80 + campplus/mel_fb_kaldi_80 single-tensor reads. Left as-is: mel2wav.cpp (standalone demo tool, not in the SDK path) and supertonic_gguf.cpp (same pattern, separate follow-up). test/test_gguf_stream.cpp (ctest label "unit", no model fixture needed) writes a synthetic GGUF — including a >8 MiB tensor so the chunked copy crosses a chunk boundary — and asserts byte-exact parity between the streaming and legacy staging loads for both to_backend and to_host, plus loud failure on unknown names and size mismatches. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review feedback on #43: no call site ever needed to drop the 8 MiB chunk scratch before the reader goes out of scope (every loader destroys the reader as soon as the copy loop finishes), so the method was dead code. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review follow-up on #43: quantising the T3 KV cache cuts its up-front allocation from f32's 4 B/elem to f16's 2 B or q8_0's ~1.06 B (one fp16 scale per 32 values) — q8_0 is ~27% of f32, so the same memory budget buys ~3.7x the context length. New EngineOptions::kv_cache_type ("f32" default / "f16" / "q8_0"), plumbed through load_model_gguf{,_mtl} into hparams.kv_type, plus a --kv-cache-type CLI flag. Unknown strings warn and fall back to f32 so a typo can't silently change numerics. The enabling change is a KV slab layout swap from head-major [HD, n_ctx, n_heads] to token-major [HD*n_heads, n_ctx] (one ggml_row_size(kv_type, HD*n_heads) row per cached position, heads packed inside the row — llama.cpp's layout): - the per-step append at position n_past becomes a CONTIGUOUS span, which is what a quantised dtype requires — ggml-cpu's dup→quantized path GGML_ABORTs on a non-contiguous dst; - the append consumes the pre-permute K (rope output) / V (projection output) directly, dropping two per-layer ggml_cont(permute(...)) on the MTL path; - flash_attn_ext reads the [HD, L, n_heads] slice with plain strides (pos stride = one token row, head stride = one HD-row inside it) and consumes f16/q8_0 K/V natively on CPU and Metal (kernel_flash_attn_ext_q8_0_dk64_dv64 matches head_dim=64); - all offsets land on whole HD=64-element rows = two q8_0 blocks, so quantised views stay block-aligned. The MTL B=2 batched write splits into one cpy per batch half (a single ne[3]=2 view would have a batch gap and stop being contiguous). Validated on real GGUFs (turbo Q4_0 + mtl Q4_0, CPU + Metal on M2): - f32 old-vs-new BYTE-IDENTICAL greedy token sequences on turbo CPU, mtl CPU, and mtl Metal (B=2) — the layout swap changes nothing at the default dtype; - turbo greedy: f32 == f16 == q8_0, identical across CPU and Metal (90/90 tokens, zero argmax flips); - mtl greedy: f16/q8_0 each diverge from f32 by a single near-tie argmax flip (CFG's cond-uncond mixing amplifies quantisation epsilon); whisper-cli transcribes the q8_0 output to the exact input text on both variants; - Metal T3 decode gets FASTER from the bandwidth saving: 1139 ms (f32) -> 832 ms (f16) / 935 ms (q8_0) on turbo; - ctest -L unit 26/26; test-t3-caches + test-cpu-caches fixture runs green against the downloaded GGUFs. The tts-ggml addon knob (kvCacheType + q8_0 default for chatterbox) follows in qvac once the registry port picks this up — same flow as the supertonic EngineOptions additions in 0.2.1. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… unit test Addresses two review comments on #43: 1. (main.cpp:326) "validating the string isn't enough — nothing checks the active backend's ggml_flash_attn_ext supports the requested quantized/f16 K/V; on an unsupported backend this aborts at graph compute rather than degrading gracefully." New chatterbox_resolve_kv_type(backend, requested, head_dim, n_head, n_kv_head): after the backend is initialised, build a throwaway no_alloc flash_attn_ext node shaped like the real T3 attention (Q=F32 [HD,1,n_head], K/V=requested [HD,8,n_kv_head], null mask = the N=1 step path) and ask ggml_backend_supports_op. Unsupported -> fall back to GGML_TYPE_F32 with a stderr warning instead of asserting deep in ggml. Wired into load_model_gguf (turbo) and load_model_gguf_mtl, replacing the raw `hp.kv_type = kv_type`. F32 short-circuits (always supported); null backend -> F32. Caveat documented in-code: a backend that ADVERTISES support via supports_op but faults at compute is not caught here — ggml-vulkan reports q8_0 K/V FA as supported on both scalar and coopmat2 paths, so this is the guard for honest backends / future ports, not a substitute for an upstream ggml-vulkan fix. 2. (main.cpp:447) "add tests that exercise the quantized KV cache paths." New fixture-free test-kv-cache-type (ctest label "unit"): covers chatterbox_kv_type_from_str (incl. unknown -> f32 guard) and chatterbox_resolve_kv_type against a CPU backend (retains f16/q8_0, F32 + null-backend short-circuits). Verified: ctest -L unit 27/27; CPU greedy tokens still byte-identical f32 vs q8_0 (probe doesn't perturb numerics); q8_0 retained on CPU and on Vulkan/MoltenVK (scalar FA path) — load logs show KV=408 MB (q8_0) vs ~1.6 GB (f32) at the turbo GGUF's native n_ctx. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…at2 FA fault) ggml-vulkan's supports_op advertises quantized K/V flash-attention as supported, but the NV_coopmat2 kernel faults at compute on a q8_0 K/V cache. Toggle-confirmed in the downstream GPU CI A/B (same NVIDIA RTX 4000 coopmat2 runners, ubuntu-22.04 + ubuntu-24.04): q8_0 KV default -> SIGSEGV (139) on both f32 KV default -> pass on both Only the chatterbox default KV dtype differed; rules out the token-major KV layout and the pre-existing chatterbox-Vulkan graph. MoltenVK (scalar FA, no coopmat) runs q8_0 fine and byte-identical to f32, so it's specific to the coopmat2 dequant-in-shader path. The load-time capability probe (chatterbox_resolve_kv_type) can't catch this — supports_op returns true — so add a targeted guard: quantized K/V on a Vulkan backend falls back to f32 with a stderr warning. f16 (the native FA input type, not dequantized in-shader) is left intact; Metal / CPU keep quantized K/V (validated byte-identical greedy decode). Net: the tts-ggml addon's q8_0 chatterbox default transparently downgrades to f32 on Vulkan (Linux/Windows/Android) at load — no addon change — while iOS/Metal and CPU keep the q8_0 memory win. Re-enabling q8_0 on Vulkan is a one-line revert once the upstream coopmat2 FA kernel handles quantized K/V. Verified on MoltenVK: q8_0 -> warns + f32 KV (1536 MB), f16 -> 768 MB, f32 -> 1536 MB; CPU q8_0 unaffected (408 MB). ctest -L unit 27/27. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
c8620cf to
05770cc
Compare
|
Rebased onto current Dropped the Android symbol-fix commit (
All cherry-picks auto-merged cleanly over #47's segmentation changes — no conflicts. Verified locally on Metal: full Landing still gated on the QVAC-19254 / registry-publish ownership being sorted so we point at a single tts-cpp revision rather than forking the registry. |
Context
The iOS QVAC SDK chatterbox e2e tests peak at ~3.1 GB physFootprint and get jetsam-killed (QVAC-19557 — the
tts-chatterbox-*variants are currently skipped on iOS Device Farm). Part of that peak is transient: every chatterbox GGUF load usedgguf_init_from_file(no_alloc=false), which materialises the entire tensor-data section in host memory before a single byte reaches the backend buffer:voice_encoder_load/campplus_load/s3tokv2_loadand the two mel-filterbank reads inmain.cppre-staged the whole T3 / S3Gen file again just to memcpy a few MB of F32 tensors out of it (the voice-encoder one runs right after the T3 weights are already resident).load_s3gen_ggufruns on thes3gen_preloadbackground thread while T3 is resident, so its ~1 GB staging blob landed exactly on the process peak.Change
New
src/gguf_stream.hprovidesgguf_stream_reader: open the GGUF withno_alloc=true(metadata-only tensors), allocate the destination, then stream each tensor's payload from the file viafseek/fread—ggml_backend_tensor_setin 8 MiB chunks for backend weights (to_backend), or a single read into host vectors (to_host). Peak host overhead per load drops fromsizeof(data section)to 8 MiB. Tensor size is validated against the destination before any byte is copied, so metadata/file drift fails loudly instead of corrupting weights.Converted call sites:
load_model_gguf(turbo),load_model_gguf_mtl,load_s3gen_gguf,voice_encoder_load,campplus_load,s3tokv2_load, and thes3gen/mel_fb/24k_80+campplus/mel_fb_kaldi_80single-tensor reads.Left as-is (follow-ups):
mel2wav.cpp(standalone demo tool, not in the SDK path) andsupertonic_gguf.cpp(same staging pattern, separate scope).Uses only
gguf_*/ggml-backendregistry-safe APIs — no direct ggml-cpu symbols, so the AndroidGGML_BACKEND_DL=ONconstraint (qvac PR ggml-org#2502 revert) is unaffected.Testing
test/test_gguf_stream.cpp(ctest labelunit, no model fixture needed): writes a synthetic GGUF — including a >8 MiB tensor so the chunked copy crosses a chunk boundary — and asserts byte-exact parity between the streaming and legacy staging loads for bothto_backendandto_host, plus loud failure on unknown names and size mismatches.ctest -L unit: 26/26 pass.test-t3-caches-mtl/test-cpu-caches-{turbo,mtl}will exercise the new path wherever the GGUF fixtures are staged).🤖 Generated with Claude Code