QVAC-19557 tts-ggml: cap Chatterbox T3 context (KV cache) at 2048 by default#2527
QVAC-19557 tts-ggml: cap Chatterbox T3 context (KV cache) at 2048 by default#2527ogad-tether wants to merge 4 commits into
Conversation
…default tts-cpp allocates the T3 KV cache up-front, in F32, at the GGUF's full n_ctx. The Turbo GGUF ships n_ctx=8196, which costs ~1.6 GB of KV (n_embd(1024) x n_layer(24) x 8196 x 4 B x K+V) for synthesis that rarely needs more than a few hundred tokens, and is the single largest contributor to the ~3.1 GB peak footprint that gets the iOS QVAC SDK chatterbox e2e tests jetsam-killed (the tts-chatterbox-* variants are currently skipped on iOS Device Farm for exactly this). tts-cpp already supports capping via EngineOptions::n_ctx (the engine clamps the GGUF's n_ctx to it, never raises it), but the addon never set it. Now: - ChatterboxModel::toEngineOptions passes n_ctx = nCtx ?? 2048 (kDefaultNCtx). 2048 tokens keep ~80 s of generated audio per synthesize() call (T3 speech tokens run at 25 Hz) for ~400 MB of KV. - New `nCtx` constructor option (JS -> JSAdapter -> ChatterboxConfig) for hosts that need a different cap; nCtx=0 is the documented escape hatch back to the GGUF's full context, negative values are rejected at construction (validateConfig). - engineOptionsForTests() exposes the config -> EngineOptions mapping to the gtest suite; covers the 2048 default, explicit forwarding, the 0 escape hatch, and negative rejection. JS unit test covers ttsParams forwarding (set, 0, and unset/omitted). Chatterbox-only: Supertonic has no autoregressive KV cache, so there is nothing to cap there. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Re: the "quantize KV cache to q8 and raise ctx" suggestion — implemented engine-side in tetherto/qvac-ext-lib-whisper.cpp#43 (commit The addon knob can't land in this PR — it builds against the pinned vcpkg |
Tier-based Approval Status |
…x default 4096
Builds on the upstream kv_cache_type support
(qvac-ext-lib-whisper.cpp#43): the T3 KV cache is allocated up-front
at nCtx, and q8_0 stores it at ~27% of f32 — so the new defaults
(nCtx=4096 + kvCacheType="q8_0", ~210 MB of KV for ~160 s of audio
per synthesize() call) use HALF the memory of the previous
f32@2048 plan while doubling the usable context.
- New `kvCacheType` constructor option ('f32'|'f16'|'q8_0'), plumbed
JS -> JSAdapter -> ChatterboxConfig -> EngineOptions. Unknown
values are rejected at construction (tts-cpp's own fallback would
silently revert to f32 and change the memory profile the caller
asked for). kvCacheType:"f32" restores bit-exact pre-quantisation
behaviour.
- nCtx default 2048 -> 4096 (cheaper than the old default AND longer,
per the review suggestion to raise ctx alongside q8 KV).
- vcpkg tts-cpp pin -> 2026-06-12. This pin is Android-safe: the
revision removes the last direct ggml_backend_is_cpu /
ggml_get_type_traits_cpu references from tts-cpp (the
unresolvable-UND dlopen crash behind the 0.2.2 revert), routing
them through the backend registry + ggml_quantize_chunk (ggml-base).
Upstream validation on real GGUFs (see tetherto#43): Turbo greedy token
sequences byte-identical across f32/f16/q8_0 on CPU and Metal; MTL
CFG can flip a near-tie argmax (same class of variation as a seed
change; whisper transcribes the q8_0 output to the exact input
text); Metal decode 20-30% faster from the KV bandwidth saving.
Tests: gtest covers the q8_0 default, explicit forwarding, the f32
escape hatch, and unknown-value rejection (42/42 against tts-cpp
2026-06-12); JS unit suite 63/63; lint clean.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Bumped this PR with the full KV-quantisation work (commit
Re: the Android landmine that forced the 0.2.2 revert — defused upstream rather than worked around: tts-cpp PR #43 now removes the last direct Validated locally against the new tts-cpp via an overlay port: addon gtests 42/42 (incl. new kvCacheType coverage), JS unit suite 63/63, lint clean. Merge order / CI note:
|
Temporary wiring so CI can resolve tts-cpp>=2026-06-12 before qvac-registry-vcpkg#188 merges: point the default registry at the tetherto#188 branch via the vcpkg-configuration 'reference' field and move the baseline to its tip. Verified locally: vcpkg resolves tts-cpp@2026-06-12 + ggml-speech@2026-06-04 from the remote branch and the addon configures cleanly. Once tetherto#188 lands on the registry main, drop 'reference' and move 'baseline' to the merged commit (tracked in the PR checklist). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…2 guard) Re-points the temporary tts-cpp-2026-06-12 registry-branch baseline to the updated port (qvac-registry-vcpkg#188 @ 4ace796) so CI builds against tts-cpp c8620cf9 — which forces quantized KV to f32 on Vulkan (NV_coopmat2 FA fault) while keeping the q8_0 chatterbox default on Metal/CPU. Resolves the linux-x64 GPU-integration SIGSEGV. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Context
The iOS QVAC SDK chatterbox e2e tests peak at ~3.1 GB physFootprint and get jetsam-killed (QVAC-19557 — the
tts-chatterbox-*variants are skipped on iOS Device Farm for exactly this). The single largest contributor: tts-cpp allocates the T3 KV cache up-front, in F32, at the GGUF's fulln_ctx. The Turbo GGUF shipsn_ctx=8196, which costsfor synthesis that rarely needs more than a few hundred tokens.
Change
tts-cpp already supports capping via
EngineOptions::n_ctx(the engine clamps the GGUF'sn_ctxto it, never raises it), but the addon never set it. Now:ChatterboxModel::toEngineOptionspassesn_ctx = nCtx ?? 2048(kDefaultNCtx). 2048 tokens keep ≈80 s of generated audio persynthesize()call (T3 speech tokens run at 25 Hz) for ~400 MB of KV — a ~1.2 GB steady-state saving on the Turbo GGUF.nCtxconstructor option (JS → JSAdapter → ChatterboxConfig) for hosts that need a different cap.nCtx: 0is the documented escape hatch back to the GGUF's full context; negative values are rejected at construction (validateConfig).A companion tts-cpp PR (tetherto/qvac-ext-lib-whisper.cpp#43) removes the GGUF load-time host-staging spikes (+0.5–1 GB transient) on the same ticket; the two land independently.
Testing
engineOptionsForTests()hook exposes the config → EngineOptions mapping; covers the 2048 default, explicit forwarding, the 0 escape hatch, and negative rejection.tts_ggml_tests: 39/39 pass.ttsParamsforwarding (set, 0, and unset/omitted → addon default applies).npm run test:unit: 62/62 pass;npm run lintclean.🤖 Generated with Claude Code