QVAC-19557 tts-ggml: cap Chatterbox T3 context (KV cache) at 2048 by default by ogad-tether · Pull Request #2527 · tetherto/qvac

ogad-tether · 2026-06-10T17:13:11Z

Context

The iOS QVAC SDK chatterbox e2e tests peak at ~3.1 GB physFootprint and get jetsam-killed (QVAC-19557 — the tts-chatterbox-* variants are skipped on iOS Device Farm for exactly this). The single largest contributor: tts-cpp allocates the T3 KV cache up-front, in F32, at the GGUF's full n_ctx. The Turbo GGUF ships n_ctx=8196, which costs

n_embd(1024) × n_layer(24) × 8196 × 4 B × 2 (K+V) ≈ 1.6 GB

for synthesis that rarely needs more than a few hundred tokens.

Change

tts-cpp already supports capping via EngineOptions::n_ctx (the engine clamps the GGUF's n_ctx to it, never raises it), but the addon never set it. Now:

ChatterboxModel::toEngineOptions passes n_ctx = nCtx ?? 2048 (kDefaultNCtx). 2048 tokens keep ≈80 s of generated audio per synthesize() call (T3 speech tokens run at 25 Hz) for ~400 MB of KV — a ~1.2 GB steady-state saving on the Turbo GGUF.
New nCtx constructor option (JS → JSAdapter → ChatterboxConfig) for hosts that need a different cap. nCtx: 0 is the documented escape hatch back to the GGUF's full context; negative values are rejected at construction (validateConfig).
Chatterbox-only: Supertonic has no autoregressive KV cache, so there is nothing to cap there.

A companion tts-cpp PR (tetherto/qvac-ext-lib-whisper.cpp#43) removes the GGUF load-time host-staging spikes (+0.5–1 GB transient) on the same ticket; the two land independently.

Testing

gtest: new engineOptionsForTests() hook exposes the config → EngineOptions mapping; covers the 2048 default, explicit forwarding, the 0 escape hatch, and negative rejection. tts_ggml_tests: 39/39 pass.
JS: new unit test covers ttsParams forwarding (set, 0, and unset/omitted → addon default applies). npm run test:unit: 62/62 pass; npm run lint clean.
README / index.d.ts / CHANGELOG updated.

🤖 Generated with Claude Code

…default tts-cpp allocates the T3 KV cache up-front, in F32, at the GGUF's full n_ctx. The Turbo GGUF ships n_ctx=8196, which costs ~1.6 GB of KV (n_embd(1024) x n_layer(24) x 8196 x 4 B x K+V) for synthesis that rarely needs more than a few hundred tokens, and is the single largest contributor to the ~3.1 GB peak footprint that gets the iOS QVAC SDK chatterbox e2e tests jetsam-killed (the tts-chatterbox-* variants are currently skipped on iOS Device Farm for exactly this). tts-cpp already supports capping via EngineOptions::n_ctx (the engine clamps the GGUF's n_ctx to it, never raises it), but the addon never set it. Now: - ChatterboxModel::toEngineOptions passes n_ctx = nCtx ?? 2048 (kDefaultNCtx). 2048 tokens keep ~80 s of generated audio per synthesize() call (T3 speech tokens run at 25 Hz) for ~400 MB of KV. - New `nCtx` constructor option (JS -> JSAdapter -> ChatterboxConfig) for hosts that need a different cap; nCtx=0 is the documented escape hatch back to the GGUF's full context, negative values are rejected at construction (validateConfig). - engineOptionsForTests() exposes the config -> EngineOptions mapping to the gtest suite; covers the 2048 default, explicit forwarding, the 0 escape hatch, and negative rejection. JS unit test covers ttsParams forwarding (set, 0, and unset/omitted). Chatterbox-only: Supertonic has no autoregressive KV cache, so there is nothing to cap there. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ogad-tether · 2026-06-12T09:32:26Z

Re: the "quantize KV cache to q8 and raise ctx" suggestion — implemented engine-side in tetherto/qvac-ext-lib-whisper.cpp#43 (commit 3939db19): EngineOptions::kv_cache_type (f32 default / f16 / q8_0) on a token-major KV slab. Validated on real GGUFs: f32 is byte-identical to the old code, turbo greedy tokens are identical across all three dtypes on CPU and Metal, and Metal decode gets ~20-30% faster from the bandwidth saving. q8_0 stores the cache at ~27% of f32.

The addon knob can't land in this PR — it builds against the pinned vcpkg tts-cpp port, which won't expose the new field until #43 merges and the registry publishes a new port version (same flow as the supertonic EngineOptions additions in 0.2.1). Follow-up addon PR will: add kvCacheType, default chatterbox to q8_0, and raise the default nCtx 2048 → 4096 — that combination is ~204 MB of KV (vs ~400 MB for f32@2048 in this PR, vs 1.6 GB before) while doubling the usable context. This PR stays as-is and is safe to land first.

github-actions · 2026-06-12T09:32:53Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ❌ (0/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

…x default 4096 Builds on the upstream kv_cache_type support (qvac-ext-lib-whisper.cpp#43): the T3 KV cache is allocated up-front at nCtx, and q8_0 stores it at ~27% of f32 — so the new defaults (nCtx=4096 + kvCacheType="q8_0", ~210 MB of KV for ~160 s of audio per synthesize() call) use HALF the memory of the previous f32@2048 plan while doubling the usable context. - New `kvCacheType` constructor option ('f32'|'f16'|'q8_0'), plumbed JS -> JSAdapter -> ChatterboxConfig -> EngineOptions. Unknown values are rejected at construction (tts-cpp's own fallback would silently revert to f32 and change the memory profile the caller asked for). kvCacheType:"f32" restores bit-exact pre-quantisation behaviour. - nCtx default 2048 -> 4096 (cheaper than the old default AND longer, per the review suggestion to raise ctx alongside q8 KV). - vcpkg tts-cpp pin -> 2026-06-12. This pin is Android-safe: the revision removes the last direct ggml_backend_is_cpu / ggml_get_type_traits_cpu references from tts-cpp (the unresolvable-UND dlopen crash behind the 0.2.2 revert), routing them through the backend registry + ggml_quantize_chunk (ggml-base). Upstream validation on real GGUFs (see tetherto#43): Turbo greedy token sequences byte-identical across f32/f16/q8_0 on CPU and Metal; MTL CFG can flip a near-tie argmax (same class of variation as a seed change; whisper transcribes the q8_0 output to the exact input text); Metal decode 20-30% faster from the KV bandwidth saving. Tests: gtest covers the q8_0 default, explicit forwarding, the f32 escape hatch, and unknown-value rejection (42/42 against tts-cpp 2026-06-12); JS unit suite 63/63; lint clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ogad-tether · 2026-06-12T09:52:25Z

Bumped this PR with the full KV-quantisation work (commit 70ea6cd28), per the review suggestion:

kvCacheType option (f32|f16|q8_0), default q8_0 — ~27% of f32's KV memory; unknown values rejected at construction; "f32" = bit-exact escape hatch.
nCtx default 2048 → 4096 — with q8_0 that's ~210 MB of KV for ~160 s of audio per call: less memory than the previous f32@2048 plan and double the context.
vcpkg tts-cpp pin → 2026-06-12.

Re: the Android landmine that forced the 0.2.2 revert — defused upstream rather than worked around: tts-cpp PR #43 now removes the last direct ggml_backend_is_cpu / ggml_get_type_traits_cpu references (backend registry + ggml_quantize_chunk instead); nm -u libtts-cpp.a is clean of both symbols, so the Android GGML_BACKEND_DL=ON addon link can't end up with the unresolvable UND symbols again.

Validated locally against the new tts-cpp via an overlay port: addon gtests 42/42 (incl. new kvCacheType coverage), JS unit suite 63/63, lint clean.

Merge order / CI note:

QVAC-19557 tts-cpp: stream chatterbox GGUF tensor data instead of staging the full file qvac-ext-lib-whisper.cpp#43 (tts-cpp)
tts-cpp: publish 2026-06-12 — QVAC-19557 chatterbox memory (PR #43) + Android-safe symbols qvac-registry-vcpkg#188 (publishes tts-cpp@2026-06-12, draft until 1 merges)
this PR — needs one final commit bumping vcpkg-configuration.json's registry baseline to the registry commit from step 2 (until then the C++ CI here can't resolve tts-cpp>=2026-06-12). I'll push that bump as soon as release(qvac-lib-registry-client): v0.2.0 #188 lands.

Temporary wiring so CI can resolve tts-cpp>=2026-06-12 before qvac-registry-vcpkg#188 merges: point the default registry at the tetherto#188 branch via the vcpkg-configuration 'reference' field and move the baseline to its tip. Verified locally: vcpkg resolves tts-cpp@2026-06-12 + ggml-speech@2026-06-04 from the remote branch and the addon configures cleanly. Once tetherto#188 lands on the registry main, drop 'reference' and move 'baseline' to the merged commit (tracked in the PR checklist). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…2 guard) Re-points the temporary tts-cpp-2026-06-12 registry-branch baseline to the updated port (qvac-registry-vcpkg#188 @ 4ace796) so CI builds against tts-cpp c8620cf9 — which forces quantized KV to f32 on Vulkan (NV_coopmat2 FA fault) while keeping the q8_0 chatterbox default on Metal/CPU. Resolves the linux-x64 GPU-integration SIGSEGV. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ogad-tether requested review from a team as code owners June 10, 2026 17:13

GustavoA1604 previously approved these changes Jun 10, 2026

View reviewed changes

ogad-tether dismissed GustavoA1604’s stale review via 70ea6cd June 12, 2026 09:48

ogad-tether mentioned this pull request Jun 12, 2026

tts-cpp: publish 2026-06-12 — QVAC-19557 chatterbox memory (PR #43) + Android-safe symbols tetherto/qvac-registry-vcpkg#188

Draft

ogad-tether added the verified Authorize secrets / label-gate in PR workflows label Jun 12, 2026

ogad-tether had a problem deploying to release June 12, 2026 09:59 — with GitHub Actions Error

ogad-tether temporarily deployed to release June 12, 2026 09:59 — with GitHub Actions Inactive

ogad-tether had a problem deploying to release June 12, 2026 09:59 — with GitHub Actions Error

ogad-tether temporarily deployed to release June 12, 2026 09:59 — with GitHub Actions Inactive

ogad-tether had a problem deploying to release June 12, 2026 09:59 — with GitHub Actions Failure

ogad-tether temporarily deployed to release June 12, 2026 09:59 — with GitHub Actions Inactive

ogad-tether had a problem deploying to release June 12, 2026 09:59 — with GitHub Actions Error

ogad-tether temporarily deployed to release June 12, 2026 09:59 — with GitHub Actions Inactive

GustavoA1604 previously approved these changes Jun 12, 2026

View reviewed changes

Zbig9000 previously approved these changes Jun 12, 2026

View reviewed changes

ogad-tether dismissed stale reviews from Zbig9000 and GustavoA1604 via 70a7502 June 15, 2026 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-19557 tts-ggml: cap Chatterbox T3 context (KV cache) at 2048 by default#2527

QVAC-19557 tts-ggml: cap Chatterbox T3 context (KV cache) at 2048 by default#2527
ogad-tether wants to merge 4 commits into
tetherto:mainfrom
ogad-tether:feat/chatterbox-nctx-cap

ogad-tether commented Jun 10, 2026

Uh oh!

ogad-tether commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

ogad-tether commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ogad-tether commented Jun 10, 2026

Context

Change

Testing

Uh oh!

ogad-tether commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier-based Approval Status

Uh oh!

ogad-tether commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 12, 2026 •

edited

Loading