Skip to content

QVAC-19557 tts-ggml: cap Chatterbox T3 context (KV cache) at 2048 by default#2527

Open
ogad-tether wants to merge 4 commits into
tetherto:mainfrom
ogad-tether:feat/chatterbox-nctx-cap
Open

QVAC-19557 tts-ggml: cap Chatterbox T3 context (KV cache) at 2048 by default#2527
ogad-tether wants to merge 4 commits into
tetherto:mainfrom
ogad-tether:feat/chatterbox-nctx-cap

Conversation

@ogad-tether

Copy link
Copy Markdown
Contributor

Context

The iOS QVAC SDK chatterbox e2e tests peak at ~3.1 GB physFootprint and get jetsam-killed (QVAC-19557 — the tts-chatterbox-* variants are skipped on iOS Device Farm for exactly this). The single largest contributor: tts-cpp allocates the T3 KV cache up-front, in F32, at the GGUF's full n_ctx. The Turbo GGUF ships n_ctx=8196, which costs

n_embd(1024) × n_layer(24) × 8196 × 4 B × 2 (K+V) ≈ 1.6 GB

for synthesis that rarely needs more than a few hundred tokens.

Change

tts-cpp already supports capping via EngineOptions::n_ctx (the engine clamps the GGUF's n_ctx to it, never raises it), but the addon never set it. Now:

  • ChatterboxModel::toEngineOptions passes n_ctx = nCtx ?? 2048 (kDefaultNCtx). 2048 tokens keep ≈80 s of generated audio per synthesize() call (T3 speech tokens run at 25 Hz) for ~400 MB of KV — a ~1.2 GB steady-state saving on the Turbo GGUF.
  • New nCtx constructor option (JS → JSAdapter → ChatterboxConfig) for hosts that need a different cap. nCtx: 0 is the documented escape hatch back to the GGUF's full context; negative values are rejected at construction (validateConfig).
  • Chatterbox-only: Supertonic has no autoregressive KV cache, so there is nothing to cap there.

A companion tts-cpp PR (tetherto/qvac-ext-lib-whisper.cpp#43) removes the GGUF load-time host-staging spikes (+0.5–1 GB transient) on the same ticket; the two land independently.

Testing

  • gtest: new engineOptionsForTests() hook exposes the config → EngineOptions mapping; covers the 2048 default, explicit forwarding, the 0 escape hatch, and negative rejection. tts_ggml_tests: 39/39 pass.
  • JS: new unit test covers ttsParams forwarding (set, 0, and unset/omitted → addon default applies). npm run test:unit: 62/62 pass; npm run lint clean.
  • README / index.d.ts / CHANGELOG updated.

🤖 Generated with Claude Code

…default

tts-cpp allocates the T3 KV cache up-front, in F32, at the GGUF's full
n_ctx.  The Turbo GGUF ships n_ctx=8196, which costs ~1.6 GB of KV
(n_embd(1024) x n_layer(24) x 8196 x 4 B x K+V) for synthesis that
rarely needs more than a few hundred tokens, and is the single largest
contributor to the ~3.1 GB peak footprint that gets the iOS QVAC SDK
chatterbox e2e tests jetsam-killed (the tts-chatterbox-* variants are
currently skipped on iOS Device Farm for exactly this).

tts-cpp already supports capping via EngineOptions::n_ctx (the engine
clamps the GGUF's n_ctx to it, never raises it), but the addon never
set it.  Now:

- ChatterboxModel::toEngineOptions passes n_ctx = nCtx ?? 2048
  (kDefaultNCtx).  2048 tokens keep ~80 s of generated audio per
  synthesize() call (T3 speech tokens run at 25 Hz) for ~400 MB of KV.
- New `nCtx` constructor option (JS -> JSAdapter -> ChatterboxConfig)
  for hosts that need a different cap; nCtx=0 is the documented escape
  hatch back to the GGUF's full context, negative values are rejected
  at construction (validateConfig).
- engineOptionsForTests() exposes the config -> EngineOptions mapping
  to the gtest suite; covers the 2048 default, explicit forwarding,
  the 0 escape hatch, and negative rejection.  JS unit test covers
  ttsParams forwarding (set, 0, and unset/omitted).

Chatterbox-only: Supertonic has no autoregressive KV cache, so there
is nothing to cap there.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ogad-tether ogad-tether requested review from a team as code owners June 10, 2026 17:13
GustavoA1604
GustavoA1604 previously approved these changes Jun 10, 2026
@ogad-tether

Copy link
Copy Markdown
Contributor Author

Re: the "quantize KV cache to q8 and raise ctx" suggestion — implemented engine-side in tetherto/qvac-ext-lib-whisper.cpp#43 (commit 3939db19): EngineOptions::kv_cache_type (f32 default / f16 / q8_0) on a token-major KV slab. Validated on real GGUFs: f32 is byte-identical to the old code, turbo greedy tokens are identical across all three dtypes on CPU and Metal, and Metal decode gets ~20-30% faster from the bandwidth saving. q8_0 stores the cache at ~27% of f32.

The addon knob can't land in this PR — it builds against the pinned vcpkg tts-cpp port, which won't expose the new field until #43 merges and the registry publishes a new port version (same flow as the supertonic EngineOptions additions in 0.2.1). Follow-up addon PR will: add kvCacheType, default chatterbox to q8_0, and raise the default nCtx 2048 → 4096 — that combination is ~204 MB of KV (vs ~400 MB for f32@2048 in this PR, vs 1.6 GB before) while doubling the usable context. This PR stays as-is and is safe to land first.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ❌ (0/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

…x default 4096

Builds on the upstream kv_cache_type support
(qvac-ext-lib-whisper.cpp#43): the T3 KV cache is allocated up-front
at nCtx, and q8_0 stores it at ~27% of f32 — so the new defaults
(nCtx=4096 + kvCacheType="q8_0", ~210 MB of KV for ~160 s of audio
per synthesize() call) use HALF the memory of the previous
f32@2048 plan while doubling the usable context.

- New `kvCacheType` constructor option ('f32'|'f16'|'q8_0'), plumbed
  JS -> JSAdapter -> ChatterboxConfig -> EngineOptions.  Unknown
  values are rejected at construction (tts-cpp's own fallback would
  silently revert to f32 and change the memory profile the caller
  asked for).  kvCacheType:"f32" restores bit-exact pre-quantisation
  behaviour.
- nCtx default 2048 -> 4096 (cheaper than the old default AND longer,
  per the review suggestion to raise ctx alongside q8 KV).
- vcpkg tts-cpp pin -> 2026-06-12.  This pin is Android-safe: the
  revision removes the last direct ggml_backend_is_cpu /
  ggml_get_type_traits_cpu references from tts-cpp (the
  unresolvable-UND dlopen crash behind the 0.2.2 revert), routing
  them through the backend registry + ggml_quantize_chunk (ggml-base).

Upstream validation on real GGUFs (see tetherto#43): Turbo greedy token
sequences byte-identical across f32/f16/q8_0 on CPU and Metal; MTL
CFG can flip a near-tie argmax (same class of variation as a seed
change; whisper transcribes the q8_0 output to the exact input
text); Metal decode 20-30% faster from the KV bandwidth saving.

Tests: gtest covers the q8_0 default, explicit forwarding, the f32
escape hatch, and unknown-value rejection (42/42 against tts-cpp
2026-06-12); JS unit suite 63/63; lint clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ogad-tether

Copy link
Copy Markdown
Contributor Author

Bumped this PR with the full KV-quantisation work (commit 70ea6cd28), per the review suggestion:

  • kvCacheType option (f32|f16|q8_0), default q8_0 — ~27% of f32's KV memory; unknown values rejected at construction; "f32" = bit-exact escape hatch.
  • nCtx default 2048 → 4096 — with q8_0 that's ~210 MB of KV for ~160 s of audio per call: less memory than the previous f32@2048 plan and double the context.
  • vcpkg tts-cpp pin → 2026-06-12.

Re: the Android landmine that forced the 0.2.2 revert — defused upstream rather than worked around: tts-cpp PR #43 now removes the last direct ggml_backend_is_cpu / ggml_get_type_traits_cpu references (backend registry + ggml_quantize_chunk instead); nm -u libtts-cpp.a is clean of both symbols, so the Android GGML_BACKEND_DL=ON addon link can't end up with the unresolvable UND symbols again.

Validated locally against the new tts-cpp via an overlay port: addon gtests 42/42 (incl. new kvCacheType coverage), JS unit suite 63/63, lint clean.

Merge order / CI note:

  1. QVAC-19557 tts-cpp: stream chatterbox GGUF tensor data instead of staging the full file qvac-ext-lib-whisper.cpp#43 (tts-cpp)
  2. tts-cpp: publish 2026-06-12 — QVAC-19557 chatterbox memory (PR #43) + Android-safe symbols qvac-registry-vcpkg#188 (publishes tts-cpp@2026-06-12, draft until 1 merges)
  3. this PR — needs one final commit bumping vcpkg-configuration.json's registry baseline to the registry commit from step 2 (until then the C++ CI here can't resolve tts-cpp>=2026-06-12). I'll push that bump as soon as release(qvac-lib-registry-client): v0.2.0 #188 lands.

Temporary wiring so CI can resolve tts-cpp>=2026-06-12 before
qvac-registry-vcpkg#188 merges: point the default registry at the
tetherto#188 branch via the vcpkg-configuration 'reference' field and move
the baseline to its tip.  Verified locally: vcpkg resolves
tts-cpp@2026-06-12 + ggml-speech@2026-06-04 from the remote branch
and the addon configures cleanly.

Once tetherto#188 lands on the registry main, drop 'reference' and move
'baseline' to the merged commit (tracked in the PR checklist).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
GustavoA1604
GustavoA1604 previously approved these changes Jun 12, 2026
Zbig9000
Zbig9000 previously approved these changes Jun 12, 2026
…2 guard)

Re-points the temporary tts-cpp-2026-06-12 registry-branch baseline to
the updated port (qvac-registry-vcpkg#188 @ 4ace796) so CI builds
against tts-cpp c8620cf9 — which forces quantized KV to f32 on Vulkan
(NV_coopmat2 FA fault) while keeping the q8_0 chatterbox default on
Metal/CPU.  Resolves the linux-x64 GPU-integration SIGSEGV.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ogad-tether ogad-tether dismissed stale reviews from Zbig9000 and GustavoA1604 via 70a7502 June 15, 2026 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

verified Authorize secrets / label-gate in PR workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants