Skip to content

Qvac 17489 io binding for kv cache chaining#1686

Merged
Zbig9000 merged 5 commits into
tetherto:mainfrom
Zbig9000:QVAC-17489-IO-binding-for-KV-cache-chaining
Apr 23, 2026
Merged

Qvac 17489 io binding for kv cache chaining#1686
Zbig9000 merged 5 commits into
tetherto:mainfrom
Zbig9000:QVAC-17489-IO-binding-for-KV-cache-chaining

Conversation

@Zbig9000

@Zbig9000 Zbig9000 commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

  • Chatterbox's autoregressive LM loop copied every present.* tensor back into its matching past_key_values.* input on every step, a per-step O(past_len × heads × head_dim) copy that grew linearly with sequence length (quadratic total cost for an N-step generation).
  • That per-step copy was a measurable share of LM generation RTF, particularly on the English (non-CFG) path where LM dominates total time.

How does it solve it?

  • Add setOutputToInputChain / clearChainedInputs / isInputChained to IOnnxInferSession. After each run() the session std::moves output Ort::Values directly into the matching input slots — no copy, no intermediate user-space vector.
  • OnnxInferSession::initInputTensors() preserves chained slots across re-init so the moved tensors survive.
  • ChatterboxEngine::enableKvCacheChaining() builds the {present.i → past_key_values.i} mapping from session I/O names (inputs [keyValueOffset_..] map 1:1 onto outputs [1..], since outputs[0] is logits). Wired into both generateSpeechTokens (non-CFG) and generateSpeechTokensWithCfg (CFG) paths. writeKvToTensors and cachePastKeyValues now skip chained inputs.
  • Add ChatterboxConfig.kvCacheChaining (default true) wired through TTSModel, AddonJs, and index.js so the optimization can be toggled for A/B benchmarking or an emergency disable.

How was it tested?

  • Full unit suite green: 214 passing / 0 failing (2 unrelated LavaSR benchmark tests skipped), qvac-lib-inference-tts-unit-test with integration filters excluded.
  • New coverage:
    • 5 KvCacheChainingTest cases: English offset=3 mapping, multilingual offset=2 mapping, truncated-outputs edge case, writeKvToTensors skip, cachePastKeyValues skip.
    • 3 OnnxInferSessionMockTest cases for the new mock methods.
  • A/B benchmark on device (Linux, 4 cores, q4 models, jfk.wav reference, kvCacheChaining toggled):
    • English (non-CFG): mean RTF 1.245 → 1.102 (−11.5%), totalTime −12.1% over 3 runs.
    • Multilingual ES (CFG): mean RTF 4.145 → 3.987 (−3.8%), totalTime −3.8% over 2 runs (smaller share because the conditional decoder dominates multilingual).
  • Listened to audio output with chaining ON on both paths — no artifacts vs the pre-change baseline.
  • JS lint (standard) clean; C++ lints clean.

API Changes
New non-breaking kvCacheChaining option on the ONNXTTS constructor (default true, i.e. optimization on).

const model = new ONNXTTS({
  engine: 'chatterbox',
  files: { /* ... */ },
  referenceAudio,
  config: { language: 'en' },
  kvCacheChaining: false // opt-out, e.g. for benchmarking or emergency disable
})

@Zbig9000 Zbig9000 requested review from a team as code owners April 21, 2026 11:17
mario-rei
mario-rei previously approved these changes Apr 23, 2026

@GustavoA1604 GustavoA1604 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need not to add kvCacheChaining as a JS option. Can just keep it to true always. We can compare with rpevious version by running old version of the addon

@GustavoA1604

Copy link
Copy Markdown
Contributor

/review

@github-actions

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants