Skip to content

QVAC-17236 [Chatterbox] Investigate possibilities of reducing RTF#1674

Merged
GustavoA1604 merged 7 commits into
tetherto:mainfrom
Zbig9000:QVAC-17236-Investigate-possibilities-of-reducing-RTF
Apr 22, 2026
Merged

QVAC-17236 [Chatterbox] Investigate possibilities of reducing RTF#1674
GustavoA1604 merged 7 commits into
tetherto:mainfrom
Zbig9000:QVAC-17236-Investigate-possibilities-of-reducing-RTF

Conversation

@Zbig9000

Copy link
Copy Markdown
Contributor

Previous Session Summary: Chatterbox RTF Reduction
Goal: Reduce the Real-Time Factor (RTF) for the Chatterbox TTS model in packages/qvac-lib-infer-onnx-tts.

Bottleneck Analysis
The inference pipeline was profiled and broken down:

Phase English (q4) Multilingual
Speech Encoder
~2.7s
~2.7s
LM Generation
~25s (62%)
~23s (17%)
Conditional Decoder
~16s (37%)
~109s (82%)
Optimizations Implemented

  1. Speech Encoder Output Caching (HIGH IMPACT)
    • Added SpeechEncoderCache struct to store audio features, prompt tokens, speaker embeddings, and speaker features
    • On first synthesize() call, the speech encoder runs and results are cached
    • Subsequent calls (e.g., in runStream() multi-chunk mode) skip the encoder entirely
    • Saves ~2.7s per subsequent call
  2. Optimized Vector Operations
    • Replaced expensive insert(begin, ...) prepends in prepareCfgEmbeddings with reserve+append pattern using std::move
    • Replaced wav.erase(begin, ...) in trimPromptFromWaveform with std::move + resize
  3. Configurable Thread Count for ONNX Sessions
    • Added numThreads config parameter flowing from JS through to OnnxInferSession
    • Default remains 1 thread for backward compatibility; users can set higher values
    • Benchmark result: 25.1% faster with 4 threads (RTF 20.92 -> 15.67 for English)
  4. Per-Phase Timing Instrumentation
    • Added std::chrono timing around speech encoder, LM generation, and conditional decoder phases
      Test Results
    • All 160 C++ unit tests pass (156 original + 4 new SpeechEncoderCacheTest tests)
    • Benchmark scripts confirmed caching and threading improvements
    • Integration test failure was pre-existing (model files not at expected path)

Remaining Opportunities Identified (not yet implemented)
- KV cache optimization: Avoiding redundant tensor copies between ORT and CPU vectors (major effort)
- ORT IO binding: Directly chaining output tensors as inputs to next step
- Conditional decoder: Dominates multilingual RTF (82%) but is a single large model — limited optimization without model-level changes

Pls take a look: @freddy311082, @GustavoA1604

@Zbig9000 Zbig9000 requested review from a team as code owners April 20, 2026 11:23
GustavoA1604
GustavoA1604 previously approved these changes Apr 21, 2026
GustavoA1604
GustavoA1604 previously approved these changes Apr 21, 2026
ogad-tether
ogad-tether previously approved these changes Apr 22, 2026
ogad-tether
ogad-tether previously approved these changes Apr 22, 2026
mario-rei
mario-rei previously approved these changes Apr 22, 2026
@GustavoA1604

Copy link
Copy Markdown
Contributor

/review

@github-actions

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (2/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants