[Feat][FishSpeech] Cache DAC-encoded ref audio for voice cloning#2609
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
dd66408 to
364dbdf
Compare
|
@Sy0307 @JuanPZuluaga -- This PR reuses the Benchmarked on H20 with a 3s ref audio -- cache works (11/12 cache hits confirmed via logs), but the TTFP improvement is only ~36ms because the ref audio is tiny. Would be good to test with a real 10-30s voice sample for more representative numbers. Would appreciate your review -- especially on whether the cache key/value pattern aligns with the Qwen3-TTS design intent. |
Reuse the existing VoiceEmbeddingCache (from Qwen3-TTS, PR vllm-project#2108) for Fish Speech S2 Pro voice cloning. When an uploaded voice is used, the expensive DAC codec encoding is performed once and cached; subsequent requests with the same voice skip encoding entirely. Changes: - serving_speech: auto-resolve uploaded voices for Fish Speech (voice → ref_audio + ref_text), pass voice_name/voice_created_at to model - fish_speech_slow_ar: check VoiceEmbeddingCache before DAC encoding, store on miss, reuse on hit, clean up temp files on cache hit - Add tests for cache integration and uploaded voice resolution Closes vllm-project#2561 Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Reuses fish_bench_utils from PR vllm-project#2515 to compare: A) Inline ref_audio (no cache, DAC encode every request) B) Uploaded voice (cache hits after 1st request) Reports TTFP/E2E/RTF comparison table. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
19fe103 to
53d8e2f
Compare
…m-project#2609) Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
…m-project#2609) Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Summary
VoiceEmbeddingCache(from Qwen3-TTS, PR [Qwen3TTS] [TTS] [Feat] Refactor voice cache manager #2108 by @JuanPZuluaga) for Fish Speech S2 Pro voice cloningref_codes_fq)voiceparameter for Fish Speech: auto-resolve uploaded voices ->ref_audio+ref_textCloses #2561
Changes
serving_speech.py_validate_fish_tts_request: auto-resolverequest.voice-> uploaded speaker audio +ref_text_build_fish_speech_prompt: passvoice_name+voice_created_atinadditional_informationfish_speech_slow_ar.pyVoiceEmbeddingCacheinstance (same pattern as Qwen3-TTS talker)_build_structured_voice_clone_prefill_embeds: cache check before DAC encode, store on miss_apply_codebook_embeddingsto share embedding logic between cache-hit and cache-miss paths.npyfile on cache hit to prevent leakstests/model_executor/models/test_fish_speech_voice_cache.pycreated_at=0disables cacheBenchmark (H20, Fish Speech S2 Pro, 3s ref audio)
Cache is functional -- confirmed via server logs (13 DAC encodes instead of 24):
The improvement is small here because the test ref audio is only 3s / 65 DAC frames. In production with 10-30s reference audio (hundreds of DAC frames), the DAC encoding cost is proportionally higher and the cache saves more.
The bigger win is reduced request size -- uploaded voice requests don't need to send the full base64 audio blob every time (e.g. a 30s WAV = ~1.4MB per request saved).
Test plan
/v1/audio/voiceswithref_text, then use it with Fish SpeechEncoded reference audio codes)ref_audio(no voice name) still works unchangedpytest tests/model_executor/models/test_fish_speech_voice_cache.pycc @Sy0307 @JuanPZuluaga