[Feat][FishSpeech] Cache DAC-encoded ref audio for voice cloning by linyueqian · Pull Request #2609 · vllm-project/vllm-omni

linyueqian · 2026-04-08T20:02:41Z

Summary

Reuse VoiceEmbeddingCache (from Qwen3-TTS, PR [Qwen3TTS] [TTS] [Feat] Refactor voice cache manager #2108 by @JuanPZuluaga) for Fish Speech S2 Pro voice cloning
On first request with an uploaded voice, DAC codec encodes the reference audio and caches the result (ref_codes_fq)
On subsequent requests with the same voice, skip DAC encoding entirely -- use cached codes
Support the voice parameter for Fish Speech: auto-resolve uploaded voices -> ref_audio + ref_text

Closes #2561

Changes

`serving_speech.py`

_validate_fish_tts_request: auto-resolve request.voice -> uploaded speaker audio + ref_text
_build_fish_speech_prompt: pass voice_name + voice_created_at in additional_information

`fish_speech_slow_ar.py`

Add VoiceEmbeddingCache instance (same pattern as Qwen3-TTS talker)
_build_structured_voice_clone_prefill_embeds: cache check before DAC encode, store on miss
Extract _apply_codebook_embeddings to share embedding logic between cache-hit and cache-miss paths
Clean up temp .npy file on cache hit to prevent leaks

`tests/model_executor/models/test_fish_speech_voice_cache.py`

Cache miss -> store, cache hit -> reuse, no-voice-name -> no caching
Stale-cache protection, temp file cleanup, created_at=0 disables cache

Benchmark (H20, Fish Speech S2 Pro, 3s ref audio)

Cache is functional -- confirmed via server logs (13 DAC encodes instead of 24):

Metric	Inline (no cache)	Uploaded (cached)	Delta
Mean TTFP	1003ms	967ms	-36ms
DAC encodes	12/12	1/12	11 cache hits

The improvement is small here because the test ref audio is only 3s / 65 DAC frames. In production with 10-30s reference audio (hundreds of DAC frames), the DAC encoding cost is proportionally higher and the cache saves more.

The bigger win is reduced request size -- uploaded voice requests don't need to send the full base64 audio blob every time (e.g. a 30s WAV = ~1.4MB per request saved).

Test plan

Upload a voice via /v1/audio/voices with ref_text, then use it with Fish Speech
Verify first request encodes DAC (server log: Encoded reference audio codes)
Verify subsequent requests skip DAC (no additional encode logs)
Verify inline ref_audio (no voice name) still works unchanged
Test with longer reference audio (10-30s) for larger TTFP improvement
Run pytest tests/model_executor/models/test_fish_speech_voice_cache.py

cc @Sy0307 @JuanPZuluaga

chatgpt-codex-connector · 2026-04-08T20:02:52Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

linyueqian · 2026-04-08T20:32:34Z

@Sy0307 @JuanPZuluaga -- This PR reuses the VoiceEmbeddingCache from PR #2108 to cache DAC-encoded reference audio codes for Fish Speech voice cloning, as suggested in #2561 and the PR #2515 comment.

Benchmarked on H20 with a 3s ref audio -- cache works (11/12 cache hits confirmed via logs), but the TTFP improvement is only ~36ms because the ref audio is tiny. Would be good to test with a real 10-30s voice sample for more representative numbers.

Would appreciate your review -- especially on whether the cache key/value pattern aligns with the Qwen3-TTS design intent.

Reuse the existing VoiceEmbeddingCache (from Qwen3-TTS, PR vllm-project#2108) for Fish Speech S2 Pro voice cloning. When an uploaded voice is used, the expensive DAC codec encoding is performed once and cached; subsequent requests with the same voice skip encoding entirely. Changes: - serving_speech: auto-resolve uploaded voices for Fish Speech (voice → ref_audio + ref_text), pass voice_name/voice_created_at to model - fish_speech_slow_ar: check VoiceEmbeddingCache before DAC encoding, store on miss, reuse on hit, clean up temp files on cache hit - Add tests for cache integration and uploaded voice resolution Closes vllm-project#2561 Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Reuses fish_bench_utils from PR vllm-project#2515 to compare: A) Inline ref_audio (no cache, DAC encode every request) B) Uploaded voice (cache hits after 1st request) Reports TTFP/E2E/RTF comparison table. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

…m-project#2609) Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

linyueqian requested a review from hsliuustc0106 as a code owner April 8, 2026 20:02

linyueqian force-pushed the feat/fish-speech-voice-cache branch from dd66408 to 364dbdf Compare April 8, 2026 20:17

linyueqian mentioned this pull request Apr 8, 2026

[Bugfix] Fix Fish Speech voice clone FileNotFoundError on multi-GPU #2606

Merged

3 tasks

JuanPZuluaga mentioned this pull request Apr 9, 2026

[TTS][SpeakerCacheManager] A global speaker cache manager for Voice Cloning #2630

Open

10 tasks

linyueqian added the ready label to trigger buildkite CI label Apr 10, 2026

linyueqian added 5 commits April 9, 2026 21:18

style: fix ruff formatting

8f0f4f7

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

refactor: move Fish Speech voice cache test to model_executor/models/

060ee7f

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

fix(bench): use correct 'audio_sample' field name for voice upload

53d8e2f

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

linyueqian force-pushed the feat/fish-speech-voice-cache branch from 19fe103 to 53d8e2f Compare April 10, 2026 01:18

hsliuustc0106 approved these changes Apr 10, 2026

View reviewed changes

hsliuustc0106 merged commit 9423243 into vllm-project:main Apr 10, 2026
8 checks passed

Sy0307 pushed a commit to Sy0307/vllm-omni that referenced this pull request Apr 10, 2026

[Feat][FishSpeech] Cache DAC-encoded ref audio for voice cloning (vll…

902869a

…m-project#2609) Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 13, 2026

[Feat][FishSpeech] Cache DAC-encoded ref audio for voice cloning (vll…

4bc1585

…m-project#2609) Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat][FishSpeech] Cache DAC-encoded ref audio for voice cloning#2609

[Feat][FishSpeech] Cache DAC-encoded ref audio for voice cloning#2609
hsliuustc0106 merged 5 commits intovllm-project:mainfrom
linyueqian:feat/fish-speech-voice-cache

linyueqian commented Apr 8, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Apr 8, 2026

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

linyueqian commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

serving_speech.py

fish_speech_slow_ar.py

tests/model_executor/models/test_fish_speech_voice_cache.py

Benchmark (H20, Fish Speech S2 Pro, 3s ref audio)

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Apr 8, 2026

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

linyueqian commented Apr 8, 2026 •

edited

Loading

`serving_speech.py`

`fish_speech_slow_ar.py`

`tests/model_executor/models/test_fish_speech_voice_cache.py`