Skip to content

[Feat][FishSpeech] Cache DAC-encoded ref audio for voice cloning#2609

Merged
hsliuustc0106 merged 5 commits intovllm-project:mainfrom
linyueqian:feat/fish-speech-voice-cache
Apr 10, 2026
Merged

[Feat][FishSpeech] Cache DAC-encoded ref audio for voice cloning#2609
hsliuustc0106 merged 5 commits intovllm-project:mainfrom
linyueqian:feat/fish-speech-voice-cache

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian commented Apr 8, 2026

Summary

  • Reuse VoiceEmbeddingCache (from Qwen3-TTS, PR [Qwen3TTS] [TTS] [Feat] Refactor voice cache manager #2108 by @JuanPZuluaga) for Fish Speech S2 Pro voice cloning
  • On first request with an uploaded voice, DAC codec encodes the reference audio and caches the result (ref_codes_fq)
  • On subsequent requests with the same voice, skip DAC encoding entirely -- use cached codes
  • Support the voice parameter for Fish Speech: auto-resolve uploaded voices -> ref_audio + ref_text

Closes #2561

Changes

serving_speech.py

  • _validate_fish_tts_request: auto-resolve request.voice -> uploaded speaker audio + ref_text
  • _build_fish_speech_prompt: pass voice_name + voice_created_at in additional_information

fish_speech_slow_ar.py

  • Add VoiceEmbeddingCache instance (same pattern as Qwen3-TTS talker)
  • _build_structured_voice_clone_prefill_embeds: cache check before DAC encode, store on miss
  • Extract _apply_codebook_embeddings to share embedding logic between cache-hit and cache-miss paths
  • Clean up temp .npy file on cache hit to prevent leaks

tests/model_executor/models/test_fish_speech_voice_cache.py

  • Cache miss -> store, cache hit -> reuse, no-voice-name -> no caching
  • Stale-cache protection, temp file cleanup, created_at=0 disables cache

Benchmark (H20, Fish Speech S2 Pro, 3s ref audio)

Cache is functional -- confirmed via server logs (13 DAC encodes instead of 24):

Metric Inline (no cache) Uploaded (cached) Delta
Mean TTFP 1003ms 967ms -36ms
DAC encodes 12/12 1/12 11 cache hits

The improvement is small here because the test ref audio is only 3s / 65 DAC frames. In production with 10-30s reference audio (hundreds of DAC frames), the DAC encoding cost is proportionally higher and the cache saves more.

The bigger win is reduced request size -- uploaded voice requests don't need to send the full base64 audio blob every time (e.g. a 30s WAV = ~1.4MB per request saved).

Test plan

  • Upload a voice via /v1/audio/voices with ref_text, then use it with Fish Speech
  • Verify first request encodes DAC (server log: Encoded reference audio codes)
  • Verify subsequent requests skip DAC (no additional encode logs)
  • Verify inline ref_audio (no voice name) still works unchanged
  • Test with longer reference audio (10-30s) for larger TTFP improvement
  • Run pytest tests/model_executor/models/test_fish_speech_voice_cache.py

cc @Sy0307 @JuanPZuluaga

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian linyueqian force-pushed the feat/fish-speech-voice-cache branch from dd66408 to 364dbdf Compare April 8, 2026 20:17
@linyueqian
Copy link
Copy Markdown
Collaborator Author

@Sy0307 @JuanPZuluaga -- This PR reuses the VoiceEmbeddingCache from PR #2108 to cache DAC-encoded reference audio codes for Fish Speech voice cloning, as suggested in #2561 and the PR #2515 comment.

Benchmarked on H20 with a 3s ref audio -- cache works (11/12 cache hits confirmed via logs), but the TTFP improvement is only ~36ms because the ref audio is tiny. Would be good to test with a real 10-30s voice sample for more representative numbers.

Would appreciate your review -- especially on whether the cache key/value pattern aligns with the Qwen3-TTS design intent.

Reuse the existing VoiceEmbeddingCache (from Qwen3-TTS, PR vllm-project#2108) for
Fish Speech S2 Pro voice cloning. When an uploaded voice is used, the
expensive DAC codec encoding is performed once and cached; subsequent
requests with the same voice skip encoding entirely.

Changes:
- serving_speech: auto-resolve uploaded voices for Fish Speech (voice →
  ref_audio + ref_text), pass voice_name/voice_created_at to model
- fish_speech_slow_ar: check VoiceEmbeddingCache before DAC encoding,
  store on miss, reuse on hit, clean up temp files on cache hit
- Add tests for cache integration and uploaded voice resolution

Closes vllm-project#2561

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Reuses fish_bench_utils from PR vllm-project#2515 to compare:
  A) Inline ref_audio (no cache, DAC encode every request)
  B) Uploaded voice (cache hits after 1st request)

Reports TTFP/E2E/RTF comparison table.

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
@linyueqian linyueqian force-pushed the feat/fish-speech-voice-cache branch from 19fe103 to 53d8e2f Compare April 10, 2026 01:18
@hsliuustc0106 hsliuustc0106 merged commit 9423243 into vllm-project:main Apr 10, 2026
8 checks passed
Sy0307 pushed a commit to Sy0307/vllm-omni that referenced this pull request Apr 10, 2026
…m-project#2609)

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 13, 2026
…m-project#2609)

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Fish Speech S2 Pro: Is there a way to register a voice once, and use it multiple times for cloning?

2 participants