Skip to content

[Qwen3TTS] [TTS] [Feat] Refactor voice cache manager#2108

Merged
linyueqian merged 100 commits intovllm-project:mainfrom
JuanPZuluaga:feat/refactor-voice-cache-manager
Apr 3, 2026
Merged

[Qwen3TTS] [TTS] [Feat] Refactor voice cache manager#2108
linyueqian merged 100 commits intovllm-project:mainfrom
JuanPZuluaga:feat/refactor-voice-cache-manager

Conversation

@JuanPZuluaga
Copy link
Copy Markdown
Contributor

@JuanPZuluaga JuanPZuluaga commented Mar 23, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

we add an in-memory cache for voice extraction artifacts in Qwen3-TTS Base voice clone requests. The system is simple enough that can be used by other models (will import it to the other models in a following PR).

When a user sends a voice clone request with ref_audio, the server runs two GPU ops: speaker embedding extraction (ECAPA-TDNN) and ref_audio encoding (SpeechTokenizer). These cost 60-250ms per request. If the same voice is used again, these results are identical, but today they're recomputed every time.

This PR caches the extraction results keyed by a SHA-256 hash of the audio content. On repeat requests with the same reference audio, the cached tensors are served from CPU memory instead of re-running GPU extraction. This gives a ~25% TTFP reduction on warm (repeated voice) requests.

Files deleted

  • vllm_omni/entrypoints/openai/metadata_manager.py: disk persistence replaced by in-memory dict (will add later something more robust)
  • vllm_omni/model_executor/models/qwen3_tts/voice_cache_manager.py: replaced by a simpler voice_cache.py
     

Known Issue: torch.compile crash on startup (vllm 0.18.0 + torch 2.10)

When using enforce_eager: false (required for stage 0), the server crashes during startup with:

AttributeError: <function standalone_compile at ...> does not have the attribute 'FakeTensorMode'

Root cause: torch._inductor.__init__ defines a wrapper function standalone_compile that shadows the submodule of the same name. unittest.mock.patch() resolves via getattr and finds the function instead of the module.

Fix: Apply a one-line patch to vllm/compilation/compiler_interface.py (line ~376). This is fixed upstream in vllm-project/vllm#37158 but not included in vllm 0.18.0.

-            fake_mode_ctx: Any = patch(
-                "torch._inductor.standalone_compile.FakeTensorMode",
+            import sys
+            fake_mode_ctx: Any = patch.object(
+                sys.modules["torch._inductor.standalone_compile"],
+                "FakeTensorMode",
                 lambda *a, **kw: input_fake_mode,
             )

this was merge to main already with bug in: vllm-project/vllm#37858

(we might need to pin to a version, please someone test this as well)

Test Plan

1. Start the server

vllm-omni serve Qwen/Qwen3-TTS-12Hz-0.6B-Base \
    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
    --omni \
    --port 8000 \
    --trust-remote-code

2. Upload a voice (one can pass "ref_text" as well)

curl -X POST http://localhost:8000/v1/audio/voices \
  -F "audio_sample=@/path/to/your/reference.wav" \
  -F "name=my_voice" \
  -F "consent=test_consent"

3. First request (warmup, it extracts speaker embedding, and then caches it)

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello, this is a voice cloning test.", "voice": "my_voice", "response_format": "wav"}' \
  --output cold.wav

4. Second request (embedding already cached, skips extraction)

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "This request reuses the cached speaker embedding.", "voice": "my_voice", "response_format": "wav"}' \
   --output warm.wav

5. Batch endpoint (all items share the cached embedding)

curl -X POST http://localhost:8000/v1/audio/speech/batch \
  -H "Content-Type: application/json" \
  -d '{
    "voice": "my_voice",
    "items": [
      {"input": "First sentence in the batch."},
      {"input": "Second sentence, same cached voice."},
      {"input": "Third sentence, still one extraction."}
    ],
    "response_format": "wav"
  }'

6. Delete voice (cache key removed, next upload gets fresh extraction)

curl -X DELETE http://localhost:8000/v1/audio/voices/my_voice

Test Result

Metric                          main        voice-cache     delta
─────────────────────────────────────────────────────────────────
Cold e2e mean (1st req/voice)   738 ms      734 ms          -4 ms
Warm con=1  median              812 ms      799 ms          -13 ms
Warm con=1  P95                 933 ms      928 ms          -6 ms
Warm con=4  median              1480 ms     1455 ms         -25 ms
Warm con=4  P95                 1688 ms     1835 ms         +147 ms
Warm con=16 median              3722 ms     3666 ms         -56 ms
Warm con=16 P95                 4576 ms     4544 ms         -32 ms
Throughput  con=16              4.0 req/s   4.0 req/s       0.0
Errors                          0           0               0
─────────────────────────────────────────────────────────────────

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

JuanPZuluaga added 4 commits March 23, 2026 09:54
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
@JuanPZuluaga JuanPZuluaga changed the title Feat/refactor voice cache manager [Qwen3TTS] [TTS] [Feat] Refactor voice cache manager Mar 23, 2026
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

Though, i was thinking in a better Unified VoiceManager, where all the things related to voices are managed.

e.g.., a single class in vllm_omni/utils/voice_manager.py that owns the full lifecycle:

VoiceManager
├── Storage (upload, delete, list, get audio path)
│   └── MetadataManager
├── Artifacts Cache
│   ├── L1: In-memory LRU (current PR)
│   └── L2: Disk safetensors (already in main)
├── Voice Data Access
│   ├── get_ref_audio(name) → base64 data URI
│   ├── get_ref_text(name) → str | None
│   └── get_cached_artifacts(name) → {ref_code, ref_spk_embedding, icl_mode}
└── Cache Invalidation
    └── on delete → evict from L1 + L2

@linyueqian
Copy link
Copy Markdown
Collaborator

The cache approach LGTM for the inline ref_audio path. A few thoughts:

  1. The voice upload API (POST /v1/audio/voices) already exists on main and stores voice profiles by name. Does this cache integrate with uploaded voices? Ideally, when a voice is uploaded via the API, the extracted artifacts should also populate the VoiceEmbeddingCache so that POST /v1/audio/speech with voice="my_voice" benefits from the same cache hit path.

  2. The cache is per-process and volatile (lost on restart). For production use cases where users pre-register voices via the upload API, the uploaded voice's artifacts should persist across restarts. Is that already handled by the upload flow, or does re-extraction happen on restart?

  3. The _normalize_ref_audio call on cache miss (line 1341) still runs before the cache check completes. Could you move it inside the miss branch to avoid unnecessary audio normalization when hitting the cache via voice_clone_prompt?

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Mar 24, 2026

I agree with your design for the Unified VoiceManager, but I think it needs more detailed configuration. For example, some settings for LRU, because users may not want to cache all audio. So should we synchronize the design of the configuration as well? This would also facilitate the adaptation of other TTS models. @JuanPZuluaga

JuanPZuluaga added 10 commits March 24, 2026 14:53
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
…ect correct spk emb.

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
…ect correct spk emb.

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

Thanks for the comments @linyueqian.

The cache approach LGTM for the inline ref_audio path. A few thoughts:

  1. The voice upload API (POST /v1/audio/voices) already exists on main and stores voice profiles by name. Does this cache integrate with uploaded voices? Ideally, when a voice is uploaded via the API, the extracted artifacts should also populate the VoiceEmbeddingCache so that POST /v1/audio/speech with voice="my_voice" benefits from the same cache hit path.

I'm thinking in doing a full refactor of the caching, whil keeping the same endpoints for upload/delete/list. This refactor will allow us to integrate new models easily (most TTS/omni models support voice cloning). It'd be good as well to discuss the best approach to have the caching generic enough, that new models can easily integrate it, cause we need to inject the speaker embedding in the forward pass dynamically, etc.

  1. The cache is per-process and volatile (lost on restart). For production use cases where users pre-register voices via the upload API, the uploaded voice's artifacts should persist across restarts. Is that already handled by the upload flow, or does re-extraction happen on restart?

That happens on restarts actually. I am bit defensive agains this, as this introduces more complexity, and maybe if one wants a clean deployment, then one can somehow just restarts and start from scratch. On the other hand, the #1227 will introduce uploading speaker embeddings, which is cool and can solve that issue.

  1. The _normalize_ref_audio call on cache miss (line 1341) still runs before the cache check completes. Could you move it inside the miss branch to avoid unnecessary audio normalization when hitting the cache via voice_clone_prompt?

already fixed in the latest revision, where the cache lookup uses the voice name directly (it's simpler instead of computing a hash of the audio file), so _normalize_ref_audio only runs on cache miss.

@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

I agree with your design for the Unified VoiceManager, but I think it needs more detailed configuration. For example, some settings for LRU, because users may not want to cache all audio. So should we synchronize the design of the configuration as well? This would also facilitate the adaptation of other TTS models. @JuanPZuluaga

Thanks for the comments @Sy0307. This is true, I'm still working a bit on design and testing. Ideally, we'd be nice to have a voice manager generic enough that adding another TTS/omni model is as simple as adding a couple of parts in their Talkers, thus, being super simple (after this PR lands, i'll onboard all the other tts/omni models that support voice cloning).

@homepy
Copy link
Copy Markdown

homepy commented Mar 25, 2026

The cache approach LGTM for the inline ref_audio path. A few thoughts:

  1. The cache is per-process and volatile (lost on restart). For production use cases where users pre-register voices via the upload API, the uploaded voice's artifacts should persist across restarts. Is that already handled by the upload flow, or does re-extraction happen on restart?

Maybe you could refer to the implementation of CosyVoice. See example of CosyVoice , lines 46 to 50.

assert cosyvoice.add_zero_shot_spk('希望你以后能够做的比我还好呦。', './asset/zero_shot_prompt.wav', 'my_zero_shot_spk') is True
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '', '', zero_shot_spk_id='my_zero_shot_spk')):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
cosyvoice.save_spkinfo()

In the voice clone example, cosyvoice.add_zero_shot_spk() add a new voice, and cosyvoice.save_spkinfo() will save all the voices to the file spk2info.pt which is in model directory.
When server restart, it loads the spk2info.pt file. And it is easy to delete this file.

JuanPZuluaga added 4 commits March 25, 2026 06:22
… feat/refactor-voice-cache-manager

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

JuanPZuluaga commented Mar 25, 2026

I'll keep working on this, once: #1227 and #2046 are merged. @linyueqian @Sy0307

JuanPZuluaga added 4 commits March 25, 2026 22:26
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Lidang-Jiang and others added 2 commits April 3, 2026 23:59
… types (vllm-project#2228)

Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 3, 2026
@linyueqian
Copy link
Copy Markdown
Collaborator

Hey @JuanPZuluaga — while reviewing PR #2475 (rebase to v0.19.0), I noticed that Qwen3 TTS handles reference audio through its own custom voice_cache_manager.py / tokenizer pipeline, whereas other TTS models like CosyVoice3 and Voxtral TTS route their reference audio through vLLM's multimodal input pipeline (SupportsMultiModal, MULTIMODAL_REGISTRY, MultiModalFieldConfig, etc.).

Since you're already refactoring the voice cache here, do you think it would make sense to also align Qwen3 TTS with vLLM's multimodal input processing for reference audio? That way all TTS models would follow the same pattern and benefit from future upstream multimodal improvements automatically.

Curious if this is something you've already considered or if there's a reason to keep the current approach. Thanks!

JuanPZuluaga added 2 commits April 3, 2026 17:39
… feat/refactor-voice-cache-manager

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

@linyueqian good point actually. I looked at both CosyVoice3 (ONNX processors) and Voxtral (from raw audio goes to embed_multimodal). actually, the Voxtral pattern is the viable approach for Qwen3 TTS since the speaker encoder and codec are GPU model sub-components, not standalone processors.

I think the TTS voice cloning has a different access pattern: voices are uploaded once and reused. A voice-name-keyed LRU cache is more appropriate than the multimodal pipeline's content-hash cache, so we'd still layer the voice cache on top. I was thinking to keep this PR focused on cache correctness and will open asap a follow-up for multimodal alignment with all the other models that support voice cloning, etc. What is your opinion? should we merge this and then i align all the models, or should i already do that here?

@linyueqian linyueqian closed this Apr 3, 2026
@linyueqian linyueqian reopened this Apr 3, 2026
Signed-off-by: Yueqian Lin <yueqianlin@gmail.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
@linyueqian linyueqian force-pushed the feat/refactor-voice-cache-manager branch 2 times, most recently from 00c5fbb to 8fd5d49 Compare April 3, 2026 19:55
@linyueqian
Copy link
Copy Markdown
Collaborator

The CI failures (ImportError: cannot import name 'hardware_test' from 'tests.utils') are caused by this PR adding tests/utils/__init__.py and tests/utils/test_voice_cache.py. This creates a tests/utils/ package that shadows the existing tests/utils.py module, breaking all imports like from tests.utils import hardware_test.

Please move tests/utils/test_voice_cache.py to a different location (e.g., tests/test_voice_cache.py) and remove the tests/utils/__init__.py file.

JuanPZuluaga added 3 commits April 3, 2026 20:27
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

thanks for the catch! @linyueqian, it's fixed now :-)

Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@linyueqian linyueqian merged commit f50c5a4 into vllm-project:main Apr 3, 2026
8 checks passed
reidliu41 added a commit to reidliu41/vllm-omni that referenced this pull request Apr 4, 2026
  - drop the old metadata/cache-manager path after rebasing onto main
  - keep voice_created_at-based stale-cache protection on raw-audio fallback
  - memoize direct speaker embeddings to avoid repeated safetensors reads
  - preserve request-level Base overrides when uploaded voices fall back to raw audio
  - keep cached ref_code handling intact for Base in-context prompt construction
  - update voice cache tests to match the rebased serving implementation

Signed-off-by: reidliu41 <reid201711@gmail.com>
skf-1999 pushed a commit to Semmer2/vllm-omni that referenced this pull request Apr 7, 2026
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Binh Tang <tangbinhna@gmail.com>
Signed-off-by: Binh Tang <binht@netflix.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Signed-off-by: vraiti <vraiti@redhat.com>
Signed-off-by: Songrui625 <songrui625@gmail.com>
Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Co-authored-by: JuanPZuluaga <juanz9312@gmal.com>
Co-authored-by: Yi Liu <yi4.liu@intel.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Canlin Guo <canlinguosdu@gmail.com>
Co-authored-by: Binh Tang <tangbinhna@gmail.com>
Co-authored-by: Binh Tang <binht@netflix.com>
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: rein yang <73573651+R2-Y@users.noreply.github.com>
Co-authored-by: zhumingjue138 <zhumingjue@huawei.com>
Co-authored-by: ChenWenjing <54166744+Shirley125@users.noreply.github.com>
Co-authored-by: vraiti <vraiti@redhat.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Sy03 <1370724210@qq.com>
Co-authored-by: chickeyton <ngton2014@gmail.com>
Co-authored-by: Jerry Song <46962917+Songrui625@users.noreply.github.com>
Co-authored-by: Lidang Jiang <119769478+Lidang-Jiang@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Alex Brooks <albrooks@redhat.com>
Co-authored-by: linyueqian <linyueqian@outlook.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
reidliu41 added a commit to reidliu41/vllm-omni that referenced this pull request Apr 7, 2026
  - drop the old metadata/cache-manager path after rebasing onto main
  - keep voice_created_at-based stale-cache protection on raw-audio fallback
  - memoize direct speaker embeddings to avoid repeated safetensors reads
  - preserve request-level Base overrides when uploaded voices fall back to raw audio
  - keep cached ref_code handling intact for Base in-context prompt construction
  - update voice cache tests to match the rebased serving implementation

Signed-off-by: reidliu41 <reid201711@gmail.com>
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Apr 8, 2026
Reuse the existing VoiceEmbeddingCache (from Qwen3-TTS, PR vllm-project#2108) for
Fish Speech S2 Pro voice cloning. When an uploaded voice is used, the
expensive DAC codec encoding is performed once and cached; subsequent
requests with the same voice skip encoding entirely.

Changes:
- serving_speech: auto-resolve uploaded voices for Fish Speech (voice →
  ref_audio + ref_text), pass voice_name/voice_created_at to model
- fish_speech_slow_ar: check VoiceEmbeddingCache before DAC encoding,
  store on miss, reuse on hit, clean up temp files on cache hit
- Add tests for cache integration and uploaded voice resolution

Closes vllm-project#2561
@JuanPZuluaga JuanPZuluaga deleted the feat/refactor-voice-cache-manager branch April 9, 2026 07:05
vraiti added a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: Binh Tang <tangbinhna@gmail.com>
Signed-off-by: Binh Tang <binht@netflix.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Signed-off-by: vraiti <vraiti@redhat.com>
Signed-off-by: Songrui625 <songrui625@gmail.com>
Signed-off-by: Lidang Jiang <lidangjiang@gmail.com>
Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Co-authored-by: JuanPZuluaga <juanz9312@gmal.com>
Co-authored-by: Yi Liu <yi4.liu@intel.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Canlin Guo <canlinguosdu@gmail.com>
Co-authored-by: Binh Tang <tangbinhna@gmail.com>
Co-authored-by: Binh Tang <binht@netflix.com>
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: rein yang <73573651+R2-Y@users.noreply.github.com>
Co-authored-by: zhumingjue138 <zhumingjue@huawei.com>
Co-authored-by: ChenWenjing <54166744+Shirley125@users.noreply.github.com>
Co-authored-by: vraiti <vraiti@redhat.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Sy03 <1370724210@qq.com>
Co-authored-by: chickeyton <ngton2014@gmail.com>
Co-authored-by: Jerry Song <46962917+Songrui625@users.noreply.github.com>
Co-authored-by: Lidang Jiang <119769478+Lidang-Jiang@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Alex Brooks <albrooks@redhat.com>
Co-authored-by: linyueqian <linyueqian@outlook.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Apr 10, 2026
Reuse the existing VoiceEmbeddingCache (from Qwen3-TTS, PR vllm-project#2108) for
Fish Speech S2 Pro voice cloning. When an uploaded voice is used, the
expensive DAC codec encoding is performed once and cached; subsequent
requests with the same voice skip encoding entirely.

Changes:
- serving_speech: auto-resolve uploaded voices for Fish Speech (voice →
  ref_audio + ref_text), pass voice_name/voice_created_at to model
- fish_speech_slow_ar: check VoiceEmbeddingCache before DAC encoding,
  store on miss, reuse on hit, clean up temp files on cache hit
- Add tests for cache integration and uploaded voice resolution

Closes vllm-project#2561

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.