[Qwen3TTS] [TTS] [Feat] Refactor voice cache manager#2108
[Qwen3TTS] [TTS] [Feat] Refactor voice cache manager#2108linyueqian merged 100 commits intovllm-project:mainfrom
Conversation
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager
|
Though, i was thinking in a better e.g.., a single class in VoiceManager
├── Storage (upload, delete, list, get audio path)
│ └── MetadataManager
├── Artifacts Cache
│ ├── L1: In-memory LRU (current PR)
│ └── L2: Disk safetensors (already in main)
├── Voice Data Access
│ ├── get_ref_audio(name) → base64 data URI
│ ├── get_ref_text(name) → str | None
│ └── get_cached_artifacts(name) → {ref_code, ref_spk_embedding, icl_mode}
└── Cache Invalidation
└── on delete → evict from L1 + L2 |
|
The cache approach LGTM for the inline
|
|
I agree with your design for the Unified VoiceManager, but I think it needs more detailed configuration. For example, some settings for LRU, because users may not want to cache all audio. So should we synchronize the design of the configuration as well? This would also facilitate the adaptation of other TTS models. @JuanPZuluaga |
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
…ect correct spk emb. Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
…ect correct spk emb. Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager
|
Thanks for the comments @linyueqian.
I'm thinking in doing a full refactor of the caching, whil keeping the same endpoints for upload/delete/list. This refactor will allow us to integrate new models easily (most TTS/omni models support voice cloning). It'd be good as well to discuss the best approach to have the caching generic enough, that new models can easily integrate it, cause we need to inject the speaker embedding in the forward pass dynamically, etc.
That happens on restarts actually. I am bit defensive agains this, as this introduces more complexity, and maybe if one wants a clean deployment, then one can somehow just restarts and start from scratch. On the other hand, the #1227 will introduce uploading speaker embeddings, which is cool and can solve that issue.
already fixed in the latest revision, where the cache lookup uses the voice name directly (it's simpler instead of computing a hash of the audio file), so |
Thanks for the comments @Sy0307. This is true, I'm still working a bit on design and testing. Ideally, we'd be nice to have a voice manager generic enough that adding another TTS/omni model is as simple as adding a couple of parts in their |
Maybe you could refer to the implementation of CosyVoice. See example of CosyVoice , lines 46 to 50.
In the voice clone example, cosyvoice.add_zero_shot_spk() add a new voice, and cosyvoice.save_spkinfo() will save all the voices to the file spk2info.pt which is in model directory. |
… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager
… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager
|
I'll keep working on this, once: #1227 and #2046 are merged. @linyueqian @Sy0307 |
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager
… types (vllm-project#2228) Signed-off-by: Lidang Jiang <lidangjiang@gmail.com> Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
|
Hey @JuanPZuluaga — while reviewing PR #2475 (rebase to v0.19.0), I noticed that Qwen3 TTS handles reference audio through its own custom Since you're already refactoring the voice cache here, do you think it would make sense to also align Qwen3 TTS with vLLM's multimodal input processing for reference audio? That way all TTS models would follow the same pattern and benefit from future upstream multimodal improvements automatically. Curious if this is something you've already considered or if there's a reason to keep the current approach. Thanks! |
… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/refactor-voice-cache-manager
|
@linyueqian good point actually. I looked at both CosyVoice3 (ONNX processors) and Voxtral (from raw audio goes to I think the TTS voice cloning has a different access pattern: voices are uploaded once and reused. A voice-name-keyed LRU cache is more appropriate than the multimodal pipeline's content-hash cache, so we'd still layer the voice cache on top. I was thinking to keep this PR focused on cache correctness and will open asap a follow-up for multimodal alignment with all the other models that support voice cloning, etc. What is your opinion? should we merge this and then i align all the models, or should i already do that here? |
Signed-off-by: Yueqian Lin <yueqianlin@gmail.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
00c5fbb to
8fd5d49
Compare
|
The CI failures ( Please move |
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
…m/JuanPZuluaga/vllm-omni into feat/refactor-voice-cache-manager
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
|
thanks for the catch! @linyueqian, it's fixed now :-) |
- drop the old metadata/cache-manager path after rebasing onto main - keep voice_created_at-based stale-cache protection on raw-audio fallback - memoize direct speaker embeddings to avoid repeated safetensors reads - preserve request-level Base overrides when uploaded voices fall back to raw audio - keep cached ref_code handling intact for Base in-context prompt construction - update voice cache tests to match the rebased serving implementation Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Binh Tang <tangbinhna@gmail.com> Signed-off-by: Binh Tang <binht@netflix.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: Rein Yang <ruiruyang2@gmail.com> Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Signed-off-by: vraiti <vraiti@redhat.com> Signed-off-by: Songrui625 <songrui625@gmail.com> Signed-off-by: Lidang Jiang <lidangjiang@gmail.com> Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com> Signed-off-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: JuanPZuluaga <juanz9312@gmal.com> Co-authored-by: Yi Liu <yi4.liu@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Canlin Guo <canlinguosdu@gmail.com> Co-authored-by: Binh Tang <tangbinhna@gmail.com> Co-authored-by: Binh Tang <binht@netflix.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: rein yang <73573651+R2-Y@users.noreply.github.com> Co-authored-by: zhumingjue138 <zhumingjue@huawei.com> Co-authored-by: ChenWenjing <54166744+Shirley125@users.noreply.github.com> Co-authored-by: vraiti <vraiti@redhat.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: Sy03 <1370724210@qq.com> Co-authored-by: chickeyton <ngton2014@gmail.com> Co-authored-by: Jerry Song <46962917+Songrui625@users.noreply.github.com> Co-authored-by: Lidang Jiang <119769478+Lidang-Jiang@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
- drop the old metadata/cache-manager path after rebasing onto main - keep voice_created_at-based stale-cache protection on raw-audio fallback - memoize direct speaker embeddings to avoid repeated safetensors reads - preserve request-level Base overrides when uploaded voices fall back to raw audio - keep cached ref_code handling intact for Base in-context prompt construction - update voice cache tests to match the rebased serving implementation Signed-off-by: reidliu41 <reid201711@gmail.com>
Reuse the existing VoiceEmbeddingCache (from Qwen3-TTS, PR vllm-project#2108) for Fish Speech S2 Pro voice cloning. When an uploaded voice is used, the expensive DAC codec encoding is performed once and cached; subsequent requests with the same voice skip encoding entirely. Changes: - serving_speech: auto-resolve uploaded voices for Fish Speech (voice → ref_audio + ref_text), pass voice_name/voice_created_at to model - fish_speech_slow_ar: check VoiceEmbeddingCache before DAC encoding, store on miss, reuse on hit, clean up temp files on cache hit - Add tests for cache integration and uploaded voice resolution Closes vllm-project#2561
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Binh Tang <tangbinhna@gmail.com> Signed-off-by: Binh Tang <binht@netflix.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: Rein Yang <ruiruyang2@gmail.com> Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Signed-off-by: vraiti <vraiti@redhat.com> Signed-off-by: Songrui625 <songrui625@gmail.com> Signed-off-by: Lidang Jiang <lidangjiang@gmail.com> Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com> Signed-off-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: JuanPZuluaga <juanz9312@gmal.com> Co-authored-by: Yi Liu <yi4.liu@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Canlin Guo <canlinguosdu@gmail.com> Co-authored-by: Binh Tang <tangbinhna@gmail.com> Co-authored-by: Binh Tang <binht@netflix.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: rein yang <73573651+R2-Y@users.noreply.github.com> Co-authored-by: zhumingjue138 <zhumingjue@huawei.com> Co-authored-by: ChenWenjing <54166744+Shirley125@users.noreply.github.com> Co-authored-by: vraiti <vraiti@redhat.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: Sy03 <1370724210@qq.com> Co-authored-by: chickeyton <ngton2014@gmail.com> Co-authored-by: Jerry Song <46962917+Songrui625@users.noreply.github.com> Co-authored-by: Lidang Jiang <119769478+Lidang-Jiang@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Reuse the existing VoiceEmbeddingCache (from Qwen3-TTS, PR vllm-project#2108) for Fish Speech S2 Pro voice cloning. When an uploaded voice is used, the expensive DAC codec encoding is performed once and cached; subsequent requests with the same voice skip encoding entirely. Changes: - serving_speech: auto-resolve uploaded voices for Fish Speech (voice → ref_audio + ref_text), pass voice_name/voice_created_at to model - fish_speech_slow_ar: check VoiceEmbeddingCache before DAC encoding, store on miss, reuse on hit, clean up temp files on cache hit - Add tests for cache integration and uploaded voice resolution Closes vllm-project#2561 Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
we add an in-memory cache for voice extraction artifacts in Qwen3-TTS Base voice clone requests. The system is simple enough that can be used by other models (will import it to the other models in a following PR).
When a user sends a voice clone request with
ref_audio, the server runs two GPU ops:speaker embedding extraction(ECAPA-TDNN) andref_audioencoding (SpeechTokenizer). These cost60-250msper request. If the same voice is used again, these results are identical, but today they're recomputed every time.This PR caches the extraction results keyed by a SHA-256 hash of the audio content. On repeat requests with the same reference audio, the cached tensors are served from CPU memory instead of re-running GPU extraction. This gives a ~25% TTFP reduction on warm (repeated voice) requests.
Files deleted
vllm_omni/entrypoints/openai/metadata_manager.py: disk persistence replaced by in-memory dict (will add later something more robust)vllm_omni/model_executor/models/qwen3_tts/voice_cache_manager.py: replaced by a simpler voice_cache.pyKnown Issue: torch.compile crash on startup (vllm 0.18.0 + torch 2.10)
When using
enforce_eager: false(required for stage 0), the server crashes during startup with:Root cause:
torch._inductor.__init__defines a wrapper functionstandalone_compilethat shadows the submodule of the same name.unittest.mock.patch()resolves viagetattrand finds the function instead of the module.Fix: Apply a one-line patch to
vllm/compilation/compiler_interface.py(line ~376). This is fixed upstream in vllm-project/vllm#37158 but not included in vllm 0.18.0.this was merge to main already with bug in: vllm-project/vllm#37858
(we might need to pin to a version, please someone test this as well)
Test Plan
1. Start the server
vllm-omni serve Qwen/Qwen3-TTS-12Hz-0.6B-Base \ --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ --omni \ --port 8000 \ --trust-remote-code2. Upload a voice (one can pass "ref_text" as well)
3. First request (warmup, it extracts speaker embedding, and then caches it)
4. Second request (embedding already cached, skips extraction)
5. Batch endpoint (all items share the cached embedding)
6. Delete voice (cache key removed, next upload gets fresh extraction)
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)