[Qwen3TTS] [TTS] [Feat] Refactor voice cache manager by JuanPZuluaga · Pull Request #2108 · vllm-project/vllm-omni

JuanPZuluaga · 2026-03-23T22:37:27Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

we add an in-memory cache for voice extraction artifacts in Qwen3-TTS Base voice clone requests. The system is simple enough that can be used by other models (will import it to the other models in a following PR).

When a user sends a voice clone request with ref_audio, the server runs two GPU ops: speaker embedding extraction (ECAPA-TDNN) and ref_audio encoding (SpeechTokenizer). These cost 60-250ms per request. If the same voice is used again, these results are identical, but today they're recomputed every time.

This PR caches the extraction results keyed by a SHA-256 hash of the audio content. On repeat requests with the same reference audio, the cached tensors are served from CPU memory instead of re-running GPU extraction. This gives a ~25% TTFP reduction on warm (repeated voice) requests.

Files deleted

vllm_omni/entrypoints/openai/metadata_manager.py: disk persistence replaced by in-memory dict (will add later something more robust)
vllm_omni/model_executor/models/qwen3_tts/voice_cache_manager.py: replaced by a simpler voice_cache.py

Known Issue: torch.compile crash on startup (vllm 0.18.0 + torch 2.10)

When using enforce_eager: false (required for stage 0), the server crashes during startup with:

AttributeError: <function standalone_compile at ...> does not have the attribute 'FakeTensorMode'

Root cause: torch._inductor.__init__ defines a wrapper function standalone_compile that shadows the submodule of the same name. unittest.mock.patch() resolves via getattr and finds the function instead of the module.

Fix: Apply a one-line patch to vllm/compilation/compiler_interface.py (line ~376). This is fixed upstream in vllm-project/vllm#37158 but not included in vllm 0.18.0.

-            fake_mode_ctx: Any = patch(
-                "torch._inductor.standalone_compile.FakeTensorMode",
+            import sys
+            fake_mode_ctx: Any = patch.object(
+                sys.modules["torch._inductor.standalone_compile"],
+                "FakeTensorMode",
                 lambda *a, **kw: input_fake_mode,
             )

this was merge to main already with bug in: vllm-project/vllm#37858

(we might need to pin to a version, please someone test this as well)

Test Plan

1. Start the server

vllm-omni serve Qwen/Qwen3-TTS-12Hz-0.6B-Base \
    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
    --omni \
    --port 8000 \
    --trust-remote-code

2. Upload a voice (one can pass "ref_text" as well)

curl -X POST http://localhost:8000/v1/audio/voices \
  -F "audio_sample=@/path/to/your/reference.wav" \
  -F "name=my_voice" \
  -F "consent=test_consent"

3. First request (warmup, it extracts speaker embedding, and then caches it)

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello, this is a voice cloning test.", "voice": "my_voice", "response_format": "wav"}' \
  --output cold.wav

4. Second request (embedding already cached, skips extraction)

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "This request reuses the cached speaker embedding.", "voice": "my_voice", "response_format": "wav"}' \
   --output warm.wav

5. Batch endpoint (all items share the cached embedding)

curl -X POST http://localhost:8000/v1/audio/speech/batch \
  -H "Content-Type: application/json" \
  -d '{
    "voice": "my_voice",
    "items": [
      {"input": "First sentence in the batch."},
      {"input": "Second sentence, same cached voice."},
      {"input": "Third sentence, still one extraction."}
    ],
    "response_format": "wav"
  }'

6. Delete voice (cache key removed, next upload gets fresh extraction)

curl -X DELETE http://localhost:8000/v1/audio/voices/my_voice

Test Result

Metric                          main        voice-cache     delta
─────────────────────────────────────────────────────────────────
Cold e2e mean (1st req/voice)   738 ms      734 ms          -4 ms
Warm con=1  median              812 ms      799 ms          -13 ms
Warm con=1  P95                 933 ms      928 ms          -6 ms
Warm con=4  median              1480 ms     1455 ms         -25 ms
Warm con=4  P95                 1688 ms     1835 ms         +147 ms
Warm con=16 median              3722 ms     3666 ms         -56 ms
Warm con=16 P95                 4576 ms     4544 ms         -32 ms
Throughput  con=16              4.0 req/s   4.0 req/s       0.0
Errors                          0           0               0
─────────────────────────────────────────────────────────────────

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/refactor-voice-cache-manager

JuanPZuluaga · 2026-03-23T22:44:16Z

Though, i was thinking in a better Unified VoiceManager, where all the things related to voices are managed.

e.g.., a single class in vllm_omni/utils/voice_manager.py that owns the full lifecycle:

VoiceManager
├── Storage (upload, delete, list, get audio path)
│   └── MetadataManager
├── Artifacts Cache
│   ├── L1: In-memory LRU (current PR)
│   └── L2: Disk safetensors (already in main)
├── Voice Data Access
│   ├── get_ref_audio(name) → base64 data URI
│   ├── get_ref_text(name) → str | None
│   └── get_cached_artifacts(name) → {ref_code, ref_spk_embedding, icl_mode}
└── Cache Invalidation
    └── on delete → evict from L1 + L2

linyueqian · 2026-03-24T04:12:45Z

The cache approach LGTM for the inline ref_audio path. A few thoughts:

The voice upload API (POST /v1/audio/voices) already exists on main and stores voice profiles by name. Does this cache integrate with uploaded voices? Ideally, when a voice is uploaded via the API, the extracted artifacts should also populate the VoiceEmbeddingCache so that POST /v1/audio/speech with voice="my_voice" benefits from the same cache hit path.
The cache is per-process and volatile (lost on restart). For production use cases where users pre-register voices via the upload API, the uploaded voice's artifacts should persist across restarts. Is that already handled by the upload flow, or does re-extraction happen on restart?
The _normalize_ref_audio call on cache miss (line 1341) still runs before the cache check completes. Could you move it inside the miss branch to avoid unnecessary audio normalization when hitting the cache via voice_clone_prompt?

Sy0307 · 2026-03-24T12:57:34Z

I agree with your design for the Unified VoiceManager, but I think it needs more detailed configuration. For example, some settings for LRU, because users may not want to cache all audio. So should we synchronize the design of the configuration as well? This would also facilitate the adaptation of other TTS models. @JuanPZuluaga

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

…ect correct spk emb. Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

…ect correct spk emb. Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/refactor-voice-cache-manager

JuanPZuluaga · 2026-03-24T17:34:15Z

Thanks for the comments @linyueqian.

The cache approach LGTM for the inline ref_audio path. A few thoughts:

The voice upload API (POST /v1/audio/voices) already exists on main and stores voice profiles by name. Does this cache integrate with uploaded voices? Ideally, when a voice is uploaded via the API, the extracted artifacts should also populate the VoiceEmbeddingCache so that POST /v1/audio/speech with voice="my_voice" benefits from the same cache hit path.

I'm thinking in doing a full refactor of the caching, whil keeping the same endpoints for upload/delete/list. This refactor will allow us to integrate new models easily (most TTS/omni models support voice cloning). It'd be good as well to discuss the best approach to have the caching generic enough, that new models can easily integrate it, cause we need to inject the speaker embedding in the forward pass dynamically, etc.

The cache is per-process and volatile (lost on restart). For production use cases where users pre-register voices via the upload API, the uploaded voice's artifacts should persist across restarts. Is that already handled by the upload flow, or does re-extraction happen on restart?

That happens on restarts actually. I am bit defensive agains this, as this introduces more complexity, and maybe if one wants a clean deployment, then one can somehow just restarts and start from scratch. On the other hand, the #1227 will introduce uploading speaker embeddings, which is cool and can solve that issue.

The _normalize_ref_audio call on cache miss (line 1341) still runs before the cache check completes. Could you move it inside the miss branch to avoid unnecessary audio normalization when hitting the cache via voice_clone_prompt?

already fixed in the latest revision, where the cache lookup uses the voice name directly (it's simpler instead of computing a hash of the audio file), so _normalize_ref_audio only runs on cache miss.

JuanPZuluaga · 2026-03-24T17:36:55Z

I agree with your design for the Unified VoiceManager, but I think it needs more detailed configuration. For example, some settings for LRU, because users may not want to cache all audio. So should we synchronize the design of the configuration as well? This would also facilitate the adaptation of other TTS models. @JuanPZuluaga

Thanks for the comments @Sy0307. This is true, I'm still working a bit on design and testing. Ideally, we'd be nice to have a voice manager generic enough that adding another TTS/omni model is as simple as adding a couple of parts in their Talkers, thus, being super simple (after this PR lands, i'll onboard all the other tts/omni models that support voice cloning).

homepy · 2026-03-25T03:34:19Z

The cache approach LGTM for the inline ref_audio path. A few thoughts:

The cache is per-process and volatile (lost on restart). For production use cases where users pre-register voices via the upload API, the uploaded voice's artifacts should persist across restarts. Is that already handled by the upload flow, or does re-extraction happen on restart?

Maybe you could refer to the implementation of CosyVoice. See example of CosyVoice , lines 46 to 50.

assert cosyvoice.add_zero_shot_spk('希望你以后能够做的比我还好呦。', './asset/zero_shot_prompt.wav', 'my_zero_shot_spk') is True
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '', '', zero_shot_spk_id='my_zero_shot_spk')):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
cosyvoice.save_spkinfo()

In the voice clone example, cosyvoice.add_zero_shot_spk() add a new voice, and cosyvoice.save_spkinfo() will save all the voices to the file spk2info.pt which is in model directory.
When server restart, it loads the spk2info.pt file. And it is easy to delete this file.

… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/refactor-voice-cache-manager

… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/refactor-voice-cache-manager

JuanPZuluaga · 2026-03-25T14:14:26Z

I'll keep working on this, once: #1227 and #2046 are merged. @linyueqian @Sy0307

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/refactor-voice-cache-manager

… types (vllm-project#2228) Signed-off-by: Lidang Jiang <lidangjiang@gmail.com> Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Signed-off-by: Alex Brooks <albrooks@redhat.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

linyueqian · 2026-04-03T16:37:18Z

Hey @JuanPZuluaga — while reviewing PR #2475 (rebase to v0.19.0), I noticed that Qwen3 TTS handles reference audio through its own custom voice_cache_manager.py / tokenizer pipeline, whereas other TTS models like CosyVoice3 and Voxtral TTS route their reference audio through vLLM's multimodal input pipeline (SupportsMultiModal, MULTIMODAL_REGISTRY, MultiModalFieldConfig, etc.).

Since you're already refactoring the voice cache here, do you think it would make sense to also align Qwen3 TTS with vLLM's multimodal input processing for reference audio? That way all TTS models would follow the same pattern and benefit from future upstream multimodal improvements automatically.

Curious if this is something you've already considered or if there's a reason to keep the current approach. Thanks!

… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/refactor-voice-cache-manager

JuanPZuluaga · 2026-04-03T18:04:55Z

@linyueqian good point actually. I looked at both CosyVoice3 (ONNX processors) and Voxtral (from raw audio goes to embed_multimodal). actually, the Voxtral pattern is the viable approach for Qwen3 TTS since the speaker encoder and codec are GPU model sub-components, not standalone processors.

I think the TTS voice cloning has a different access pattern: voices are uploaded once and reused. A voice-name-keyed LRU cache is more appropriate than the multimodal pipeline's content-hash cache, so we'd still layer the voice cache on top. I was thinking to keep this PR focused on cache correctness and will open asap a follow-up for multimodal alignment with all the other models that support voice cloning, etc. What is your opinion? should we merge this and then i align all the models, or should i already do that here?

Signed-off-by: Yueqian Lin <yueqianlin@gmail.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

linyueqian · 2026-04-03T20:24:43Z

The CI failures (ImportError: cannot import name 'hardware_test' from 'tests.utils') are caused by this PR adding tests/utils/__init__.py and tests/utils/test_voice_cache.py. This creates a tests/utils/ package that shadows the existing tests/utils.py module, breaking all imports like from tests.utils import hardware_test.

Please move tests/utils/test_voice_cache.py to a different location (e.g., tests/test_voice_cache.py) and remove the tests/utils/__init__.py file.

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

…m/JuanPZuluaga/vllm-omni into feat/refactor-voice-cache-manager

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

JuanPZuluaga · 2026-04-03T20:30:35Z

thanks for the catch! @linyueqian, it's fixed now :-)

linyueqian

LGTM

- drop the old metadata/cache-manager path after rebasing onto main - keep voice_created_at-based stale-cache protection on raw-audio fallback - memoize direct speaker embeddings to avoid repeated safetensors reads - preserve request-level Base overrides when uploaded voices fall back to raw audio - keep cached ref_code handling intact for Base in-context prompt construction - update voice cache tests to match the rebased serving implementation Signed-off-by: reidliu41 <reid201711@gmail.com>

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Binh Tang <tangbinhna@gmail.com> Signed-off-by: Binh Tang <binht@netflix.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: Rein Yang <ruiruyang2@gmail.com> Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Signed-off-by: vraiti <vraiti@redhat.com> Signed-off-by: Songrui625 <songrui625@gmail.com> Signed-off-by: Lidang Jiang <lidangjiang@gmail.com> Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com> Signed-off-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: JuanPZuluaga <juanz9312@gmal.com> Co-authored-by: Yi Liu <yi4.liu@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Canlin Guo <canlinguosdu@gmail.com> Co-authored-by: Binh Tang <tangbinhna@gmail.com> Co-authored-by: Binh Tang <binht@netflix.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: rein yang <73573651+R2-Y@users.noreply.github.com> Co-authored-by: zhumingjue138 <zhumingjue@huawei.com> Co-authored-by: ChenWenjing <54166744+Shirley125@users.noreply.github.com> Co-authored-by: vraiti <vraiti@redhat.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: Sy03 <1370724210@qq.com> Co-authored-by: chickeyton <ngton2014@gmail.com> Co-authored-by: Jerry Song <46962917+Songrui625@users.noreply.github.com> Co-authored-by: Lidang Jiang <119769478+Lidang-Jiang@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

- drop the old metadata/cache-manager path after rebasing onto main - keep voice_created_at-based stale-cache protection on raw-audio fallback - memoize direct speaker embeddings to avoid repeated safetensors reads - preserve request-level Base overrides when uploaded voices fall back to raw audio - keep cached ref_code handling intact for Base in-context prompt construction - update voice cache tests to match the rebased serving implementation Signed-off-by: reidliu41 <reid201711@gmail.com>

Reuse the existing VoiceEmbeddingCache (from Qwen3-TTS, PR vllm-project#2108) for Fish Speech S2 Pro voice cloning. When an uploaded voice is used, the expensive DAC codec encoding is performed once and cached; subsequent requests with the same voice skip encoding entirely. Changes: - serving_speech: auto-resolve uploaded voices for Fish Speech (voice → ref_audio + ref_text), pass voice_name/voice_created_at to model - fish_speech_slow_ar: check VoiceEmbeddingCache before DAC encoding, store on miss, reuse on hit, clean up temp files on cache hit - Add tests for cache integration and uploaded voice resolution Closes vllm-project#2561

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: Binh Tang <tangbinhna@gmail.com> Signed-off-by: Binh Tang <binht@netflix.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: Rein Yang <ruiruyang2@gmail.com> Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Signed-off-by: vraiti <vraiti@redhat.com> Signed-off-by: Songrui625 <songrui625@gmail.com> Signed-off-by: Lidang Jiang <lidangjiang@gmail.com> Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com> Signed-off-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: JuanPZuluaga <juanz9312@gmal.com> Co-authored-by: Yi Liu <yi4.liu@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Canlin Guo <canlinguosdu@gmail.com> Co-authored-by: Binh Tang <tangbinhna@gmail.com> Co-authored-by: Binh Tang <binht@netflix.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: rein yang <73573651+R2-Y@users.noreply.github.com> Co-authored-by: zhumingjue138 <zhumingjue@huawei.com> Co-authored-by: ChenWenjing <54166744+Shirley125@users.noreply.github.com> Co-authored-by: vraiti <vraiti@redhat.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: Sy03 <1370724210@qq.com> Co-authored-by: chickeyton <ngton2014@gmail.com> Co-authored-by: Jerry Song <46962917+Songrui625@users.noreply.github.com> Co-authored-by: Lidang Jiang <119769478+Lidang-Jiang@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Alex Brooks <albrooks@redhat.com> Co-authored-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Reuse the existing VoiceEmbeddingCache (from Qwen3-TTS, PR vllm-project#2108) for Fish Speech S2 Pro voice cloning. When an uploaded voice is used, the expensive DAC codec encoding is performed once and cached; subsequent requests with the same voice skip encoding entirely. Changes: - serving_speech: auto-resolve uploaded voices for Fish Speech (voice → ref_audio + ref_text), pass voice_name/voice_created_at to model - fish_speech_slow_ar: check VoiceEmbeddingCache before DAC encoding, store on miss, reuse on hit, clean up temp files on cache hit - Add tests for cache integration and uploaded voice resolution Closes vllm-project#2561 Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

JuanPZuluaga added 4 commits March 23, 2026 09:54

refactor the voice cache manager

c985bed

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

refactor the voice cache manager

e7cd85a

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

38427a7

… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

587bfb2

… feat/refactor-voice-cache-manager

JuanPZuluaga changed the title ~~Feat/refactor voice cache manager~~ [Qwen3TTS] [TTS] [Feat] Refactor voice cache manager Mar 23, 2026

JuanPZuluaga added 10 commits March 24, 2026 14:53

update api server and serving speech with new voicemanager.

2bcd065

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

update api server and serving speech with new voicemanager.

7162276

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

remove old voice cache manager, not used.

95159c9

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

remove old voice cache manager, not used.

469e3bf

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

update of the new voice cache manager.

b859bd2

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

update of the new voice cache manager.

7af99bc

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

only use voice name as hash in the cache and update the talker to inj…

7605c5b

…ect correct spk emb. Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

only use voice name as hash in the cache and update the talker to inj…

444578f

…ect correct spk emb. Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

783f96c

… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

2fe7afc

… feat/refactor-voice-cache-manager

JuanPZuluaga added 4 commits March 25, 2026 06:22

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

10be0fb

… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

2de3fbe

… feat/refactor-voice-cache-manager

merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

b7f6cb7

… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

a52a22a

… feat/refactor-voice-cache-manager

JuanPZuluaga added 4 commits March 25, 2026 22:26

merge main

a46358a

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

merge main

3e64214

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

63e1d57

… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

aec41a4

… feat/refactor-voice-cache-manager

Lidang-Jiang and others added 2 commits April 3, 2026 23:59

[Bugfix] Fix Flux2 Dev Guidance (vllm-project#2433)

f9eb5c5

Signed-off-by: Alex Brooks <albrooks@redhat.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

linyueqian added the ready label to trigger buildkite CI label Apr 3, 2026

JuanPZuluaga added 2 commits April 3, 2026 17:39

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

061b470

… feat/refactor-voice-cache-manager Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

77cbae9

… feat/refactor-voice-cache-manager

JuanPZuluaga mentioned this pull request Apr 3, 2026

[Frontend] Add voice clone prompt cache endpoint for Qwen3-TTS (#1760) #2457

Open

5 tasks

linyueqian closed this Apr 3, 2026

linyueqian reopened this Apr 3, 2026

Retrigger CI

8fd5d49

Signed-off-by: Yueqian Lin <yueqianlin@gmail.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

linyueqian force-pushed the feat/refactor-voice-cache-manager branch 2 times, most recently from 00c5fbb to 8fd5d49 Compare April 3, 2026 19:55

Merge branch 'main' into feat/refactor-voice-cache-manager

7b695c6

JuanPZuluaga added 3 commits April 3, 2026 20:27

change tests path

5c1702d

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Merge branch 'feat/refactor-voice-cache-manager' of https://github.co…

8e8f323

…m/JuanPZuluaga/vllm-omni into feat/refactor-voice-cache-manager

remove broken test

a03e2dc

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

linyueqian approved these changes Apr 3, 2026

View reviewed changes

linyueqian merged commit f50c5a4 into vllm-project:main Apr 3, 2026
8 checks passed

linyueqian mentioned this pull request Apr 5, 2026

[Bugfix] Accept 'speaker' as alias for 'voice' in TTS speech API #2424

Merged

linyueqian mentioned this pull request Apr 8, 2026

[Feat][FishSpeech] Cache DAC-encoded ref audio for voice cloning #2609

Merged

6 tasks

JuanPZuluaga deleted the feat/refactor-voice-cache-manager branch April 9, 2026 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen3TTS] [TTS] [Feat] Refactor voice cache manager#2108

[Qwen3TTS] [TTS] [Feat] Refactor voice cache manager#2108
linyueqian merged 100 commits intovllm-project:mainfrom
JuanPZuluaga:feat/refactor-voice-cache-manager

JuanPZuluaga commented Mar 23, 2026 •

edited

Loading

Uh oh!

JuanPZuluaga commented Mar 23, 2026

Uh oh!

linyueqian commented Mar 24, 2026

Uh oh!

Sy0307 commented Mar 24, 2026

Uh oh!

JuanPZuluaga commented Mar 24, 2026

Uh oh!

JuanPZuluaga commented Mar 24, 2026

Uh oh!

homepy commented Mar 25, 2026 •

edited

Loading

Uh oh!

JuanPZuluaga commented Mar 25, 2026 •

edited

Loading

Uh oh!

linyueqian commented Apr 3, 2026

Uh oh!

JuanPZuluaga commented Apr 3, 2026

Uh oh!

linyueqian commented Apr 3, 2026

Uh oh!

JuanPZuluaga commented Apr 3, 2026

Uh oh!

linyueqian left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

Conversation

JuanPZuluaga commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Files deleted

Known Issue: torch.compile crash on startup (vllm 0.18.0 + torch 2.10)

Test Plan

1. Start the server

2. Upload a voice (one can pass "ref_text" as well)

3. First request (warmup, it extracts speaker embedding, and then caches it)

4. Second request (embedding already cached, skips extraction)

5. Batch endpoint (all items share the cached embedding)

6. Delete voice (cache key removed, next upload gets fresh extraction)

Test Result

Uh oh!

JuanPZuluaga commented Mar 23, 2026

Uh oh!

linyueqian commented Mar 24, 2026

Uh oh!

Sy0307 commented Mar 24, 2026

Uh oh!

JuanPZuluaga commented Mar 24, 2026

Uh oh!

JuanPZuluaga commented Mar 24, 2026

Uh oh!

homepy commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanPZuluaga commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Apr 3, 2026

Uh oh!

JuanPZuluaga commented Apr 3, 2026

Uh oh!

linyueqian commented Apr 3, 2026

Uh oh!

JuanPZuluaga commented Apr 3, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

JuanPZuluaga commented Mar 23, 2026 •

edited

Loading

homepy commented Mar 25, 2026 •

edited

Loading

JuanPZuluaga commented Mar 25, 2026 •

edited

Loading