[Frontend] Add voice clone prompt cache endpoint for Qwen3-TTS (#1760)#2457
[Frontend] Add voice clone prompt cache endpoint for Qwen3-TTS (#1760)#2457reidliu41 wants to merge 11 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8d4e94cd0a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if not self.metadata_manager.update_speaker( | ||
| voice_key, | ||
| { | ||
| "cache_status": "processing", | ||
| "cache_generated_at": now, | ||
| }, | ||
| ): |
There was a problem hiding this comment.
Prevent cache rebuild from recreating deleted voice metadata
This path marks a voice as processing via metadata_manager.update_speaker(...) after only an in-memory existence check, but update_speaker creates missing entries when the key is absent. If another worker deletes the voice between those two operations, this call resurrects a partial metadata record and can leave a zombie voice after rollback/failure. In multi-process deployments, a DELETE racing with /cache can therefore corrupt metadata.json instead of failing cleanly.
Useful? React with 👍 / 👎.
| # ── No cache (pending/failed/processing): raw audio path ── | ||
| self._fallback_to_raw_audio(request.voice, speaker_info, params) | ||
| _uploaded_voice_resolved = True |
There was a problem hiding this comment.
Keep request-level Base overrides for uncached uploaded voices
In the raw-audio fallback branch for uploaded voices without a ready cache, _uploaded_voice_resolved is set to True, which skips the later merge of request-level ref_text/x_vector_only_mode/speaker_embedding. As a result, requests for pending/failed/processing uploaded voices now silently ignore valid per-request Base cloning overrides and always use upload-time defaults from _fallback_to_raw_audio, which regresses prior behavior. The override suppression should be limited to cases where a cached/direct prompt was actually used.
Useful? React with 👍 / 👎.
6619916 to
c6e9e19
Compare
lishunyang12
left a comment
There was a problem hiding this comment.
left a couple comments, mostly around hot-path perf
| emb_path = Path(cache_file_str) | ||
| if not _validate_path_within_directory(emb_path, self.uploaded_speakers_dir): | ||
| raise ValueError("Illegal cache path outside voice samples directory") | ||
| if not emb_path.is_file() or emb_path.suffix != ".safetensors": |
There was a problem hiding this comment.
This loads and deserializes the safetensors file on every single TTS request for direct-embedding voices. That's synchronous disk I/O on the request hot path — kind of defeats the purpose of caching.
Could you load this once (e.g. at upload time or first access) and keep the embedding list in memory, similar to how the audio-cache path uses load_cached_voice_prompt?
| icl_mode = ref_text is not None and ref_text.strip() != "" | ||
|
|
||
| if icl_mode and not hasattr(model, "_encode_ref_audio_to_code"): | ||
| raise NotImplementedError(f"{type(model).__name__} does not support ref audio codec encoding") |
There was a problem hiding this comment.
wav_np.tolist() converts the entire waveform to a Python list of floats before sending over RPC. For a 10s clip at 24kHz that's 240k Python float objects — roughly 10x the memory of the numpy array.
Worth checking if the RPC layer can handle numpy arrays or bytes directly. If not, at least document why this is necessary.
| updates["cache_generated_at"] = cache_generated_at | ||
| if not self.metadata_manager.update_speaker(voice_key, updates): | ||
| logger.error("Failed to rollback cache state for voice %s to disk", voice_key) | ||
| if voice_key in self.uploaded_speakers: |
There was a problem hiding this comment.
Nit: except Exception is fine for the rollback, but the docstring says "returns plain Python types only (must survive msgspec IPC)" over in the worker — same constraint applies to wav_samples arg. Might be worth a brief comment here explaining why tolist() is needed for the audio data too (msgspec can't handle numpy).
|
@JuanPZuluaga ptal |
|
Hi @linyueqian @lishunyang12 @reidliu41, i think this PR overlaps a bit with: #2108 in that pr, we use an in-memory LRU cache with a Ideally, the voice cache manager should handle all model types that support voice cloning, and I'll work on that as soon as #2108 is merged into main. |
|
@reidliu41 now that #2108 is merged. please rebase on main. thanks! |
52bca00 to
e62bdfb
Compare
linyueqian
left a comment
There was a problem hiding this comment.
Feature is well-designed with solid state management and good test coverage. A few issues to address:
[P1] _load_cached_voice_prompt reads safetensors from disk on every TTS request
For audio-uploaded voices with cache_status="ready", _build_tts_params calls _load_cached_voice_prompt which does safe_open + tensor deserialization on every single request. This undermines the latency benefit of caching.
Direct-embedding voices already have _direct_embedding_cache for in-memory caching. Audio-cached prompts should get the same treatment:
self._audio_prompt_cache: dict[str, dict[str, Any]] = {}Populate on first load, invalidate on force-rebuild or delete. The payload is small (1024-dim embedding + codec codes), so memory is not a concern.
[Minor] Non-atomic save
_save_voice_cache writes safetensors directly to the final path. A tmp+rename pattern would prevent corrupted cache files if the process dies mid-write. Fine as a follow-up.
[Minor] Step numbering
serving_speech.py has steps 1, 2, 3, then 5 (no step 4).
Positive notes:
- State machine (pending/processing/ready/failed) with timeout, rollback, and idempotency is solid
- Path traversal checks via
_validate_path_within_directoryin all save/load paths - The
ref_codefix inqwen3_tts_talker.py:1356-1365is important --_as_singletonwas silently dropping all but the first frame of cached ref_code _uploaded_voice_resolvedflag correctly prevents request params from overriding cached ref_text/x_vector_only_mode
|
Please unify naming in the PR, please use Can the cache endpoint delegate to the shared Do you agree here? @linyueqian |
|
Agree, let's extract it to the VoiceCacheManager. @reidliu41 please refactor before merge according to @JuanPZuluaga 's instruction. thanks! |
…project#1760) Avoid repeated GPU preprocessing for uploaded audio voices by caching the generated voice clone prompt and reusing it in later TTS requests. - add worker RPC to pre-compute speaker embedding and ref_code - add POST /v1/audio/voices/{name}/cache with processing/ready/failed handling - reuse cached voice_clone_prompt in uploaded-voice TTS requests - prevent request ref_text/x_vector_only_mode from overriding cached semantics - fix direct-embedding uploaded voice handling in the TTS path - add rollback for pre-save rebuild failures and clearer validation errors - add unit and handler-contract coverage for cache generation and error paths Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: reidliu41 <reid201711@gmail.com>
- avoid recreating deleted voice metadata during cache generation races - preserve request-level Base overrides when uploaded voices fall back to raw audio - memoize direct speaker embeddings to remove repeated safetensors disk reads - document why waveform RPC args must be converted to plain Python types - keep cached ref_code intact when building Qwen3-TTS Base prompts Signed-off-by: reidliu41 <reid201711@gmail.com>
- drop the old metadata/cache-manager path after rebasing onto main - keep voice_created_at-based stale-cache protection on raw-audio fallback - memoize direct speaker embeddings to avoid repeated safetensors reads - preserve request-level Base overrides when uploaded voices fall back to raw audio - keep cached ref_code handling intact for Base in-context prompt construction - update voice cache tests to match the rebased serving implementation Signed-off-by: reidliu41 <reid201711@gmail.com>
- memoize cached audio voice_clone_prompt payloads to avoid repeated
safetensors reads on the TTS hot path
- invalidate the in-memory audio prompt cache on rebuild and delete
- warm the in-memory cache immediately after a successful cache save
- write safetensors through a temp file and os.replace for atomic updates
- fix the create_voice_cache step numbering comments
- add tests for audio prompt memoization, atomic save, and cache invalidation
Signed-off-by: reidliu41 <reid201711@gmail.com>
- move uploaded speaker cache state and safetensors persistence out of
serving_speech into a shared VoiceCacheManager
- keep serving_speech focused on API boundary, model checks, and worker RPC
- align new internal naming with speaker-oriented terminology while
keeping voice at the HTTP boundary
- update voice cache tests to match the shared manager refactor
Signed-off-by: reidliu41 <reid201711@gmail.com>
- resolve the serving_speech conflict after vllm-project#2424 merged into main - keep the speaker-to-voice request alias from vllm-project#2424 - preserve the uploaded voice cache endpoint and shared speaker cache flow - drop the stale direct-embedding helper left behind by the rebase Signed-off-by: reidliu41 <reid201711@gmail.com>
2c1f285 to
d1f8816
Compare
|
@JuanPZuluaga Could you take another look? The author addressed the P1 (in-memory caching + atomic save in |
|
thanks for the contribution @reidliu41, few comments below:
Also, overall i have the following question: we already have We can even have 2 dictionaries:
please let me know what you think. |
|
@JuanPZuluaga Thanks, this is a good direction. I agree that reducing the number of cache layers would be cleaner. My hesitation is that the current Because of that, my preference is to keep this PR scoped to the explicit |
- rename the cache endpoint exceptions to SpeakerNotFoundError and
SpeakerCacheUnsupportedError for internal naming consistency
- update the API layer and tests to use the speaker-named exceptions
- add a per-speaker asyncio.Lock in VoiceCacheManager to prevent
duplicate GPU work when concurrent /cache requests target the same speaker
Signed-off-by: reidliu41 <reid201711@gmail.com>
lishunyang12
left a comment
There was a problem hiding this comment.
Review: Voice Clone Prompt Cache Endpoint for Qwen3-TTS
Overall this is a well-structured PR that fills an important gap — pre-computing speaker embeddings and codec codes so repeated TTS requests skip redundant GPU work. The state machine (pending → processing → ready/failed), rollback logic, atomic file writes, and idempotency handling are all solid. The test coverage is thorough. A few issues worth addressing:
Concurrency: _speaker_locks dict is not thread-safe
VoiceCacheManager._get_speaker_lock() does a check-then-create on a plain dict without synchronization. If two coroutines race on the same speaker key for the first time, they could both create separate asyncio.Lock instances, defeating the serialization:
def _get_speaker_lock(self, speaker_key: str) -> asyncio.Lock:
lock = self._speaker_locks.get(speaker_key)
if lock is None:
lock = asyncio.Lock()
self._speaker_locks[speaker_key] = lock
return lockIn practice this is unlikely with asyncio's single-threaded event loop (no preemption between the get and set), but dict.setdefault would make the intent explicit and be future-proof:
return self._speaker_locks.setdefault(speaker_key, asyncio.Lock())_speaker_locks grows unboundedly
Every voice that ever gets cached adds a lock that is never removed — even after delete_voice. Over time on a long-running server with many uploaded voices this leaks memory. Consider removing the lock in invalidate_speaker_prompt_cache() or delete_voice, or switching to a bounded structure.
save_speaker_cache mutates speaker_info dict in place
save_speaker_cache() writes cache_status, cache_file, and cache_generated_at directly into the speaker_info dict that lives in uploaded_speakers. This works because the caller holds the async lock, but it couples persistence logic to in-memory state mutation. If save_file() succeeds but os.replace() fails (e.g., cross-device rename), the metadata is never updated — good. But if os.replace succeeds and then speaker_info["cache_status"] = "ready" raises (impossible for dict, but fragile pattern), the file is orphaned. Consider returning the metadata updates and letting the caller apply them, or at minimum documenting that this method has side effects on speaker_info.
wav_samples sent as list[float] over IPC could be large
In _build_speaker_cache_payload, the entire waveform is converted to a Python list of floats (wav_np.tolist()) for the collective_rpc call. For a 10-second clip at 24kHz, that's 240,000 Python float objects serialized through msgspec. This works, but could be a latency/memory bottleneck for longer reference clips. Worth a comment noting this limitation, or consider base64-encoding the raw bytes if msgspec supports bytes.
Unused _speaker_embedding_cache field on OmniOpenAIServingSpeech
The new _speaker_embedding_cache: dict[str, list[float]] on the serving class is only used for direct-embedding voices. Meanwhile VoiceCacheManager has its own _speaker_prompt_cache for audio-cached voices. Having two separate caches for similar purposes is confusing. Consider consolidating, or at minimum add a comment explaining why they are separate.
Minor: _as_singleton removal in talker could break non-cached paths
In qwen3_tts_talker.py, the change removes the _as_singleton() call on ref_code:
- ref_code = _as_singleton(voice_clone_prompt.get("ref_code"))
+ ref_code = voice_clone_prompt.get("ref_code")If any existing code path (e.g., inline ref_audio without caching) previously relied on _as_singleton to unwrap a batch dimension, this change could silently break it. Please verify that all non-cached voice_clone_prompt payloads either don't set ref_code or already provide it in the expected shape.
Minor: f-string in logger.exception
logger.exception(f"Failed to create voice cache for '{name}': {e}")Using f-string with logger.exception means the string is always formatted even if the log level is disabled. Use logger.exception("Failed to create voice cache for '%s': %s", name, e) for lazy formatting.
Minor: test line lengths
Several test lines exceed 120 characters (e.g., _make_server, test_create_cache_processing_active). Not a blocker but worth a formatting pass for consistency with the rest of the codebase.
Positive notes
- The rollback logic (restoring
previous_statuson pre-save failures vs. setting"failed"on post-save failures) is well thought out. - Atomic file replacement via
tempfile+os.replaceis the right pattern. - Idempotency check (ready + valid cache → early return) prevents wasted GPU work.
- The
forceparameter for debugging/maintenance is a nice touch. - Test coverage is comprehensive: error branches, rollback, handler contract, direct-embedding edge cases.
- Use setdefault for per-speaker cache lock creation - Remove per-speaker cache locks when uploaded voices are deleted - Document VoiceCacheManager side effects and cache ownership boundaries - Note waveform list IPC cost for long reference audio - Preserve ref_code compatibility without unwrapping cached list payloads - Use lazy formatting for voice cache exception logging - Extend delete-voice cache invalidation test coverage Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: reidliu41 <reid201711@gmail.com>
…dpoint-1760 Signed-off-by: reidliu41 <reid201711@gmail.com>
|
I reworked it based on the latest shared speaker-cache implementation. The |
|
@JuanPZuluaga @lishunyang12 Please take another look when you have time. |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Add
POST /v1/audio/voices/{name}/cachefor uploaded Qwen3-TTS voices.This change pre-computes speaker embedding and reference audio codec codes on the
TTS worker through
collective_rpc, persists them as safetensors, and letssubsequent TTS requests reuse the cached
voice_clone_promptinstead ofreprocessing reference audio on every request.
Value:
Closes [Feature]: Add a separate endpoint for create_voice_clone_prompt for qwen3-TTS model #1760
Test Plan
Manual end-to-end validation on a local Omni server:
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)