[TTS][SpeakerCacheManager] A global speaker cache manager for Voice Cloning#2630
[TTS][SpeakerCacheManager] A global speaker cache manager for Voice Cloning#2630JuanPZuluaga wants to merge 19 commits intovllm-project:mainfrom
Conversation
|
things to do:
|
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
is it ready to be reviewed? I need to merge voxcpm2 perf optimization #2690 first may need to wait for that one. |
True, thanks for the heads up |
lishunyang12
left a comment
There was a problem hiding this comment.
Review: [TTS][SpeakerCacheManager] A global speaker cache manager for Voice Cloning
Overall this is a solid improvement -- consolidating voice caching across all TTS backends (Fish Speech, CosyVoice3, OmniVoice, VoxCPM2, Qwen3 TTS) with a shared VoiceEmbeddingCache is the right direction. The LRU eviction, thread safety via lock, and per-voice clear() on delete are all welcome. However, I see several issues that should be addressed before merging.
Critical Issues
1. Stale cache on voice re-upload (regression)
The PR removes the created_at-based cache invalidation that previously prevented stale cache hits when a voice is deleted and re-uploaded with different audio. The new approach relies on clear(voice_name) being called on delete. However, there is no guarantee the cache is cleared on the model-side instances (CosyVoice3, Fish Speech, VoxCPM2, OmniVoice each create their own VoiceEmbeddingCache() in __init__). The serving_speech.py delete_voice() only calls self._voice_cache.clear(voice_name_lower) on its own cache instance -- it has no reference to the per-model caches. This means:
serving_speech._voice_cachegets cleared on delete (good)cosyvoice3._voice_cache,fish_speech._voice_cache,voxcpm2._voice_cache,pipeline_omnivoice._voice_cacheall retain stale entries (bug)
The PR title says "global" speaker cache manager, but the implementation creates 5+ independent instances. Either make it truly global (singleton or injected reference), or propagate invalidation to model-level caches. This is a correctness bug.
2. Each VoiceEmbeddingCache() defaults to 128 entries -- unbounded aggregate memory
With 5 independent caches (serving_speech, cosyvoice3, fish_speech, voxcpm2, omnivoice), the system can hold up to 640 cached voice entries. CosyVoice3 caches 4 tensors per voice (speech_feat, speech_token, speech_token_len, embedding). For long reference audio, speech_feat alone can be large. There is no aggregate memory limit, only an entry count limit. The memory_bytes() method exists but is never consulted for eviction decisions.
Consider adding a max_memory_bytes threshold that triggers eviction, or at minimum document the expected memory footprint per model type.
3. _resolve_uploaded_voice mutates request in-place -- surprising side effect
_resolve_uploaded_voice() modifies request.ref_audio and request.ref_text in place. This pattern is fragile: if the method is accidentally called twice (it is called in multiple code paths for omnivoice), the request state could become inconsistent. The guard request.ref_audio is not None at the top prevents double-injection, but it would be cleaner to return the resolved data rather than mutating.
Design Concerns
4. clear() with prefix matching is fragile
keys_to_remove = [k for k in self._cache if k.startswith(f"{voice_name}:")]If a voice is named "alice" and another is "alice_v2", calling clear("alice") will NOT remove "alice_v2:default" (the : prevents it), so this is actually fine. But if voice names ever contain :, it would break. Consider validating voice names at upload time to reject colons.
5. serving_speech.py has duplicated VoxCPM2 handling
The _prepare_speech_generation method now has two separate VoxCPM2 branches:
- Lines around the
elif self._tts_model_type == "voxcpm2":in the non-diffusion path - A second
elif self._tts_model_type == "voxcpm2":block further down in what appears to be the diffusion/fallback path
This duplication is confusing and error-prone. It's unclear which branch executes for a given request. Please consolidate or add clear comments explaining when each branch is reached.
6. Voxtral voice cloning support removed silently
The _build_voxtral_prompt change removes support for ref_audio entirely and now raises ValueError("Voxtral requires a voice name (preset voice).") if no voice is provided. This is a breaking change for users who were using inline ref_audio with Voxtral. Should be documented in the PR description.
Minor Issues
7. _init_voice_storage() uses /tmp/voice_samples default
This is fine for development but concerning for production. The default path should be documented, and ideally the directory should be configurable without environment variables (e.g., via server config).
8. stats() method calls memory_bytes() outside the lock, then acquires lock again
def stats(self) -> dict[str, Any]:
memory = self.memory_bytes() # acquires lock, releases
with self._lock: # acquires lock again
return { ... }This is not atomic -- entries could be added/removed between the two lock acquisitions, making memory_bytes inconsistent with entries. Consider computing everything under a single lock acquisition.
9. Tests removed for stale-cache protection
test_stale_cache_on_reupload, test_stale_cache_protection, test_make_cache_key_created_at_isolation, and test_created_at_zero_disables_cache are all removed. Since the new approach relies on explicit clear() on delete, there should be an integration test verifying that deleting and re-uploading a voice actually invalidates the model-side caches (not just the serving_speech cache). The new test_voice_cache_integration.py only tests a single cache instance.
10. _cosyvoice3_tokenizer attribute removal
Line self._cosyvoice3_tokenizer = None was removed from __init__. Verify this attribute isn't referenced elsewhere, as it would cause an AttributeError at runtime.
Summary
The core idea is good, but the "global" cache is not actually global -- each model creates its own instance, and invalidation on voice delete only reaches the serving layer's instance. This is the primary blocker. Please either make the cache a true singleton/shared instance, or add a mechanism to propagate invalidation to all model-level caches. The other issues (memory bounds, Voxtral breaking change, duplicated VoxCPM2 branches) should also be addressed.
|
Thanks for the nice review @lishunyang12
Q: should we leave only one?
other changes, everything is computed under a single with self._lock. updated tests, no remaining references of |
hsliuustc0106
left a comment
There was a problem hiding this comment.
Code Review: Global Speaker Cache Manager for Voice Cloning
CI Note: pre-commit fails (ruff check — import ordering in tests/conftest.py). Run pre-commit run --all-files locally to fix.
Critical
1. Loss of stale-cache protection (created_at) — regression risk
The old VoiceEmbeddingCache.make_cache_key() included created_at in the key so that deleting and re-uploading a voice with the same name but different audio would not hit stale cached embeddings. The new SpeakerEmbeddingCache.make_cache_key(speaker_name, model_type) drops this entirely.
The old tests explicitly tested this (test_stale_cache_on_reupload, test_stale_cache_protection, test_created_at_zero_disables_cache) — all removed in this PR. While delete_voice() now calls self._speaker_cache.clear(voice_name_lower), there's a gap: if a user deletes and re-uploads a voice between two concurrent requests, a race condition can serve the OLD cached artifacts for the NEW upload.
Recommendation: Either (a) add created_at back into the cache key, or (b) use a version counter atomically incremented on re-upload, or (c) document the known race and the trade-off.
Warnings
2. _get_uploaded_audio_data() re-encodes to WAV on every cache miss
_get_uploaded_audio_data() reads the safetensors, decodes to numpy, then re-encodes as WAV via sf.write(buf, ...) just to get a base64 data URL. Previously the raw audio bytes were stored and base64-encoded directly — much cheaper. Consider caching the data URL string or benchmarking the overhead for large audio files.
3. Voice name validation is inconsistent across paths
_resolve_uploaded_speaker() returns an error string but doesn't reject unknown voices for non-CosyVoice3/Fish/OmniVoice models. Meanwhile _prepare_speech_generation() for voxcpm2 raises directly, and _create_diffusion_speech() returns a 400 Response. Three different error-handling patterns for the same logical validation. Recommend extracting a _validate_voice_name() helper.
4. CosyVoice3 dynamic token length change mixed into cache refactoring
The change from character-based (len(request.input)) to token-based (extract_text_token(...)) is a behavioral change that could significantly affect generated audio length for multilingual text. This should ideally be a separate PR. At minimum, add a test verifying token count differs from char count for CJK text.
5. _resolve_uploaded_speaker() called with near-duplicate code in 3 places
Called in _prepare_speech_generation() for fish_tts/cosyvoice3, again for omnivoice, and again in _create_diffusion_speech(). Each call site manually applies results. Consider having the method mutate the request directly (with clear docs) or use a shared helper.
6. f-string in logger.warning/error calls (multiple locations)
e.g., logger.warning(f"Failed to delete audio file for '{name}': {e}") — should use lazy %s formatting.
Suggestions
7. The | delimiter for cache keys is fragile. Consider using tuple keys (model_type, speaker_name) — no delimiter collision risk.
8. PR description checklist is all unchecked. The PR does add tests and docs, but the description should be filled in.
9. 42 commits is excessive. Consider squashing into 5-10 logical commits.
10. for_diffusion() sets _is_tts = False / _is_fish_speech = False but no _tts_stage. Could lead to AttributeError on diffusion-only instances.
Looks Good
- Singleton pattern with double-checked locking is correct
fresh_speaker_cachefixture for test isolation is well-designed- Thread safety properly handled
- Byte-budget + entry-count dual eviction strategy is sensible
- Safetensors metadata round-trip for persistence is clean
- Voice name
|validation prevents cache key collisions clear(speaker_name)cross-model-type invalidation solves the real stale-entry bug- Good test coverage
Reviewed by Hermes Agent
| return f"{model_type}|{speaker_name}" | ||
|
|
||
| def get(self, key: str) -> dict[str, Any] | None: | ||
| """Return cached artifacts on hit. Promotes to MRU.""" |
There was a problem hiding this comment.
🔴 Critical: The old included to prevent stale cache hits after a voice is deleted and re-uploaded with the same name but different audio. This is removed here. While now calls , there is a race window between delete and re-upload in concurrent scenarios.
Consider adding (or a version counter) back into the cache key.
…RU and tuple keys Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
…syvoice3, qwen3_tts, omnivoice) Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
…PI, and misc fixes Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… conftest fixture Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
1bee529 to
9571741
Compare
…om/JuanPZuluaga/vllm-omni into feat/general-speaker-cache-manager
|
@hsliuustc0106 thanks for the review: some replies to your comments:
|
…NTRIES Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/general-speaker-cache-manager
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
linyueqian
left a comment
There was a problem hiding this comment.
LGTM. Re-reviewed at 20e90d7 and the blockers from the earlier rounds are resolved:
- 🟢 True singleton via
get_speaker_cache()invllm_omni/utils/speaker_cache.py, used byserving_speech.pyand all five model paths (Qwen3-TTS, Fish Speech, CosyVoice3, VoxCPM2, OmniVoice).test_singleton_shared_across_call_siteslocks it in. - 🟢 Byte budget: single 512 MiB cap, LRU eviction on
put(), oversize entries skipped. Covered bytest_byte_budget_evicts/test_oversize_entry_skipped. - 🟢 Stale-cache protection: tuple key
(model_type, speaker_name, created_at)+clear(speaker_name)scanning by positionk[1]invalidates every model-type slot on delete.test_stale_cache_protection_delete_then_reuploadandtest_clear_matches_speaker_across_model_typescover both axes. - 🟢 Voxtral inline
ref_audiorestored in_build_voxtral_prompt. - 🟢 Duplicated VoxCPM2 branch collapsed;
_apply_uploaded_speakerconsolidates the three prior call sites with consistentraise ValueError(err)handling. - 🟢 Safetensors round-trip via
_speaker_metadata_to_header/_speaker_metadata_from_headerhas unit coverage for ints, strings, None-stripping, malformed ints, and re-injectedfile_path.
Non-blocking nits for a follow-up if you feel like it:
- 🟢 [nit]
_apply_uploaded_speakerstill mutates the request in place. Idempotency guard makes it safe, but a name like_apply_uploaded_speaker_in_placewould advertise the side effect. - 🟢 [nit] For CosyVoice3 / Fish Speech, uploaded audio round-trips samples → WAV-base64 data URL → numpy. Memoized at the data-URL level so impact is bounded, but a direct
_load_uploaded_audioshortcut that skips the re-encode would be cleaner; perf only. - 🟢 [nit]
shutdown()callsself._speaker_cache.clear()which resets singleton hit/miss counters along with entries. Only matters if the serving instance is ever re-created in the same process. - 🟢 [nit]
_estimate_tensor_bytesignoring non-tensor metadata is intentional (big tensors dominate the budget) but worth a one-line comment for future readers. - 🟢 [nit] PR description checklist still unchecked (benchmark re-run).
Nice consolidation overall.
|
Resolve conflict and fix CI. |
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… feat/general-speaker-cache-manager
…om/JuanPZuluaga/vllm-omni into feat/general-speaker-cache-manager
|
Thanks for the comments guys! It's been a long way with this PR :) Please, let me know if you'd like something else to get added |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Consolidate
speaker-embeddingcaching for all TTS backends (Qwen3TTS, FishSpeech, CosyVoice3, VoxCPM2, OmniVoice) behind one shared LRU cache, and make uploaded voices survive server restarts.Changes
SpeakerEmbeddingCache(LRU, byte + count caps to avoid too many files) replaces 5 per-model caches. Deleting a voice invalidates every model's cache at once..safetensorsin~/.cache/vllm-omni/speakers/(metadata in the header). Restored on server start.ref_audiopath restored.new env vars added here
SPEAKER_SAMPLES_DIR~/.cache/vllm-omni/speakersSPEAKER_MAX_UPLOADED1000SPEAKER_CACHE_MAX_BYTESSPEAKER_CACHE_MAX_ENTRIESTest Plan
tests/test_speaker_cache.py(cache module, tuple keys, created_at isolation)tests/test_speaker_cache_integration.py(end-to-end upload/cache/delete, stale-cache race)tests/test_fish_speech_cache.py)pre-commit run --all-files)benchmarks/fish-speech/bench_speaker_cache.pypendingTest Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)