Skip to content

[TTS][SpeakerCacheManager] A global speaker cache manager for Voice Cloning#2630

Open
JuanPZuluaga wants to merge 19 commits intovllm-project:mainfrom
JuanPZuluaga:feat/general-speaker-cache-manager
Open

[TTS][SpeakerCacheManager] A global speaker cache manager for Voice Cloning#2630
JuanPZuluaga wants to merge 19 commits intovllm-project:mainfrom
JuanPZuluaga:feat/general-speaker-cache-manager

Conversation

@JuanPZuluaga
Copy link
Copy Markdown
Contributor

@JuanPZuluaga JuanPZuluaga commented Apr 9, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Consolidate speaker-embedding caching for all TTS backends (Qwen3TTS, FishSpeech, CosyVoice3, VoxCPM2, OmniVoice) behind one shared LRU cache, and make uploaded voices survive server restarts.

Changes

  • Single process-wide SpeakerEmbeddingCache (LRU, byte + count caps to avoid too many files) replaces 5 per-model caches. Deleting a voice invalidates every model's cache at once.
  • Uploaded voices persist as .safetensors in ~/.cache/vllm-omni/speakers/ (metadata in the header). Restored on server start.
  • Fish Speech / CosyVoice3 reject unknown voice names with 400.
  • Voxtral: inline ref_audio path restored.

new env vars added here

Variable Default
SPEAKER_SAMPLES_DIR ~/.cache/vllm-omni/speakers
SPEAKER_MAX_UPLOADED 1000
SPEAKER_CACHE_MAX_BYTES 512 MiB
SPEAKER_CACHE_MAX_ENTRIES 1024

Test Plan

  • Unit tests — tests/test_speaker_cache.py (cache module, tuple keys, created_at isolation)
  • Integration tests — tests/test_speaker_cache_integration.py (end-to-end upload/cache/delete, stale-cache race)
  • Per-model cache tests — Fish Speech (tests/test_fish_speech_cache.py)
  • Pre-commit passes locally (pre-commit run --all-files)
  • Benchmark re-run on benchmarks/fish-speech/bench_speaker_cache.py pending

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@JuanPZuluaga JuanPZuluaga changed the title [TTS][SpeakerCacheManager] Feat/general speaker cache manager [TTS][SpeakerCacheManager] A global speaker cache manager for Voice Cloning Apr 9, 2026
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

things to do:

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian
Copy link
Copy Markdown
Collaborator

is it ready to be reviewed? I need to merge voxcpm2 perf optimization #2690 first may need to wait for that one.

@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

is it ready to be reviewed? I need to merge voxcpm2 perf optimization #2690 first may need to wait for that one.

True, thanks for the heads up

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [TTS][SpeakerCacheManager] A global speaker cache manager for Voice Cloning

Overall this is a solid improvement -- consolidating voice caching across all TTS backends (Fish Speech, CosyVoice3, OmniVoice, VoxCPM2, Qwen3 TTS) with a shared VoiceEmbeddingCache is the right direction. The LRU eviction, thread safety via lock, and per-voice clear() on delete are all welcome. However, I see several issues that should be addressed before merging.


Critical Issues

1. Stale cache on voice re-upload (regression)

The PR removes the created_at-based cache invalidation that previously prevented stale cache hits when a voice is deleted and re-uploaded with different audio. The new approach relies on clear(voice_name) being called on delete. However, there is no guarantee the cache is cleared on the model-side instances (CosyVoice3, Fish Speech, VoxCPM2, OmniVoice each create their own VoiceEmbeddingCache() in __init__). The serving_speech.py delete_voice() only calls self._voice_cache.clear(voice_name_lower) on its own cache instance -- it has no reference to the per-model caches. This means:

  • serving_speech._voice_cache gets cleared on delete (good)
  • cosyvoice3._voice_cache, fish_speech._voice_cache, voxcpm2._voice_cache, pipeline_omnivoice._voice_cache all retain stale entries (bug)

The PR title says "global" speaker cache manager, but the implementation creates 5+ independent instances. Either make it truly global (singleton or injected reference), or propagate invalidation to model-level caches. This is a correctness bug.

2. Each VoiceEmbeddingCache() defaults to 128 entries -- unbounded aggregate memory

With 5 independent caches (serving_speech, cosyvoice3, fish_speech, voxcpm2, omnivoice), the system can hold up to 640 cached voice entries. CosyVoice3 caches 4 tensors per voice (speech_feat, speech_token, speech_token_len, embedding). For long reference audio, speech_feat alone can be large. There is no aggregate memory limit, only an entry count limit. The memory_bytes() method exists but is never consulted for eviction decisions.

Consider adding a max_memory_bytes threshold that triggers eviction, or at minimum document the expected memory footprint per model type.

3. _resolve_uploaded_voice mutates request in-place -- surprising side effect

_resolve_uploaded_voice() modifies request.ref_audio and request.ref_text in place. This pattern is fragile: if the method is accidentally called twice (it is called in multiple code paths for omnivoice), the request state could become inconsistent. The guard request.ref_audio is not None at the top prevents double-injection, but it would be cleaner to return the resolved data rather than mutating.


Design Concerns

4. clear() with prefix matching is fragile

keys_to_remove = [k for k in self._cache if k.startswith(f"{voice_name}:")]

If a voice is named "alice" and another is "alice_v2", calling clear("alice") will NOT remove "alice_v2:default" (the : prevents it), so this is actually fine. But if voice names ever contain :, it would break. Consider validating voice names at upload time to reject colons.

5. serving_speech.py has duplicated VoxCPM2 handling

The _prepare_speech_generation method now has two separate VoxCPM2 branches:

  • Lines around the elif self._tts_model_type == "voxcpm2": in the non-diffusion path
  • A second elif self._tts_model_type == "voxcpm2": block further down in what appears to be the diffusion/fallback path

This duplication is confusing and error-prone. It's unclear which branch executes for a given request. Please consolidate or add clear comments explaining when each branch is reached.

6. Voxtral voice cloning support removed silently

The _build_voxtral_prompt change removes support for ref_audio entirely and now raises ValueError("Voxtral requires a voice name (preset voice).") if no voice is provided. This is a breaking change for users who were using inline ref_audio with Voxtral. Should be documented in the PR description.


Minor Issues

7. _init_voice_storage() uses /tmp/voice_samples default

This is fine for development but concerning for production. The default path should be documented, and ideally the directory should be configurable without environment variables (e.g., via server config).

8. stats() method calls memory_bytes() outside the lock, then acquires lock again

def stats(self) -> dict[str, Any]:
    memory = self.memory_bytes()  # acquires lock, releases
    with self._lock:              # acquires lock again
        return { ... }

This is not atomic -- entries could be added/removed between the two lock acquisitions, making memory_bytes inconsistent with entries. Consider computing everything under a single lock acquisition.

9. Tests removed for stale-cache protection

test_stale_cache_on_reupload, test_stale_cache_protection, test_make_cache_key_created_at_isolation, and test_created_at_zero_disables_cache are all removed. Since the new approach relies on explicit clear() on delete, there should be an integration test verifying that deleting and re-uploading a voice actually invalidates the model-side caches (not just the serving_speech cache). The new test_voice_cache_integration.py only tests a single cache instance.

10. _cosyvoice3_tokenizer attribute removal

Line self._cosyvoice3_tokenizer = None was removed from __init__. Verify this attribute isn't referenced elsewhere, as it would cause an AttributeError at runtime.


Summary

The core idea is good, but the "global" cache is not actually global -- each model creates its own instance, and invalidation on voice delete only reaches the serving layer's instance. This is the primary blocker. Please either make the cache a true singleton/shared instance, or add a mechanism to propagate invalidation to all model-level caches. The other issues (memory bounds, Voxtral breaking change, duplicated VoxCPM2 branches) should also be addressed.

@lishunyang12 lishunyang12 dismissed their stale review April 16, 2026 14:56

Replacing with inline comments

@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

Thanks for the nice review @lishunyang12

#1 Stale cache on re-upload (regression): added VoiceEmbeddingCache, now a true singleton accessed via get_voice_cache(); all 5 model backends and the serving layer share one instance, and keys are namespaced as {model_type}|{voice_name} so clear(voice_name) on delete reaches every model slot.

#2 Aggregate memory unbounded: fixed this as well, as we have now a single global cache with both max_entries (default 1024) and max_bytes (default 512 MiB) eviction, this is configurable via VOICE_CACHE_MAX_ENTRIES / VOICE_CACHE_MAX_BYTES.

Q: should we leave only one?

#3 _resolve_uploaded_voice mutation: refactored to a pure function returning (error, ref_audio, ref_text); the 3 call sites apply the values explicitly.

#4 Fragile prefix matching in clear(): uses now exact second-segment comparison after split("|", 1), and voice names containing | are rejected at upload time.

#5 Duplicated VoxCPM2 branches: actually, there is only one VoxCPM2 branch; the second one is OmniVoice in the diffusion path.

#6 Voxtral ref_audio removal: restored back _build_voxtral_prompt accepts both a preset voice and inline ref_audio, and the "not yet released" notes were removed from the two Voxtral docs.

#7 /tmp/voice_samples default: the default is now ~/.cache/vllm-omni/voices (survives reboots), configurable via SPEECH_VOICE_SAMPLES, with a 1000-voice cap via SPEECH_MAX_UPLOADED_VOICES.

other changes, everything is computed under a single with self._lock. updated tests, no remaining references of _cosyvoice3_tokenizer

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Global Speaker Cache Manager for Voice Cloning

CI Note: pre-commit fails (ruff check — import ordering in tests/conftest.py). Run pre-commit run --all-files locally to fix.


Critical

1. Loss of stale-cache protection (created_at) — regression risk

The old VoiceEmbeddingCache.make_cache_key() included created_at in the key so that deleting and re-uploading a voice with the same name but different audio would not hit stale cached embeddings. The new SpeakerEmbeddingCache.make_cache_key(speaker_name, model_type) drops this entirely.

The old tests explicitly tested this (test_stale_cache_on_reupload, test_stale_cache_protection, test_created_at_zero_disables_cache) — all removed in this PR. While delete_voice() now calls self._speaker_cache.clear(voice_name_lower), there's a gap: if a user deletes and re-uploads a voice between two concurrent requests, a race condition can serve the OLD cached artifacts for the NEW upload.

Recommendation: Either (a) add created_at back into the cache key, or (b) use a version counter atomically incremented on re-upload, or (c) document the known race and the trade-off.


Warnings

2. _get_uploaded_audio_data() re-encodes to WAV on every cache miss

_get_uploaded_audio_data() reads the safetensors, decodes to numpy, then re-encodes as WAV via sf.write(buf, ...) just to get a base64 data URL. Previously the raw audio bytes were stored and base64-encoded directly — much cheaper. Consider caching the data URL string or benchmarking the overhead for large audio files.

3. Voice name validation is inconsistent across paths

_resolve_uploaded_speaker() returns an error string but doesn't reject unknown voices for non-CosyVoice3/Fish/OmniVoice models. Meanwhile _prepare_speech_generation() for voxcpm2 raises directly, and _create_diffusion_speech() returns a 400 Response. Three different error-handling patterns for the same logical validation. Recommend extracting a _validate_voice_name() helper.

4. CosyVoice3 dynamic token length change mixed into cache refactoring

The change from character-based (len(request.input)) to token-based (extract_text_token(...)) is a behavioral change that could significantly affect generated audio length for multilingual text. This should ideally be a separate PR. At minimum, add a test verifying token count differs from char count for CJK text.

5. _resolve_uploaded_speaker() called with near-duplicate code in 3 places

Called in _prepare_speech_generation() for fish_tts/cosyvoice3, again for omnivoice, and again in _create_diffusion_speech(). Each call site manually applies results. Consider having the method mutate the request directly (with clear docs) or use a shared helper.

6. f-string in logger.warning/error calls (multiple locations)

e.g., logger.warning(f"Failed to delete audio file for '{name}': {e}") — should use lazy %s formatting.


Suggestions

7. The | delimiter for cache keys is fragile. Consider using tuple keys (model_type, speaker_name) — no delimiter collision risk.

8. PR description checklist is all unchecked. The PR does add tests and docs, but the description should be filled in.

9. 42 commits is excessive. Consider squashing into 5-10 logical commits.

10. for_diffusion() sets _is_tts = False / _is_fish_speech = False but no _tts_stage. Could lead to AttributeError on diffusion-only instances.


Looks Good

  • Singleton pattern with double-checked locking is correct
  • fresh_speaker_cache fixture for test isolation is well-designed
  • Thread safety properly handled
  • Byte-budget + entry-count dual eviction strategy is sensible
  • Safetensors metadata round-trip for persistence is clean
  • Voice name | validation prevents cache key collisions
  • clear(speaker_name) cross-model-type invalidation solves the real stale-entry bug
  • Good test coverage

Reviewed by Hermes Agent

Comment thread vllm_omni/utils/speaker_cache.py Outdated
return f"{model_type}|{speaker_name}"

def get(self, key: str) -> dict[str, Any] | None:
"""Return cached artifacts on hit. Promotes to MRU."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: The old included to prevent stale cache hits after a voice is deleted and re-uploaded with the same name but different audio. This is removed here. While now calls , there is a race window between delete and re-upload in concurrent scenarios.

Consider adding (or a version counter) back into the cache key.

JuanPZuluaga added 5 commits April 17, 2026 13:01
…RU and tuple keys

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
…syvoice3, qwen3_tts, omnivoice)

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
…PI, and misc fixes

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
… conftest fixture

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
@JuanPZuluaga JuanPZuluaga force-pushed the feat/general-speaker-cache-manager branch from 1bee529 to 9571741 Compare April 17, 2026 13:08
JuanPZuluaga added 4 commits April 17, 2026 13:16
fix
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
fix
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

JuanPZuluaga commented Apr 17, 2026

@hsliuustc0106 thanks for the review:

some replies to your comments:

  1. created_at in cache key is fixed. make_cache_key(speaker_name, model_type, created_at=0) now returns (model_type, speaker_name, int(created_at)). All 5 model backends pass created_at=int(info_dict.get("voice_created_at") or 0).

  2. WAV re-encode on every request is fixed. _get_uploaded_audio_data() now memoizes the base64 data URL under uploaded_speakers[name]["_ref_audio_data_url"]; first request pays the encode cost, subsequent requests return the cached string.

  3. Inconsistent voice-name validation i agree it is a bit messy, but the three paths (fish/cosyvoice3, voxcpm2, diffusion) have different request shapes and error contracts. Refactor to a shared _validate_voice_name() belongs in a follow-up to avoid bloating this PR.

  4. CosyVoice3 extract_text_token it is already in main (pre-existing). No behavioral change introduced here.

  5. _resolve_uploaded_speaker duplication same as in the third point above in 3; the three call sites consume the result differently. I can work on this on a follow up PR.

  6. f-string loggers are fixed now, all 10 occurrences converted to lazy %s formatting.

  7. | delimiter - tuple keys now it is fixed, keys are now tuple[str, str, int]. Removed the | name-validation check and the related doc line since collision risk is gone.

  8. PR description — Updated. Checklist filled, summary reflects final scope.

  9. Squashed the commits to few.

  10. for_diffusion() missing _tts_stage it is fixed now.

JuanPZuluaga added 4 commits April 17, 2026 14:17
…NTRIES

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 21, 2026
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Re-reviewed at 20e90d7 and the blockers from the earlier rounds are resolved:

  • 🟢 True singleton via get_speaker_cache() in vllm_omni/utils/speaker_cache.py, used by serving_speech.py and all five model paths (Qwen3-TTS, Fish Speech, CosyVoice3, VoxCPM2, OmniVoice). test_singleton_shared_across_call_sites locks it in.
  • 🟢 Byte budget: single 512 MiB cap, LRU eviction on put(), oversize entries skipped. Covered by test_byte_budget_evicts / test_oversize_entry_skipped.
  • 🟢 Stale-cache protection: tuple key (model_type, speaker_name, created_at) + clear(speaker_name) scanning by position k[1] invalidates every model-type slot on delete. test_stale_cache_protection_delete_then_reupload and test_clear_matches_speaker_across_model_types cover both axes.
  • 🟢 Voxtral inline ref_audio restored in _build_voxtral_prompt.
  • 🟢 Duplicated VoxCPM2 branch collapsed; _apply_uploaded_speaker consolidates the three prior call sites with consistent raise ValueError(err) handling.
  • 🟢 Safetensors round-trip via _speaker_metadata_to_header / _speaker_metadata_from_header has unit coverage for ints, strings, None-stripping, malformed ints, and re-injected file_path.

Non-blocking nits for a follow-up if you feel like it:

  • 🟢 [nit] _apply_uploaded_speaker still mutates the request in place. Idempotency guard makes it safe, but a name like _apply_uploaded_speaker_in_place would advertise the side effect.
  • 🟢 [nit] For CosyVoice3 / Fish Speech, uploaded audio round-trips samples → WAV-base64 data URL → numpy. Memoized at the data-URL level so impact is bounded, but a direct _load_uploaded_audio shortcut that skips the re-encode would be cleaner; perf only.
  • 🟢 [nit] shutdown() calls self._speaker_cache.clear() which resets singleton hit/miss counters along with entries. Only matters if the serving instance is ever re-created in the same process.
  • 🟢 [nit] _estimate_tensor_bytes ignoring non-tensor metadata is intentional (big tensors dominate the budget) but worth a one-line comment for future readers.
  • 🟢 [nit] PR description checklist still unchecked (benchmark re-run).

Nice consolidation overall.

@lishunyang12
Copy link
Copy Markdown
Collaborator

Resolve conflict and fix CI.

JuanPZuluaga added 2 commits April 23, 2026 10:40
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
JuanPZuluaga added 3 commits April 23, 2026 19:19
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

Thanks for the comments guys! It's been a long way with this PR :) Please, let me know if you'd like something else to get added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants