Skip to content

[Frontend] Add voice clone prompt cache endpoint for Qwen3-TTS (#1760)#2457

Open
reidliu41 wants to merge 11 commits into
vllm-project:mainfrom
reidliu41:feat/voice-cache-endpoint-1760
Open

[Frontend] Add voice clone prompt cache endpoint for Qwen3-TTS (#1760)#2457
reidliu41 wants to merge 11 commits into
vllm-project:mainfrom
reidliu41:feat/voice-cache-endpoint-1760

Conversation

@reidliu41
Copy link
Copy Markdown
Contributor

@reidliu41 reidliu41 commented Apr 2, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Add POST /v1/audio/voices/{name}/cache for uploaded Qwen3-TTS voices.

This change pre-computes speaker embedding and reference audio codec codes on the
TTS worker through collective_rpc, persists them as safetensors, and lets
subsequent TTS requests reuse the cached voice_clone_prompt instead of
reprocessing reference audio on every request.

Value:

Test Plan

Manual end-to-end validation on a local Omni server:

# Start server
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
SPEECH_VOICE_SAMPLES=/tmp/voice_samples_1760 \
./.venv/bin/vllm serve Qwen/Qwen3-TTS-12Hz-0.6B-Base \
  --omni \
  --port 8091 \
  --gpu-memory-utilization 0.15

# Generate a local reference clip
ffmpeg -y -f lavfi \
  -i "flite=text='This is a cache validation reference clip.':voice=slt" \
  -ar 24000 -ac 1 /tmp/voice-cache-e2e/ref.wav

# Upload an audio voice with ref_text
curl -sS -X POST http://127.0.0.1:8091/v1/audio/voices \
  -F audio_sample=@/tmp/voice-cache-e2e/ref.wav \
  -F consent=consent_001 \
  -F name=voicecachee2e \
  -F ref_text='This is a cache validation reference clip.'

# Generate cache
curl -sS -X POST http://127.0.0.1:8091/v1/audio/voices/voicecachee2e/cache

# Re-run cache generation to verify idempotent ready behavior
curl -sS -X POST http://127.0.0.1:8091/v1/audio/voices/voicecachee2e/cache

# Inspect metadata
jq '.uploaded_speakers.voicecachee2e' /tmp/voice_samples_1760/metadata.json

# Inspect safetensors cache contents
./.venv/bin/python - <<'PY'
import json
from safetensors import safe_open

with open('/tmp/voice_samples_1760/metadata.json', 'r', encoding='utf-8') as f:
    meta = json.load(f)

cache_file = meta['uploaded_speakers']['voicecachee2e']['cache_file']
print("CACHE_FILE", cache_file)

with safe_open(cache_file, framework='pt', device='cpu') as f:
    print("KEYS", list(f.keys()))
    print("META", f.metadata())
PY

# Run cached TTS
curl -sS -D /tmp/voice-cache-e2e/speech-fixed.headers \
  -o /tmp/voice-cache-e2e/speech-fixed.out \
  http://127.0.0.1:8091/v1/audio/speech \
  -H 'Content-Type: application/json' \
  -d '{"input":"This is the cached prompt synthesis check after the
fix.","voice":"voicecachee2e","response_format":"wav"}'

# Move the original uploaded audio away and run cached-only TTS
mv /tmp/voice_samples_1760/voicecachee2e_consent_001_1775133885.wav \
   /tmp/voice_samples_1760/voicecachee2e_consent_001_1775133885.wav.bak

curl -sS -D /tmp/voice-cache-e2e/speech-cached-only.headers \
  -o /tmp/voice-cache-e2e/speech-cached-only.out \
  http://127.0.0.1:8091/v1/audio/speech \
  -H 'Content-Type: application/json' \
  -d '{"input":"This request runs after removing the original uploaded
audio.","voice":"voicecachee2e","response_format":"wav"}'

# Restore the original uploaded audio after validation
mv /tmp/voice_samples_1760/voicecachee2e_consent_001_1775133885.wav.bak \
   /tmp/voice_samples_1760/voicecachee2e_consent_001_1775133885.wav

Test Result

  ## Test Result

  Manual validation passed.

  Observed results:

  - Voice upload returned `200`:
    {"success":true,"voice":{"name":"voicecachee2e","consent":"consent_001","created_at":1775133885,"mime_type":"audio/
  wav","file_size":132798,"ref_text":"This is a cache validation reference clip."}}

  - Cache generation returned:

    {"voice":"voicecachee2e","cache_status":"ready"}
  - Repeated cache generation returned the expected idempotent response:

    {"voice":"voicecachee2e","cache_status":"ready","message":"Cache already exists and is valid"}
  - Metadata showed a ready audio-backed cache:

    {
      "name": "voicecachee2e",
      "consent": "consent_001",
      "file_path": "/tmp/voice_samples_1760/voicecachee2e_consent_001_1775133885.wav",
      "ref_text": "This is a cache validation reference clip.",
      "cache_status": "ready",
      "cache_file": "/tmp/voice_samples_1760/voicecachee2e_consent_001_1775133885.safetensors",
      "cache_generated_at": 1775133893.501347,
      "embedding_source": "audio"
    }
  - The generated safetensors cache contained both cached speaker embedding and cached ref_code:

    KEYS ['__len__', 'item_0_has_ref_code', 'item_0_icl_mode', 'item_0_ref_code', 'item_0_ref_spk_embedding',
  'item_0_x_vector_only_mode']
    META {'item_0_ref_text': 'This is a cache validation reference clip.'}
  - Cached TTS returned 200 OK with audio/wav:

    HTTP/1.1 200 OK
    content-type: audio/wav
    /tmp/voice-cache-e2e/speech-fixed.out: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 24000 Hz
    -rw-rw-r-- 1 xx xx 196K /tmp/voice-cache-e2e/speech-fixed.out
  - Server logs confirmed the cached path was used:

    Using cached voice_clone_prompt for: voicecachee2e (icl=True)
  - Cached-only TTS still returned 200 OK after temporarily removing the original uploaded audio:

    HTTP/1.1 200 OK
    content-type: audio/wav
    /tmp/voice-cache-e2e/speech-cached-only.out: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono
  24000 Hz
    -rw-rw-r-- 1 xx xx 218K /tmp/voice-cache-e2e/speech-cached-only.out
  - Server logs again confirmed cached prompt reuse, with no raw-audio fallback:
Using cached voice_clone_prompt for: voicecachee2e (icl=True)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@reidliu41 reidliu41 requested a review from hsliuustc0106 as a code owner April 2, 2026 12:57
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d4e94cd0a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +791 to +797
if not self.metadata_manager.update_speaker(
voice_key,
{
"cache_status": "processing",
"cache_generated_at": now,
},
):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent cache rebuild from recreating deleted voice metadata

This path marks a voice as processing via metadata_manager.update_speaker(...) after only an in-memory existence check, but update_speaker creates missing entries when the key is absent. If another worker deletes the voice between those two operations, this call resurrects a partial metadata record and can leave a zombie voice after rollback/failure. In multi-process deployments, a DELETE racing with /cache can therefore corrupt metadata.json instead of failing cleanly.

Useful? React with 👍 / 👎.

Comment on lines +1354 to +1356
# ── No cache (pending/failed/processing): raw audio path ──
self._fallback_to_raw_audio(request.voice, speaker_info, params)
_uploaded_voice_resolved = True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep request-level Base overrides for uncached uploaded voices

In the raw-audio fallback branch for uploaded voices without a ready cache, _uploaded_voice_resolved is set to True, which skips the later merge of request-level ref_text/x_vector_only_mode/speaker_embedding. As a result, requests for pending/failed/processing uploaded voices now silently ignore valid per-request Base cloning overrides and always use upload-time defaults from _fallback_to_raw_audio, which regresses prior behavior. The override suppression should be limited to cases where a cached/direct prompt was actually used.

Useful? React with 👍 / 👎.

@reidliu41 reidliu41 force-pushed the feat/voice-cache-endpoint-1760 branch from 6619916 to c6e9e19 Compare April 2, 2026 13:09
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple comments, mostly around hot-path perf

emb_path = Path(cache_file_str)
if not _validate_path_within_directory(emb_path, self.uploaded_speakers_dir):
raise ValueError("Illegal cache path outside voice samples directory")
if not emb_path.is_file() or emb_path.suffix != ".safetensors":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loads and deserializes the safetensors file on every single TTS request for direct-embedding voices. That's synchronous disk I/O on the request hot path — kind of defeats the purpose of caching.

Could you load this once (e.g. at upload time or first access) and keep the embedding list in memory, similar to how the audio-cache path uses load_cached_voice_prompt?

Comment thread vllm_omni/worker/base.py
icl_mode = ref_text is not None and ref_text.strip() != ""

if icl_mode and not hasattr(model, "_encode_ref_audio_to_code"):
raise NotImplementedError(f"{type(model).__name__} does not support ref audio codec encoding")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wav_np.tolist() converts the entire waveform to a Python list of floats before sending over RPC. For a 10s clip at 24kHz that's 240k Python float objects — roughly 10x the memory of the numpy array.

Worth checking if the RPC layer can handle numpy arrays or bytes directly. If not, at least document why this is necessary.

updates["cache_generated_at"] = cache_generated_at
if not self.metadata_manager.update_speaker(voice_key, updates):
logger.error("Failed to rollback cache state for voice %s to disk", voice_key)
if voice_key in self.uploaded_speakers:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: except Exception is fine for the rollback, but the docstring says "returns plain Python types only (must survive msgspec IPC)" over in the worker — same constraint applies to wav_samples arg. Might be worth a brief comment here explaining why tolist() is needed for the audio data too (msgspec can't handle numpy).

@linyueqian
Copy link
Copy Markdown
Collaborator

@JuanPZuluaga ptal

@JuanPZuluaga
Copy link
Copy Markdown
Contributor

Hi @linyueqian @lishunyang12 @reidliu41, i think this PR overlaps a bit with: #2108

in that pr, we use an in-memory LRU cache with a voice_name:created_at:mode key that prevents stale cache hits after delete + re-upload. we don't have safetensors, and no metadata.json, also no file locks, just a thread-safe Dict. Same API surface (upload/list/delete). Would be great to coordinate so we don't duplicate effort.

Ideally, the voice cache manager should handle all model types that support voice cloning, and I'll work on that as soon as #2108 is merged into main.

@linyueqian
Copy link
Copy Markdown
Collaborator

@reidliu41 now that #2108 is merged. please rebase on main. thanks!

@reidliu41 reidliu41 force-pushed the feat/voice-cache-endpoint-1760 branch from 52bca00 to e62bdfb Compare April 4, 2026 04:13
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature is well-designed with solid state management and good test coverage. A few issues to address:

[P1] _load_cached_voice_prompt reads safetensors from disk on every TTS request

For audio-uploaded voices with cache_status="ready", _build_tts_params calls _load_cached_voice_prompt which does safe_open + tensor deserialization on every single request. This undermines the latency benefit of caching.

Direct-embedding voices already have _direct_embedding_cache for in-memory caching. Audio-cached prompts should get the same treatment:

self._audio_prompt_cache: dict[str, dict[str, Any]] = {}

Populate on first load, invalidate on force-rebuild or delete. The payload is small (1024-dim embedding + codec codes), so memory is not a concern.

[Minor] Non-atomic save

_save_voice_cache writes safetensors directly to the final path. A tmp+rename pattern would prevent corrupted cache files if the process dies mid-write. Fine as a follow-up.

[Minor] Step numbering

serving_speech.py has steps 1, 2, 3, then 5 (no step 4).

Positive notes:

  • State machine (pending/processing/ready/failed) with timeout, rollback, and idempotency is solid
  • Path traversal checks via _validate_path_within_directory in all save/load paths
  • The ref_code fix in qwen3_tts_talker.py:1356-1365 is important -- _as_singleton was silently dropping all but the first frame of cached ref_code
  • _uploaded_voice_resolved flag correctly prevents request params from overriding cached ref_text/x_vector_only_mode

@JuanPZuluaga
Copy link
Copy Markdown
Contributor

JuanPZuluaga commented Apr 6, 2026

Please unify naming in the PR, please use "speaker" for all internal names (exceptions, methods, caches). Keep "voice" only at the HTTP API boundary.

Can the cache endpoint delegate to the shared VoiceCacheManager rather than implementing its own state machine in serving_speech.py? This would make it easier to extend to other TTS models later. @reidliu41

Do you agree here? @linyueqian

@linyueqian
Copy link
Copy Markdown
Collaborator

Agree, let's extract it to the VoiceCacheManager. @reidliu41 please refactor before merge according to @JuanPZuluaga 's instruction. thanks!

…project#1760)

  Avoid repeated GPU preprocessing for uploaded audio voices by caching
  the generated voice clone prompt and reusing it in later TTS requests.

  - add worker RPC to pre-compute speaker embedding and ref_code
  - add POST /v1/audio/voices/{name}/cache with processing/ready/failed handling
  - reuse cached voice_clone_prompt in uploaded-voice TTS requests
  - prevent request ref_text/x_vector_only_mode from overriding cached semantics
  - fix direct-embedding uploaded voice handling in the TTS path
  - add rollback for pre-save rebuild failures and clearer validation errors
  - add unit and handler-contract coverage for cache generation and error paths

Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: reidliu41 <reid201711@gmail.com>
  - avoid recreating deleted voice metadata during cache generation races
  - preserve request-level Base overrides when uploaded voices fall back to raw audio
  - memoize direct speaker embeddings to remove repeated safetensors disk reads
  - document why waveform RPC args must be converted to plain Python types
  - keep cached ref_code intact when building Qwen3-TTS Base prompts

Signed-off-by: reidliu41 <reid201711@gmail.com>
  - drop the old metadata/cache-manager path after rebasing onto main
  - keep voice_created_at-based stale-cache protection on raw-audio fallback
  - memoize direct speaker embeddings to avoid repeated safetensors reads
  - preserve request-level Base overrides when uploaded voices fall back to raw audio
  - keep cached ref_code handling intact for Base in-context prompt construction
  - update voice cache tests to match the rebased serving implementation

Signed-off-by: reidliu41 <reid201711@gmail.com>
  - memoize cached audio voice_clone_prompt payloads to avoid repeated
    safetensors reads on the TTS hot path
  - invalidate the in-memory audio prompt cache on rebuild and delete
  - warm the in-memory cache immediately after a successful cache save
  - write safetensors through a temp file and os.replace for atomic updates
  - fix the create_voice_cache step numbering comments
  - add tests for audio prompt memoization, atomic save, and cache invalidation

Signed-off-by: reidliu41 <reid201711@gmail.com>
  - move uploaded speaker cache state and safetensors persistence out of
    serving_speech into a shared VoiceCacheManager
  - keep serving_speech focused on API boundary, model checks, and worker RPC
  - align new internal naming with speaker-oriented terminology while
    keeping voice at the HTTP boundary
  - update voice cache tests to match the shared manager refactor

Signed-off-by: reidliu41 <reid201711@gmail.com>
  - resolve the serving_speech conflict after vllm-project#2424 merged into main
  - keep the speaker-to-voice request alias from vllm-project#2424
  - preserve the uploaded voice cache endpoint and shared speaker cache flow
  - drop the stale direct-embedding helper left behind by the rebase

Signed-off-by: reidliu41 <reid201711@gmail.com>
@reidliu41 reidliu41 force-pushed the feat/voice-cache-endpoint-1760 branch from 2c1f285 to d1f8816 Compare April 7, 2026 16:05
@linyueqian
Copy link
Copy Markdown
Collaborator

@JuanPZuluaga Could you take another look? The author addressed the P1 (in-memory caching + atomic save in b4217d8d) and rebased on top of #2424. The changeset is large (1378 additions across 6 files) so would appreciate your review since this builds directly on your VoiceEmbeddingCache from #2108.

@JuanPZuluaga
Copy link
Copy Markdown
Contributor

JuanPZuluaga commented Apr 9, 2026

thanks for the contribution @reidliu41, few comments below:

  • let's pick "speaker" as naming convention for the exception classes. Having both SpeakerNotFoundError and VoiceNotFoundError might add confusion, and we are leaning towards using "Speaker" as most of the model use this naming convention.
  • VoiceCacheUnsupportedError to SpeakerCacheUnsupportedError
  • VoiceNotFoundError to SpeakerNotFoundError
  • also, could you add a asyncio.Lock per speaker in create_speaker_cache to prevent redundant GPU work when two concurrent POST /cache requests hit the same voice/speaker during the await build_speaker_prompt(...) step. Not an issue, but it wastes GPU cycles.

Also, overall i have the following question:

we already have VoiceEmbeddingCache (GPU-side LRU) and the embedding_source: "direct" path (safetensors+voice_clone_prompt). Rather than a new endpoint + state machine + VoiceCacheManager + 3 cache layers, what do you think about running the extraction as a background task on upload and reusing the existing direct-embedding code path. This reduce the PR code a lot, while still preserving the key improvements proposed in the PR: IPC savings + warm first request, when not following the "voice upload" path.

We can even have 2 dictionaries:

  • one for the speakers that we upload directly (so if we send multiple voice clone request with different voices, would not overwrite/delete/evict the speakers we have already uploaded)
  • one for the speakers to be cached when sending clone requests

please let me know what you think.

@reidliu41
Copy link
Copy Markdown
Contributor Author

@JuanPZuluaga Thanks, this is a good direction.

I agree that reducing the number of cache layers would be cleaner. My hesitation is that the current embedding_source="direct" path is still x-vector-only, while this PR's cached audio path also stores ref_code, icl_mode, and ref_text, so the two paths are not fully equivalent yet.

Because of that, my preference is to keep this PR scoped to the explicit /cache endpoint and persisted cached prompt reuse, and treat “background extraction on upload + unifying with the direct path” as a follow-up refactor.

  - rename the cache endpoint exceptions to SpeakerNotFoundError and
    SpeakerCacheUnsupportedError for internal naming consistency
  - update the API layer and tests to use the speaker-named exceptions
  - add a per-speaker asyncio.Lock in VoiceCacheManager to prevent
    duplicate GPU work when concurrent /cache requests target the same speaker

Signed-off-by: reidliu41 <reid201711@gmail.com>
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Voice Clone Prompt Cache Endpoint for Qwen3-TTS

Overall this is a well-structured PR that fills an important gap — pre-computing speaker embeddings and codec codes so repeated TTS requests skip redundant GPU work. The state machine (pending → processing → ready/failed), rollback logic, atomic file writes, and idempotency handling are all solid. The test coverage is thorough. A few issues worth addressing:


Concurrency: _speaker_locks dict is not thread-safe

VoiceCacheManager._get_speaker_lock() does a check-then-create on a plain dict without synchronization. If two coroutines race on the same speaker key for the first time, they could both create separate asyncio.Lock instances, defeating the serialization:

def _get_speaker_lock(self, speaker_key: str) -> asyncio.Lock:
    lock = self._speaker_locks.get(speaker_key)
    if lock is None:
        lock = asyncio.Lock()
        self._speaker_locks[speaker_key] = lock
    return lock

In practice this is unlikely with asyncio's single-threaded event loop (no preemption between the get and set), but dict.setdefault would make the intent explicit and be future-proof:

return self._speaker_locks.setdefault(speaker_key, asyncio.Lock())

_speaker_locks grows unboundedly

Every voice that ever gets cached adds a lock that is never removed — even after delete_voice. Over time on a long-running server with many uploaded voices this leaks memory. Consider removing the lock in invalidate_speaker_prompt_cache() or delete_voice, or switching to a bounded structure.


save_speaker_cache mutates speaker_info dict in place

save_speaker_cache() writes cache_status, cache_file, and cache_generated_at directly into the speaker_info dict that lives in uploaded_speakers. This works because the caller holds the async lock, but it couples persistence logic to in-memory state mutation. If save_file() succeeds but os.replace() fails (e.g., cross-device rename), the metadata is never updated — good. But if os.replace succeeds and then speaker_info["cache_status"] = "ready" raises (impossible for dict, but fragile pattern), the file is orphaned. Consider returning the metadata updates and letting the caller apply them, or at minimum documenting that this method has side effects on speaker_info.


wav_samples sent as list[float] over IPC could be large

In _build_speaker_cache_payload, the entire waveform is converted to a Python list of floats (wav_np.tolist()) for the collective_rpc call. For a 10-second clip at 24kHz, that's 240,000 Python float objects serialized through msgspec. This works, but could be a latency/memory bottleneck for longer reference clips. Worth a comment noting this limitation, or consider base64-encoding the raw bytes if msgspec supports bytes.


Unused _speaker_embedding_cache field on OmniOpenAIServingSpeech

The new _speaker_embedding_cache: dict[str, list[float]] on the serving class is only used for direct-embedding voices. Meanwhile VoiceCacheManager has its own _speaker_prompt_cache for audio-cached voices. Having two separate caches for similar purposes is confusing. Consider consolidating, or at minimum add a comment explaining why they are separate.


Minor: _as_singleton removal in talker could break non-cached paths

In qwen3_tts_talker.py, the change removes the _as_singleton() call on ref_code:

-                ref_code = _as_singleton(voice_clone_prompt.get("ref_code"))
+                ref_code = voice_clone_prompt.get("ref_code")

If any existing code path (e.g., inline ref_audio without caching) previously relied on _as_singleton to unwrap a batch dimension, this change could silently break it. Please verify that all non-cached voice_clone_prompt payloads either don't set ref_code or already provide it in the expected shape.


Minor: f-string in logger.exception

logger.exception(f"Failed to create voice cache for '{name}': {e}")

Using f-string with logger.exception means the string is always formatted even if the log level is disabled. Use logger.exception("Failed to create voice cache for '%s': %s", name, e) for lazy formatting.


Minor: test line lengths

Several test lines exceed 120 characters (e.g., _make_server, test_create_cache_processing_active). Not a blocker but worth a formatting pass for consistency with the rest of the codebase.


Positive notes

  • The rollback logic (restoring previous_status on pre-save failures vs. setting "failed" on post-save failures) is well thought out.
  • Atomic file replacement via tempfile + os.replace is the right pattern.
  • Idempotency check (ready + valid cache → early return) prevents wasted GPU work.
  • The force parameter for debugging/maintenance is a nice touch.
  • Test coverage is comprehensive: error branches, rollback, handler contract, direct-embedding edge cases.

  - Use setdefault for per-speaker cache lock creation
  - Remove per-speaker cache locks when uploaded voices are deleted
  - Document VoiceCacheManager side effects and cache ownership boundaries
  - Note waveform list IPC cost for long reference audio
  - Preserve ref_code compatibility without unwrapping cached list payloads
  - Use lazy formatting for voice cache exception logging
  - Extend delete-voice cache invalidation test coverage

Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: reidliu41 <reid201711@gmail.com>
…dpoint-1760

Signed-off-by: reidliu41 <reid201711@gmail.com>
@reidliu41
Copy link
Copy Markdown
Contributor Author

I reworked it based on the latest shared speaker-cache implementation.

The /v1/audio/voices/{name}/cache endpoint is still the public API, but internally it now pre-computes the Qwen3-TTS speaker embedding and optional ref_code, then stores them in the shared SpeakerEmbeddingCache. This removes the separate cache manager, metadata state machine, and persisted cache-file layer.

@reidliu41
Copy link
Copy Markdown
Contributor Author

@JuanPZuluaga @lishunyang12 Please take another look when you have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add a separate endpoint for create_voice_clone_prompt for qwen3-TTS model

4 participants