[Frontend] Add voice clone prompt cache endpoint for Qwen3-TTS (#1760) by reidliu41 · Pull Request #2457 · vllm-project/vllm-omni

reidliu41 · 2026-04-02T12:57:38Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Add POST /v1/audio/voices/{name}/cache for uploaded Qwen3-TTS voices.

This change pre-computes speaker embedding and reference audio codec codes on the
TTS worker through collective_rpc, persists them as safetensors, and lets
subsequent TTS requests reuse the cached voice_clone_prompt instead of
reprocessing reference audio on every request.

Value:

reduces redundant GPU work for uploaded audio voices
improves repeated-request latency for voice cloning
wires up the voice cache infrastructure introduced earlier but not yet exposed
fixes the uploaded direct-embedding path in TTS request building
hardens cache rebuild behavior with rollback and clearer error handling
Closes [Feature]: Add a separate endpoint for create_voice_clone_prompt for qwen3-TTS model #1760

Test Plan

Manual end-to-end validation on a local Omni server:

# Start server
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
SPEECH_VOICE_SAMPLES=/tmp/voice_samples_1760 \
./.venv/bin/vllm serve Qwen/Qwen3-TTS-12Hz-0.6B-Base \
  --omni \
  --port 8091 \
  --gpu-memory-utilization 0.15

# Generate a local reference clip
ffmpeg -y -f lavfi \
  -i "flite=text='This is a cache validation reference clip.':voice=slt" \
  -ar 24000 -ac 1 /tmp/voice-cache-e2e/ref.wav

# Upload an audio voice with ref_text
curl -sS -X POST http://127.0.0.1:8091/v1/audio/voices \
  -F audio_sample=@/tmp/voice-cache-e2e/ref.wav \
  -F consent=consent_001 \
  -F name=voicecachee2e \
  -F ref_text='This is a cache validation reference clip.'

# Generate cache
curl -sS -X POST http://127.0.0.1:8091/v1/audio/voices/voicecachee2e/cache

# Re-run cache generation to verify idempotent ready behavior
curl -sS -X POST http://127.0.0.1:8091/v1/audio/voices/voicecachee2e/cache

# Inspect metadata
jq '.uploaded_speakers.voicecachee2e' /tmp/voice_samples_1760/metadata.json

# Inspect safetensors cache contents
./.venv/bin/python - <<'PY'
import json
from safetensors import safe_open

with open('/tmp/voice_samples_1760/metadata.json', 'r', encoding='utf-8') as f:
    meta = json.load(f)

cache_file = meta['uploaded_speakers']['voicecachee2e']['cache_file']
print("CACHE_FILE", cache_file)

with safe_open(cache_file, framework='pt', device='cpu') as f:
    print("KEYS", list(f.keys()))
    print("META", f.metadata())
PY

# Run cached TTS
curl -sS -D /tmp/voice-cache-e2e/speech-fixed.headers \
  -o /tmp/voice-cache-e2e/speech-fixed.out \
  http://127.0.0.1:8091/v1/audio/speech \
  -H 'Content-Type: application/json' \
  -d '{"input":"This is the cached prompt synthesis check after the
fix.","voice":"voicecachee2e","response_format":"wav"}'

# Move the original uploaded audio away and run cached-only TTS
mv /tmp/voice_samples_1760/voicecachee2e_consent_001_1775133885.wav \
   /tmp/voice_samples_1760/voicecachee2e_consent_001_1775133885.wav.bak

curl -sS -D /tmp/voice-cache-e2e/speech-cached-only.headers \
  -o /tmp/voice-cache-e2e/speech-cached-only.out \
  http://127.0.0.1:8091/v1/audio/speech \
  -H 'Content-Type: application/json' \
  -d '{"input":"This request runs after removing the original uploaded
audio.","voice":"voicecachee2e","response_format":"wav"}'

# Restore the original uploaded audio after validation
mv /tmp/voice_samples_1760/voicecachee2e_consent_001_1775133885.wav.bak \
   /tmp/voice_samples_1760/voicecachee2e_consent_001_1775133885.wav

Test Result

  ## Test Result

  Manual validation passed.

  Observed results:

  - Voice upload returned `200`:
    {"success":true,"voice":{"name":"voicecachee2e","consent":"consent_001","created_at":1775133885,"mime_type":"audio/
  wav","file_size":132798,"ref_text":"This is a cache validation reference clip."}}

  - Cache generation returned:

    {"voice":"voicecachee2e","cache_status":"ready"}
  - Repeated cache generation returned the expected idempotent response:

    {"voice":"voicecachee2e","cache_status":"ready","message":"Cache already exists and is valid"}
  - Metadata showed a ready audio-backed cache:

    {
      "name": "voicecachee2e",
      "consent": "consent_001",
      "file_path": "/tmp/voice_samples_1760/voicecachee2e_consent_001_1775133885.wav",
      "ref_text": "This is a cache validation reference clip.",
      "cache_status": "ready",
      "cache_file": "/tmp/voice_samples_1760/voicecachee2e_consent_001_1775133885.safetensors",
      "cache_generated_at": 1775133893.501347,
      "embedding_source": "audio"
    }
  - The generated safetensors cache contained both cached speaker embedding and cached ref_code:

    KEYS ['__len__', 'item_0_has_ref_code', 'item_0_icl_mode', 'item_0_ref_code', 'item_0_ref_spk_embedding',
  'item_0_x_vector_only_mode']
    META {'item_0_ref_text': 'This is a cache validation reference clip.'}
  - Cached TTS returned 200 OK with audio/wav:

    HTTP/1.1 200 OK
    content-type: audio/wav
    /tmp/voice-cache-e2e/speech-fixed.out: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 24000 Hz
    -rw-rw-r-- 1 xx xx 196K /tmp/voice-cache-e2e/speech-fixed.out
  - Server logs confirmed the cached path was used:

    Using cached voice_clone_prompt for: voicecachee2e (icl=True)
  - Cached-only TTS still returned 200 OK after temporarily removing the original uploaded audio:

    HTTP/1.1 200 OK
    content-type: audio/wav
    /tmp/voice-cache-e2e/speech-cached-only.out: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono
  24000 Hz
    -rw-rw-r-- 1 xx xx 218K /tmp/voice-cache-e2e/speech-cached-only.out
  - Server logs again confirmed cached prompt reuse, with no raw-audio fallback:
Using cached voice_clone_prompt for: voicecachee2e (icl=True)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d4e94cd0a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-02T13:04:56Z

+        if not self.metadata_manager.update_speaker(
+            voice_key,
+            {
+                "cache_status": "processing",
+                "cache_generated_at": now,
+            },
+        ):


Prevent cache rebuild from recreating deleted voice metadata

This path marks a voice as processing via metadata_manager.update_speaker(...) after only an in-memory existence check, but update_speaker creates missing entries when the key is absent. If another worker deletes the voice between those two operations, this call resurrects a partial metadata record and can leave a zombie voice after rollback/failure. In multi-process deployments, a DELETE racing with /cache can therefore corrupt metadata.json instead of failing cleanly.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-02T13:04:56Z

+                    # ── No cache (pending/failed/processing): raw audio path ──
+                    self._fallback_to_raw_audio(request.voice, speaker_info, params)
+                    _uploaded_voice_resolved = True


Keep request-level Base overrides for uncached uploaded voices

In the raw-audio fallback branch for uploaded voices without a ready cache, _uploaded_voice_resolved is set to True, which skips the later merge of request-level ref_text/x_vector_only_mode/speaker_embedding. As a result, requests for pending/failed/processing uploaded voices now silently ignore valid per-request Base cloning overrides and always use upload-time defaults from _fallback_to_raw_audio, which regresses prior behavior. The override suppression should be limited to cases where a cached/direct prompt was actually used.

Useful? React with 👍 / 👎.

lishunyang12

left a couple comments, mostly around hot-path perf

lishunyang12 · 2026-04-02T15:28:13Z

+                    emb_path = Path(cache_file_str)
+                    if not _validate_path_within_directory(emb_path, self.uploaded_speakers_dir):
+                        raise ValueError("Illegal cache path outside voice samples directory")
+                    if not emb_path.is_file() or emb_path.suffix != ".safetensors":


This loads and deserializes the safetensors file on every single TTS request for direct-embedding voices. That's synchronous disk I/O on the request hot path — kind of defeats the purpose of caching.

Could you load this once (e.g. at upload time or first access) and keep the embedding list in memory, similar to how the audio-cache path uses load_cached_voice_prompt?

lishunyang12 · 2026-04-02T15:28:13Z

+        icl_mode = ref_text is not None and ref_text.strip() != ""
+
+        if icl_mode and not hasattr(model, "_encode_ref_audio_to_code"):
+            raise NotImplementedError(f"{type(model).__name__} does not support ref audio codec encoding")


wav_np.tolist() converts the entire waveform to a Python list of floats before sending over RPC. For a 10s clip at 24kHz that's 240k Python float objects — roughly 10x the memory of the numpy array.

Worth checking if the RPC layer can handle numpy arrays or bytes directly. If not, at least document why this is necessary.

lishunyang12 · 2026-04-02T15:28:13Z

+            updates["cache_generated_at"] = cache_generated_at
+        if not self.metadata_manager.update_speaker(voice_key, updates):
+            logger.error("Failed to rollback cache state for voice %s to disk", voice_key)
+        if voice_key in self.uploaded_speakers:


Nit: except Exception is fine for the rollback, but the docstring says "returns plain Python types only (must survive msgspec IPC)" over in the worker — same constraint applies to wav_samples arg. Might be worth a brief comment here explaining why tolist() is needed for the audio data too (msgspec can't handle numpy).

linyueqian · 2026-04-03T16:36:51Z

@JuanPZuluaga ptal

JuanPZuluaga · 2026-04-03T18:02:08Z

Hi @linyueqian @lishunyang12 @reidliu41, i think this PR overlaps a bit with: #2108

in that pr, we use an in-memory LRU cache with a voice_name:created_at:mode key that prevents stale cache hits after delete + re-upload. we don't have safetensors, and no metadata.json, also no file locks, just a thread-safe Dict. Same API surface (upload/list/delete). Would be great to coordinate so we don't duplicate effort.

Ideally, the voice cache manager should handle all model types that support voice cloning, and I'll work on that as soon as #2108 is merged into main.

linyueqian · 2026-04-04T03:14:15Z

@reidliu41 now that #2108 is merged. please rebase on main. thanks!

linyueqian

Feature is well-designed with solid state management and good test coverage. A few issues to address:

[P1] _load_cached_voice_prompt reads safetensors from disk on every TTS request

For audio-uploaded voices with cache_status="ready", _build_tts_params calls _load_cached_voice_prompt which does safe_open + tensor deserialization on every single request. This undermines the latency benefit of caching.

Direct-embedding voices already have _direct_embedding_cache for in-memory caching. Audio-cached prompts should get the same treatment:

self._audio_prompt_cache: dict[str, dict[str, Any]] = {}

Populate on first load, invalidate on force-rebuild or delete. The payload is small (1024-dim embedding + codec codes), so memory is not a concern.

[Minor] Non-atomic save

_save_voice_cache writes safetensors directly to the final path. A tmp+rename pattern would prevent corrupted cache files if the process dies mid-write. Fine as a follow-up.

[Minor] Step numbering

serving_speech.py has steps 1, 2, 3, then 5 (no step 4).

Positive notes:

State machine (pending/processing/ready/failed) with timeout, rollback, and idempotency is solid
Path traversal checks via _validate_path_within_directory in all save/load paths
The ref_code fix in qwen3_tts_talker.py:1356-1365 is important -- _as_singleton was silently dropping all but the first frame of cached ref_code
_uploaded_voice_resolved flag correctly prevents request params from overriding cached ref_text/x_vector_only_mode

JuanPZuluaga · 2026-04-06T15:44:11Z

Please unify naming in the PR, please use "speaker" for all internal names (exceptions, methods, caches). Keep "voice" only at the HTTP API boundary.

Can the cache endpoint delegate to the shared VoiceCacheManager rather than implementing its own state machine in serving_speech.py? This would make it easier to extend to other TTS models later. @reidliu41

Do you agree here? @linyueqian

linyueqian · 2026-04-06T17:11:51Z

Agree, let's extract it to the VoiceCacheManager. @reidliu41 please refactor before merge according to @JuanPZuluaga 's instruction. thanks!

…project#1760) Avoid repeated GPU preprocessing for uploaded audio voices by caching the generated voice clone prompt and reusing it in later TTS requests. - add worker RPC to pre-compute speaker embedding and ref_code - add POST /v1/audio/voices/{name}/cache with processing/ready/failed handling - reuse cached voice_clone_prompt in uploaded-voice TTS requests - prevent request ref_text/x_vector_only_mode from overriding cached semantics - fix direct-embedding uploaded voice handling in the TTS path - add rollback for pre-save rebuild failures and clearer validation errors - add unit and handler-contract coverage for cache generation and error paths Signed-off-by: reidliu41 <reid201711@gmail.com>

Signed-off-by: reidliu41 <reid201711@gmail.com>

- avoid recreating deleted voice metadata during cache generation races - preserve request-level Base overrides when uploaded voices fall back to raw audio - memoize direct speaker embeddings to remove repeated safetensors disk reads - document why waveform RPC args must be converted to plain Python types - keep cached ref_code intact when building Qwen3-TTS Base prompts Signed-off-by: reidliu41 <reid201711@gmail.com>

- drop the old metadata/cache-manager path after rebasing onto main - keep voice_created_at-based stale-cache protection on raw-audio fallback - memoize direct speaker embeddings to avoid repeated safetensors reads - preserve request-level Base overrides when uploaded voices fall back to raw audio - keep cached ref_code handling intact for Base in-context prompt construction - update voice cache tests to match the rebased serving implementation Signed-off-by: reidliu41 <reid201711@gmail.com>

- memoize cached audio voice_clone_prompt payloads to avoid repeated safetensors reads on the TTS hot path - invalidate the in-memory audio prompt cache on rebuild and delete - warm the in-memory cache immediately after a successful cache save - write safetensors through a temp file and os.replace for atomic updates - fix the create_voice_cache step numbering comments - add tests for audio prompt memoization, atomic save, and cache invalidation Signed-off-by: reidliu41 <reid201711@gmail.com>

- move uploaded speaker cache state and safetensors persistence out of serving_speech into a shared VoiceCacheManager - keep serving_speech focused on API boundary, model checks, and worker RPC - align new internal naming with speaker-oriented terminology while keeping voice at the HTTP boundary - update voice cache tests to match the shared manager refactor Signed-off-by: reidliu41 <reid201711@gmail.com>

- resolve the serving_speech conflict after vllm-project#2424 merged into main - keep the speaker-to-voice request alias from vllm-project#2424 - preserve the uploaded voice cache endpoint and shared speaker cache flow - drop the stale direct-embedding helper left behind by the rebase Signed-off-by: reidliu41 <reid201711@gmail.com>

linyueqian · 2026-04-09T00:57:17Z

@JuanPZuluaga Could you take another look? The author addressed the P1 (in-memory caching + atomic save in b4217d8d) and rebased on top of #2424. The changeset is large (1378 additions across 6 files) so would appreciate your review since this builds directly on your VoiceEmbeddingCache from #2108.

JuanPZuluaga · 2026-04-09T04:45:25Z

thanks for the contribution @reidliu41, few comments below:

let's pick "speaker" as naming convention for the exception classes. Having both SpeakerNotFoundError and VoiceNotFoundError might add confusion, and we are leaning towards using "Speaker" as most of the model use this naming convention.
VoiceCacheUnsupportedError to SpeakerCacheUnsupportedError
VoiceNotFoundError to SpeakerNotFoundError
also, could you add a asyncio.Lock per speaker in create_speaker_cache to prevent redundant GPU work when two concurrent POST /cache requests hit the same voice/speaker during the await build_speaker_prompt(...) step. Not an issue, but it wastes GPU cycles.

Also, overall i have the following question:

we already have VoiceEmbeddingCache (GPU-side LRU) and the embedding_source: "direct" path (safetensors+voice_clone_prompt). Rather than a new endpoint + state machine + VoiceCacheManager + 3 cache layers, what do you think about running the extraction as a background task on upload and reusing the existing direct-embedding code path. This reduce the PR code a lot, while still preserving the key improvements proposed in the PR: IPC savings + warm first request, when not following the "voice upload" path.

We can even have 2 dictionaries:

one for the speakers that we upload directly (so if we send multiple voice clone request with different voices, would not overwrite/delete/evict the speakers we have already uploaded)
one for the speakers to be cached when sending clone requests

please let me know what you think.

reidliu41 · 2026-04-09T15:32:14Z

@JuanPZuluaga Thanks, this is a good direction.

I agree that reducing the number of cache layers would be cleaner. My hesitation is that the current embedding_source="direct" path is still x-vector-only, while this PR's cached audio path also stores ref_code, icl_mode, and ref_text, so the two paths are not fully equivalent yet.

Because of that, my preference is to keep this PR scoped to the explicit /cache endpoint and persisted cached prompt reuse, and treat “background extraction on upload + unifying with the direct path” as a follow-up refactor.

- rename the cache endpoint exceptions to SpeakerNotFoundError and SpeakerCacheUnsupportedError for internal naming consistency - update the API layer and tests to use the speaker-named exceptions - add a per-speaker asyncio.Lock in VoiceCacheManager to prevent duplicate GPU work when concurrent /cache requests target the same speaker Signed-off-by: reidliu41 <reid201711@gmail.com>

lishunyang12

Review: Voice Clone Prompt Cache Endpoint for Qwen3-TTS

Overall this is a well-structured PR that fills an important gap — pre-computing speaker embeddings and codec codes so repeated TTS requests skip redundant GPU work. The state machine (pending → processing → ready/failed), rollback logic, atomic file writes, and idempotency handling are all solid. The test coverage is thorough. A few issues worth addressing:

Concurrency: `_speaker_locks` dict is not thread-safe

VoiceCacheManager._get_speaker_lock() does a check-then-create on a plain dict without synchronization. If two coroutines race on the same speaker key for the first time, they could both create separate asyncio.Lock instances, defeating the serialization:

def _get_speaker_lock(self, speaker_key: str) -> asyncio.Lock:
    lock = self._speaker_locks.get(speaker_key)
    if lock is None:
        lock = asyncio.Lock()
        self._speaker_locks[speaker_key] = lock
    return lock

In practice this is unlikely with asyncio's single-threaded event loop (no preemption between the get and set), but dict.setdefault would make the intent explicit and be future-proof:

return self._speaker_locks.setdefault(speaker_key, asyncio.Lock())

`_speaker_locks` grows unboundedly

Every voice that ever gets cached adds a lock that is never removed — even after delete_voice. Over time on a long-running server with many uploaded voices this leaks memory. Consider removing the lock in invalidate_speaker_prompt_cache() or delete_voice, or switching to a bounded structure.

`save_speaker_cache` mutates `speaker_info` dict in place

save_speaker_cache() writes cache_status, cache_file, and cache_generated_at directly into the speaker_info dict that lives in uploaded_speakers. This works because the caller holds the async lock, but it couples persistence logic to in-memory state mutation. If save_file() succeeds but os.replace() fails (e.g., cross-device rename), the metadata is never updated — good. But if os.replace succeeds and then speaker_info["cache_status"] = "ready" raises (impossible for dict, but fragile pattern), the file is orphaned. Consider returning the metadata updates and letting the caller apply them, or at minimum documenting that this method has side effects on speaker_info.

`wav_samples` sent as `list[float]` over IPC could be large

In _build_speaker_cache_payload, the entire waveform is converted to a Python list of floats (wav_np.tolist()) for the collective_rpc call. For a 10-second clip at 24kHz, that's 240,000 Python float objects serialized through msgspec. This works, but could be a latency/memory bottleneck for longer reference clips. Worth a comment noting this limitation, or consider base64-encoding the raw bytes if msgspec supports bytes.

Unused `_speaker_embedding_cache` field on `OmniOpenAIServingSpeech`

The new _speaker_embedding_cache: dict[str, list[float]] on the serving class is only used for direct-embedding voices. Meanwhile VoiceCacheManager has its own _speaker_prompt_cache for audio-cached voices. Having two separate caches for similar purposes is confusing. Consider consolidating, or at minimum add a comment explaining why they are separate.

Minor: `_as_singleton` removal in talker could break non-cached paths

In qwen3_tts_talker.py, the change removes the _as_singleton() call on ref_code:

-                ref_code = _as_singleton(voice_clone_prompt.get("ref_code"))
+                ref_code = voice_clone_prompt.get("ref_code")

If any existing code path (e.g., inline ref_audio without caching) previously relied on _as_singleton to unwrap a batch dimension, this change could silently break it. Please verify that all non-cached voice_clone_prompt payloads either don't set ref_code or already provide it in the expected shape.

Minor: f-string in `logger.exception`

logger.exception(f"Failed to create voice cache for '{name}': {e}")

Using f-string with logger.exception means the string is always formatted even if the log level is disabled. Use logger.exception("Failed to create voice cache for '%s': %s", name, e) for lazy formatting.

Minor: test line lengths

Several test lines exceed 120 characters (e.g., _make_server, test_create_cache_processing_active). Not a blocker but worth a formatting pass for consistency with the rest of the codebase.

Positive notes

The rollback logic (restoring previous_status on pre-save failures vs. setting "failed" on post-save failures) is well thought out.
Atomic file replacement via tempfile + os.replace is the right pattern.
Idempotency check (ready + valid cache → early return) prevents wasted GPU work.
The force parameter for debugging/maintenance is a nice touch.
Test coverage is comprehensive: error branches, rollback, handler contract, direct-embedding edge cases.

- Use setdefault for per-speaker cache lock creation - Remove per-speaker cache locks when uploaded voices are deleted - Document VoiceCacheManager side effects and cache ownership boundaries - Note waveform list IPC cost for long reference audio - Preserve ref_code compatibility without unwrapping cached list payloads - Use lazy formatting for voice cache exception logging - Extend delete-voice cache invalidation test coverage Signed-off-by: reidliu41 <reid201711@gmail.com>

Signed-off-by: reidliu41 <reid201711@gmail.com>

…dpoint-1760 Signed-off-by: reidliu41 <reid201711@gmail.com>

reidliu41 · 2026-05-14T01:32:57Z

I reworked it based on the latest shared speaker-cache implementation.

The /v1/audio/voices/{name}/cache endpoint is still the public API, but internally it now pre-computes the Qwen3-TTS speaker embedding and optional ref_code, then stores them in the shared SpeakerEmbeddingCache. This removes the separate cache manager, metadata state machine, and persisted cache-file layer.

reidliu41 · 2026-05-14T01:35:18Z

@JuanPZuluaga @lishunyang12 Please take another look when you have time.

reidliu41 requested a review from hsliuustc0106 as a code owner April 2, 2026 12:57

chatgpt-codex-connector Bot reviewed Apr 2, 2026

View reviewed changes

reidliu41 force-pushed the feat/voice-cache-endpoint-1760 branch from 6619916 to c6e9e19 Compare April 2, 2026 13:09

lishunyang12 reviewed Apr 2, 2026

View reviewed changes

reidliu41 force-pushed the feat/voice-cache-endpoint-1760 branch from 52bca00 to e62bdfb Compare April 4, 2026 04:13

linyueqian reviewed Apr 5, 2026

View reviewed changes

linyueqian mentioned this pull request Apr 5, 2026

[Bugfix] Accept 'speaker' as alias for 'voice' in TTS speech API #2424

Merged

reidliu41 added 7 commits April 7, 2026 23:57

[Frontend] Fix Qwen3-TTS cached ref_code handling

4d8cc6f

Signed-off-by: reidliu41 <reid201711@gmail.com>

reidliu41 force-pushed the feat/voice-cache-endpoint-1760 branch from 2c1f285 to d1f8816 Compare April 7, 2026 16:05

reidliu41 requested review from linyueqian and lishunyang12 April 9, 2026 00:02

JuanPZuluaga mentioned this pull request Apr 9, 2026

[TTS][SpeakerCacheManager] A global speaker cache manager for Voice Cloning #2630

Merged

10 tasks

lishunyang12 reviewed Apr 16, 2026

View reviewed changes

reidliu41 added 2 commits April 17, 2026 08:29

fix conflicts

44b2bee

Signed-off-by: reidliu41 <reid201711@gmail.com>

Merge remote-tracking branch 'upstream/main' into feat/voice-cache-en…

23e6b29

…dpoint-1760 Signed-off-by: reidliu41 <reid201711@gmail.com>

reidliu41 requested review from ZeldaHuang, gcanlin, princepride, tzhouam, yenuo26 and yuanheng-zhao as code owners May 14, 2026 01:00

Conversation

reidliu41 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Apr 3, 2026

Uh oh!

JuanPZuluaga commented Apr 3, 2026

Uh oh!

linyueqian commented Apr 4, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

JuanPZuluaga commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Apr 6, 2026

Uh oh!

linyueqian commented Apr 9, 2026

Uh oh!

JuanPZuluaga commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reidliu41 commented Apr 9, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Review: Voice Clone Prompt Cache Endpoint for Qwen3-TTS

Concurrency: _speaker_locks dict is not thread-safe

_speaker_locks grows unboundedly

save_speaker_cache mutates speaker_info dict in place

wav_samples sent as list[float] over IPC could be large

Unused _speaker_embedding_cache field on OmniOpenAIServingSpeech

Minor: _as_singleton removal in talker could break non-cached paths

Minor: f-string in logger.exception

Minor: test line lengths

Positive notes

Uh oh!

reidliu41 commented May 14, 2026

Uh oh!

reidliu41 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

reidliu41 commented Apr 2, 2026 •

edited

Loading

JuanPZuluaga commented Apr 6, 2026 •

edited

Loading

JuanPZuluaga commented Apr 9, 2026 •

edited

Loading

Concurrency: `_speaker_locks` dict is not thread-safe

`_speaker_locks` grows unboundedly

`save_speaker_cache` mutates `speaker_info` dict in place

`wav_samples` sent as `list[float]` over IPC could be large

Unused `_speaker_embedding_cache` field on `OmniOpenAIServingSpeech`

Minor: `_as_singleton` removal in talker could break non-cached paths

Minor: f-string in `logger.exception`