[Bugfix] Accept 'speaker' as alias for 'voice' in TTS speech API#2424
Conversation
ac7b3d1 to
d9d3f88
Compare
lishunyang12
left a comment
There was a problem hiding this comment.
Looks good, left a couple of comments.
|
any regression test? |
There was a problem hiding this comment.
Pull request overview
This PR fixes interoperability issues in the OpenAI-compatible TTS Speech API by accepting speaker as an input alias for the canonical voice field, and by correctly handling uploaded voices that were created from pre-computed speaker embeddings (safetensors) during TTS parameter construction.
Changes:
- Accept
speakeras a validation alias forvoiceinOpenAICreateSpeechRequest. - Add logic to load uploaded safetensors embeddings and build
voice_clone_promptinstead of treating them like audio files. - Update the example client to send
voice(canonical key) and add regression/unit tests for both aliasing and embedding-uploaded voices.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
vllm_omni/entrypoints/openai/serving_speech.py |
Adds embedding-loading helper and updates _build_tts_params to support embedding-uploaded voices. |
vllm_omni/entrypoints/openai/protocol/audio.py |
Adds Pydantic alias support so speaker maps to voice on input. |
tests/entrypoints/openai_api/test_serving_speech.py |
Adds unit/regression tests covering speaker aliasing and embedding-uploaded voice handling. |
examples/online_serving/qwen3_tts/openai_speech_client.py |
Updates example payload to use voice instead of speaker. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
linyueqian
left a comment
There was a problem hiding this comment.
Thanks for tracking down both failure modes. The aliased key and the safetensors mis-routing are clearly described, and the regression tests are a nice addition. A few things need addressing before this is safe to merge.
Root cause note: the "speaker" drift came from #1963, which renamed the CLI flag and payload key in the example client without updating the protocol model. The fix here is correct, but the underlying fragility is that OpenAICreateSpeechRequest uses plain BaseModel with no extra policy, so Pydantic silently drops any unrecognized key. Other protocol models in this repo (protocol/videos.py) already use model_config = ConfigDict(extra="forbid"). Adding that here would surface this class of bug as a ValidationError at test time rather than a silent runtime failure. Worth doing in a follow-up if not here.
|
Thanks for the good reviews! 🙏 Will try to address asap |
02dff32 to
aeb2fad
Compare
|
Thanks for the thorough reviews @linyueqian @lishunyang12! All feedback has been addressed in the latest push: Blocking fixes:
Important fixes:
New tests:
All 104 tests pass, pre-commit clean. |
aeb2fad to
158eff1
Compare
158eff1 to
11c5ad4
Compare
…uploads Cherry-pick of upstream vllm-project#2424: - Add validation_alias=AliasChoices("voice", "speaker") to voice field - Handle safetensors-uploaded voices correctly in _build_tts_params - Add _get_uploaded_speaker_embedding method for embedding-based voices - Validate embedding cache readiness for uploaded voices - Fix request_id undefined in create_speech Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
linyueqian
left a comment
There was a problem hiding this comment.
LGTM. All blocking and important feedback from the previous round has been addressed.
The global speaker alias is fine -- it's input-only (validation_alias), consistent across all TTS models, and the alternative (model-specific scoping) adds complexity for no real benefit.
One minor ask: consider adding a brief inline comment on the voice field noting the speaker alias is intentional across all TTS models, so future contributors don't second-guess it.
|
@JuanPZuluaga Quick heads-up: this PR adds Could we unify these? For example, the uploaded-embedding branch could populate Tagging you since you built the original embedding support -- would appreciate your thoughts. |
|
Correction on my previous comment: the original embedding upload/cache flow was built by @JuanPZuluaga in #2108 and #2046, not #1227. @JuanPZuluaga -- this PR adds a |
linyueqian
left a comment
There was a problem hiding this comment.
Follow-up on my earlier review -- blocking issues are resolved, two remaining items:
[important] Merge conflict with #2457
Both this PR and #2457 modify _build_tts_params with overlapping uploaded-voice embedding logic. Whichever lands second will need a non-trivial rebase. Worth coordinating merge order with @reidliu41.
[important] Missing path traversal check in _get_uploaded_speaker_embedding
cache_file from speaker_info is passed directly to load_file() without validating it resolves within the voice samples directory. A crafted metadata entry with cache_file="/etc/passwd" could read arbitrary files. #2457 includes this check via _validate_path_within_directory -- same pattern should be applied here.
|
I am wondering whether it makes sense to move all the naming conventions to "speaker" instead of "voice". As it aligns better to how the model works, which is via speaker embeddings. What do you think? @linyueqian Otherwise, i think this is a good addition, and i'd say would be good to keep this PR very minimal so the alias fix is enough to make embedding voices loadable by name. But please add |
|
OpenAI use voice for API so i think we should keep voice at the API boundary (OpenAI convention) and speaker internally (which is mostly what we have already). |
5c7ff27 to
58857f8
Compare
|
@linyueqian Great suggestion — done in the latest push. The uploaded-embedding branch now populates @JuanPZuluaga Added |
6a698fa to
e7178e3
Compare
…m-project#1603) Fix uploaded custom voices not working when referenced by name in TTS requests. The example client and users were sending 'speaker' in the JSON payload, but the API only recognized 'voice', silently dropping the voice name and failing with "Base task requires ref_audio or speaker_embedding". Also fix embedding-uploaded voices (via speaker_embedding) being incorrectly treated as audio files in _build_tts_params. Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
- Guard x_vector_only_mode from being overwritten by request field when uploaded embedding already set it (blocking) - Validate cache_file readiness for embedding-uploaded voices to prevent falling through to audio branch on pending cache (blocking) - Catch ImportError separately for missing safetensors package - Guard for missing speaker_embedding key in safetensors file - Add .squeeze() to handle [1, dim] tensor shapes - Add tests for pending cache rejection and x_vector_only_mode guard Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
- Populate request.speaker_embedding instead of manually wiring voice_clone_prompt, letting the existing code path handle it - Add _validate_path_within_directory() check on cache_file to prevent path traversal (per review feedback) - Revert the "voice_clone_prompt not in params" guard (no longer needed with unified path) Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
e7178e3 to
9a1c62d
Compare
All blockers cleared
@reidliu41 my PR has been approved and I have cleared all the blockers so it should be good to merge. Cheers, and again thanks to all revieweres! 🙏 |
- resolve the serving_speech conflict after vllm-project#2424 merged into main - keep the speaker-to-voice request alias from vllm-project#2424 - preserve the uploaded voice cache endpoint and shared speaker cache flow - drop the stale direct-embedding helper left behind by the rebase Signed-off-by: reidliu41 <reid201711@gmail.com>
…m-project#2424) Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com> Co-authored-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
…m-project#2424) Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com> Co-authored-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com> Signed-off-by: bob-021206 <binyan_github@163.com>
…m-project#2424) Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com> Co-authored-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
…m-project#2424) Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com> Co-authored-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
Purpose
Fix uploaded custom voices not working when referenced by name in TTS requests (resolves #1603).
The
OpenAICreateSpeechRequestmodel only recognized thevoiceJSON key, but the example client (openai_speech_client.py) and users were sendingspeaker— which Pydantic silently dropped, causing the request to fail with:Additionally, voices uploaded via
speaker_embedding(stored as safetensors) were incorrectly handled as audio files in_build_tts_params, which would fail when trying to base64-encode a safetensors binary.Changes
protocol/audio.py: Addvalidation_alias=AliasChoices("voice", "speaker")to thevoicefield so the API accepts both JSON keys. The alias is global across all TTS models (Qwen3, Voxtral, Fish Speech) since they all usevoicefor the speaker name.serving_speech.py:_get_uploaded_speaker_embedding()helper withImportErrorhandling, missing-key guard,.squeeze()for[1, dim]tensor shapes, and_validate_path_within_directory()on cache_file.request.speaker_embeddingand lets the existing code path handlevoice_clone_prompt+x_vector_only_mode(unified, no duplication).cache_filereadiness for embedding-uploaded voices to prevent fallthrough to audio branch oncache_status="pending".openai_speech_client.py: Update example client to use the canonicalvoicefield name.Test Plan
8 new unit tests added to
tests/entrypoints/openai_api/test_serving_speech.py:test_speaker_alias_accepted_as_voice— verifiesspeakerJSON key maps tovoicetest_voice_field_still_accepted— verifies canonicalvoicekey still workstest_speaker_alias_in_base_task_with_uploaded_voice—speakerkey + uploaded voice + Base task validationtest_build_tts_params_with_uploaded_voice_embedding— embedding-uploaded voices producevoice_clone_prompttest_regression_1603_speaker_key_with_uploaded_audio_voice— full validate+build flow for audio voicestest_regression_1603_speaker_key_with_uploaded_embedding_voice— full validate+build flow for embedding voicestest_validate_rejects_embedding_voice_with_pending_cache— pending cache correctly rejectedtest_x_vector_only_mode_not_overwritten_for_uploaded_embedding— embedding mode not clobbered by requestpython -m pytest tests/entrypoints/openai_api/test_serving_speech.py -v # All 120 tests pass