[Bugfix] Accept 'speaker' as alias for 'voice' in TTS speech API by marksverdhei · Pull Request #2424 · vllm-project/vllm-omni

marksverdhei · 2026-04-01T17:19:58Z

Purpose

Fix uploaded custom voices not working when referenced by name in TTS requests (resolves #1603).

The OpenAICreateSpeechRequest model only recognized the voice JSON key, but the example client (openai_speech_client.py) and users were sending speaker — which Pydantic silently dropped, causing the request to fail with:

Base task requires 'ref_audio' or 'speaker_embedding' for voice cloning

Additionally, voices uploaded via speaker_embedding (stored as safetensors) were incorrectly handled as audio files in _build_tts_params, which would fail when trying to base64-encode a safetensors binary.

Changes

protocol/audio.py: Add validation_alias=AliasChoices("voice", "speaker") to the voice field so the API accepts both JSON keys. The alias is global across all TTS models (Qwen3, Voxtral, Fish Speech) since they all use voice for the speaker name.
serving_speech.py:
- Add _get_uploaded_speaker_embedding() helper with ImportError handling, missing-key guard, .squeeze() for [1, dim] tensor shapes, and _validate_path_within_directory() on cache_file.
- Uploaded-embedding branch populates request.speaker_embedding and lets the existing code path handle voice_clone_prompt + x_vector_only_mode (unified, no duplication).
- Validate cache_file readiness for embedding-uploaded voices to prevent fallthrough to audio branch on cache_status="pending".
openai_speech_client.py: Update example client to use the canonical voice field name.

Test Plan

8 new unit tests added to tests/entrypoints/openai_api/test_serving_speech.py:

test_speaker_alias_accepted_as_voice — verifies speaker JSON key maps to voice
test_voice_field_still_accepted — verifies canonical voice key still works
test_speaker_alias_in_base_task_with_uploaded_voice — speaker key + uploaded voice + Base task validation
test_build_tts_params_with_uploaded_voice_embedding — embedding-uploaded voices produce voice_clone_prompt
test_regression_1603_speaker_key_with_uploaded_audio_voice — full validate+build flow for audio voices
test_regression_1603_speaker_key_with_uploaded_embedding_voice — full validate+build flow for embedding voices
test_validate_rejects_embedding_voice_with_pending_cache — pending cache correctly rejected
test_x_vector_only_mode_not_overwritten_for_uploaded_embedding — embedding mode not clobbered by request

python -m pytest tests/entrypoints/openai_api/test_serving_speech.py -v
# All 120 tests pass

lishunyang12

Looks good, left a couple of comments.

hsliuustc0106 · 2026-04-02T23:45:24Z

any regression test?

Copilot

Pull request overview

This PR fixes interoperability issues in the OpenAI-compatible TTS Speech API by accepting speaker as an input alias for the canonical voice field, and by correctly handling uploaded voices that were created from pre-computed speaker embeddings (safetensors) during TTS parameter construction.

Changes:

Accept speaker as a validation alias for voice in OpenAICreateSpeechRequest.
Add logic to load uploaded safetensors embeddings and build voice_clone_prompt instead of treating them like audio files.
Update the example client to send voice (canonical key) and add regression/unit tests for both aliasing and embedding-uploaded voices.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`vllm_omni/entrypoints/openai/serving_speech.py`	Adds embedding-loading helper and updates `_build_tts_params` to support embedding-uploaded voices.
`vllm_omni/entrypoints/openai/protocol/audio.py`	Adds Pydantic alias support so `speaker` maps to `voice` on input.
`tests/entrypoints/openai_api/test_serving_speech.py`	Adds unit/regression tests covering `speaker` aliasing and embedding-uploaded voice handling.
`examples/online_serving/qwen3_tts/openai_speech_client.py`	Updates example payload to use `voice` instead of `speaker`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

linyueqian

Thanks for tracking down both failure modes. The aliased key and the safetensors mis-routing are clearly described, and the regression tests are a nice addition. A few things need addressing before this is safe to merge.

Root cause note: the "speaker" drift came from #1963, which renamed the CLI flag and payload key in the example client without updating the protocol model. The fix here is correct, but the underlying fragility is that OpenAICreateSpeechRequest uses plain BaseModel with no extra policy, so Pydantic silently drops any unrecognized key. Other protocol models in this repo (protocol/videos.py) already use model_config = ConfigDict(extra="forbid"). Adding that here would surface this class of bug as a ValidationError at test time rather than a silent runtime failure. Worth doing in a follow-up if not here.

marksverdhei · 2026-04-04T07:04:21Z

Thanks for the good reviews! 🙏 Will try to address asap

marksverdhai · 2026-04-04T08:01:36Z

Thanks for the thorough reviews @linyueqian @lishunyang12! All feedback has been addressed in the latest push:

Blocking fixes:

Validation now checks cache_file readiness for embedding-uploaded voices, preventing fallthrough to the audio branch on cache_status="pending"
x_vector_only_mode set by uploaded embeddings is now guarded from being overwritten by later request-level parameter merging

Important fixes:

ImportError for missing safetensors is now caught separately with a clear install message
Added guard for missing speaker_embedding key in safetensors files
Added .squeeze() to handle [1, dim] tensor shapes
Re: alias scope — the alias is intentionally global since all TTS models (Qwen3, Voxtral, Fish Speech) use request.voice for the speaker name

New tests:

test_validate_rejects_embedding_voice_with_pending_cache
test_x_vector_only_mode_not_overwritten_for_uploaded_embedding

All 104 tests pass, pre-commit clean.

…uploads Cherry-pick of upstream vllm-project#2424: - Add validation_alias=AliasChoices("voice", "speaker") to voice field - Handle safetensors-uploaded voices correctly in _build_tts_params - Add _get_uploaded_speaker_embedding method for embedding-based voices - Validate embedding cache readiness for uploaded voices - Fix request_id undefined in create_speech Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

linyueqian

LGTM. All blocking and important feedback from the previous round has been addressed.

The global speaker alias is fine -- it's input-only (validation_alias), consistent across all TTS models, and the alternative (model-specific scoping) adds complexity for no real benefit.

One minor ask: consider adding a brief inline comment on the voice field noting the speaker alias is intentional across all TTS models, so future contributors don't second-guess it.

linyueqian · 2026-04-05T19:03:23Z

@JuanPZuluaga Quick heads-up: this PR adds _get_uploaded_speaker_embedding() which loads a safetensors embedding and sets voice_clone_prompt + x_vector_only_mode in _build_tts_params. This duplicates the existing request.speaker_embedding path from #1227 (lines 1120-1130 on main), which does the same thing but from an inline request field.

Could we unify these? For example, the uploaded-embedding branch could populate request.speaker_embedding with the loaded values and let the existing code handle the rest, rather than wiring voice_clone_prompt a second time. Would simplify the logic and reduce the surface for bugs like the x_vector_only_mode overwrite issue that was already caught in review.

Tagging you since you built the original embedding support -- would appreciate your thoughts.

linyueqian · 2026-04-05T19:04:37Z

Correction on my previous comment: the original embedding upload/cache flow was built by @JuanPZuluaga in #2108 and #2046, not #1227. @JuanPZuluaga -- this PR adds a _get_uploaded_speaker_embedding() helper that loads safetensors and sets voice_clone_prompt + x_vector_only_mode, which overlaps with the existing request.speaker_embedding path. Would be good to get your input on whether these should be unified.

linyueqian

Follow-up on my earlier review -- blocking issues are resolved, two remaining items:

[important] Merge conflict with #2457

Both this PR and #2457 modify _build_tts_params with overlapping uploaded-voice embedding logic. Whichever lands second will need a non-trivial rebase. Worth coordinating merge order with @reidliu41.

[important] Missing path traversal check in _get_uploaded_speaker_embedding

cache_file from speaker_info is passed directly to load_file() without validating it resolves within the voice samples directory. A crafted metadata entry with cache_file="/etc/passwd" could read arbitrary files. #2457 includes this check via _validate_path_within_directory -- same pattern should be applied here.

JuanPZuluaga · 2026-04-06T15:42:00Z

I am wondering whether it makes sense to move all the naming conventions to "speaker" instead of "voice". As it aligns better to how the model works, which is via speaker embeddings. What do you think? @linyueqian

Otherwise, i think this is a good addition, and i'd say would be good to keep this PR very minimal so the alias fix is enough to make embedding voices loadable by name. But please add _validate_path_within_directory() on the cache_file path before merge (same pattern we use in upload_voice_embedding()).

linyueqian · 2026-04-06T17:10:54Z

OpenAI use voice for API so i think we should keep voice at the API boundary (OpenAI convention) and speaker internally (which is mostly what we have already).

marksverdhai · 2026-04-07T09:26:07Z

@linyueqian Great suggestion — done in the latest push. The uploaded-embedding branch now populates request.speaker_embedding and lets the existing code path handle voice_clone_prompt + x_vector_only_mode, instead of duplicating that wiring.

@JuanPZuluaga Added _validate_path_within_directory() on the cache_file path as requested. The latest push keeps the PR minimal: alias fix + the reviewed embedding improvements.

…m-project#1603) Fix uploaded custom voices not working when referenced by name in TTS requests. The example client and users were sending 'speaker' in the JSON payload, but the API only recognized 'voice', silently dropping the voice name and failing with "Base task requires ref_audio or speaker_embedding". Also fix embedding-uploaded voices (via speaker_embedding) being incorrectly treated as audio files in _build_tts_params. Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>

- Guard x_vector_only_mode from being overwritten by request field when uploaded embedding already set it (blocking) - Validate cache_file readiness for embedding-uploaded voices to prevent falling through to audio branch on pending cache (blocking) - Catch ImportError separately for missing safetensors package - Guard for missing speaker_embedding key in safetensors file - Add .squeeze() to handle [1, dim] tensor shapes - Add tests for pending cache rejection and x_vector_only_mode guard Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>

- Populate request.speaker_embedding instead of manually wiring voice_clone_prompt, letting the existing code path handle it - Add _validate_path_within_directory() check on cache_file to prevent path traversal (per review feedback) - Revert the "voice_clone_prompt not in params" guard (no longer needed with unified path) Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>

Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>

marksverdhei · 2026-04-07T11:26:49Z

All blockers cleared

Follow-up on my earlier review -- blocking issues are resolved, two remaining items:

[important] Merge conflict with #2457

Both this PR and #2457 modify _build_tts_params with overlapping uploaded-voice embedding logic. Whichever lands second will need a non-trivial rebase. Worth coordinating merge order with @reidliu41.

@reidliu41 my PR has been approved and I have cleared all the blockers so it should be good to merge.
Afaik this PR is more mature than #2457 , so makes more sense to merge this first.
I recommend rebasing your branch onto this branch if you're blocked, unless it has already made its way to main by the time. Please @ me for anything that should require my attention. Also feel free to give my PR a review

Cheers, and again thanks to all revieweres! 🙏

- resolve the serving_speech conflict after vllm-project#2424 merged into main - keep the speaker-to-voice request alias from vllm-project#2424 - preserve the uploaded voice cache endpoint and shared speaker cache flow - drop the stale direct-embedding helper left behind by the rebase Signed-off-by: reidliu41 <reid201711@gmail.com>

…m-project#2424) Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com> Co-authored-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>

…m-project#2424) Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com> Co-authored-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com> Signed-off-by: bob-021206 <binyan_github@163.com>

…m-project#2424) Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com> Co-authored-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>

marksverdhei requested a review from hsliuustc0106 as a code owner April 1, 2026 17:20

marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch 2 times, most recently from ac7b3d1 to d9d3f88 Compare April 1, 2026 17:23

lishunyang12 reviewed Apr 2, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/openai/serving_speech.py Outdated

Comment thread vllm_omni/entrypoints/openai/serving_speech.py

Comment thread vllm_omni/entrypoints/openai/protocol/audio.py

hsliuustc0106 requested review from ZeldaHuang, Copilot and linyueqian April 2, 2026 23:45

Copilot started reviewing on behalf of hsliuustc0106 April 2, 2026 23:45 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/openai/serving_speech.py Outdated

Comment thread vllm_omni/entrypoints/openai/serving_speech.py Outdated

linyueqian requested changes Apr 3, 2026

View reviewed changes

marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch from 02dff32 to aeb2fad Compare April 4, 2026 08:00

marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch from aeb2fad to 158eff1 Compare April 4, 2026 08:16

marksverdhei requested a review from linyueqian April 4, 2026 19:09

marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch from 158eff1 to 11c5ad4 Compare April 4, 2026 19:09

marksverdhei mentioned this pull request Apr 5, 2026

fix(qwen3-tts): accept 'speaker' as alias for 'voice', fix embedding uploads heiervang-technologies/ht-vllm-omni#37

Merged

3 tasks

linyueqian approved these changes Apr 5, 2026

View reviewed changes

linyueqian reviewed Apr 5, 2026

View reviewed changes

marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch 2 times, most recently from 5c7ff27 to 58857f8 Compare April 7, 2026 09:25

marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch 2 times, most recently from 6a698fa to e7178e3 Compare April 7, 2026 11:15

marksverdhai added 4 commits April 7, 2026 13:17

Add inline comment clarifying global scope of speaker alias

9a1c62d

Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>

marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch from e7178e3 to 9a1c62d Compare April 7, 2026 11:18

linyueqian added the ready label to trigger buildkite CI label Apr 7, 2026

linyueqian enabled auto-merge (squash) April 7, 2026 14:37

linyueqian merged commit feefdae into vllm-project:main Apr 7, 2026
8 checks passed

linyueqian mentioned this pull request Apr 9, 2026

[Frontend] Add voice clone prompt cache endpoint for Qwen3-TTS (#1760) #2457

Open

5 tasks

Conversation

marksverdhei commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marksverdhei commented Apr 4, 2026

Uh oh!

marksverdhai commented Apr 4, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Apr 5, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

JuanPZuluaga commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Apr 6, 2026

Uh oh!

marksverdhai commented Apr 7, 2026

Uh oh!

marksverdhei commented Apr 7, 2026

All blockers cleared

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

marksverdhei commented Apr 1, 2026 •

edited

Loading

linyueqian commented Apr 5, 2026 •

edited

Loading

JuanPZuluaga commented Apr 6, 2026 •

edited

Loading