Skip to content

[Bugfix] Accept 'speaker' as alias for 'voice' in TTS speech API#2424

Merged
linyueqian merged 4 commits into
vllm-project:mainfrom
marksverdhei:fix/voice-speaker-alias-1603
Apr 7, 2026
Merged

[Bugfix] Accept 'speaker' as alias for 'voice' in TTS speech API#2424
linyueqian merged 4 commits into
vllm-project:mainfrom
marksverdhei:fix/voice-speaker-alias-1603

Conversation

@marksverdhei
Copy link
Copy Markdown
Contributor

@marksverdhei marksverdhei commented Apr 1, 2026

Purpose

Fix uploaded custom voices not working when referenced by name in TTS requests (resolves #1603).

The OpenAICreateSpeechRequest model only recognized the voice JSON key, but the example client (openai_speech_client.py) and users were sending speaker — which Pydantic silently dropped, causing the request to fail with:

Base task requires 'ref_audio' or 'speaker_embedding' for voice cloning

Additionally, voices uploaded via speaker_embedding (stored as safetensors) were incorrectly handled as audio files in _build_tts_params, which would fail when trying to base64-encode a safetensors binary.

Changes

  1. protocol/audio.py: Add validation_alias=AliasChoices("voice", "speaker") to the voice field so the API accepts both JSON keys. The alias is global across all TTS models (Qwen3, Voxtral, Fish Speech) since they all use voice for the speaker name.
  2. serving_speech.py:
    • Add _get_uploaded_speaker_embedding() helper with ImportError handling, missing-key guard, .squeeze() for [1, dim] tensor shapes, and _validate_path_within_directory() on cache_file.
    • Uploaded-embedding branch populates request.speaker_embedding and lets the existing code path handle voice_clone_prompt + x_vector_only_mode (unified, no duplication).
    • Validate cache_file readiness for embedding-uploaded voices to prevent fallthrough to audio branch on cache_status="pending".
  3. openai_speech_client.py: Update example client to use the canonical voice field name.

Test Plan

8 new unit tests added to tests/entrypoints/openai_api/test_serving_speech.py:

  • test_speaker_alias_accepted_as_voice — verifies speaker JSON key maps to voice
  • test_voice_field_still_accepted — verifies canonical voice key still works
  • test_speaker_alias_in_base_task_with_uploaded_voicespeaker key + uploaded voice + Base task validation
  • test_build_tts_params_with_uploaded_voice_embedding — embedding-uploaded voices produce voice_clone_prompt
  • test_regression_1603_speaker_key_with_uploaded_audio_voice — full validate+build flow for audio voices
  • test_regression_1603_speaker_key_with_uploaded_embedding_voice — full validate+build flow for embedding voices
  • test_validate_rejects_embedding_voice_with_pending_cache — pending cache correctly rejected
  • test_x_vector_only_mode_not_overwritten_for_uploaded_embedding — embedding mode not clobbered by request
python -m pytest tests/entrypoints/openai_api/test_serving_speech.py -v
# All 120 tests pass

@marksverdhei marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch 2 times, most recently from ac7b3d1 to d9d3f88 Compare April 1, 2026 17:23
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left a couple of comments.

Comment thread vllm_omni/entrypoints/openai/serving_speech.py Outdated
Comment thread vllm_omni/entrypoints/openai/serving_speech.py
Comment thread vllm_omni/entrypoints/openai/protocol/audio.py
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

any regression test?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes interoperability issues in the OpenAI-compatible TTS Speech API by accepting speaker as an input alias for the canonical voice field, and by correctly handling uploaded voices that were created from pre-computed speaker embeddings (safetensors) during TTS parameter construction.

Changes:

  • Accept speaker as a validation alias for voice in OpenAICreateSpeechRequest.
  • Add logic to load uploaded safetensors embeddings and build voice_clone_prompt instead of treating them like audio files.
  • Update the example client to send voice (canonical key) and add regression/unit tests for both aliasing and embedding-uploaded voices.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
vllm_omni/entrypoints/openai/serving_speech.py Adds embedding-loading helper and updates _build_tts_params to support embedding-uploaded voices.
vllm_omni/entrypoints/openai/protocol/audio.py Adds Pydantic alias support so speaker maps to voice on input.
tests/entrypoints/openai_api/test_serving_speech.py Adds unit/regression tests covering speaker aliasing and embedding-uploaded voice handling.
examples/online_serving/qwen3_tts/openai_speech_client.py Updates example payload to use voice instead of speaker.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_omni/entrypoints/openai/serving_speech.py Outdated
Comment thread vllm_omni/entrypoints/openai/serving_speech.py Outdated
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking down both failure modes. The aliased key and the safetensors mis-routing are clearly described, and the regression tests are a nice addition. A few things need addressing before this is safe to merge.

Root cause note: the "speaker" drift came from #1963, which renamed the CLI flag and payload key in the example client without updating the protocol model. The fix here is correct, but the underlying fragility is that OpenAICreateSpeechRequest uses plain BaseModel with no extra policy, so Pydantic silently drops any unrecognized key. Other protocol models in this repo (protocol/videos.py) already use model_config = ConfigDict(extra="forbid"). Adding that here would surface this class of bug as a ValidationError at test time rather than a silent runtime failure. Worth doing in a follow-up if not here.

Comment thread vllm_omni/entrypoints/openai/protocol/audio.py
Comment thread vllm_omni/entrypoints/openai/serving_speech.py Outdated
Comment thread vllm_omni/entrypoints/openai/serving_speech.py
Comment thread vllm_omni/entrypoints/openai/serving_speech.py
Comment thread vllm_omni/entrypoints/openai/serving_speech.py
Comment thread tests/entrypoints/openai_api/test_serving_speech.py
Comment thread tests/entrypoints/openai_api/test_serving_speech.py
@marksverdhei
Copy link
Copy Markdown
Contributor Author

Thanks for the good reviews! 🙏 Will try to address asap

@marksverdhei marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch from 02dff32 to aeb2fad Compare April 4, 2026 08:00
@marksverdhai
Copy link
Copy Markdown
Contributor

Thanks for the thorough reviews @linyueqian @lishunyang12! All feedback has been addressed in the latest push:

Blocking fixes:

  • Validation now checks cache_file readiness for embedding-uploaded voices, preventing fallthrough to the audio branch on cache_status="pending"
  • x_vector_only_mode set by uploaded embeddings is now guarded from being overwritten by later request-level parameter merging

Important fixes:

  • ImportError for missing safetensors is now caught separately with a clear install message
  • Added guard for missing speaker_embedding key in safetensors files
  • Added .squeeze() to handle [1, dim] tensor shapes
  • Re: alias scope — the alias is intentionally global since all TTS models (Qwen3, Voxtral, Fish Speech) use request.voice for the speaker name

New tests:

  • test_validate_rejects_embedding_voice_with_pending_cache
  • test_x_vector_only_mode_not_overwritten_for_uploaded_embedding

All 104 tests pass, pre-commit clean.

@marksverdhei marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch from aeb2fad to 158eff1 Compare April 4, 2026 08:16
@marksverdhei marksverdhei requested a review from linyueqian April 4, 2026 19:09
@marksverdhei marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch from 158eff1 to 11c5ad4 Compare April 4, 2026 19:09
marksverdhei added a commit to heiervang-technologies/ht-vllm-omni that referenced this pull request Apr 5, 2026
…uploads

Cherry-pick of upstream vllm-project#2424:
- Add validation_alias=AliasChoices("voice", "speaker") to voice field
- Handle safetensors-uploaded voices correctly in _build_tts_params
- Add _get_uploaded_speaker_embedding method for embedding-based voices
- Validate embedding cache readiness for uploaded voices
- Fix request_id undefined in create_speech

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. All blocking and important feedback from the previous round has been addressed.

The global speaker alias is fine -- it's input-only (validation_alias), consistent across all TTS models, and the alternative (model-specific scoping) adds complexity for no real benefit.

One minor ask: consider adding a brief inline comment on the voice field noting the speaker alias is intentional across all TTS models, so future contributors don't second-guess it.

@linyueqian
Copy link
Copy Markdown
Collaborator

linyueqian commented Apr 5, 2026

@JuanPZuluaga Quick heads-up: this PR adds _get_uploaded_speaker_embedding() which loads a safetensors embedding and sets voice_clone_prompt + x_vector_only_mode in _build_tts_params. This duplicates the existing request.speaker_embedding path from #1227 (lines 1120-1130 on main), which does the same thing but from an inline request field.

Could we unify these? For example, the uploaded-embedding branch could populate request.speaker_embedding with the loaded values and let the existing code handle the rest, rather than wiring voice_clone_prompt a second time. Would simplify the logic and reduce the surface for bugs like the x_vector_only_mode overwrite issue that was already caught in review.

Tagging you since you built the original embedding support -- would appreciate your thoughts.

@linyueqian
Copy link
Copy Markdown
Collaborator

Correction on my previous comment: the original embedding upload/cache flow was built by @JuanPZuluaga in #2108 and #2046, not #1227. @JuanPZuluaga -- this PR adds a _get_uploaded_speaker_embedding() helper that loads safetensors and sets voice_clone_prompt + x_vector_only_mode, which overlaps with the existing request.speaker_embedding path. Would be good to get your input on whether these should be unified.

Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up on my earlier review -- blocking issues are resolved, two remaining items:

[important] Merge conflict with #2457

Both this PR and #2457 modify _build_tts_params with overlapping uploaded-voice embedding logic. Whichever lands second will need a non-trivial rebase. Worth coordinating merge order with @reidliu41.

[important] Missing path traversal check in _get_uploaded_speaker_embedding

cache_file from speaker_info is passed directly to load_file() without validating it resolves within the voice samples directory. A crafted metadata entry with cache_file="/etc/passwd" could read arbitrary files. #2457 includes this check via _validate_path_within_directory -- same pattern should be applied here.

@JuanPZuluaga
Copy link
Copy Markdown
Contributor

JuanPZuluaga commented Apr 6, 2026

I am wondering whether it makes sense to move all the naming conventions to "speaker" instead of "voice". As it aligns better to how the model works, which is via speaker embeddings. What do you think? @linyueqian

Otherwise, i think this is a good addition, and i'd say would be good to keep this PR very minimal so the alias fix is enough to make embedding voices loadable by name. But please add _validate_path_within_directory() on the cache_file path before merge (same pattern we use in upload_voice_embedding()).

@linyueqian
Copy link
Copy Markdown
Collaborator

OpenAI use voice for API so i think we should keep voice at the API boundary (OpenAI convention) and speaker internally (which is mostly what we have already).

@marksverdhei marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch 2 times, most recently from 5c7ff27 to 58857f8 Compare April 7, 2026 09:25
@marksverdhai
Copy link
Copy Markdown
Contributor

@linyueqian Great suggestion — done in the latest push. The uploaded-embedding branch now populates request.speaker_embedding and lets the existing code path handle voice_clone_prompt + x_vector_only_mode, instead of duplicating that wiring.

@JuanPZuluaga Added _validate_path_within_directory() on the cache_file path as requested. The latest push keeps the PR minimal: alias fix + the reviewed embedding improvements.

@marksverdhei marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch 2 times, most recently from 6a698fa to e7178e3 Compare April 7, 2026 11:15
…m-project#1603)

Fix uploaded custom voices not working when referenced by name in TTS
requests. The example client and users were sending 'speaker' in the
JSON payload, but the API only recognized 'voice', silently dropping
the voice name and failing with "Base task requires ref_audio or
speaker_embedding".

Also fix embedding-uploaded voices (via speaker_embedding) being
incorrectly treated as audio files in _build_tts_params.

Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
- Guard x_vector_only_mode from being overwritten by request field
  when uploaded embedding already set it (blocking)
- Validate cache_file readiness for embedding-uploaded voices to
  prevent falling through to audio branch on pending cache (blocking)
- Catch ImportError separately for missing safetensors package
- Guard for missing speaker_embedding key in safetensors file
- Add .squeeze() to handle [1, dim] tensor shapes
- Add tests for pending cache rejection and x_vector_only_mode guard

Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
- Populate request.speaker_embedding instead of manually wiring
  voice_clone_prompt, letting the existing code path handle it
- Add _validate_path_within_directory() check on cache_file to
  prevent path traversal (per review feedback)
- Revert the "voice_clone_prompt not in params" guard (no longer
  needed with unified path)

Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
@marksverdhei marksverdhei force-pushed the fix/voice-speaker-alias-1603 branch from e7178e3 to 9a1c62d Compare April 7, 2026 11:18
@marksverdhei
Copy link
Copy Markdown
Contributor Author

All blockers cleared

Follow-up on my earlier review -- blocking issues are resolved, two remaining items:

[important] Merge conflict with #2457

Both this PR and #2457 modify _build_tts_params with overlapping uploaded-voice embedding logic. Whichever lands second will need a non-trivial rebase. Worth coordinating merge order with @reidliu41.

@reidliu41 my PR has been approved and I have cleared all the blockers so it should be good to merge.
Afaik this PR is more mature than #2457 , so makes more sense to merge this first.
I recommend rebasing your branch onto this branch if you're blocked, unless it has already made its way to main by the time. Please @ me for anything that should require my attention. Also feel free to give my PR a review

Cheers, and again thanks to all revieweres! 🙏

@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 7, 2026
@linyueqian linyueqian enabled auto-merge (squash) April 7, 2026 14:37
@linyueqian linyueqian merged commit feefdae into vllm-project:main Apr 7, 2026
8 checks passed
reidliu41 added a commit to reidliu41/vllm-omni that referenced this pull request Apr 7, 2026
  - resolve the serving_speech conflict after vllm-project#2424 merged into main
  - keep the speaker-to-voice request alias from vllm-project#2424
  - preserve the uploaded voice cache endpoint and shared speaker cache flow
  - drop the stale direct-embedding helper left behind by the rebase

Signed-off-by: reidliu41 <reid201711@gmail.com>
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
…m-project#2424)

Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
Co-authored-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
bob-021206 pushed a commit to jasonlee-1024/vllm-omni that referenced this pull request Apr 21, 2026
…m-project#2424)

Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
Co-authored-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
Signed-off-by: bob-021206 <binyan_github@163.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…m-project#2424)

Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
Co-authored-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…m-project#2424)

Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
Signed-off-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
Co-authored-by: marksverdhai <249650165+marksverdhai@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: voice_clone for Qwen3-TTS Base model

7 participants