Skip to content

[Refactor] Normalize speech request handling in serving_speech#2757

Open
reidliu41 wants to merge 3 commits intovllm-project:mainfrom
reidliu41:feat/speech-request-normalization
Open

[Refactor] Normalize speech request handling in serving_speech#2757
reidliu41 wants to merge 3 commits intovllm-project:mainfrom
reidliu41:feat/speech-request-normalization

Conversation

@reidliu41
Copy link
Copy Markdown
Contributor

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Refactor vllm_omni/entrypoints/openai/serving_speech.py to normalize shared speech request
handling before validation and model-specific generation setup.

This change introduces a small normalization layer for:

  • case-insensitive voice lookup
  • uploaded voice resolution
  • uploaded embedding resolution
  • Base-task inference for Qwen-style voice cloning inputs
  • preserving the distinction between explicit clone inputs and auto-resolved uploaded-voice inputs

The goal is to reduce repeated request-shaping logic across validation and generation paths
without changing the external API contract.

Test Plan

  ./.venv/bin/python -m pytest -q tests/entrypoints/openai_api/test_serving_speech.py
  ./.venv/bin/python -m pytest -q tests/entrypoints/openai_api/test_serving_speech_stream.py
  ./.venv/bin/python -m pytest -q tests/entrypoints/openai_api

Test Result

 ./.venv/bin/python -m pytest -q tests/entrypoints/openai_api/test_serving_speech.py

............................................................................................ [ 71%]
....................................                                                         [100%]
128 passed, 17 warnings in 0.90s


 ./.venv/bin/python -m pytest -q tests/entrypoints/openai_api/test_serving_speech_stream.py
...............                                                                              [100%]
15 passed, 17 warnings in 0.50s


./.venv/bin/python -m pytest -q tests/entrypoints/openai_api
............................................................................................ [ 32%]
............................................................................................ [ 65%]
............................................................................................ [ 98%]
....                                                                                         [100%]
280 passed, 19 warnings in 3.35s

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

  - add a shared SpeechRequestNormalized helper for speech request canonicalization
  - centralize uploaded voice resolution for audio and embedding-backed voices
  - preserve explicit-vs-auto-resolved clone input semantics across validation and generation
  - route TTS validation, param building, and prepare_speech_generation through the normalized path
  - add regression tests for uploaded voice normalization and prepare_speech_generation

Signed-off-by: reidliu41 <reid201711@gmail.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

BLOCKER scan:

  • Correctness: PASS
  • Reliability/Safety: PASS
  • Breaking Changes: PASS (internal refactoring)
  • Test Coverage: PASS (comprehensive new tests added)
  • Documentation: PASS (no user-facing changes)
  • Security: PASS

OVERALL: NO BLOCKERS

VERDICT: COMMENT

Solid refactoring. The SpeechRequestNormalized dataclass provides a clean single source of truth for all normalized fields. The separation of normalization from application is good design.

The test coverage is excellent - the 4 new tests cover all the important normalization cases.

@tzhouam tzhouam self-requested a review April 15, 2026 03:17
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [Refactor] Normalize speech request handling in serving_speech

Good direction -- centralizing the repeated uploaded-voice resolution, case-insensitive lookup, and task_type inference into a single _normalize_speech_request is a worthwhile cleanup. The new SpeechRequestNormalized dataclass is clean and the test coverage for the normalizer itself is solid.

However, I have a few concerns that should be addressed before merging:


1. (Bug) Speaker name no longer lowercased for non-uploaded voices -- behavioral regression

Previously, both _validate_qwen_tts_request and _validate_voxtral_tts_request mutated request.voice = request.voice.lower() before _build_tts_params ran, so params["speaker"] always received the lowercased voice name.

After this PR, _apply_normalized_speech_request does not write back the lowercased voice to request.voice, and _build_tts_params now sends normalized.voice (original case) for non-uploaded voices:

speaker_value = (
    normalized.voice_lookup if normalized.uploaded_speaker_info is not None else normalized.voice
)
params["speaker"] = [speaker_value]

For a request with voice="Ryan", the old code sent speaker="ryan", the new code sends speaker="Ryan". If the model's speaker map uses lowercase keys, this is a silent regression. The fix is straightforward: use normalized.voice_lookup unconditionally (or at least for the speaker param), or have _apply_normalized_speech_request write back the lowercased voice.


2. (Performance / clarity) _normalize_speech_request is called up to 3 times per request

A single request through _prepare_speech_generation triggers:

  1. _prepare_speech_generation itself calls _normalize_speech_request + _apply_normalized_speech_request
  2. The validation method (e.g., _validate_qwen_tts_request) calls it again
  3. _build_tts_params calls it a third time

Each call re-instantiates the dataclass and potentially re-runs _get_uploaded_audio_data / _get_uploaded_speaker_embedding (though the second/third calls avoid the worst of it via the _auto_resolved_upload_* sentinel attributes). This is not a correctness issue today thanks to the sentinel guards, but it is fragile and wasteful:

  • The _auto_resolved_upload_* sentinel is set via object.__setattr__ on a likely Pydantic/msgspec model, which is an unusual pattern that could break if the request model gains __slots__ or stricter validation.
  • Reading audio files or safetensors on repeated calls is wasted I/O if the sentinel check ever regresses.

Suggestion: Normalize once at the top of _prepare_speech_generation and thread the SpeechRequestNormalized object through validation and build methods instead of re-deriving it in each function. This would make the flow easier to reason about and eliminate the sentinel hack.


3. (Nit) _validate_fish_tts_request error path changed subtly

Old code checked whether the uploaded audio file existed on disk via Path(...).exists() before attempting to load it, giving a specific "not found on disk" error. The new code relies on _normalize_speech_request calling _get_uploaded_audio_data, and if that returns None, the validation check is:

if normalized.uploaded_speaker_info is not None and request.ref_audio is None and normalized.ref_audio is None:
    return f"Could not load audio for uploaded voice '{request.voice}'"

This loses the distinction between "file not found" and "file found but failed to load." Not critical, but worth noting for debuggability.


4. (Nit) Test for double-normalization idempotency is missing

Given that _normalize_speech_request is called multiple times per request, it would be valuable to have a test that explicitly asserts the second normalization (after _apply_normalized_speech_request has mutated the request) produces identical results to the first. This would catch regressions if the sentinel attribute approach breaks.


Summary

The core refactoring idea is sound, but issue #1 is a correctness regression that needs fixing before merge. Issue #2 is a design concern worth addressing now to avoid future bugs -- if not refactored now, at minimum add a comment explaining the multi-call pattern and sentinel attributes.

@lishunyang12 lishunyang12 dismissed their stale review April 16, 2026 14:56

Replacing with inline comments

  Reuse the normalized speech request through validation and TTS parameter
  building to avoid repeated uploaded-voice resolution.

  Preserve canonical lowercase speaker IDs, restore the specific Fish Speech
  missing-uploaded-audio error, and keep the latest VoxCPM TTS parameter path
  compatible with the normalized builder.

Signed-off-by: reidliu41 <reid201711@gmail.com>
…-normalization

Signed-off-by: reidliu41 <reid201711@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants