[Refactor] Normalize speech request handling in serving_speech by reidliu41 · Pull Request #2757 · vllm-project/vllm-omni

reidliu41 · 2026-04-14T01:22:18Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Refactor vllm_omni/entrypoints/openai/serving_speech.py to normalize shared speech request
handling before validation and model-specific generation setup.

This change introduces a small normalization layer for:

case-insensitive voice lookup
uploaded voice resolution
uploaded embedding resolution
Base-task inference for Qwen-style voice cloning inputs
preserving the distinction between explicit clone inputs and auto-resolved uploaded-voice inputs

The goal is to reduce repeated request-shaping logic across validation and generation paths
without changing the external API contract.

Test Plan

  ./.venv/bin/python -m pytest -q tests/entrypoints/openai_api/test_serving_speech.py
  ./.venv/bin/python -m pytest -q tests/entrypoints/openai_api/test_serving_speech_stream.py
  ./.venv/bin/python -m pytest -q tests/entrypoints/openai_api

Test Result

 ./.venv/bin/python -m pytest -q tests/entrypoints/openai_api/test_serving_speech.py

............................................................................................ [ 71%]
....................................                                                         [100%]
128 passed, 17 warnings in 0.90s


 ./.venv/bin/python -m pytest -q tests/entrypoints/openai_api/test_serving_speech_stream.py
...............                                                                              [100%]
15 passed, 17 warnings in 0.50s


./.venv/bin/python -m pytest -q tests/entrypoints/openai_api
............................................................................................ [ 32%]
............................................................................................ [ 65%]
............................................................................................ [ 98%]
....                                                                                         [100%]
280 passed, 19 warnings in 3.35s

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

- add a shared SpeechRequestNormalized helper for speech request canonicalization - centralize uploaded voice resolution for audio and embedding-backed voices - preserve explicit-vs-auto-resolved clone input semantics across validation and generation - route TTS validation, param building, and prepare_speech_generation through the normalized path - add regression tests for uploaded voice normalization and prepare_speech_generation Signed-off-by: reidliu41 <reid201711@gmail.com>

chatgpt-codex-connector · 2026-04-14T01:22:24Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

hsliuustc0106 · 2026-04-14T09:36:42Z

BLOCKER scan:

Correctness: PASS
Reliability/Safety: PASS
Breaking Changes: PASS (internal refactoring)
Test Coverage: PASS (comprehensive new tests added)
Documentation: PASS (no user-facing changes)
Security: PASS

OVERALL: NO BLOCKERS

VERDICT: COMMENT

Solid refactoring. The SpeechRequestNormalized dataclass provides a clean single source of truth for all normalized fields. The separation of normalization from application is good design.

The test coverage is excellent - the 4 new tests cover all the important normalization cases.

lishunyang12

Review: [Refactor] Normalize speech request handling in serving_speech

Good direction -- centralizing the repeated uploaded-voice resolution, case-insensitive lookup, and task_type inference into a single _normalize_speech_request is a worthwhile cleanup. The new SpeechRequestNormalized dataclass is clean and the test coverage for the normalizer itself is solid.

However, I have a few concerns that should be addressed before merging:

1. (Bug) Speaker name no longer lowercased for non-uploaded voices -- behavioral regression

Previously, both _validate_qwen_tts_request and _validate_voxtral_tts_request mutated request.voice = request.voice.lower() before _build_tts_params ran, so params["speaker"] always received the lowercased voice name.

After this PR, _apply_normalized_speech_request does not write back the lowercased voice to request.voice, and _build_tts_params now sends normalized.voice (original case) for non-uploaded voices:

speaker_value = (
    normalized.voice_lookup if normalized.uploaded_speaker_info is not None else normalized.voice
)
params["speaker"] = [speaker_value]

For a request with voice="Ryan", the old code sent speaker="ryan", the new code sends speaker="Ryan". If the model's speaker map uses lowercase keys, this is a silent regression. The fix is straightforward: use normalized.voice_lookup unconditionally (or at least for the speaker param), or have _apply_normalized_speech_request write back the lowercased voice.

2. (Performance / clarity) `_normalize_speech_request` is called up to 3 times per request

A single request through _prepare_speech_generation triggers:

_prepare_speech_generation itself calls _normalize_speech_request + _apply_normalized_speech_request
The validation method (e.g., _validate_qwen_tts_request) calls it again
_build_tts_params calls it a third time

Each call re-instantiates the dataclass and potentially re-runs _get_uploaded_audio_data / _get_uploaded_speaker_embedding (though the second/third calls avoid the worst of it via the _auto_resolved_upload_* sentinel attributes). This is not a correctness issue today thanks to the sentinel guards, but it is fragile and wasteful:

The _auto_resolved_upload_* sentinel is set via object.__setattr__ on a likely Pydantic/msgspec model, which is an unusual pattern that could break if the request model gains __slots__ or stricter validation.
Reading audio files or safetensors on repeated calls is wasted I/O if the sentinel check ever regresses.

Suggestion: Normalize once at the top of _prepare_speech_generation and thread the SpeechRequestNormalized object through validation and build methods instead of re-deriving it in each function. This would make the flow easier to reason about and eliminate the sentinel hack.

3. (Nit) `_validate_fish_tts_request` error path changed subtly

Old code checked whether the uploaded audio file existed on disk via Path(...).exists() before attempting to load it, giving a specific "not found on disk" error. The new code relies on _normalize_speech_request calling _get_uploaded_audio_data, and if that returns None, the validation check is:

if normalized.uploaded_speaker_info is not None and request.ref_audio is None and normalized.ref_audio is None:
    return f"Could not load audio for uploaded voice '{request.voice}'"

This loses the distinction between "file not found" and "file found but failed to load." Not critical, but worth noting for debuggability.

4. (Nit) Test for double-normalization idempotency is missing

Given that _normalize_speech_request is called multiple times per request, it would be valuable to have a test that explicitly asserts the second normalization (after _apply_normalized_speech_request has mutated the request) produces identical results to the first. This would catch regressions if the sentinel attribute approach breaks.

Summary

The core refactoring idea is sound, but issue #1 is a correctness regression that needs fixing before merge. Issue #2 is a design concern worth addressing now to avoid future bugs -- if not refactored now, at minimum add a comment explaining the multi-call pattern and sentinel attributes.

Replacing with inline comments

Reuse the normalized speech request through validation and TTS parameter building to avoid repeated uploaded-voice resolution. Preserve canonical lowercase speaker IDs, restore the specific Fish Speech missing-uploaded-audio error, and keep the latest VoxCPM TTS parameter path compatible with the normalized builder. Signed-off-by: reidliu41 <reid201711@gmail.com>

…-normalization Signed-off-by: reidliu41 <reid201711@gmail.com>

reidliu41 requested a review from hsliuustc0106 as a code owner April 14, 2026 01:22

tzhouam self-requested a review April 15, 2026 03:17

lishunyang12 previously requested changes Apr 16, 2026

View reviewed changes

reidliu41 added 2 commits April 17, 2026 10:09

Merge remote-tracking branch 'upstream/main' into feat/speech-request…

dd3fb4a

…-normalization Signed-off-by: reidliu41 <reid201711@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Normalize speech request handling in serving_speech#2757

[Refactor] Normalize speech request handling in serving_speech#2757
reidliu41 wants to merge 3 commits intovllm-project:mainfrom
reidliu41:feat/speech-request-normalization

reidliu41 commented Apr 14, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 14, 2026

Uh oh!

hsliuustc0106 commented Apr 14, 2026

Uh oh!

lishunyang12 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

reidliu41 commented Apr 14, 2026

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 14, 2026

Uh oh!

hsliuustc0106 commented Apr 14, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Review: [Refactor] Normalize speech request handling in serving_speech

1. (Bug) Speaker name no longer lowercased for non-uploaded voices -- behavioral regression

2. (Performance / clarity) _normalize_speech_request is called up to 3 times per request

3. (Nit) _validate_fish_tts_request error path changed subtly

4. (Nit) Test for double-normalization idempotency is missing

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2. (Performance / clarity) `_normalize_speech_request` is called up to 3 times per request

3. (Nit) `_validate_fish_tts_request` error path changed subtly