[Refactor] Normalize speech request handling in serving_speech#2757
[Refactor] Normalize speech request handling in serving_speech#2757reidliu41 wants to merge 3 commits intovllm-project:mainfrom
Conversation
- add a shared SpeechRequestNormalized helper for speech request canonicalization - centralize uploaded voice resolution for audio and embedding-backed voices - preserve explicit-vs-auto-resolved clone input semantics across validation and generation - route TTS validation, param building, and prepare_speech_generation through the normalized path - add regression tests for uploaded voice normalization and prepare_speech_generation Signed-off-by: reidliu41 <reid201711@gmail.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
BLOCKER scan:
OVERALL: NO BLOCKERS VERDICT: COMMENT Solid refactoring. The SpeechRequestNormalized dataclass provides a clean single source of truth for all normalized fields. The separation of normalization from application is good design. The test coverage is excellent - the 4 new tests cover all the important normalization cases. |
lishunyang12
left a comment
There was a problem hiding this comment.
Review: [Refactor] Normalize speech request handling in serving_speech
Good direction -- centralizing the repeated uploaded-voice resolution, case-insensitive lookup, and task_type inference into a single _normalize_speech_request is a worthwhile cleanup. The new SpeechRequestNormalized dataclass is clean and the test coverage for the normalizer itself is solid.
However, I have a few concerns that should be addressed before merging:
1. (Bug) Speaker name no longer lowercased for non-uploaded voices -- behavioral regression
Previously, both _validate_qwen_tts_request and _validate_voxtral_tts_request mutated request.voice = request.voice.lower() before _build_tts_params ran, so params["speaker"] always received the lowercased voice name.
After this PR, _apply_normalized_speech_request does not write back the lowercased voice to request.voice, and _build_tts_params now sends normalized.voice (original case) for non-uploaded voices:
speaker_value = (
normalized.voice_lookup if normalized.uploaded_speaker_info is not None else normalized.voice
)
params["speaker"] = [speaker_value]For a request with voice="Ryan", the old code sent speaker="ryan", the new code sends speaker="Ryan". If the model's speaker map uses lowercase keys, this is a silent regression. The fix is straightforward: use normalized.voice_lookup unconditionally (or at least for the speaker param), or have _apply_normalized_speech_request write back the lowercased voice.
2. (Performance / clarity) _normalize_speech_request is called up to 3 times per request
A single request through _prepare_speech_generation triggers:
_prepare_speech_generationitself calls_normalize_speech_request+_apply_normalized_speech_request- The validation method (e.g.,
_validate_qwen_tts_request) calls it again _build_tts_paramscalls it a third time
Each call re-instantiates the dataclass and potentially re-runs _get_uploaded_audio_data / _get_uploaded_speaker_embedding (though the second/third calls avoid the worst of it via the _auto_resolved_upload_* sentinel attributes). This is not a correctness issue today thanks to the sentinel guards, but it is fragile and wasteful:
- The
_auto_resolved_upload_*sentinel is set viaobject.__setattr__on a likely Pydantic/msgspec model, which is an unusual pattern that could break if the request model gains__slots__or stricter validation. - Reading audio files or safetensors on repeated calls is wasted I/O if the sentinel check ever regresses.
Suggestion: Normalize once at the top of _prepare_speech_generation and thread the SpeechRequestNormalized object through validation and build methods instead of re-deriving it in each function. This would make the flow easier to reason about and eliminate the sentinel hack.
3. (Nit) _validate_fish_tts_request error path changed subtly
Old code checked whether the uploaded audio file existed on disk via Path(...).exists() before attempting to load it, giving a specific "not found on disk" error. The new code relies on _normalize_speech_request calling _get_uploaded_audio_data, and if that returns None, the validation check is:
if normalized.uploaded_speaker_info is not None and request.ref_audio is None and normalized.ref_audio is None:
return f"Could not load audio for uploaded voice '{request.voice}'"This loses the distinction between "file not found" and "file found but failed to load." Not critical, but worth noting for debuggability.
4. (Nit) Test for double-normalization idempotency is missing
Given that _normalize_speech_request is called multiple times per request, it would be valuable to have a test that explicitly asserts the second normalization (after _apply_normalized_speech_request has mutated the request) produces identical results to the first. This would catch regressions if the sentinel attribute approach breaks.
Summary
The core refactoring idea is sound, but issue #1 is a correctness regression that needs fixing before merge. Issue #2 is a design concern worth addressing now to avoid future bugs -- if not refactored now, at minimum add a comment explaining the multi-call pattern and sentinel attributes.
Reuse the normalized speech request through validation and TTS parameter building to avoid repeated uploaded-voice resolution. Preserve canonical lowercase speaker IDs, restore the specific Fish Speech missing-uploaded-audio error, and keep the latest VoxCPM TTS parameter path compatible with the normalized builder. Signed-off-by: reidliu41 <reid201711@gmail.com>
…-normalization Signed-off-by: reidliu41 <reid201711@gmail.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Refactor
vllm_omni/entrypoints/openai/serving_speech.pyto normalize shared speech requesthandling before validation and model-specific generation setup.
This change introduces a small normalization layer for:
The goal is to reduce repeated request-shaping logic across validation and generation paths
without changing the external API contract.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)