[Qwen3-Omni] Support per-request voice_type selection#2071
[Qwen3-Omni] Support per-request voice_type selection#2071princepride wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: princepride <wangzhipeng628@gmail.com>
👌, I will close it, I've "vibe coded" a demo to showcase our streaming input and output capabilities: https://github.com/princepride/qwen-omni-voice-assistant. I think this feature is very interesting. Aside from the fact that requests cannot yet freely pass in custom voice timbres, I believe we should prioritize supporting RL specifically for Qwen3-omni. I also have a question: I'm not sure where to obtain ready-made audio data for training; crawling it from the web is just too much of a hassle. Do we have a WeChat group dedicated to the Qwen-omni model? |
I've invited you into the TTS discussion group, where should be okay to raise the question:) |
We have a WeChat group. If you are interested in the Qwen-omni model, I would like to invite you to join the group. |
Summary
voice_type(speaker/timbre) selection for Qwen3-Omni viaextra_bodyin chat completion requests.voice_typehack (which locked the first request's voice for all subsequent requests) with a proper per-request flow throughadditional_information.serving_chat.py) → stage input processor → talker model, with fallback to the model default when not specified.Changes
serving_chat.py: Extractvoice_type/voicefromextra_bodyand inject intoadditional_information.qwen3_omni.py: Readvoice_typefrominfo_dictper-request instead of caching on the model instance. Remove twoTODOhacks.stage_input_processors/qwen3_omni.py: Add_extract_voice_type()helper and propagatevoice_typethrough thinker → talker stage transition for both online and offline paths.Test Plan & Results
Server Launch Command
Test 1: Default voice_type (no extra_body voice_type)
Result: audio_default.wav
Test 2: voice_type='chelsie' (female)
Result: audio_chelsie.wav
Test 3: voice_type='aiden' (male)
Result: audio_aiden.wav