Skip to content

[Qwen3-Omni] Support per-request voice_type selection#2071

Closed
princepride wants to merge 1 commit intovllm-project:mainfrom
princepride:qwen3-omni-support-tone-request
Closed

[Qwen3-Omni] Support per-request voice_type selection#2071
princepride wants to merge 1 commit intovllm-project:mainfrom
princepride:qwen3-omni-support-tone-request

Conversation

@princepride
Copy link
Copy Markdown
Collaborator

@princepride princepride commented Mar 22, 2026

Summary

  • Support per-request voice_type (speaker/timbre) selection for Qwen3-Omni via extra_body in chat completion requests.
  • Replace the instance-level voice_type hack (which locked the first request's voice for all subsequent requests) with a proper per-request flow through additional_information.
  • The voice_type now flows end-to-end: API layer (serving_chat.py) → stage input processor → talker model, with fallback to the model default when not specified.

Changes

  • serving_chat.py: Extract voice_type / voice from extra_body and inject into additional_information.
  • qwen3_omni.py: Read voice_type from info_dict per-request instead of caching on the model instance. Remove two TODO hacks.
  • stage_input_processors/qwen3_omni.py: Add _extract_voice_type() helper and propagate voice_type through thinker → talker stage transition for both online and offline paths.

Test Plan & Results

Server Launch Command

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --omni \
  --port 8091 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml

Test 1: Default voice_type (no extra_body voice_type)

curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "modalities": ["audio"],
    "messages": [
      {"role": "system", "content": [{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}]},
      {"role": "user", "content": [{"type": "text", "text": "Count from one to five."}]}
    ],
    "sampling_params_list": [
      {"temperature": 0.4, "top_p": 0.9, "top_k": 1, "max_tokens": 2048, "repetition_penalty": 1.05, "stop_token_ids": [151645], "seed": 42},
      {"temperature": 0.9, "top_k": 50, "max_tokens": 4096, "seed": 42, "detokenize": false, "repetition_penalty": 1.05, "stop_token_ids": [2150]},
      {"temperature": 0.0, "top_p": 1.0, "top_k": -1, "max_tokens": 65536, "seed": 42, "detokenize": true, "repetition_penalty": 1.1}
    ]
  }'

Result: audio_default.wav

Test 2: voice_type='chelsie' (female)

curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "modalities": ["audio"],
    "voice_type": "chelsie",
    "messages": [
      {"role": "system", "content": [{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}]},
      {"role": "user", "content": [{"type": "text", "text": "Count from one to five."}]}
    ],
    "sampling_params_list": [
      {"temperature": 0.4, "top_p": 0.9, "top_k": 1, "max_tokens": 2048, "repetition_penalty": 1.05, "stop_token_ids": [151645], "seed": 42},
      {"temperature": 0.9, "top_k": 50, "max_tokens": 4096, "seed": 42, "detokenize": false, "repetition_penalty": 1.05, "stop_token_ids": [2150]},
      {"temperature": 0.0, "top_p": 1.0, "top_k": -1, "max_tokens": 65536, "seed": 42, "detokenize": true, "repetition_penalty": 1.1}
    ]
  }'

Result: audio_chelsie.wav

Test 3: voice_type='aiden' (male)

curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "modalities": ["audio"],
    "voice_type": "aiden",
    "messages": [
      {"role": "system", "content": [{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}]},
      {"role": "user", "content": [{"type": "text", "text": "Count from one to five."}]}
    ],
    "sampling_params_list": [
      {"temperature": 0.4, "top_p": 0.9, "top_k": 1, "max_tokens": 2048, "repetition_penalty": 1.05, "stop_token_ids": [151645], "seed": 42},
      {"temperature": 0.9, "top_k": 50, "max_tokens": 4096, "seed": 42, "detokenize": false, "repetition_penalty": 1.05, "stop_token_ids": [2150]},
      {"temperature": 0.0, "top_p": 1.0, "top_k": -1, "max_tokens": 65536, "seed": 42, "detokenize": true, "repetition_penalty": 1.1}
    ]
  }'

Result: audio_aiden.wav

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like similar as #1963.

@princepride
Copy link
Copy Markdown
Collaborator Author

Looks like similar as #1963.

👌, I will close it, I've "vibe coded" a demo to showcase our streaming input and output capabilities: https://github.com/princepride/qwen-omni-voice-assistant. I think this feature is very interesting. Aside from the fact that requests cannot yet freely pass in custom voice timbres, I believe we should prioritize supporting RL specifically for Qwen3-omni. I also have a question: I'm not sure where to obtain ready-made audio data for training; crawling it from the web is just too much of a hassle. Do we have a WeChat group dedicated to the Qwen-omni model?

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Mar 22, 2026

Do we have a WeChat group dedicated to the Qwen-omni model?

I've invited you into the TTS discussion group, where should be okay to raise the question:)

@amy-why-3459
Copy link
Copy Markdown
Contributor

Looks like similar as #1963.

👌, I will close it, I've "vibe coded" a demo to showcase our streaming input and output capabilities: https://github.com/princepride/qwen-omni-voice-assistant. I think this feature is very interesting. Aside from the fact that requests cannot yet freely pass in custom voice timbres, I believe we should prioritize supporting RL specifically for Qwen3-omni. I also have a question: I'm not sure where to obtain ready-made audio data for training; crawling it from the web is just too much of a hassle. Do we have a WeChat group dedicated to the Qwen-omni model?

We have a WeChat group. If you are interested in the Qwen-omni model, I would like to invite you to join the group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants