[Qwen3-Omni] Support per-request voice_type selection by princepride · Pull Request #2071 · vllm-project/vllm-omni

princepride · 2026-03-22T10:16:25Z

Summary

Support per-request voice_type (speaker/timbre) selection for Qwen3-Omni via extra_body in chat completion requests.
Replace the instance-level voice_type hack (which locked the first request's voice for all subsequent requests) with a proper per-request flow through additional_information.
The voice_type now flows end-to-end: API layer (serving_chat.py) → stage input processor → talker model, with fallback to the model default when not specified.

Changes

serving_chat.py: Extract voice_type / voice from extra_body and inject into additional_information.
qwen3_omni.py: Read voice_type from info_dict per-request instead of caching on the model instance. Remove two TODO hacks.
stage_input_processors/qwen3_omni.py: Add _extract_voice_type() helper and propagate voice_type through thinker → talker stage transition for both online and offline paths.

Test Plan & Results

Server Launch Command

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --omni \
  --port 8091 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml

Test 1: Default voice_type (no extra_body voice_type)

curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "modalities": ["audio"],
    "messages": [
      {"role": "system", "content": [{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}]},
      {"role": "user", "content": [{"type": "text", "text": "Count from one to five."}]}
    ],
    "sampling_params_list": [
      {"temperature": 0.4, "top_p": 0.9, "top_k": 1, "max_tokens": 2048, "repetition_penalty": 1.05, "stop_token_ids": [151645], "seed": 42},
      {"temperature": 0.9, "top_k": 50, "max_tokens": 4096, "seed": 42, "detokenize": false, "repetition_penalty": 1.05, "stop_token_ids": [2150]},
      {"temperature": 0.0, "top_p": 1.0, "top_k": -1, "max_tokens": 65536, "seed": 42, "detokenize": true, "repetition_penalty": 1.1}
    ]
  }'

Result: audio_default.wav

Test 2: voice_type='chelsie' (female)

curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "modalities": ["audio"],
    "voice_type": "chelsie",
    "messages": [
      {"role": "system", "content": [{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}]},
      {"role": "user", "content": [{"type": "text", "text": "Count from one to five."}]}
    ],
    "sampling_params_list": [
      {"temperature": 0.4, "top_p": 0.9, "top_k": 1, "max_tokens": 2048, "repetition_penalty": 1.05, "stop_token_ids": [151645], "seed": 42},
      {"temperature": 0.9, "top_k": 50, "max_tokens": 4096, "seed": 42, "detokenize": false, "repetition_penalty": 1.05, "stop_token_ids": [2150]},
      {"temperature": 0.0, "top_p": 1.0, "top_k": -1, "max_tokens": 65536, "seed": 42, "detokenize": true, "repetition_penalty": 1.1}
    ]
  }'

Result: audio_chelsie.wav

Test 3: voice_type='aiden' (male)

curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "modalities": ["audio"],
    "voice_type": "aiden",
    "messages": [
      {"role": "system", "content": [{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}]},
      {"role": "user", "content": [{"type": "text", "text": "Count from one to five."}]}
    ],
    "sampling_params_list": [
      {"temperature": 0.4, "top_p": 0.9, "top_k": 1, "max_tokens": 2048, "repetition_penalty": 1.05, "stop_token_ids": [151645], "seed": 42},
      {"temperature": 0.9, "top_k": 50, "max_tokens": 4096, "seed": 42, "detokenize": false, "repetition_penalty": 1.05, "stop_token_ids": [2150]},
      {"temperature": 0.0, "top_p": 1.0, "top_k": -1, "max_tokens": 65536, "seed": 42, "detokenize": true, "repetition_penalty": 1.1}
    ]
  }'

Result: audio_aiden.wav

Signed-off-by: princepride <wangzhipeng628@gmail.com>

gcanlin

Looks like similar as #1963.

princepride · 2026-03-22T13:10:48Z

Looks like similar as #1963.

👌, I will close it, I've "vibe coded" a demo to showcase our streaming input and output capabilities: https://github.com/princepride/qwen-omni-voice-assistant. I think this feature is very interesting. Aside from the fact that requests cannot yet freely pass in custom voice timbres, I believe we should prioritize supporting RL specifically for Qwen3-omni. I also have a question: I'm not sure where to obtain ready-made audio data for training; crawling it from the web is just too much of a hassle. Do we have a WeChat group dedicated to the Qwen-omni model?

gcanlin · 2026-03-22T13:23:50Z

Do we have a WeChat group dedicated to the Qwen-omni model?

I've invited you into the TTS discussion group, where should be okay to raise the question:)

amy-why-3459 · 2026-03-23T08:25:44Z

Looks like similar as #1963.

👌, I will close it, I've "vibe coded" a demo to showcase our streaming input and output capabilities: https://github.com/princepride/qwen-omni-voice-assistant. I think this feature is very interesting. Aside from the fact that requests cannot yet freely pass in custom voice timbres, I believe we should prioritize supporting RL specifically for Qwen3-omni. I also have a question: I'm not sure where to obtain ready-made audio data for training; crawling it from the web is just too much of a hassle. Do we have a WeChat group dedicated to the Qwen-omni model?

We have a WeChat group. If you are interested in the Qwen-omni model, I would like to invite you to join the group.

qwen3-omni support tone as request

27062b0

Signed-off-by: princepride <wangzhipeng628@gmail.com>

princepride requested a review from hsliuustc0106 as a code owner March 22, 2026 10:16

gcanlin reviewed Mar 22, 2026

View reviewed changes

princepride closed this Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen3-Omni] Support per-request voice_type selection#2071

[Qwen3-Omni] Support per-request voice_type selection#2071
princepride wants to merge 1 commit intovllm-project:mainfrom
princepride:qwen3-omni-support-tone-request

princepride commented Mar 22, 2026 •

edited

Loading

Uh oh!

gcanlin left a comment

Uh oh!

princepride commented Mar 22, 2026

Uh oh!

gcanlin commented Mar 22, 2026

Uh oh!

amy-why-3459 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

princepride commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Plan & Results

Server Launch Command

Test 1: Default voice_type (no extra_body voice_type)

Test 2: voice_type='chelsie' (female)

Test 3: voice_type='aiden' (male)

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

princepride commented Mar 22, 2026

Uh oh!

gcanlin commented Mar 22, 2026

Uh oh!

amy-why-3459 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

princepride commented Mar 22, 2026 •

edited

Loading