[Feature] support to change the speaker of qwen3-omni by R2-Y · Pull Request #1963 · vllm-project/vllm-omni

R2-Y · 2026-03-18T02:40:06Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

add voice & language config for user, support to customize speaker & language of qwen3-omni, qwen2.5-omni, qwen3-tts
fix #1845

Test Plan

change different voice type, and check if the output.wav voice is correct.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

R2-Y · 2026-03-18T02:41:57Z

@linyueqian @yenuo26 @amy-why-3459 @tzhouam PTAL

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ae1c610ee2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T02:48:19Z

+    p = prompt[index] if isinstance(prompt, list) and index < len(prompt) else prompt
+    if p is None:
+        return None
+    print(f"_extract_voice_type_from_prompt p: {p}")


Remove raw prompt printing from voice extraction path

_extract_voice_type_from_prompt currently prints the full prompt object for every thinker output, and this code path runs during normal request forwarding. In production, that can leak user prompt contents into stdout/log sinks and also hurt throughput because synchronous console I/O is on the hot path; please remove this print (or replace it with gated, redacted debug logging).

Useful? React with 👍 / 👎.

amy-why-3459 · 2026-03-18T03:08:10Z

Could you also adapt qwen2.5 and qwen3-tts accordingly? Could you also adapt language_id accordingly?

amy-why-3459 · 2026-03-18T03:10:58Z

            else:
                logger.debug("No additional_information provided to code2wav stage.")
-            audio_tensors = self.generate_audio(codes, voice_type, left_context_size, seq_token_counts)
+            audio_tensors = self.generate_audio(codes, self.voice_type, left_context_size, seq_token_counts)


In the multi-concurrency scenario, is there any problem if different users use different voice_type values when self.voice_type is used?

may lead some problem, will look into it

gcanlin · 2026-03-18T03:11:23Z

 logger = init_logger(__name__)


+def _voice_type_from_runtime_info(


Are _voice_type_from_runtime_info, _extract_voice_type_from_request and _extract_voice_type_from_prompt general for all models? The filed of voice_type seems to be defined by our framework instead of model-specific. If so, could we make them as utility function?

yeah general for all models, will make them as utility function. good idea thanks

gcanlin · 2026-03-18T03:15:28Z

+print(response.choices[1].message.audio)
+```
+
+Supported voice names depend on the model (e.g. `ethan`, `chelsie`). Omit `voice` to use the default.


I check Qwen3-Omni config and find one more speaker_id. I think we can add it in docs to avoid users lookup config manually.

Suggested change

Supported voice names depend on the model (e.g. `ethan`, `chelsie`). Omit `voice` to use the default.

Supported voice names depend on the model (e.g. `ethan`, `chelsie`, `aiden`). Omit `voice` to use the default.

R2-Y · 2026-03-18T03:20:45Z

Could you also adapt qwen2.5 and qwen3-tts accordingly? Could you also adapt language_id accordingly?

can do

yenuo26 · 2026-03-18T03:37:52Z

Could you also adapt qwen2.5 and qwen3-tts accordingly? Could you also adapt language_id accordingly?

can do

Does this mean that after this PR is merged, this test case(#1844) will need to be re-adapted, and the same scenario for Qwen2.5 will be added?

linyueqian · 2026-03-18T03:37:59Z

Should the voice speaker names use an uppercase initial letter? Currently they're lowercase (e.g. ethan, chelsie), but since they're proper names, capitalizing them (Ethan, Chelsie) might be more appropriate. (although i know qwen3 tts right now is all lowercase but i think capitalizing is better. any thoughts?

R2-Y · 2026-03-18T03:43:08Z

Should the voice speaker names use an uppercase initial letter? Currently they're lowercase (e.g. ethan, chelsie), but since they're proper names, capitalizing them (Ethan, Chelsie) might be more appropriate. (although i know qwen3 tts right now is all lowercase but i think capitalizing is better. any thoughts?

I can let users enter names with uppercase initial name or whatever they want, but I still need to convert them to lowercase in the model so that the model can correctly tokenize

R2-Y · 2026-03-18T03:47:49Z

Could you also adapt qwen2.5 and qwen3-tts accordingly? Could you also adapt language_id accordingly?

can do

Does this mean that after this PR is merged, this test case(#1844) will need to be re-adapted, and the same scenario for Qwen2.5 will be added?

need to configure two more parameters in test

hsliuustc0106

PR Review: [Feature] support to change the speaker of qwen3-omni

Gate Status: PASSING ✓

All CI checks pass. Proceeding with full review.

Routing

Primary: Audio/TTS (vllm-omni-audio-tts) - voice/speaker selection
Secondary: API (vllm-omni-api) - new voice parameter

Critical Findings

1. State Mutation on Model Instance [HIGH]

In qwen3_omni.py:forward():

self.voice_type = effective_voice_type

Setting voice_type on self (the model instance) could cause race conditions in concurrent requests. If two requests with different voices are processed simultaneously, they could overwrite each other's voice setting.

Recommendation: Pass voice_type through the request pipeline (via additional_information) rather than storing on the model instance.

2. Debug Statements Left in Code

Multiple debug statements should be removed or converted to appropriate log levels:

# qwen3_omni.py (stage_input_processors)
print(f"_extract_voice_type_from_prompt p: {p}")  # Remove or use logger

# serving_chat.py
logger.info(f"preprocess_chat voice: {voice}")  # Use logger.debug()

# qwen3_omni.py (model)
logger.info(f"voice_type: {voice_type}")  # Use logger.debug()

3. Missing Input Validation

The voice parameter is not validated against supported voice types (e.g., ethan, chelsie). Invalid voice names will be silently passed to the model, which may cause unexpected behavior or errors deep in the TTS pipeline.

Recommendation: Add validation at the API layer with a clear error message for unsupported voices.

Missing Tests

This PR needs automated tests:

Test Type	Description
Regression test	Verify voice selection works (issue #1845 test was failing)
Edge cases	Empty string, `None`, invalid voice name
Concurrent requests	Multiple requests with different voices should not interfere

The linked issue references tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_voice_001. This PR should add/update that test to verify the fix works.

Summary

Validated	Lacks Evidence	Must Change
Voice parameter flows through API	List of supported voices per model	Fix state mutation issue
Documentation updated	Test commands/scripts	Remove debug statements
Fixes linked issue #1845	Validation for invalid voices	Add automated tests

🤖 Generated with Claude Code

amy-why-3459 · 2026-03-23T14:48:19Z

 - `--image-path` (or `-i`): Path to local image file or URL. If not provided and query-type is `use_image`, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example: `--image-path /path/to/image.jpg` or `--image-path https://example.com/image.png`
 - `--audio-path` (or `-a`): Path to local audio file or URL. If not provided and query-type is `use_audio`, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example: `--audio-path /path/to/audio.wav` or `--audio-path https://example.com/audio.mp3`
 - `--prompt` (or `-p`): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example: `--prompt "What are the main activities shown in this video?"`
+- `--voice`: TTS speaker/voice for audio output when requesting audio (e.g. `ethan`, `chelsie`). Omit to use the model default. Example: `--voice "chelsie"`


Is the voice mismatched with the speaker?

amy-why-3459 · 2026-03-23T14:53:34Z

@princepride @gcanlin PTAL

princepride

Review Summary

Thanks for the work on speaker/language configurability! I found several issues that need to be addressed before merging.

1. 🔴 Runtime Bug: `args.voice` references non-existent attribute

In examples/online_serving/qwen3_tts/openai_speech_client.py, the argparse argument was renamed from --voice to --speaker, which means args.voice no longer exists (it's now args.speaker). However, the code still references args.voice in two places:

"speaker": args.voice,          # ❌ AttributeError at runtime
print(f"Speaker: {args.voice}") # ❌ AttributeError at runtime

Both should be args.speaker.

2. 🔴 Breaking OpenAI API Compatibility

Renaming voice → speaker in OpenAICreateSpeechRequest and StreamingSpeechSessionConfig breaks OpenAI API compatibility. The OpenAI Speech API uses voice as the standard field name. All existing clients using the OpenAI SDK:

client.audio.speech.create(voice="alloy", ...)

will silently fail or error because the SDK sends voice, not speaker.

Suggestion: Keep voice at the API protocol layer for backward compatibility. You can add speaker as an internal alias or handle the mapping internally. For the chat completions endpoint, using extra_body={"speaker": ...} is fine since it's a vllm-omni extension field.

3. 🟡 Documentation Inconsistencies

examples/online_serving/qwen3_omni/README.md line mentions --voice but the actual CLI arg is now --speaker.
docs/user_guide/examples/online_serving/qwen3_tts.md line 298: says "Requires a valid voice" but the field was renamed to speaker.

4. 🟡 Noisy Production Logging

In vllm_omni/entrypoints/openai/serving_chat.py, logger.info is used for per-request debug output:

logger.info(f"preprocess_chat speaker: {speaker}")
logger.info(f"preprocess_chat language: {language}")

These fire on every single request. Should be logger.debug.

5. 🟡 `_build_prompt_for_query_type` Loses Arguments for Mixed Modalities

The fallback branch for use_mixed_modalities / use_multi_audios:

# use_mixed_modalities / use_multi_audios
return query_func(custom_prompt=custom_prompt)

drops video_path, image_path, audio_path that the original mixed_modalities handler passed. This is a regression from the deleted qwen2.5-omni client.

6. 🟡 No Backward Compatibility for Existing Clients

The WebSocket streaming session.config also changed from voice to speaker. Existing WebSocket clients sending {"type": "session.config", "voice": "Vivian"} will have their voice setting silently ignored. Consider adding a fallback that reads voice if speaker is not present.

7. 🟢 Minor: Redundant `.lower()` Call

In qwen3_tts_talker.py, spk_id_map[speaker.lower()] is called but speaker was already normalized to lowercase a few lines earlier. Harmless but redundant.

Overall: The core feature (per-request speaker selection) is great and fills a real gap. The main blockers are the args.voice runtime bug (#1) and the OpenAI API compatibility break (#2). I'd suggest keeping voice at the API boundary and mapping to speaker internally.

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

R2-Y · 2026-03-24T04:12:00Z

Review Summary

Thanks for the work on speaker/language configurability! I found several issues that need to be addressed before merging.

1. 🔴 Runtime Bug: args.voice references non-existent attribute

In examples/online_serving/qwen3_tts/openai_speech_client.py, the argparse argument was renamed from --voice to --speaker, which means args.voice no longer exists (it's now args.speaker). However, the code still references args.voice in two places:
"speaker": args.voice,          # ❌ AttributeError at runtime
print(f"Speaker: {args.voice}") # ❌ AttributeError at runtime
Both should be args.speaker.

2. 🔴 Breaking OpenAI API Compatibility

Renaming voice → speaker in OpenAICreateSpeechRequest and StreamingSpeechSessionConfig breaks OpenAI API compatibility. The OpenAI Speech API uses voice as the standard field name. All existing clients using the OpenAI SDK:
client.audio.speech.create(voice="alloy", ...)
will silently fail or error because the SDK sends voice, not speaker.

Suggestion: Keep voice at the API protocol layer for backward compatibility. You can add speaker as an internal alias or handle the mapping internally. For the chat completions endpoint, using extra_body={"speaker": ...} is fine since it's a vllm-omni extension field.

3. 🟡 Documentation Inconsistencies

examples/online_serving/qwen3_omni/README.md line mentions --voice but the actual CLI arg is now --speaker.

docs/user_guide/examples/online_serving/qwen3_tts.md line 298: says "Requires a valid voice" but the field was renamed to speaker.

4. 🟡 Noisy Production Logging

In vllm_omni/entrypoints/openai/serving_chat.py, logger.info is used for per-request debug output:
logger.info(f"preprocess_chat speaker: {speaker}")
logger.info(f"preprocess_chat language: {language}")
These fire on every single request. Should be logger.debug.

5. 🟡 _build_prompt_for_query_type Loses Arguments for Mixed Modalities

The fallback branch for use_mixed_modalities / use_multi_audios:
# use_mixed_modalities / use_multi_audios
return query_func(custom_prompt=custom_prompt)
drops video_path, image_path, audio_path that the original mixed_modalities handler passed. This is a regression from the deleted qwen2.5-omni client.

6. 🟡 No Backward Compatibility for Existing Clients

The WebSocket streaming session.config also changed from voice to speaker. Existing WebSocket clients sending {"type": "session.config", "voice": "Vivian"} will have their voice setting silently ignored. Consider adding a fallback that reads voice if speaker is not present.

7. 🟢 Minor: Redundant .lower() Call

In qwen3_tts_talker.py, spk_id_map[speaker.lower()] is called but speaker was already normalized to lowercase a few lines earlier. Harmless but redundant.

Overall: The core feature (per-request speaker selection) is great and fills a real gap. The main blockers are the args.voice runtime bug (#1) and the OpenAI API compatibility break (#2). I'd suggest keeping voice at the API boundary and mapping to speaker internally.

fixed

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

) Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

R2-Y requested a review from hsliuustc0106 as a code owner March 18, 2026 02:40

R2-Y force-pushed the config_speaker_id branch 2 times, most recently from 42b0a88 to b4e36bc Compare March 18, 2026 02:46

chatgpt-codex-connector Bot reviewed Mar 18, 2026

View reviewed changes

R2-Y force-pushed the config_speaker_id branch from b4e36bc to 1b5f428 Compare March 18, 2026 02:49

gcanlin reviewed Mar 18, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py Outdated

amy-why-3459 reviewed Mar 18, 2026

View reviewed changes

gcanlin reviewed Mar 18, 2026

View reviewed changes

hsliuustc0106 requested changes Mar 18, 2026

View reviewed changes

linyueqian mentioned this pull request Mar 18, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

Gaohan123 added this to the v0.18.0 milestone Mar 19, 2026

gcanlin mentioned this pull request Mar 22, 2026

[Qwen3-Omni] Support per-request voice_type selection #2071

Closed

amy-why-3459 mentioned this pull request Mar 23, 2026

[RFC]: Omni Model 2026 Q1 Roadmap #1191

Open

1 task

R2-Y force-pushed the config_speaker_id branch 2 times, most recently from 5f295a3 to 68ba3b3 Compare March 23, 2026 13:18

amy-why-3459 reviewed Mar 23, 2026

View reviewed changes

princepride requested changes Mar 23, 2026

View reviewed changes

R2-Y force-pushed the config_speaker_id branch from 58dcd1d to f3b0456 Compare March 23, 2026 15:56

R2-Y added 3 commits March 24, 2026 03:06

support change speaker for qwen3-omni

34e6ff4

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

add support for qwen3-tts and qwen2.5-omni & unify speaker and voice

debd5a6

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

support send multiple requests and speakers

ea3e050

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

correct README & fix some bugs

848c2f5

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

R2-Y force-pushed the config_speaker_id branch 3 times, most recently from 6249969 to cbd5547 Compare March 24, 2026 03:17

revert qwen2.5 omni modification, add later

41a853e

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

R2-Y force-pushed the config_speaker_id branch 3 times, most recently from 4f169b7 to 6cff43f Compare March 24, 2026 04:09

Gaohan123 added the ready label to trigger buildkite CI label Mar 24, 2026

compatible with OPENAI Speech API

8a9f0ae

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

R2-Y force-pushed the config_speaker_id branch from ba58027 to 8a9f0ae Compare March 24, 2026 05:29

Merge branch 'main' into config_speaker_id

4a36cc8

hsliuustc0106 merged commit 3cdf288 into vllm-project:main Mar 24, 2026
7 of 8 checks passed

R2-Y deleted the config_speaker_id branch March 24, 2026 05:56

zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026

[Feature] support to change the speaker of qwen3-omni (vllm-project#1963

0f455e5

) Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026

[Feature] support to change the speaker of qwen3-omni (vllm-project#1963

618087e

) Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

linyueqian mentioned this pull request Apr 3, 2026

[Bugfix] Accept 'speaker' as alias for 'voice' in TTS speech API #2424

Merged

		logger = init_logger(__name__)


		def _voice_type_from_runtime_info(

	Supported voice names depend on the model (e.g. `ethan`, `chelsie`). Omit `voice` to use the default.
	Supported voice names depend on the model (e.g. `ethan`, `chelsie`, `aiden`). Omit `voice` to use the default.

Conversation

R2-Y commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

R2-Y commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amy-why-3459 commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

R2-Y commented Mar 18, 2026

Uh oh!

yenuo26 commented Mar 18, 2026

Uh oh!

linyueqian commented Mar 18, 2026

Uh oh!

R2-Y commented Mar 18, 2026

Uh oh!

R2-Y commented Mar 18, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

PR Review: [Feature] support to change the speaker of qwen3-omni

Gate Status: PASSING ✓

Routing

Critical Findings

1. State Mutation on Model Instance [HIGH]

2. Debug Statements Left in Code

3. Missing Input Validation

Missing Tests

Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amy-why-3459 commented Mar 23, 2026

Uh oh!

princepride left a comment

Choose a reason for hiding this comment

Review Summary

1. 🔴 Runtime Bug: args.voice references non-existent attribute

2. 🔴 Breaking OpenAI API Compatibility

3. 🟡 Documentation Inconsistencies

4. 🟡 Noisy Production Logging

5. 🟡 _build_prompt_for_query_type Loses Arguments for Mixed Modalities

6. 🟡 No Backward Compatibility for Existing Clients

7. 🟢 Minor: Redundant .lower() Call

Uh oh!

R2-Y commented Mar 24, 2026

Review Summary

1. 🔴 Runtime Bug: args.voice references non-existent attribute

2. 🔴 Breaking OpenAI API Compatibility

3. 🟡 Documentation Inconsistencies

4. 🟡 Noisy Production Logging

5. 🟡 _build_prompt_for_query_type Loses Arguments for Mixed Modalities

6. 🟡 No Backward Compatibility for Existing Clients

7. 🟢 Minor: Redundant .lower() Call

Uh oh!

R2-Y commented Mar 18, 2026 •

edited

Loading

1. 🔴 Runtime Bug: `args.voice` references non-existent attribute

5. 🟡 `_build_prompt_for_query_type` Loses Arguments for Mixed Modalities

7. 🟢 Minor: Redundant `.lower()` Call

1. 🔴 Runtime Bug: `args.voice` references non-existent attribute

5. 🟡 `_build_prompt_for_query_type` Loses Arguments for Mixed Modalities

7. 🟢 Minor: Redundant `.lower()` Call