Skip to content

[Feature] support to change the speaker of qwen3-omni#1963

Merged
hsliuustc0106 merged 7 commits intovllm-project:mainfrom
R2-Y:config_speaker_id
Mar 24, 2026
Merged

[Feature] support to change the speaker of qwen3-omni#1963
hsliuustc0106 merged 7 commits intovllm-project:mainfrom
R2-Y:config_speaker_id

Conversation

@R2-Y
Copy link
Copy Markdown
Contributor

@R2-Y R2-Y commented Mar 18, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

add voice & language config for user, support to customize speaker & language of qwen3-omni, qwen2.5-omni, qwen3-tts
fix #1845

Test Plan

change different voice type, and check if the output.wav voice is correct.

Test Result

image image image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@R2-Y R2-Y requested a review from hsliuustc0106 as a code owner March 18, 2026 02:40
@R2-Y
Copy link
Copy Markdown
Contributor Author

R2-Y commented Mar 18, 2026

@R2-Y R2-Y force-pushed the config_speaker_id branch 2 times, most recently from 42b0a88 to b4e36bc Compare March 18, 2026 02:46
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ae1c610ee2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

p = prompt[index] if isinstance(prompt, list) and index < len(prompt) else prompt
if p is None:
return None
print(f"_extract_voice_type_from_prompt p: {p}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove raw prompt printing from voice extraction path

_extract_voice_type_from_prompt currently prints the full prompt object for every thinker output, and this code path runs during normal request forwarding. In production, that can leak user prompt contents into stdout/log sinks and also hurt throughput because synchronous console I/O is on the hot path; please remove this print (or replace it with gated, redacted debug logging).

Useful? React with 👍 / 👎.

@R2-Y R2-Y force-pushed the config_speaker_id branch from b4e36bc to 1b5f428 Compare March 18, 2026 02:49
Comment thread vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py Outdated
@amy-why-3459
Copy link
Copy Markdown
Contributor

Could you also adapt qwen2.5 and qwen3-tts accordingly? Could you also adapt language_id accordingly?

else:
logger.debug("No additional_information provided to code2wav stage.")
audio_tensors = self.generate_audio(codes, voice_type, left_context_size, seq_token_counts)
audio_tensors = self.generate_audio(codes, self.voice_type, left_context_size, seq_token_counts)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the multi-concurrency scenario, is there any problem if different users use different voice_type values when self.voice_type is used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may lead some problem, will look into it

logger = init_logger(__name__)


def _voice_type_from_runtime_info(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are _voice_type_from_runtime_info, _extract_voice_type_from_request and _extract_voice_type_from_prompt general for all models? The filed of voice_type seems to be defined by our framework instead of model-specific. If so, could we make them as utility function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah general for all models, will make them as utility function. good idea thanks

print(response.choices[1].message.audio)
```

Supported voice names depend on the model (e.g. `ethan`, `chelsie`). Omit `voice` to use the default.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I check Qwen3-Omni config and find one more speaker_id. I think we can add it in docs to avoid users lookup config manually.

Suggested change
Supported voice names depend on the model (e.g. `ethan`, `chelsie`). Omit `voice` to use the default.
Supported voice names depend on the model (e.g. `ethan`, `chelsie`, `aiden`). Omit `voice` to use the default.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@R2-Y
Copy link
Copy Markdown
Contributor Author

R2-Y commented Mar 18, 2026

Could you also adapt qwen2.5 and qwen3-tts accordingly? Could you also adapt language_id accordingly?

can do

@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Mar 18, 2026

Could you also adapt qwen2.5 and qwen3-tts accordingly? Could you also adapt language_id accordingly?

can do

Does this mean that after this PR is merged, this test case(#1844) will need to be re-adapted, and the same scenario for Qwen2.5 will be added?

@linyueqian
Copy link
Copy Markdown
Collaborator

Should the voice speaker names use an uppercase initial letter? Currently they're lowercase (e.g. ethan, chelsie), but since they're proper names, capitalizing them (Ethan, Chelsie) might be more appropriate. (although i know qwen3 tts right now is all lowercase but i think capitalizing is better. any thoughts?

@R2-Y
Copy link
Copy Markdown
Contributor Author

R2-Y commented Mar 18, 2026

Should the voice speaker names use an uppercase initial letter? Currently they're lowercase (e.g. ethan, chelsie), but since they're proper names, capitalizing them (Ethan, Chelsie) might be more appropriate. (although i know qwen3 tts right now is all lowercase but i think capitalizing is better. any thoughts?

I can let users enter names with uppercase initial name or whatever they want, but I still need to convert them to lowercase in the model so that the model can correctly tokenize

@R2-Y
Copy link
Copy Markdown
Contributor Author

R2-Y commented Mar 18, 2026

Could you also adapt qwen2.5 and qwen3-tts accordingly? Could you also adapt language_id accordingly?

can do

Does this mean that after this PR is merged, this test case(#1844) will need to be re-adapted, and the same scenario for Qwen2.5 will be added?

need to configure two more parameters in test

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: [Feature] support to change the speaker of qwen3-omni

Gate Status: PASSING ✓

All CI checks pass. Proceeding with full review.


Routing

  • Primary: Audio/TTS (vllm-omni-audio-tts) - voice/speaker selection
  • Secondary: API (vllm-omni-api) - new voice parameter

Critical Findings

1. State Mutation on Model Instance [HIGH]

In qwen3_omni.py:forward():

self.voice_type = effective_voice_type

Setting voice_type on self (the model instance) could cause race conditions in concurrent requests. If two requests with different voices are processed simultaneously, they could overwrite each other's voice setting.

Recommendation: Pass voice_type through the request pipeline (via additional_information) rather than storing on the model instance.

2. Debug Statements Left in Code

Multiple debug statements should be removed or converted to appropriate log levels:

# qwen3_omni.py (stage_input_processors)
print(f"_extract_voice_type_from_prompt p: {p}")  # Remove or use logger

# serving_chat.py
logger.info(f"preprocess_chat voice: {voice}")  # Use logger.debug()

# qwen3_omni.py (model)
logger.info(f"voice_type: {voice_type}")  # Use logger.debug()

3. Missing Input Validation

The voice parameter is not validated against supported voice types (e.g., ethan, chelsie). Invalid voice names will be silently passed to the model, which may cause unexpected behavior or errors deep in the TTS pipeline.

Recommendation: Add validation at the API layer with a clear error message for unsupported voices.


Missing Tests

This PR needs automated tests:

Test Type Description
Regression test Verify voice selection works (issue #1845 test was failing)
Edge cases Empty string, None, invalid voice name
Concurrent requests Multiple requests with different voices should not interfere

The linked issue references tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_voice_001. This PR should add/update that test to verify the fix works.


Summary

Validated Lacks Evidence Must Change
Voice parameter flows through API List of supported voices per model Fix state mutation issue
Documentation updated Test commands/scripts Remove debug statements
Fixes linked issue #1845 Validation for invalid voices Add automated tests

🤖 Generated with Claude Code

- `--image-path` (or `-i`): Path to local image file or URL. If not provided and query-type is `use_image`, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example: `--image-path /path/to/image.jpg` or `--image-path https://example.com/image.png`
- `--audio-path` (or `-a`): Path to local audio file or URL. If not provided and query-type is `use_audio`, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example: `--audio-path /path/to/audio.wav` or `--audio-path https://example.com/audio.mp3`
- `--prompt` (or `-p`): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example: `--prompt "What are the main activities shown in this video?"`
- `--voice`: TTS speaker/voice for audio output when requesting audio (e.g. `ethan`, `chelsie`). Omit to use the model default. Example: `--voice "chelsie"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the voice mismatched with the speaker?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah fixed

@amy-why-3459
Copy link
Copy Markdown
Contributor

@princepride @gcanlin PTAL

Copy link
Copy Markdown
Collaborator

@princepride princepride left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Thanks for the work on speaker/language configurability! I found several issues that need to be addressed before merging.

1. 🔴 Runtime Bug: args.voice references non-existent attribute

In examples/online_serving/qwen3_tts/openai_speech_client.py, the argparse argument was renamed from --voice to --speaker, which means args.voice no longer exists (it's now args.speaker). However, the code still references args.voice in two places:

"speaker": args.voice,          # ❌ AttributeError at runtime
print(f"Speaker: {args.voice}") # ❌ AttributeError at runtime

Both should be args.speaker.

2. 🔴 Breaking OpenAI API Compatibility

Renaming voicespeaker in OpenAICreateSpeechRequest and StreamingSpeechSessionConfig breaks OpenAI API compatibility. The OpenAI Speech API uses voice as the standard field name. All existing clients using the OpenAI SDK:

client.audio.speech.create(voice="alloy", ...)

will silently fail or error because the SDK sends voice, not speaker.

Suggestion: Keep voice at the API protocol layer for backward compatibility. You can add speaker as an internal alias or handle the mapping internally. For the chat completions endpoint, using extra_body={"speaker": ...} is fine since it's a vllm-omni extension field.

3. 🟡 Documentation Inconsistencies

  • examples/online_serving/qwen3_omni/README.md line mentions --voice but the actual CLI arg is now --speaker.
  • docs/user_guide/examples/online_serving/qwen3_tts.md line 298: says "Requires a valid voice" but the field was renamed to speaker.

4. 🟡 Noisy Production Logging

In vllm_omni/entrypoints/openai/serving_chat.py, logger.info is used for per-request debug output:

logger.info(f"preprocess_chat speaker: {speaker}")
logger.info(f"preprocess_chat language: {language}")

These fire on every single request. Should be logger.debug.

5. 🟡 _build_prompt_for_query_type Loses Arguments for Mixed Modalities

The fallback branch for use_mixed_modalities / use_multi_audios:

# use_mixed_modalities / use_multi_audios
return query_func(custom_prompt=custom_prompt)

drops video_path, image_path, audio_path that the original mixed_modalities handler passed. This is a regression from the deleted qwen2.5-omni client.

6. 🟡 No Backward Compatibility for Existing Clients

The WebSocket streaming session.config also changed from voice to speaker. Existing WebSocket clients sending {"type": "session.config", "voice": "Vivian"} will have their voice setting silently ignored. Consider adding a fallback that reads voice if speaker is not present.

7. 🟢 Minor: Redundant .lower() Call

In qwen3_tts_talker.py, spk_id_map[speaker.lower()] is called but speaker was already normalized to lowercase a few lines earlier. Harmless but redundant.


Overall: The core feature (per-request speaker selection) is great and fills a real gap. The main blockers are the args.voice runtime bug (#1) and the OpenAI API compatibility break (#2). I'd suggest keeping voice at the API boundary and mapping to speaker internally.

@R2-Y R2-Y force-pushed the config_speaker_id branch from 58dcd1d to f3b0456 Compare March 23, 2026 15:56
R2-Y added 3 commits March 24, 2026 03:06
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
@R2-Y R2-Y force-pushed the config_speaker_id branch 3 times, most recently from 6249969 to cbd5547 Compare March 24, 2026 03:17
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
@R2-Y R2-Y force-pushed the config_speaker_id branch 3 times, most recently from 4f169b7 to 6cff43f Compare March 24, 2026 04:09
@R2-Y
Copy link
Copy Markdown
Contributor Author

R2-Y commented Mar 24, 2026

Review Summary

Thanks for the work on speaker/language configurability! I found several issues that need to be addressed before merging.

1. 🔴 Runtime Bug: args.voice references non-existent attribute

In examples/online_serving/qwen3_tts/openai_speech_client.py, the argparse argument was renamed from --voice to --speaker, which means args.voice no longer exists (it's now args.speaker). However, the code still references args.voice in two places:

"speaker": args.voice,          # ❌ AttributeError at runtime
print(f"Speaker: {args.voice}") # ❌ AttributeError at runtime

Both should be args.speaker.

2. 🔴 Breaking OpenAI API Compatibility

Renaming voicespeaker in OpenAICreateSpeechRequest and StreamingSpeechSessionConfig breaks OpenAI API compatibility. The OpenAI Speech API uses voice as the standard field name. All existing clients using the OpenAI SDK:

client.audio.speech.create(voice="alloy", ...)

will silently fail or error because the SDK sends voice, not speaker.

Suggestion: Keep voice at the API protocol layer for backward compatibility. You can add speaker as an internal alias or handle the mapping internally. For the chat completions endpoint, using extra_body={"speaker": ...} is fine since it's a vllm-omni extension field.

3. 🟡 Documentation Inconsistencies

  • examples/online_serving/qwen3_omni/README.md line mentions --voice but the actual CLI arg is now --speaker.
  • docs/user_guide/examples/online_serving/qwen3_tts.md line 298: says "Requires a valid voice" but the field was renamed to speaker.

4. 🟡 Noisy Production Logging

In vllm_omni/entrypoints/openai/serving_chat.py, logger.info is used for per-request debug output:

logger.info(f"preprocess_chat speaker: {speaker}")
logger.info(f"preprocess_chat language: {language}")

These fire on every single request. Should be logger.debug.

5. 🟡 _build_prompt_for_query_type Loses Arguments for Mixed Modalities

The fallback branch for use_mixed_modalities / use_multi_audios:

# use_mixed_modalities / use_multi_audios
return query_func(custom_prompt=custom_prompt)

drops video_path, image_path, audio_path that the original mixed_modalities handler passed. This is a regression from the deleted qwen2.5-omni client.

6. 🟡 No Backward Compatibility for Existing Clients

The WebSocket streaming session.config also changed from voice to speaker. Existing WebSocket clients sending {"type": "session.config", "voice": "Vivian"} will have their voice setting silently ignored. Consider adding a fallback that reads voice if speaker is not present.

7. 🟢 Minor: Redundant .lower() Call

In qwen3_tts_talker.py, spk_id_map[speaker.lower()] is called but speaker was already normalized to lowercase a few lines earlier. Harmless but redundant.

Overall: The core feature (per-request speaker selection) is great and fills a real gap. The main blockers are the args.voice runtime bug (#1) and the OpenAI API compatibility break (#2). I'd suggest keeping voice at the API boundary and mapping to speaker internally.

fixed

@Gaohan123 Gaohan123 added the ready label to trigger buildkite CI label Mar 24, 2026
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
@R2-Y R2-Y force-pushed the config_speaker_id branch from ba58027 to 8a9f0ae Compare March 24, 2026 05:29
@hsliuustc0106 hsliuustc0106 merged commit 3cdf288 into vllm-project:main Mar 24, 2026
7 of 8 checks passed
@R2-Y R2-Y deleted the config_speaker_id branch March 24, 2026 05:56
zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026
)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026
)

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Qwen3-omni, when I specify the voice as female in the system prompt, the output audio is still male.

8 participants