[Feature] support to change the speaker of qwen3-omni#1963
[Feature] support to change the speaker of qwen3-omni#1963hsliuustc0106 merged 7 commits intovllm-project:mainfrom
Conversation
42b0a88 to
b4e36bc
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ae1c610ee2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| p = prompt[index] if isinstance(prompt, list) and index < len(prompt) else prompt | ||
| if p is None: | ||
| return None | ||
| print(f"_extract_voice_type_from_prompt p: {p}") |
There was a problem hiding this comment.
Remove raw prompt printing from voice extraction path
_extract_voice_type_from_prompt currently prints the full prompt object for every thinker output, and this code path runs during normal request forwarding. In production, that can leak user prompt contents into stdout/log sinks and also hurt throughput because synchronous console I/O is on the hot path; please remove this print (or replace it with gated, redacted debug logging).
Useful? React with 👍 / 👎.
|
Could you also adapt qwen2.5 and qwen3-tts accordingly? Could you also adapt language_id accordingly? |
| else: | ||
| logger.debug("No additional_information provided to code2wav stage.") | ||
| audio_tensors = self.generate_audio(codes, voice_type, left_context_size, seq_token_counts) | ||
| audio_tensors = self.generate_audio(codes, self.voice_type, left_context_size, seq_token_counts) |
There was a problem hiding this comment.
In the multi-concurrency scenario, is there any problem if different users use different voice_type values when self.voice_type is used?
There was a problem hiding this comment.
may lead some problem, will look into it
| logger = init_logger(__name__) | ||
|
|
||
|
|
||
| def _voice_type_from_runtime_info( |
There was a problem hiding this comment.
Are _voice_type_from_runtime_info, _extract_voice_type_from_request and _extract_voice_type_from_prompt general for all models? The filed of voice_type seems to be defined by our framework instead of model-specific. If so, could we make them as utility function?
There was a problem hiding this comment.
yeah general for all models, will make them as utility function. good idea thanks
| print(response.choices[1].message.audio) | ||
| ``` | ||
|
|
||
| Supported voice names depend on the model (e.g. `ethan`, `chelsie`). Omit `voice` to use the default. |
There was a problem hiding this comment.
I check Qwen3-Omni config and find one more speaker_id. I think we can add it in docs to avoid users lookup config manually.
| Supported voice names depend on the model (e.g. `ethan`, `chelsie`). Omit `voice` to use the default. | |
| Supported voice names depend on the model (e.g. `ethan`, `chelsie`, `aiden`). Omit `voice` to use the default. |
can do |
Does this mean that after this PR is merged, this test case(#1844) will need to be re-adapted, and the same scenario for Qwen2.5 will be added? |
|
Should the voice speaker names use an uppercase initial letter? Currently they're lowercase (e.g. |
I can let users enter names with uppercase initial name or whatever they want, but I still need to convert them to lowercase in the model so that the model can correctly tokenize |
need to configure two more parameters in test |
hsliuustc0106
left a comment
There was a problem hiding this comment.
PR Review: [Feature] support to change the speaker of qwen3-omni
Gate Status: PASSING ✓
All CI checks pass. Proceeding with full review.
Routing
- Primary: Audio/TTS (
vllm-omni-audio-tts) - voice/speaker selection - Secondary: API (
vllm-omni-api) - newvoiceparameter
Critical Findings
1. State Mutation on Model Instance [HIGH]
In qwen3_omni.py:forward():
self.voice_type = effective_voice_typeSetting voice_type on self (the model instance) could cause race conditions in concurrent requests. If two requests with different voices are processed simultaneously, they could overwrite each other's voice setting.
Recommendation: Pass voice_type through the request pipeline (via additional_information) rather than storing on the model instance.
2. Debug Statements Left in Code
Multiple debug statements should be removed or converted to appropriate log levels:
# qwen3_omni.py (stage_input_processors)
print(f"_extract_voice_type_from_prompt p: {p}") # Remove or use logger
# serving_chat.py
logger.info(f"preprocess_chat voice: {voice}") # Use logger.debug()
# qwen3_omni.py (model)
logger.info(f"voice_type: {voice_type}") # Use logger.debug()3. Missing Input Validation
The voice parameter is not validated against supported voice types (e.g., ethan, chelsie). Invalid voice names will be silently passed to the model, which may cause unexpected behavior or errors deep in the TTS pipeline.
Recommendation: Add validation at the API layer with a clear error message for unsupported voices.
Missing Tests
This PR needs automated tests:
| Test Type | Description |
|---|---|
| Regression test | Verify voice selection works (issue #1845 test was failing) |
| Edge cases | Empty string, None, invalid voice name |
| Concurrent requests | Multiple requests with different voices should not interfere |
The linked issue references tests/e2e/online_serving/test_qwen3_omni_expansion.py::test_voice_001. This PR should add/update that test to verify the fix works.
Summary
| Validated | Lacks Evidence | Must Change |
|---|---|---|
| Voice parameter flows through API | List of supported voices per model | Fix state mutation issue |
| Documentation updated | Test commands/scripts | Remove debug statements |
| Fixes linked issue #1845 | Validation for invalid voices | Add automated tests |
🤖 Generated with Claude Code
5f295a3 to
68ba3b3
Compare
| - `--image-path` (or `-i`): Path to local image file or URL. If not provided and query-type is `use_image`, uses default image URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common image formats: JPEG, PNG, GIF, WebP. Example: `--image-path /path/to/image.jpg` or `--image-path https://example.com/image.png` | ||
| - `--audio-path` (or `-a`): Path to local audio file or URL. If not provided and query-type is `use_audio`, uses default audio URL. Supports local file paths (automatically encoded to base64) or HTTP/HTTPS URLs and common audio formats: MP3, WAV, OGG, FLAC, M4A. Example: `--audio-path /path/to/audio.wav` or `--audio-path https://example.com/audio.mp3` | ||
| - `--prompt` (or `-p`): Custom text prompt/question. If not provided, uses default prompt for the selected query type. Example: `--prompt "What are the main activities shown in this video?"` | ||
| - `--voice`: TTS speaker/voice for audio output when requesting audio (e.g. `ethan`, `chelsie`). Omit to use the model default. Example: `--voice "chelsie"` |
There was a problem hiding this comment.
Is the voice mismatched with the speaker?
|
@princepride @gcanlin PTAL |
princepride
left a comment
There was a problem hiding this comment.
Review Summary
Thanks for the work on speaker/language configurability! I found several issues that need to be addressed before merging.
1. 🔴 Runtime Bug: args.voice references non-existent attribute
In examples/online_serving/qwen3_tts/openai_speech_client.py, the argparse argument was renamed from --voice to --speaker, which means args.voice no longer exists (it's now args.speaker). However, the code still references args.voice in two places:
"speaker": args.voice, # ❌ AttributeError at runtime
print(f"Speaker: {args.voice}") # ❌ AttributeError at runtimeBoth should be args.speaker.
2. 🔴 Breaking OpenAI API Compatibility
Renaming voice → speaker in OpenAICreateSpeechRequest and StreamingSpeechSessionConfig breaks OpenAI API compatibility. The OpenAI Speech API uses voice as the standard field name. All existing clients using the OpenAI SDK:
client.audio.speech.create(voice="alloy", ...)will silently fail or error because the SDK sends voice, not speaker.
Suggestion: Keep voice at the API protocol layer for backward compatibility. You can add speaker as an internal alias or handle the mapping internally. For the chat completions endpoint, using extra_body={"speaker": ...} is fine since it's a vllm-omni extension field.
3. 🟡 Documentation Inconsistencies
examples/online_serving/qwen3_omni/README.mdline mentions--voicebut the actual CLI arg is now--speaker.docs/user_guide/examples/online_serving/qwen3_tts.mdline 298: says "Requires a validvoice" but the field was renamed tospeaker.
4. 🟡 Noisy Production Logging
In vllm_omni/entrypoints/openai/serving_chat.py, logger.info is used for per-request debug output:
logger.info(f"preprocess_chat speaker: {speaker}")
logger.info(f"preprocess_chat language: {language}")These fire on every single request. Should be logger.debug.
5. 🟡 _build_prompt_for_query_type Loses Arguments for Mixed Modalities
The fallback branch for use_mixed_modalities / use_multi_audios:
# use_mixed_modalities / use_multi_audios
return query_func(custom_prompt=custom_prompt)drops video_path, image_path, audio_path that the original mixed_modalities handler passed. This is a regression from the deleted qwen2.5-omni client.
6. 🟡 No Backward Compatibility for Existing Clients
The WebSocket streaming session.config also changed from voice to speaker. Existing WebSocket clients sending {"type": "session.config", "voice": "Vivian"} will have their voice setting silently ignored. Consider adding a fallback that reads voice if speaker is not present.
7. 🟢 Minor: Redundant .lower() Call
In qwen3_tts_talker.py, spk_id_map[speaker.lower()] is called but speaker was already normalized to lowercase a few lines earlier. Harmless but redundant.
Overall: The core feature (per-request speaker selection) is great and fills a real gap. The main blockers are the args.voice runtime bug (#1) and the OpenAI API compatibility break (#2). I'd suggest keeping voice at the API boundary and mapping to speaker internally.
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
6249969 to
cbd5547
Compare
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
4f169b7 to
6cff43f
Compare
fixed |
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
add voice & language config for user, support to customize speaker & language of qwen3-omni, qwen2.5-omni, qwen3-tts
fix #1845
Test Plan
change different voice type, and check if the output.wav voice is correct.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)