[Qwen3TTS][ServingSpeech] Bugfix/voice upload and add optional ref_text#2046
Conversation
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
… bugfix/voice-upload-and-ref-text
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
… bugfix/voice-upload-and-ref-text
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 746ba821b0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if duration > 20.0: | ||
| raise ValueError(f"Reference audio too long ({duration:.1f}s). Maximum 20s supported — use a shorter clip.") |
There was a problem hiding this comment.
Remove the shared 20s cap from reference-audio loading
_resolve_ref_audio() is used by all speech models, not just Qwen TTS, so this new duration > 20s rejection breaks Fish Speech voice cloning. In _prepare_speech_generation() the Fish Speech path always calls this helper before prompt construction, while our own Fish Speech UI/docs still recommend 10–30s reference clips (examples/online_serving/fish_speech/gradio_demo.py:193-237). Any documented 20–30s sample will now fail with a 400 even though that flow previously worked.
Useful? React with 👍 / 👎.
| if any(metadata.get("final_output_type") == "audio" for metadata in stage_metadata): | ||
| supported_tasks.add("speech") | ||
| # TTS stage-0 is an AR model, so we need to add generate | ||
| supported_tasks.add("generate") |
There was a problem hiding this comment.
Avoid advertising generate for speech-only stage graphs
Adding "generate" whenever any stage returns final_output_type == "audio" makes the API server instantiate all text-generation handlers behind api_server.py:611-766, even for pipelines with no comprehension stage. I checked vllm_omni/model_executor/stage_configs/fish_speech_s2_pro.yaml and cosyvoice3.yaml: neither marks a stage as is_comprehension, so before this change those servers only exposed speech APIs; after it they will also advertise /chat/completions and /completions on graphs that are not wired for text generation.
Useful? React with 👍 / 👎.
| if stored_ref_text: | ||
| params["ref_text"] = [stored_ref_text] | ||
| params["x_vector_only_mode"] = [False] |
There was a problem hiding this comment.
Normalize blank uploaded transcripts before enabling ICL
upload_voice() now persists ref_text, and this branch treats any truthy stored value as a signal to force in-context cloning. A transcript like ' ' therefore uploads successfully, but every later synthesis for that voice will fail when qwen3_tts_talker.py:1389-1391 rejects the same value because ref_text.strip() is empty. Stripping or rejecting blank transcripts before setting x_vector_only_mode=False would avoid creating voices that can never be used.
Useful? React with 👍 / 👎.
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
… bugfix/voice-upload-and-ref-text
… bugfix/voice-upload-and-ref-text
… bugfix/voice-upload-and-ref-text
… bugfix/voice-upload-and-ref-text
… bugfix/voice-upload-and-ref-text
lishunyang12
left a comment
There was a problem hiding this comment.
Left a few comments. The ref_text plumbing looks correct overall, nice fix for the upload+generate flow.
| wav_np = np.mean(wav_np, axis=-1) | ||
| return wav_np.tolist(), int(sr) | ||
| sr = int(sr) | ||
| duration = len(wav_np) / sr if sr > 0 else 0.0 |
There was a problem hiding this comment.
Duration validation only runs at generation time (_resolve_ref_audio), not at upload time. If someone uploads a 45s clip, upload_voice succeeds but every subsequent generation request will fail with a confusing error. Should validate duration bounds in upload_voice as well (or instead).
There was a problem hiding this comment.
Perfect. I added a check there as well.
| "At least 1s of clear speech is required for speaker embedding." | ||
| ) | ||
| if duration > 30.0: | ||
| raise ValueError(f"Reference audio too long ({duration:.1f}s). Maximum 30s supported — use a shorter clip.") |
There was a problem hiding this comment.
This line is ~110 chars. Split the f-string to stay within the line-length limit.
| raise ValueError(f"Reference audio too long ({duration:.1f}s). Maximum 30s supported — use a shorter clip.") | |
| raise ValueError( | |
| f"Reference audio too long ({duration:.1f}s). " | |
| "Maximum 30s supported — use a shorter clip." | |
| ) |
There was a problem hiding this comment.
I tried to change it, but the pre-commit run changes it back. I think the line lenght is correct.
| n = flat.numel() | ||
| if n == 0 or n % q != 0: | ||
| if n > 0: | ||
| if n > 1: |
There was a problem hiding this comment.
Why n > 1 instead of n > 0? When q > 1, a single-element input is still malformed (not divisible by q). Changing this silently swallows the warning for n == 1.
| assert response.status_code == 200 | ||
| result = response.json() | ||
| assert result["success"] is True | ||
| assert result["voice"]["name"] == "test_voice_rt" |
There was a problem hiding this comment.
This test doesn't assert that ref_text was actually stored. At minimum check result["voice"].get("ref_text") == "Hello world transcript" — otherwise the test passes even if ref_text is silently dropped.
… bugfix/voice-upload-and-ref-text
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
|
@lishunyang12 changes done. Btw, i'm currently working on a smarter way of VoiceManager for voice caching that will be shared across all TTS models where we do not need to compute the speaker embedding multiple times during voice cloning, but only once, maybe at voice-upload time. This should bring a noticeable speedup in batched voice cloning decoding (#1701). |
I suggest you can proposal a new RFC to implement it and some other works. I think multi-turn real-time voice chat is an important scenario and we can do more work on it. Feel free to reach out anytime. |
… bugfix/voice-upload-and-ref-text
This makes sense, do you mean a full RFC regarding Voice Caching manager? @Sy0307 |
Yes, you can propose such an RFC. Additionally, I believe that multi-turn conversations and similar caching features are helpful for any model with real-time interactive multi-turn requirements. We can start with voice models for now, but I hope the design of this RFC can be extended to more real-time models, such as world models or real-time diffusion models. Reference: #1987 (comment) Feel free to leave your thoughts. |
… bugfix/voice-upload-and-ref-text
@Sy0307 this is super cool btw, we need t work in the future in a way to keep "a chat/conversation" style but for voice. Maybe even something with |
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
… bugfix/voice-upload-and-ref-text
|
resolve conflicts please, also fix ci |
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Head branch was pushed to by a user without write access
|
@linyueqian done. Thanks! |
|
fix ci please |
… bugfix/voice-upload-and-ref-text
…/JuanPZuluaga/vllm-omni into bugfix/voice-upload-and-ref-text
… bugfix/voice-upload-and-ref-text
|
@linyueqian thanks. I'm looking into it. |
…xt (vllm-project#2046) Signed-off-by: JuanPZuluaga <juanz9312@gmail.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
…xt (vllm-project#2046) Signed-off-by: JuanPZuluaga <juanz9312@gmail.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
…xt (vllm-project#2046) Signed-off-by: JuanPZuluaga <juanz9312@gmail.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Test Plan
1. Launch server (same for both branches)
Test Result
outputs after the fix:
baseline.wav
gen_icl.wav
gen_xvec.wav
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)