[Bugfix] Fix CosyVoice3 online serving via /v1/audio/speech#2121
[Bugfix] Fix CosyVoice3 online serving via /v1/audio/speech#2121divyanshsinghvi wants to merge 2 commits into
Conversation
…ject#2043) CosyVoice3 model_stage values (talker/code2wav) were not recognized by the online speech serving path, causing requests to fall through to the generic text-only prompt builder and crash with CUDA index out-of-bounds. - Namespace CosyVoice3 stage types to cosyvoice3_talker/cosyvoice3_code2wav to avoid collision with other models using the same generic names - Register cosyvoice3_talker in _TTS_MODEL_STAGES so the model is recognized as TTS in the serving layer - Add cosyvoice3 branch in _prepare_speech_generation to build the correct multimodal prompt (audio data + prompt_text) matching the offline inference format Closes vllm-project#2043 Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Gaohan123
left a comment
There was a problem hiding this comment.
Thanks for your contribution! Please post test results and supplement UT for protecting functions
|
@divyanshsinghvi is there any updates? thanks! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 056ac58036
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "sample_rate": sr, | ||
| }, | ||
| } | ||
| tts_params = {} |
There was a problem hiding this comment.
Respect max_new_tokens for CosyVoice3 requests
The new CosyVoice3 path drops all per-request generation controls by setting tts_params = {} and never mapping request.max_new_tokens into sampling params, so /v1/audio/speech callers who set max_new_tokens for latency/cost control will have that limit silently ignored. This is observable whenever max_new_tokens is provided with a CosyVoice3 model and can lead to much longer-than-requested decoding runs.
Useful? React with 👍 / 👎.
Cover the changes in this PR: - model_stage rename (cosyvoice3_talker/cosyvoice3_code2wav) - TTS model type detection for cosyvoice3 - Validation: ref_audio, ref_text, and input text required - Prompt building with audio data and processor kwargs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
|
Added unit tests (8 tests, all passing) covering:
Could you also add e2e benchmark results (TTFP, RTF, latency) from a working run? That would help get this merged. |
Will do over next few days. |
|
Hi @divyanshsinghvi since #2431 is merged, I will close this pr for now. |
CosyVoice3 model_stage values (talker/code2wav) were not recognized by the online speech serving path, causing requests to fall through to the generic text-only prompt builder and crash with CUDA index out-of-bounds.
Closes #2043
Purpose
Fix CosyVoice3 online serving via
/v1/audio/speechendpoint which was crashing with CUDA index out-of-bounds because the model was not recognized as TTS.Test Plan
verify_e2e_cosyvoice.py/v1/audio/speechwith CosyVoice3 model (requiresref_audioandref_text)Test Result
Pending
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.