[CosyVoice3] Add online serving support, fix stage config, and add CI tests#2431
Conversation
557f2a1 to
6d2a5e0
Compare
|
@yenuo26 PTAL |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 557f2a1910
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
… tests - Namespace stage names to cosyvoice3_talker/cosyvoice3_code2wav to avoid collision with other models using generic talker/code2wav names - Register CosyVoice3 in the TTS serving layer with validation and prompt building for /v1/audio/speech endpoint (voice cloning with ref_audio) - Fix cuDNN crash in code2wav by setting enforce_eager=true (Conv1d dynamic shapes are incompatible with CUDA graphs) - Add sr=22050 to code2wav multimodal output for correct audio playback - Tune gpu_memory_utilization (0.2/0.1) for the 0.5B model - Auto-inject model_type into hf_overrides so models with empty config.json (like CosyVoice3) can be loaded directly from HuggingFace - Register omni model configs in vLLM _CONFIG_REGISTRY for config resolution - Auto-detect tokenizer in subdirectories for models that don't store it at the root (CosyVoice-BlankEN/) - Auto-download mel_filters.npz asset from Whisper repo when missing - Add unit tests for CosyVoice3 serving (validation, detection, prompt) - Add e2e test with official CosyVoice zero-shot reference audio - Add CI steps in merge (core_model) and nightly (advanced_model) pipelines Signed-off-by: linyueqian <linyueqian@outlook.com>
6d2a5e0 to
78f2d65
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
| - | | ||
| timeout 20m bash -c ' | ||
| export VLLM_WORKER_MULTIPROC_METHOD=spawn | ||
| pytest -s -v tests/e2e/online_serving/test_cosyvoice3_tts.py -m "core_model" --run-level "core_model" |
There was a problem hiding this comment.
We don't need to put the same test cases in ready, merge, and nightly simultaneously. I think you may need to modify it here to pytest -s -v tests/e2e/online_serving/test_cosyvoice3_tts.py -m "advanced_model" --run-level "advanced_model"
| pytest -s -v tests/e2e/offline_inference/test_voxtral_tts.py -m "advanced_model" --run-level "advanced_model" | ||
| EXIT4=$$? | ||
| exit $$((EXIT1 | EXIT2 | EXIT3 | EXIT4)) | ||
| pytest -s -v tests/e2e/online_serving/test_cosyvoice3_tts.py -m "advanced_model" --run-level "advanced_model" |
There was a problem hiding this comment.
i think maybe this can be deleted
| @@ -0,0 +1,172 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
To unify the code style, maybe we can modify this test case according to the tests/e2e/online_serving/test_qwen3_tts_base.py? If there are validation points that cannot be covered, we can add them in the assert_audio_speech_response of tests/conftest.py.
- Use advanced_model marker in merge pipeline - Remove duplicate CosyVoice3 step from nightly pipeline - Rewrite test to follow test_qwen3_tts_base.py style with function-based tests and openai_client.send_audio_speech_request; basic zh test marked core_model+advanced_model, others advanced_model only Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
|
Thanks for the feedback @yenuo26! I've made the following changes:
|
lishunyang12
left a comment
There was a problem hiding this comment.
left a few comments on the arg_utils changes
| _TOKENIZER_SUBFOLDER_MAP = { | ||
| "CosyVoice3Model": "CosyVoice-BlankEN", | ||
| } | ||
| subfolder = _TOKENIZER_SUBFOLDER_MAP.get(self.model_arch) |
There was a problem hiding this comment.
This downloads the full snapshot (all files matching the subfolder prefix), not just tokenizer files. For a model repo with large checkpoint files in nested dirs this could be slow or wasteful.
| subfolder = _TOKENIZER_SUBFOLDER_MAP.get(self.model_arch) | |
| local_dir = snapshot_download( | |
| model_path, | |
| allow_patterns=[f"{subfolder}/tokenizer*", f"{subfolder}/special_tokens*", f"{subfolder}/vocab*", f"{subfolder}/merges*", f"{subfolder}/added_tokens*"], | |
| ) |
| self.hf_overrides = {} | ||
| if isinstance(self.hf_overrides, dict): | ||
| self.hf_overrides.setdefault("architectures", [self.model_arch]) | ||
| # Derive model_type from known arch→model_type mappings. |
There was a problem hiding this comment.
Nit: defining this dict inside create_model_config means it gets rebuilt every call. Move it to module level.
| "Missing CosyVoice3 mel filter asset:\n" | ||
| f" {filters_path}\n" | ||
| "Auto-download failed. Download it manually from:\n" | ||
| f" {source_url}\n" |
There was a problem hiding this comment.
urlretrieve has no timeout — this can hang indefinitely in CI. Consider urllib.request.urlopen with a timeout + manual write, or just requests.get(timeout=30).
…row download, add timeout - Move _ARCH_TO_MODEL_TYPE and _TOKENIZER_SUBFOLDER_MAP to module level - Narrow snapshot_download allow_patterns to tokenizer files only - Replace urlretrieve with urlopen(timeout=30) to prevent CI hangs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
When model_dir is an HF repo ID (e.g. FunAudioLLM/Fun-CosyVoice3-0.5B-2512), os.path.join with qwen_pretrain_path produces an invalid 3-part repo ID that AutoTokenizer.from_pretrained rejects. Use snapshot_download to resolve to the local cache directory first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
The previous fix only covered the processor path. CosyVoice3Model.load_weights also uses self.model_dir with os.path.join for flow.pt, llm.pt, hift.pt etc. Resolve the HF repo ID to local cache in __init__ so all downstream code works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
… generation_config
Models like CosyVoice3 have an empty config.json ({}) without model_type,
which causes AutoConfig.from_pretrained to fail. This commit:
1. Registers omni config classes with vLLM's internal _CONFIG_REGISTRY
(not just transformers AutoConfig) so HFConfigParser can resolve them
2. Injects model_type into hf_overrides when model_arch is specified
3. Patches try_get_generation_config in _attach_llm_stage to avoid
crashes for models without generation_config.json
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
…ine-serving-ci # Conflicts: # vllm_omni/entrypoints/openai/serving_speech.py Signed-off-by: linyueqian <linyueqian@outlook.com>
The previous patch returned None, but get_diff_sampling_param() calls .update() on the result, causing AttributeError on NoneType. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
vllm-project#2431 inserted the CosyVoice3 step between Voxtral's commands block and its agents/plugins block, causing the agents/plugins to attach to CosyVoice3 instead of Voxtral. This left the Voxtral step without a queue, causing `buildkite-agent pipeline upload` to reject the entire pipeline with "No queue specified". Add back the kubernetes agents/plugins block for the Voxtral step. Signed-off-by: linyueqian <linyueqian@outlook.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vllm-project#2431 inserted the CosyVoice3 step between Voxtral's commands block and its agents/plugins block, causing the agents/plugins to attach to CosyVoice3 instead of Voxtral. This left the Voxtral step without a queue, causing `buildkite-agent pipeline upload` to reject the entire pipeline with "No queue specified". Add back the kubernetes agents/plugins block for the Voxtral step. Signed-off-by: linyueqian <linyueqian@outlook.com>
… tests (vllm-project#2431) Signed-off-by: linyueqian <linyueqian@outlook.com>
… tests (vllm-project#2431) Signed-off-by: linyueqian <linyueqian@outlook.com>
… tests (vllm-project#2431) Signed-off-by: linyueqian <linyueqian@outlook.com>
… tests (vllm-project#2431) Signed-off-by: linyueqian <linyueqian@outlook.com>
Summary
CosyVoice3 online serving via
/v1/audio/speechwas broken due to generic stage names colliding with other models and missing model type registration. This PR fixes the serving path end-to-end and adds CI coverage.Bug Fixes
talker/code2wav→cosyvoice3_talker/cosyvoice3_code2wavto avoid conflicts with other modelsenforce_eager=truefor code2wav stage — Conv1d in HiFiGAN has dynamic shapes incompatible with CUDA graphssr=22050to code2wav multimodal output (was defaulting to 24000)model_typeviahf_overridesfor models with emptyconfig.json(like CosyVoice3)CosyVoice-BlankEN/) for both local and HF-hosted modelsmel_filters.npzfrom Whisper repo when missing instead of raising errorOnline Serving
cosyvoice3_talkerin TTS model stages"cosyvoice3"_validate_cosyvoice3_request()— requiresinput,ref_audio, andref_text_build_cosyvoice3_prompt()— multimodal prompt with reference audio for voice cloningConfig Tuning
gpu_memory_utilization: 0.4→0.2 (stage 0), 0.2→0.1 (stage 1) — 0.5B model doesn't need high utilizationenforce_eager: trueon both stages — avoids 20+ min CUDA graph capture on startupTests
core_model and cpu)core_model, H100) and nightly pipeline (advanced_model)Test plan
Related: Supersedes #2121