[Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1#3748
Conversation
CosyVoice3 already works behind `/v1/audio/speech` (see the `tests/e2e/online_serving/test_cosyvoice3_tts_expansion.py` suite and `vllm_omni/deploy/cosyvoice3.yaml`), but the online-serving docs still claimed it was unsupported and the model was missing from the speech-API intro and Quick Start. This change closes that gap. Changes: - `docs/serving/speech_api.md`: add CosyVoice3 to the intro bullet list and a serve command to Quick Start (Supported Models table already carried the entry). - `docs/user_guide/examples/online_serving/text_to_speech.md`: add a CosyVoice3 row to the Supported Models table, drop the stale "intentionally absent" line, add a full `## CosyVoice3` section (Prerequisites / Launch / CLI client / Notes) mirroring the OmniVoice pattern, include 22.05 kHz in the sample-rate hint, and add `--8<--` example-materials includes. - `examples/online_serving/text_to_speech/cosyvoice3/run_server.sh`: launch helper honoring `MODEL`, `PORT`, and `NO_ASYNC_CHUNK`. - `examples/online_serving/text_to_speech/cosyvoice3/speech_client.py`: voice-cloning client that always sends `ref_audio` + `ref_text` (required by `_validate_cosyvoice3_request`), with the official upstream zero-shot prompt as the default so it runs out-of-box. Supports `--stream`, `--ref-audio`, `--ref-text`, `--response-format`, `--output`. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Bundled `cosyvoice3.yaml` on the Fun-CosyVoice3-0.5B-2512 checkpoint sets `sample_rate: 24000`, and the WAV header on actual `/v1/audio/speech` output reads "mono 24000 Hz" — not 22.05 kHz as I originally wrote (propagated from a stale claim in the offline TTS README). Fix the intro bullet and the per-model section header; drop the now-redundant CosyVoice3 entry from the player-sample-rate hint since 24 kHz is the default for "the others". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The offline TTS README listed Fun-CosyVoice3-0.5B-2512 as 22.05 kHz in both the Supported Models table and the per-model section header. The bundled `cosyvoice3.yaml` on the checkpoint sets `sample_rate: 24000` and the actual WAV header on `/v1/audio/speech` output reads "mono 24000 Hz" — this was the source the online docs propagated from. Bring the offline README in line with reality. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…yamls to deploy/ vllm-project#3338 migrated CosyVoice3 / OmniVoice / VoxCPM TTS configs into `vllm_omni/deploy/` but left two TTS variants behind in `vllm_omni/model_executor/stage_configs/`: - `voxcpm_async_chunk.yaml` — VoxCPM streaming mode (default `async` path in `examples/online_serving/text_to_speech/voxcpm/run_server.sh` and `DEFAULT_STAGE_ASYNC` in `benchmarks/tts/bench_voxcpm_offline.py`). - `qwen3_tts_uniproc.yaml` — Qwen3-TTS Base uniproc-executor variant documented in the online TTS hub (issues vllm-project#2603 / vllm-project#2604). Both are still actively referenced and serve real user-facing modes, so they are not stale; the upstream refactor simply hadn't finished relocating them. Move the files into `vllm_omni/deploy/` (same legacy `stage_args` schema, same `--stage-configs-path` invocation) and sweep the 7 call sites — `run_server.sh`, the bench script, two README files, and three doc pages — to point at the new location. No behavioral change; this is a relocation + reference update. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
After PR vllm-project#2604 (`[Qwen3-TTS] Remove hardcoded distributed_executor_backend`) and the pipeline.py absorption of `model_arch` / `model_stage` / `engine_output_type` / `sampling_constraints`, the uniproc variant diffs to nothing meaningful against `vllm_omni/deploy/qwen3_tts.yaml` on single-GPU: same `max_num_seqs`, same memory budgets, same connector config, same `max_model_len`. Single-GPU serves now default to the uniproc executor — that was the entire purpose of the variant. Delete the redundant yaml and rewrite the two "Choosing an executor backend (uniproc vs mp)" doc sections (online TTS hub doc + online README) to explain that uniproc is the single-GPU default since vllm-project#2604; `qwen3_tts.yaml` is now the only Qwen3-TTS deploy config. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Removes VoxCPM v1 model code, deploy configs (sync + async_chunk, GPU + NPU), config / arch / arg-utils registrations, request-handling paths in serving_speech, examples, bench script, tests, and the TTS hub docs for v1. VoxCPM2 (`openbmb/VoxCPM2`) is unchanged. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Resolve one modify/delete conflict: this branch removes VoxCPM v1 integration entirely (commit 81c7308 deletes 10 voxcpm files), so upstream's tweak to tests/entrypoints/openai_api/test_serving_speech_voxcpm.py is moot. Keeping the delete. vllm_omni/entrypoints/openai/serving_speech.py auto-merged cleanly; all remaining voxcpm references in the file are VoxCPM2 (v2), which this PR does not touch. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…e VoxCPM v1 (vllm-project#3748) Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
VoxCPM v1 was removed in vllm-project#3748 but _register_omni_hf_configs still imported voxcpm.VoxCPMConfig. Since all imports share one try block, this failure skipped ALL config registrations (including Qwen3-TTS, CosyVoice3, GLM-TTS), breaking any model relying on AutoConfig resolution. Signed-off-by: BeatSeat <wendavid552@gmail.com>
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review: [Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1
Verdict: LGTM. No blocking issues.
Section-by-section
1. CosyVoice3 online serving docs — Adds the missing model to speech_api.md intro + Quick Start, a full ## CosyVoice3 section in the online TTS guide, and a launch helper + voice-cloning client. The client defaults to the official upstream zero-shot prompt so it works out-of-box, handles both local paths (base64) and URLs for ref_audio, and supports --stream. Verified by a real-hardware smoke test on H20 per the PR body. Correct.
The sample rate correction (22.05 kHz → 24 kHz) in both offline and online docs matches the actual cosyvoice3.yaml config and the real WAV output. Correct.
2. VoxCPM v1 removal — 30 files deleted: model code (voxcpm.py, voxcpm_loader.py, etc.), both GPU and NPU deploy/stage configs, model registry entries, config resolution, arg_utils injection, serving_speech.py request handling, examples, benchmark, and tests. VoxCPM2 is untouched (only the detection/dispatch logic in serving_speech.py is simplified since the v1 vs v2 branch is now dead). The voxcpm>=2.0.2 dev dependency is kept — comment updated to reflect v2-only use. Clean and complete.
The _detect_tts_model_type simplification in serving_speech.py is correct: after removing _VOXCPM_TTS_MODEL_STAGES = {"latent_generator", "vae"}, the remaining _VOXCPM2_TTS_MODEL_STAGES = {"latent_generator"} no longer needs the VAE-stage-check to disambiguate.
3. qwen3_tts_uniproc.yaml removal — Since single-GPU serves now default to the uniproc executor, this variant config is redundant. The "Choosing an executor backend" doc sections are rewritten to explain the new default and reference the single remaining deploy config. Correct.
4. voxcpm_async_chunk.yaml removal — Both the GPU and NPU variants are deleted along with the rest of VoxCPM v1, resolving the leftover schema-migration question from the prior round. Correct.
Verification checklist
- All VoxCPM v1 references removed from code paths (model code, configs, registry, serving)
- VoxCPM2 preserved (
voxcpm2model code, deploy configs, docs, tests untouched) - CosyVoice3 sample rate corrected in both offline and online docs
- No orphaned tests (all v1-specific tests deleted; CosyVoice3 e2e test unchanged)
-
pyproject.tomldependency kept (voxcpm>=2.0.2) with updated comment - Pre-commit green
- E2E smoke test on hardware passes (documented in PR body)
Minor observations (non-blocking)
-
cosyvoice3/speech_client.pyuseshttpx.Client(sync) rather thanhttpx.AsyncClient. For a client example this is fine — simplicity over concurrency. Just noting that the other TTS clients in the repo (qwen3_tts/streaming_speech_client.py, etc.) tend to use async patterns, so there's a stylistic inconsistency if anyone cares about that. Not worth holding up the PR. -
docs/api/README.mdhas a 1-line deletion (removing a VoxCPM v1 reference) — confirmed clean.
…e VoxCPM v1 (vllm-project#3748) Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Summary
Four TTS doc / refactor follow-ups to #3338 ("Migrate and clean up TTS configs"):
1. CosyVoice3 online serving — document what already works
CosyVoice3 (
FunAudioLLM/Fun-CosyVoice3-0.5B-2512) has worked behind/v1/audio/speechsince #2431 and is exercised end-to-end bytests/e2e/online_serving/test_cosyvoice3_tts_expansion.py, but the docs still claimed it was unsupported and the model was missing from the speech-API intro / Quick Start.docs/serving/speech_api.md: CosyVoice3 added to the intro bullet list + a serve command added to Quick Start.docs/user_guide/examples/online_serving/text_to_speech.md: CosyVoice3 row added to the Supported Models table; stale "intentionally absent" line removed; full## CosyVoice3section added (Prerequisites / Launch / CLI client / Notes) mirroring the OmniVoice pattern;--8<--example-materials includes added.examples/online_serving/text_to_speech/cosyvoice3/run_server.sh: launch helper honoringMODEL,PORT,NO_ASYNC_CHUNK.examples/online_serving/text_to_speech/cosyvoice3/speech_client.py: voice-cloning client that always sendsref_audio+ref_text(required by_validate_cosyvoice3_requestinvllm_omni/entrypoints/openai/serving_speech.py), defaulting to the official upstream zero-shot prompt so it runs out-of-box. Supports--stream,--ref-audio,--ref-text,--response-format,--output.Sample rate correction (24 kHz, not 22.05 kHz): the bundled
cosyvoice3.yamlon the Fun-CosyVoice3-0.5B-2512 checkpoint setssample_rate: 24000, and the WAV header on actual/v1/audio/speechoutput readsmono 24000 Hz. The "22.05 kHz" claim that lived in the offline README + offline supported_models table was wrong; corrected in both online and offline docs.2. Relocate
voxcpm_async_chunk.yamlintovllm_omni/deploy/(Superseded by section 4 below — both v1 yamls are now removed entirely.)
3. Remove redundant
qwen3_tts_uniproc.yamlThe
qwen3_tts_uniproc.yamlvariant that was left behind instage_configs/is redundant after PR #2604 ("Remove hardcodeddistributed_executor_backend") + the pipeline.py absorption ofmodel_arch/model_stage/engine_output_type/sampling_constraints. The runtime tuning fields (max_num_seqs,gpu_memory_utilization,max_num_batched_tokens,max_model_len, connector config) matchvllm_omni/deploy/qwen3_tts.yamlexactly, and single-GPU serves now default to the uniproc executor — that was the entire purpose of the variant. Deleted, and the two "Choosing an executor backend (uniproc vs mp)" doc sections rewritten to explain the new default.4. Remove VoxCPM v1 integration
Removes VoxCPM v1 model code, deploy configs (sync + async_chunk, GPU + NPU), config / arch / arg-utils registrations, request-handling paths in
serving_speech.py, examples, bench script, tests, and the TTS hub docs for v1. VoxCPM2 (openbmb/VoxCPM2) is unchanged. This also resolves the residual schema-migration question onvoxcpm_async_chunk.yamlraised in the previous round — there's nothing left to migrate.Test plan
DEFAULT_REF_AUDIOliteral on the first push — fixed; green since).vllm serve FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --omni --trust-remote-code):RIFF WAVE 16-bit mono 24000 Hz, ~28 s audio.--streammode (English) → 384 KB PCM, 8.0 s @ 24 kHz.pkillleaves noStageEngineCoreProcorphans, GPU mem back to 4 MiB.tests/e2e/online_serving/test_cosyvoice3_tts_expansion.pyunchanged (no code paths touched).grep -rE "VoxCPMForConditionalGeneration|VoxCPMConfig|openbmb/VoxCPM-0\.5B|VLLM_OMNI_VOXCPM_HF_CONFIG_PATH|bench_voxcpm_offline"returns empty (voxcpm2and the sharedvoxcpmPyPI package used by VoxCPM2 are preserved).pyproject.toml:voxcpm>=2.0.2kept (VoxCPM2 path-locating intests/e2e/offline_inference/test_voxcpm2.py); only the comment updated to reflect v2-only use.