[Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1 by linyueqian · Pull Request #3748 · vllm-project/vllm-omni

linyueqian · 2026-05-20T00:11:38Z

Summary

Four TTS doc / refactor follow-ups to #3338 ("Migrate and clean up TTS configs"):

1. CosyVoice3 online serving — document what already works

CosyVoice3 (FunAudioLLM/Fun-CosyVoice3-0.5B-2512) has worked behind /v1/audio/speech since #2431 and is exercised end-to-end by tests/e2e/online_serving/test_cosyvoice3_tts_expansion.py, but the docs still claimed it was unsupported and the model was missing from the speech-API intro / Quick Start.

docs/serving/speech_api.md: CosyVoice3 added to the intro bullet list + a serve command added to Quick Start.
docs/user_guide/examples/online_serving/text_to_speech.md: CosyVoice3 row added to the Supported Models table; stale "intentionally absent" line removed; full ## CosyVoice3 section added (Prerequisites / Launch / CLI client / Notes) mirroring the OmniVoice pattern; --8<-- example-materials includes added.
examples/online_serving/text_to_speech/cosyvoice3/run_server.sh: launch helper honoring MODEL, PORT, NO_ASYNC_CHUNK.
examples/online_serving/text_to_speech/cosyvoice3/speech_client.py: voice-cloning client that always sends ref_audio + ref_text (required by _validate_cosyvoice3_request in vllm_omni/entrypoints/openai/serving_speech.py), defaulting to the official upstream zero-shot prompt so it runs out-of-box. Supports --stream, --ref-audio, --ref-text, --response-format, --output.

Sample rate correction (24 kHz, not 22.05 kHz): the bundled cosyvoice3.yaml on the Fun-CosyVoice3-0.5B-2512 checkpoint sets sample_rate: 24000, and the WAV header on actual /v1/audio/speech output reads mono 24000 Hz. The "22.05 kHz" claim that lived in the offline README + offline supported_models table was wrong; corrected in both online and offline docs.

2. Relocate `voxcpm_async_chunk.yaml` into `vllm_omni/deploy/`

(Superseded by section 4 below — both v1 yamls are now removed entirely.)

3. Remove redundant `qwen3_tts_uniproc.yaml`

The qwen3_tts_uniproc.yaml variant that was left behind in stage_configs/ is redundant after PR #2604 ("Remove hardcoded distributed_executor_backend") + the pipeline.py absorption of model_arch / model_stage / engine_output_type / sampling_constraints. The runtime tuning fields (max_num_seqs, gpu_memory_utilization, max_num_batched_tokens, max_model_len, connector config) match vllm_omni/deploy/qwen3_tts.yaml exactly, and single-GPU serves now default to the uniproc executor — that was the entire purpose of the variant. Deleted, and the two "Choosing an executor backend (uniproc vs mp)" doc sections rewritten to explain the new default.

4. Remove VoxCPM v1 integration

Removes VoxCPM v1 model code, deploy configs (sync + async_chunk, GPU + NPU), config / arch / arg-utils registrations, request-handling paths in serving_speech.py, examples, bench script, tests, and the TTS hub docs for v1. VoxCPM2 (openbmb/VoxCPM2) is unchanged. This also resolves the residual schema-migration question on voxcpm_async_chunk.yaml raised in the previous round — there's nothing left to migrate.

Test plan

Pre-commit green locally and in CI on every push (ruff-format hit one DEFAULT_REF_AUDIO literal on the first push — fixed; green since).
End-to-end CosyVoice3 client smoke test on H20 (GPU 1, port 8091, real vllm serve FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --omni --trust-remote-code):
- default mode (Chinese voice-cloning) → 1.34 MB RIFF WAVE 16-bit mono 24000 Hz, ~28 s audio.
- --stream mode (English) → 384 KB PCM, 8.0 s @ 24 kHz.
- Both modes complete cleanly; server pkill leaves no StageEngineCoreProc orphans, GPU mem back to 4 MiB.
Existing tests/e2e/online_serving/test_cosyvoice3_tts_expansion.py unchanged (no code paths touched).
No remaining v1 references after the deletion: grep -rE "VoxCPMForConditionalGeneration|VoxCPMConfig|openbmb/VoxCPM-0\.5B|VLLM_OMNI_VOXCPM_HF_CONFIG_PATH|bench_voxcpm_offline" returns empty (voxcpm2 and the shared voxcpm PyPI package used by VoxCPM2 are preserved).
pyproject.toml: voxcpm>=2.0.2 kept (VoxCPM2 path-locating in tests/e2e/offline_inference/test_voxcpm2.py); only the comment updated to reflect v2-only use.

CosyVoice3 already works behind `/v1/audio/speech` (see the `tests/e2e/online_serving/test_cosyvoice3_tts_expansion.py` suite and `vllm_omni/deploy/cosyvoice3.yaml`), but the online-serving docs still claimed it was unsupported and the model was missing from the speech-API intro and Quick Start. This change closes that gap. Changes: - `docs/serving/speech_api.md`: add CosyVoice3 to the intro bullet list and a serve command to Quick Start (Supported Models table already carried the entry). - `docs/user_guide/examples/online_serving/text_to_speech.md`: add a CosyVoice3 row to the Supported Models table, drop the stale "intentionally absent" line, add a full `## CosyVoice3` section (Prerequisites / Launch / CLI client / Notes) mirroring the OmniVoice pattern, include 22.05 kHz in the sample-rate hint, and add `--8<--` example-materials includes. - `examples/online_serving/text_to_speech/cosyvoice3/run_server.sh`: launch helper honoring `MODEL`, `PORT`, and `NO_ASYNC_CHUNK`. - `examples/online_serving/text_to_speech/cosyvoice3/speech_client.py`: voice-cloning client that always sends `ref_audio` + `ref_text` (required by `_validate_cosyvoice3_request`), with the official upstream zero-shot prompt as the default so it runs out-of-box. Supports `--stream`, `--ref-audio`, `--ref-text`, `--response-format`, `--output`. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Bundled `cosyvoice3.yaml` on the Fun-CosyVoice3-0.5B-2512 checkpoint sets `sample_rate: 24000`, and the WAV header on actual `/v1/audio/speech` output reads "mono 24000 Hz" — not 22.05 kHz as I originally wrote (propagated from a stale claim in the offline TTS README). Fix the intro bullet and the per-model section header; drop the now-redundant CosyVoice3 entry from the player-sample-rate hint since 24 kHz is the default for "the others". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

The offline TTS README listed Fun-CosyVoice3-0.5B-2512 as 22.05 kHz in both the Supported Models table and the per-model section header. The bundled `cosyvoice3.yaml` on the checkpoint sets `sample_rate: 24000` and the actual WAV header on `/v1/audio/speech` output reads "mono 24000 Hz" — this was the source the online docs propagated from. Bring the offline README in line with reality. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…yamls to deploy/ vllm-project#3338 migrated CosyVoice3 / OmniVoice / VoxCPM TTS configs into `vllm_omni/deploy/` but left two TTS variants behind in `vllm_omni/model_executor/stage_configs/`: - `voxcpm_async_chunk.yaml` — VoxCPM streaming mode (default `async` path in `examples/online_serving/text_to_speech/voxcpm/run_server.sh` and `DEFAULT_STAGE_ASYNC` in `benchmarks/tts/bench_voxcpm_offline.py`). - `qwen3_tts_uniproc.yaml` — Qwen3-TTS Base uniproc-executor variant documented in the online TTS hub (issues vllm-project#2603 / vllm-project#2604). Both are still actively referenced and serve real user-facing modes, so they are not stale; the upstream refactor simply hadn't finished relocating them. Move the files into `vllm_omni/deploy/` (same legacy `stage_args` schema, same `--stage-configs-path` invocation) and sweep the 7 call sites — `run_server.sh`, the bench script, two README files, and three doc pages — to point at the new location. No behavioral change; this is a relocation + reference update. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

After PR vllm-project#2604 (`[Qwen3-TTS] Remove hardcoded distributed_executor_backend`) and the pipeline.py absorption of `model_arch` / `model_stage` / `engine_output_type` / `sampling_constraints`, the uniproc variant diffs to nothing meaningful against `vllm_omni/deploy/qwen3_tts.yaml` on single-GPU: same `max_num_seqs`, same memory budgets, same connector config, same `max_model_len`. Single-GPU serves now default to the uniproc executor — that was the entire purpose of the variant. Delete the redundant yaml and rewrite the two "Choosing an executor backend (uniproc vs mp)" doc sections (online TTS hub doc + online README) to explain that uniproc is the single-GPU default since vllm-project#2604; `qwen3_tts.yaml` is now the only Qwen3-TTS deploy config. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Removes VoxCPM v1 model code, deploy configs (sync + async_chunk, GPU + NPU), config / arch / arg-utils registrations, request-handling paths in serving_speech, examples, bench script, tests, and the TTS hub docs for v1. VoxCPM2 (`openbmb/VoxCPM2`) is unchanged. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Resolve one modify/delete conflict: this branch removes VoxCPM v1 integration entirely (commit 81c7308 deletes 10 voxcpm files), so upstream's tweak to tests/entrypoints/openai_api/test_serving_speech_voxcpm.py is moot. Keeping the delete. vllm_omni/entrypoints/openai/serving_speech.py auto-merged cleanly; all remaining voxcpm references in the file are VoxCPM2 (v2), which this PR does not touch. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…e VoxCPM v1 (vllm-project#3748) Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

VoxCPM v1 was removed in vllm-project#3748 but _register_omni_hf_configs still imported voxcpm.VoxCPMConfig. Since all imports share one try block, this failure skipped ALL config registrations (including Qwen3-TTS, CosyVoice3, GLM-TTS), breaking any model relying on AutoConfig resolution. Signed-off-by: BeatSeat <wendavid552@gmail.com>

hsliuustc0106

Review: `[Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1`

Verdict: LGTM. No blocking issues.

Section-by-section

1. CosyVoice3 online serving docs — Adds the missing model to speech_api.md intro + Quick Start, a full ## CosyVoice3 section in the online TTS guide, and a launch helper + voice-cloning client. The client defaults to the official upstream zero-shot prompt so it works out-of-box, handles both local paths (base64) and URLs for ref_audio, and supports --stream. Verified by a real-hardware smoke test on H20 per the PR body. Correct.

The sample rate correction (22.05 kHz → 24 kHz) in both offline and online docs matches the actual cosyvoice3.yaml config and the real WAV output. Correct.

2. VoxCPM v1 removal — 30 files deleted: model code (voxcpm.py, voxcpm_loader.py, etc.), both GPU and NPU deploy/stage configs, model registry entries, config resolution, arg_utils injection, serving_speech.py request handling, examples, benchmark, and tests. VoxCPM2 is untouched (only the detection/dispatch logic in serving_speech.py is simplified since the v1 vs v2 branch is now dead). The voxcpm>=2.0.2 dev dependency is kept — comment updated to reflect v2-only use. Clean and complete.

The _detect_tts_model_type simplification in serving_speech.py is correct: after removing _VOXCPM_TTS_MODEL_STAGES = {"latent_generator", "vae"}, the remaining _VOXCPM2_TTS_MODEL_STAGES = {"latent_generator"} no longer needs the VAE-stage-check to disambiguate.

3. qwen3_tts_uniproc.yaml removal — Since single-GPU serves now default to the uniproc executor, this variant config is redundant. The "Choosing an executor backend" doc sections are rewritten to explain the new default and reference the single remaining deploy config. Correct.

4. voxcpm_async_chunk.yaml removal — Both the GPU and NPU variants are deleted along with the rest of VoxCPM v1, resolving the leftover schema-migration question from the prior round. Correct.

Verification checklist

All VoxCPM v1 references removed from code paths (model code, configs, registry, serving)
VoxCPM2 preserved (voxcpm2 model code, deploy configs, docs, tests untouched)
CosyVoice3 sample rate corrected in both offline and online docs
No orphaned tests (all v1-specific tests deleted; CosyVoice3 e2e test unchanged)
pyproject.toml dependency kept (voxcpm>=2.0.2) with updated comment
Pre-commit green
E2E smoke test on hardware passes (documented in PR body)

Minor observations (non-blocking)

cosyvoice3/speech_client.py uses httpx.Client (sync) rather than httpx.AsyncClient. For a client example this is fine — simplicity over concurrency. Just noting that the other TTS clients in the repo (qwen3_tts/streaming_speech_client.py, etc.) tend to use async patterns, so there's a stylistic inconsistency if anyone cares about that. Not worth holding up the PR.
docs/api/README.md has a 1-line deletion (removing a VoxCPM v1 reference) — confirmed clean.

…e VoxCPM v1 (vllm-project#3748) Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

linyueqian requested review from Gaohan123, david6666666, hsliuustc0106 and ywang96 as code owners May 20, 2026 00:11

linyueqian added 4 commits May 19, 2026 20:14

[Doc][TTS] Apply ruff-format to cosyvoice3 speech_client

783ff2d

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

linyueqian requested review from ZeldaHuang, gcanlin, lishunyang12, princepride and tzhouam as code owners May 20, 2026 00:33

linyueqian changed the title ~~[Doc][TTS] Document CosyVoice3 online serving + add example client~~ [Doc][TTS] Document CosyVoice3 online serving + relocate residual TTS deploy yamls May 20, 2026

linyueqian added 2 commits May 19, 2026 20:49

linyueqian requested review from yenuo26 and yuanheng-zhao as code owners May 20, 2026 01:09

linyueqian changed the title ~~[Doc][TTS] Document CosyVoice3 online serving + relocate residual TTS deploy yamls~~ [Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1 May 20, 2026

linyueqian added the ready label to trigger buildkite CI label May 20, 2026

hsliuustc0106 merged commit fc8486c into vllm-project:main May 20, 2026
7 of 9 checks passed

lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request May 20, 2026

[Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remov…

00f9d02

…e VoxCPM v1 (vllm-project#3748) Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

hsliuustc0106 reviewed May 21, 2026

View reviewed changes

zengchuang-hw pushed a commit to zengchuang-hw/vllm-omni that referenced this pull request Jun 1, 2026

[Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remov…

beaad7f

…e VoxCPM v1 (vllm-project#3748) Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1#3748

[Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1#3748
hsliuustc0106 merged 8 commits into
vllm-project:mainfrom
linyueqian:docs/cosyvoice3-online-serving

linyueqian commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

linyueqian commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. CosyVoice3 online serving — document what already works

2. Relocate voxcpm_async_chunk.yaml into vllm_omni/deploy/

3. Remove redundant qwen3_tts_uniproc.yaml

4. Remove VoxCPM v1 integration

Test plan

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Review: [Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1

Section-by-section

Verification checklist

Minor observations (non-blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

linyueqian commented May 20, 2026 •

edited

Loading

2. Relocate `voxcpm_async_chunk.yaml` into `vllm_omni/deploy/`

3. Remove redundant `qwen3_tts_uniproc.yaml`

Review: `[Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1`