Skip to content

[Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1#3748

Merged
hsliuustc0106 merged 8 commits into
vllm-project:mainfrom
linyueqian:docs/cosyvoice3-online-serving
May 20, 2026
Merged

[Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1#3748
hsliuustc0106 merged 8 commits into
vllm-project:mainfrom
linyueqian:docs/cosyvoice3-online-serving

Conversation

@linyueqian

@linyueqian linyueqian commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

Four TTS doc / refactor follow-ups to #3338 ("Migrate and clean up TTS configs"):

1. CosyVoice3 online serving — document what already works

CosyVoice3 (FunAudioLLM/Fun-CosyVoice3-0.5B-2512) has worked behind /v1/audio/speech since #2431 and is exercised end-to-end by tests/e2e/online_serving/test_cosyvoice3_tts_expansion.py, but the docs still claimed it was unsupported and the model was missing from the speech-API intro / Quick Start.

  • docs/serving/speech_api.md: CosyVoice3 added to the intro bullet list + a serve command added to Quick Start.
  • docs/user_guide/examples/online_serving/text_to_speech.md: CosyVoice3 row added to the Supported Models table; stale "intentionally absent" line removed; full ## CosyVoice3 section added (Prerequisites / Launch / CLI client / Notes) mirroring the OmniVoice pattern; --8<-- example-materials includes added.
  • examples/online_serving/text_to_speech/cosyvoice3/run_server.sh: launch helper honoring MODEL, PORT, NO_ASYNC_CHUNK.
  • examples/online_serving/text_to_speech/cosyvoice3/speech_client.py: voice-cloning client that always sends ref_audio + ref_text (required by _validate_cosyvoice3_request in vllm_omni/entrypoints/openai/serving_speech.py), defaulting to the official upstream zero-shot prompt so it runs out-of-box. Supports --stream, --ref-audio, --ref-text, --response-format, --output.

Sample rate correction (24 kHz, not 22.05 kHz): the bundled cosyvoice3.yaml on the Fun-CosyVoice3-0.5B-2512 checkpoint sets sample_rate: 24000, and the WAV header on actual /v1/audio/speech output reads mono 24000 Hz. The "22.05 kHz" claim that lived in the offline README + offline supported_models table was wrong; corrected in both online and offline docs.

2. Relocate voxcpm_async_chunk.yaml into vllm_omni/deploy/

(Superseded by section 4 below — both v1 yamls are now removed entirely.)

3. Remove redundant qwen3_tts_uniproc.yaml

The qwen3_tts_uniproc.yaml variant that was left behind in stage_configs/ is redundant after PR #2604 ("Remove hardcoded distributed_executor_backend") + the pipeline.py absorption of model_arch / model_stage / engine_output_type / sampling_constraints. The runtime tuning fields (max_num_seqs, gpu_memory_utilization, max_num_batched_tokens, max_model_len, connector config) match vllm_omni/deploy/qwen3_tts.yaml exactly, and single-GPU serves now default to the uniproc executor — that was the entire purpose of the variant. Deleted, and the two "Choosing an executor backend (uniproc vs mp)" doc sections rewritten to explain the new default.

4. Remove VoxCPM v1 integration

Removes VoxCPM v1 model code, deploy configs (sync + async_chunk, GPU + NPU), config / arch / arg-utils registrations, request-handling paths in serving_speech.py, examples, bench script, tests, and the TTS hub docs for v1. VoxCPM2 (openbmb/VoxCPM2) is unchanged. This also resolves the residual schema-migration question on voxcpm_async_chunk.yaml raised in the previous round — there's nothing left to migrate.

Test plan

  • Pre-commit green locally and in CI on every push (ruff-format hit one DEFAULT_REF_AUDIO literal on the first push — fixed; green since).
  • End-to-end CosyVoice3 client smoke test on H20 (GPU 1, port 8091, real vllm serve FunAudioLLM/Fun-CosyVoice3-0.5B-2512 --omni --trust-remote-code):
    • default mode (Chinese voice-cloning) → 1.34 MB RIFF WAVE 16-bit mono 24000 Hz, ~28 s audio.
    • --stream mode (English) → 384 KB PCM, 8.0 s @ 24 kHz.
    • Both modes complete cleanly; server pkill leaves no StageEngineCoreProc orphans, GPU mem back to 4 MiB.
  • Existing tests/e2e/online_serving/test_cosyvoice3_tts_expansion.py unchanged (no code paths touched).
  • No remaining v1 references after the deletion: grep -rE "VoxCPMForConditionalGeneration|VoxCPMConfig|openbmb/VoxCPM-0\.5B|VLLM_OMNI_VOXCPM_HF_CONFIG_PATH|bench_voxcpm_offline" returns empty (voxcpm2 and the shared voxcpm PyPI package used by VoxCPM2 are preserved).
  • pyproject.toml: voxcpm>=2.0.2 kept (VoxCPM2 path-locating in tests/e2e/offline_inference/test_voxcpm2.py); only the comment updated to reflect v2-only use.

CosyVoice3 already works behind `/v1/audio/speech` (see the
`tests/e2e/online_serving/test_cosyvoice3_tts_expansion.py` suite and
`vllm_omni/deploy/cosyvoice3.yaml`), but the online-serving docs still
claimed it was unsupported and the model was missing from the speech-API
intro and Quick Start. This change closes that gap.

Changes:
- `docs/serving/speech_api.md`: add CosyVoice3 to the intro bullet list
  and a serve command to Quick Start (Supported Models table already
  carried the entry).
- `docs/user_guide/examples/online_serving/text_to_speech.md`: add a
  CosyVoice3 row to the Supported Models table, drop the stale
  "intentionally absent" line, add a full `## CosyVoice3` section
  (Prerequisites / Launch / CLI client / Notes) mirroring the OmniVoice
  pattern, include 22.05 kHz in the sample-rate hint, and add
  `--8<--` example-materials includes.
- `examples/online_serving/text_to_speech/cosyvoice3/run_server.sh`:
  launch helper honoring `MODEL`, `PORT`, and `NO_ASYNC_CHUNK`.
- `examples/online_serving/text_to_speech/cosyvoice3/speech_client.py`:
  voice-cloning client that always sends `ref_audio` + `ref_text`
  (required by `_validate_cosyvoice3_request`), with the official
  upstream zero-shot prompt as the default so it runs out-of-box.
  Supports `--stream`, `--ref-audio`, `--ref-text`, `--response-format`,
  `--output`.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Bundled `cosyvoice3.yaml` on the Fun-CosyVoice3-0.5B-2512 checkpoint sets
`sample_rate: 24000`, and the WAV header on actual `/v1/audio/speech`
output reads "mono 24000 Hz" — not 22.05 kHz as I originally wrote
(propagated from a stale claim in the offline TTS README). Fix the
intro bullet and the per-model section header; drop the now-redundant
CosyVoice3 entry from the player-sample-rate hint since 24 kHz is the
default for "the others".

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The offline TTS README listed Fun-CosyVoice3-0.5B-2512 as 22.05 kHz in
both the Supported Models table and the per-model section header. The
bundled `cosyvoice3.yaml` on the checkpoint sets `sample_rate: 24000`
and the actual WAV header on `/v1/audio/speech` output reads "mono
24000 Hz" — this was the source the online docs propagated from. Bring
the offline README in line with reality.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…yamls to deploy/

vllm-project#3338 migrated CosyVoice3 / OmniVoice / VoxCPM TTS configs into
`vllm_omni/deploy/` but left two TTS variants behind in
`vllm_omni/model_executor/stage_configs/`:

- `voxcpm_async_chunk.yaml` — VoxCPM streaming mode (default `async`
  path in `examples/online_serving/text_to_speech/voxcpm/run_server.sh`
  and `DEFAULT_STAGE_ASYNC` in `benchmarks/tts/bench_voxcpm_offline.py`).
- `qwen3_tts_uniproc.yaml` — Qwen3-TTS Base uniproc-executor variant
  documented in the online TTS hub (issues vllm-project#2603 / vllm-project#2604).

Both are still actively referenced and serve real user-facing modes, so
they are not stale; the upstream refactor simply hadn't finished
relocating them. Move the files into `vllm_omni/deploy/` (same legacy
`stage_args` schema, same `--stage-configs-path` invocation) and
sweep the 7 call sites — `run_server.sh`, the bench script, two README
files, and three doc pages — to point at the new location.

No behavioral change; this is a relocation + reference update.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@linyueqian linyueqian changed the title [Doc][TTS] Document CosyVoice3 online serving + add example client [Doc][TTS] Document CosyVoice3 online serving + relocate residual TTS deploy yamls May 20, 2026
After PR vllm-project#2604 (`[Qwen3-TTS] Remove hardcoded distributed_executor_backend`)
and the pipeline.py absorption of `model_arch` / `model_stage` /
`engine_output_type` / `sampling_constraints`, the uniproc variant
diffs to nothing meaningful against `vllm_omni/deploy/qwen3_tts.yaml`
on single-GPU: same `max_num_seqs`, same memory budgets, same connector
config, same `max_model_len`. Single-GPU serves now default to the
uniproc executor — that was the entire purpose of the variant.

Delete the redundant yaml and rewrite the two "Choosing an executor
backend (uniproc vs mp)" doc sections (online TTS hub doc + online
README) to explain that uniproc is the single-GPU default since vllm-project#2604;
`qwen3_tts.yaml` is now the only Qwen3-TTS deploy config.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Removes VoxCPM v1 model code, deploy configs (sync + async_chunk,
GPU + NPU), config / arch / arg-utils registrations, request-handling
paths in serving_speech, examples, bench script, tests, and the TTS
hub docs for v1. VoxCPM2 (`openbmb/VoxCPM2`) is unchanged.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@linyueqian linyueqian changed the title [Doc][TTS] Document CosyVoice3 online serving + relocate residual TTS deploy yamls [Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1 May 20, 2026
@linyueqian linyueqian added the ready label to trigger buildkite CI label May 20, 2026
Resolve one modify/delete conflict: this branch removes VoxCPM v1
integration entirely (commit 81c7308 deletes 10 voxcpm files), so
upstream's tweak to tests/entrypoints/openai_api/test_serving_speech_voxcpm.py
is moot. Keeping the delete.

vllm_omni/entrypoints/openai/serving_speech.py auto-merged cleanly;
all remaining voxcpm references in the file are VoxCPM2 (v2), which
this PR does not touch.

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@hsliuustc0106 hsliuustc0106 merged commit fc8486c into vllm-project:main May 20, 2026
7 of 9 checks passed
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request May 20, 2026
…e VoxCPM v1 (vllm-project#3748)

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
BeatSeat added a commit to BeatSeat/vllm-omni that referenced this pull request May 20, 2026
VoxCPM v1 was removed in vllm-project#3748 but _register_omni_hf_configs still
imported voxcpm.VoxCPMConfig. Since all imports share one try block,
this failure skipped ALL config registrations (including Qwen3-TTS,
CosyVoice3, GLM-TTS), breaking any model relying on AutoConfig
resolution.

Signed-off-by: BeatSeat <wendavid552@gmail.com>

@hsliuustc0106 hsliuustc0106 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1

Verdict: LGTM. No blocking issues.

Section-by-section

1. CosyVoice3 online serving docs — Adds the missing model to speech_api.md intro + Quick Start, a full ## CosyVoice3 section in the online TTS guide, and a launch helper + voice-cloning client. The client defaults to the official upstream zero-shot prompt so it works out-of-box, handles both local paths (base64) and URLs for ref_audio, and supports --stream. Verified by a real-hardware smoke test on H20 per the PR body. Correct.

The sample rate correction (22.05 kHz → 24 kHz) in both offline and online docs matches the actual cosyvoice3.yaml config and the real WAV output. Correct.

2. VoxCPM v1 removal — 30 files deleted: model code (voxcpm.py, voxcpm_loader.py, etc.), both GPU and NPU deploy/stage configs, model registry entries, config resolution, arg_utils injection, serving_speech.py request handling, examples, benchmark, and tests. VoxCPM2 is untouched (only the detection/dispatch logic in serving_speech.py is simplified since the v1 vs v2 branch is now dead). The voxcpm>=2.0.2 dev dependency is kept — comment updated to reflect v2-only use. Clean and complete.

The _detect_tts_model_type simplification in serving_speech.py is correct: after removing _VOXCPM_TTS_MODEL_STAGES = {"latent_generator", "vae"}, the remaining _VOXCPM2_TTS_MODEL_STAGES = {"latent_generator"} no longer needs the VAE-stage-check to disambiguate.

3. qwen3_tts_uniproc.yaml removal — Since single-GPU serves now default to the uniproc executor, this variant config is redundant. The "Choosing an executor backend" doc sections are rewritten to explain the new default and reference the single remaining deploy config. Correct.

4. voxcpm_async_chunk.yaml removal — Both the GPU and NPU variants are deleted along with the rest of VoxCPM v1, resolving the leftover schema-migration question from the prior round. Correct.

Verification checklist

  • All VoxCPM v1 references removed from code paths (model code, configs, registry, serving)
  • VoxCPM2 preserved (voxcpm2 model code, deploy configs, docs, tests untouched)
  • CosyVoice3 sample rate corrected in both offline and online docs
  • No orphaned tests (all v1-specific tests deleted; CosyVoice3 e2e test unchanged)
  • pyproject.toml dependency kept (voxcpm>=2.0.2) with updated comment
  • Pre-commit green
  • E2E smoke test on hardware passes (documented in PR body)

Minor observations (non-blocking)

  • cosyvoice3/speech_client.py uses httpx.Client (sync) rather than httpx.AsyncClient. For a client example this is fine — simplicity over concurrency. Just noting that the other TTS clients in the repo (qwen3_tts/streaming_speech_client.py, etc.) tend to use async patterns, so there's a stylistic inconsistency if anyone cares about that. Not worth holding up the PR.

  • docs/api/README.md has a 1-line deletion (removing a VoxCPM v1 reference) — confirmed clean.

zengchuang-hw pushed a commit to zengchuang-hw/vllm-omni that referenced this pull request Jun 1, 2026
…e VoxCPM v1 (vllm-project#3748)

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants