Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions .claude/skills/add-tts-model/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,8 +211,8 @@ See `plan/voxcpm2_native_ar_design.md`.

- Model files in `vllm_omni/model_executor/models/<model_name>/`
- Stage config YAML
- Working `end2end.py` with correct audio output
- README.md in the example directory
- Working `end2end.py` at `examples/offline_inference/text_to_speech/<model>/end2end.py`
- New section in `examples/offline_inference/text_to_speech/README.md` (table row + per-model section). Do **not** create a top-level `examples/offline_inference/<model>/` dir or a per-model `README.md` inside `text_to_speech/<model>/` — the hub README is the documented surface and the mkdocs `generate_examples` hook only descends one level into `examples/<category>/`.

## Phase 3: Online Serving

Expand Down Expand Up @@ -308,11 +308,11 @@ def build_voice_clone_prompt(ref_audio_path: str, text: str, codec) -> list:
### Deliverables

- Updated `serving_speech.py` with all 5 integration points (single commit)
- Client scripts and server launcher
- Gradio demo with streaming and voice cloning UI
- Client scripts and server launcher under `examples/online_serving/text_to_speech/<model>/`
- Gradio demo with streaming and voice cloning UI in the same dir
- E2E online serving test (`tests/e2e/online_serving/test_<model>.py`)
- Buildkite CI entry in `.buildkite/test-merge.yml`
- Documentation (offline + online serving docs)
- New section in `examples/online_serving/text_to_speech/README.md` (table row + per-model section). Do **not** create a top-level `examples/online_serving/<model>/` dir or a per-model `README.md` inside `text_to_speech/<model>/`.

### E2E test pitfalls to avoid

Expand Down
11 changes: 11 additions & 0 deletions docs/contributing/model/adding_tts_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,17 @@ vllm_omni/model_executor/stage_configs/
your_model_name_async_chunk.yaml # Streaming mode config
```

### Example placement

TTS examples live in the consolidated text-to-speech hub, **not** in their
own top-level directory. Place per-model scripts under
`examples/offline_inference/text_to_speech/<your_model>/` and
`examples/online_serving/text_to_speech/<your_model>/`, and add a section
to the hub `README.md` files (table row + per-model section) instead of a
new per-model `README.md`. The mkdocs `generate_examples` hook treats the
`text_to_speech/` parent as a single example, so per-model READMEs inside
it would not be picked up — the hub README is the documented surface.

**Qwen3-TTS reference files:**

| File | Purpose |
Expand Down
68 changes: 0 additions & 68 deletions docs/user_guide/examples/offline_inference/voxtral_tts.md

This file was deleted.

2 changes: 1 addition & 1 deletion examples/offline_inference/ming_flash_omni/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ vLLM-Omni supports two deployment modes:
| Thinker + Talker (omni-speech, default) | `vllm_omni/deploy/ming_flash_omni.yaml` | Text + Audio |
| Thinker only (multimodal understanding) | `vllm_omni/deploy/ming_flash_omni_thinker_only.yaml` | Text |

For standalone TTS (talker only), see [`examples/offline_inference/ming_flash_omni_tts/`](../ming_flash_omni_tts/).
For standalone TTS (talker only), see the [Ming-flash-omni-TTS section in the Text-To-Speech hub](../text_to_speech/README.md#ming-flash-omni-tts).

## Setup

Expand Down
47 changes: 0 additions & 47 deletions examples/offline_inference/ming_flash_omni_tts/README.md

This file was deleted.

97 changes: 0 additions & 97 deletions examples/offline_inference/moss_tts_nano/README.md

This file was deleted.

74 changes: 73 additions & 1 deletion examples/offline_inference/text_to_speech/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,11 @@ list of supported architectures across all modalities, see
|---|---|---|---|---|---|---|
| CosyVoice3 | `FunAudioLLM/Fun-CosyVoice3-0.5B-2512` | 2 (talker + code2wav) | ✓ | ✓ | — | 22.05 kHz |
| Fish Speech S2 Pro | `fishaudio/s2-pro` | dual-AR | ✓ | ✓ | — | 44.1 kHz |
| Ming-flash-omni-TTS | `Jonathan1909/Ming-flash-omni-2.0` | single (talker only) | — (caption-controlled) | — | style / IP / basic captions | 44.1 kHz |
| MOSS-TTS-Nano | `OpenMOSS-Team/MOSS-TTS-Nano` | single (AR + codec) | ✓ (required) | ✓ | voice_clone, continuation | 48 kHz |
| OmniVoice | `k2-fsa/OmniVoice` | 2 (gen + dec) | ✓ | — | voice design, language hint | 24 kHz |
| Qwen3-TTS | `Qwen/Qwen3-TTS-12Hz-1.7B-{CustomVoice,VoiceDesign,Base}` | 2 (talker + code2wav) | ✓ (Base) | ✓ | 3 task variants | 24 kHz |
| VoxCPM | local model dir | split | ✓ | ✓ | — | 24 kHz |
| VoxCPM | `openbmb/VoxCPM-0.5B` | split | ✓ | ✓ | — | 24 kHz |
| VoxCPM2 | `openbmb/VoxCPM2` | single (native AR) | ✓ | ✓ (online) | continuation | 48 kHz |
| Voxtral TTS | `mistralai/Voxtral-4B-TTS-2603` | varies | ✓ | ✓ | voice presets | 24 kHz |

Expand Down Expand Up @@ -126,6 +128,76 @@ Streaming requires `async_chunk: true` in the stage config.

---

## Ming-flash-omni-TTS

Standalone talker-only deployment of Ming-flash-omni-2.0 at 44.1 kHz. Voice is controlled through caption fields (`风格` / `IP` / `语速`/`基频`/`音量`) rather than reference audio.

### Prerequisites
The example calls into `vllm_omni.model_executor.models.ming_flash_omni.prompt_utils` for the default prompt and instruction builder; no extra pip install on top of the base vLLM-Omni install.

### Quick start
```bash
python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case style
```

### Cases
```bash
# ASMR-style whisper (caption-driven)
python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case style

# IP voice (preset character voice via caption)
python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case ip

# Basic speed/pitch/volume control
python examples/offline_inference/text_to_speech/ming_flash_omni_tts/end2end.py --case basic
```

Override the default text per case with `--text`, write to a custom path with `--output`.

### Notes
- Talker-only deployment — for the multimodal Ming-flash-omni example, see [`examples/offline_inference/ming_flash_omni/`](../../ming_flash_omni/).
- Deploy config: `vllm_omni/deploy/ming_flash_omni_tts.yaml` (single GPU, `enforce_eager`, `max_num_seqs: 1`).
- Decode defaults from the Ming cookbook: `max_decode_steps=200`, `cfg=2.0`, `sigma=0.25`, `temperature=0.0`, `use_zero_spk_emb=True`.

---

## MOSS-TTS-Nano

Single-stage 0.1B AR LM + MOSS-Audio-Tokenizer-Nano codec at 48 kHz mono (mixed down from upstream stereo). ZH / EN / JA. Every request requires a reference clip via `--ref-audio`.

> **No built-in speaker presets.** `--ref-audio` is required on every call. Default `--mode voice_clone` matches upstream's recommended workflow; `--mode continuation` is exposed for completeness but upstream's continuation-with-prompt path emits very short / near-silent output, so it is rarely useful in practice. Sample reference clips ship in the upstream repo under [`assets/audio/`](https://github.com/OpenMOSS/MOSS-TTS-Nano/tree/main/assets/audio) (e.g. `zh_1.wav`, `en_2.wav`, `jp_2.wav`).

### Quick start
```bash
# Fetch a sample reference clip (one-off, user-scoped cache).
REF_DIR="${XDG_CACHE_HOME:-$HOME/.cache}/moss-tts-nano"
mkdir -p "$REF_DIR"
[ -s "$REF_DIR/zh_1.wav" ] || \
curl -L -o "$REF_DIR/zh_1.wav" https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS-Nano/main/assets/audio/zh_1.wav

python examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py \
--text "你好,这是MOSS-TTS-Nano的语音合成演示。" \
--ref-audio "$REF_DIR/zh_1.wav"
```
The first run downloads `OpenMOSS-Team/MOSS-TTS-Nano` and `OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano` from Hugging Face.

### Reproducible runs
```bash
python examples/offline_inference/text_to_speech/moss_tts_nano/end2end.py \
--text "Deterministic test." \
--ref-audio "$REF_DIR/en_2.wav" \
--seed 42
```

### Notes
- Output: 48 kHz mono WAV (the tokenizer is internally stereo at 48 kHz; the wrapper averages to mono before reaching the engine).
- Deploy config: `vllm_omni/deploy/moss_tts_nano.yaml` (auto-loaded; override with `--deploy-config`).
- Default `--max-new-frames 375` ≈ 14 s of audio; raise for longer outputs.
- `--ref-text` is rejected in `voice_clone` mode and required only with `--mode continuation`.
- Run `--help` for the full sampling-knob surface (`--audio-temperature`, `--audio-top-k`, `--audio-top-p`, `--text-temperature`).

---

## OmniVoice

Zero-shot multilingual TTS supporting 600+ languages, with three modes (auto / clone / design).
Expand Down
Loading
Loading